Parsr

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers. It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others. Currently, Parsr can perform: Document Hierarchy Regeneration - Words, Lines and Paragraphs Headings Detection Table Detection and Reconstruction Lists Detection Text Order Detection Table of Contents Detection Named Entity Recognition (Dates, Percentages, etc) Key-Value Pair Detection (for the extraction of specific form-based entries) Page Number Detection Header-Footer Detection Link Detection Whitespace Removal Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats: JSON Markdown Text CSV (for tables), or Pandas Dataframes (see here) PDF