Many to One OCR Pre-Processing

Chris_Hoch · ‎Nov 22, 2020

Before getting into my lengthy question, please let me know if anyone is or knows of an expert airtable consultant that has programming experience, and is well versed in OCRing text based document and performing basic NLP & text mining queries on the documents…

I’m Looking for some advice on pre-processing multiple images of unstructured text that will be OCRd (whether via Textract or Vision (or other, better option one may recommend))…

In my use case there can be multiple thousands of images that are each a part of a separate document (which can be one or multiple pages). For example, one muse case may have 100 documents (each containing 10 pages each), with the entire set containing 1,000 individual pages. Typically the images of text are photographed and then provided to the user in jpeg, tiff, or pdf format. The sender typically receives the images ‘raw’ and the imager usually does not take the time to label files or combine them into singular pdfs containing the subject images for each document. So the end user spends a lot of time reviewing, categorizing, and cataloging the images into the documents.

What processes or packages might be out there that can assist in combining the singular images into their respective document? One could run a query off the metadata on each singular image (for example – looking at the timestamp and combining the images that were taken close together in time). Also, one could arguably try to combine through some sort of text mining query after OCRing each singular page, but that would be a bit more difficult given that the text is unstructured.

The end goal is to be able to:

combine the separate images into their respective document;
a. Not sure if it would be easier to treat each image as its own document or ‘row’ in a database, and then just group or link them to their master document, or to group them into their own files. This will also affect the OCR portion. Unsure if it makes more sense to OCR each page or file individually, or to have them combined beforehand…
b. Really the only other ‘manual’ alternative I can think of is shipping it off to Mechanical Turk.
OCR them;
Categorize (using data extraction), sort, filter, and group (in Airtable) the various documents.

They will mainly be sorted by dates, names (of referenced individuals/entities within the documents), and document type.

Apologies if this is somewhat vague. I know it’s hard to conceptualize without an example. Any help would be greatly appreciated.