What is the best way to extract text from many PDFs

Forum|Forum|3 years ago
July 25, 2022
3 replies
239 views

+25

Alexey_Gusev
Brainy

Hi,

I need to extract text from multiple (thousands) of PDF files and load it to Airtable so that I can read that text by AT script, each portion of text for each file.
Files are stored locally, and I wrote an uploader, which can, for example, for each loaded 200 files (file1, file2 etc…) create new table with 1 record per file.
I can use bulk conversion by Acrobat (to doc, html or txt format) and upload those files in the same way. The question is - what format could be used to retrieve the text by AT script?
For now I see that I need to write some node.js script (with pdfjs) that will retrieve the text, or remix some AT extension using API to upload text. Can it be done with less efforts?

I didn’t consider, at least for now, using 3rd party tool subscriptions because I cannot tell for now how often it be used. maybe one time, maybe more and more, with the total of files up to hundred of thousands.

Note: files are already OCR-processed and they are editable PDFs. Maybe a file can be loaded as binary and text is somehow extracted from it?

It’s ok for me to perform bulk conversion and upload in manual mode

To make question more clear:

And here is the question - what format can be used to retrieve text by script, running it per record

I don’t care much about text formatting, 'cause I will use it to extract values
All files are small, 1-2pages.

+21

Justin_Barrett
Inspiring
Forum|Forum|3 years ago
July 26, 2022

I once used CloudConvert to convert PDF to text. Using CloudConvert via its API is slightly tricky in terms of setup (multiple steps, waiting for the processing step to complete, etc.), and to process the file volume that you have will require purchasing one of their packages, but it’s the option that comes to mind the most readily. Because manually-run scripts have no time limits (to my knowledge) you could write one script that could (in theory) process all records in a table.

Like

P

+6

paulo12
Participating Frequently
Forum|Forum|2 years ago
December 17, 2023

We have created product/extensions that integrates with Airtable. It can extract text from a PDF file in an attachment field and save this text in a text field. It can process records in bulk and you can set it to run regularly.

Like

+17

Andy_Cloke
Known Participant
Forum|Forum|4 months ago
August 12, 2025

For anyone still looking to extract test from PDFs in Airtable - we use Data Fetcher with the OpenAI Assistants API. It handles different formats of PDFs with high accuracy. No coding needed.

Quick setup:

Add Data Fetcher extension
Create an assistant in OpenAI
Connect Airtable to OpenAI via API key
Map extracted text to Airtable fields

You can pull either raw text or structured data (like specific fields from invoices into separate Airtable fields). Set up triggers and new attachments process automatically.

Guide here: https://datafetcher.com/blog/extract-data-pdfs-airtable-openai

Like

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded