Why is Extracting Text from PDFs Challenging?
Extract all the prices of these products from these catalogs. Oh, and they are in PDF format.
UPDATE AS OF 2024: A really cool python library enables you to extract texts from PDFs: https://github.com/Unstructured-IO
UPDATE AS OF 2022: You can upload PDF documents into IBM Cloud’s Watson Discovery, process them as text files, and extract text patterns from them: https://maeda-han.medium.com/pattern-induction-what-is-a-pattern-part-1-79ee1bd5adc6
Oftentimes, we inevitably come across the need to extract data from PDFs. How hard could it be? Don’t you just write an extraction script for that? You know how to program in Python right?
Well, the amount of research papers written says otherwise. Extracting texts from PDFs is challenging for a number of reasons:
- How varied are your PDFs? Are they just cooking instructions, where each document has an ingredients list followed by a list of instructions? Are they like contracts, where there are paragraphs and bullet points of heavy text and towards the end a set of signatures? Or perhaps something more varied such as resumes, where it may be two-columned or perhaps even one, but even then the sections vary heavily from one another?
- What do you wish to extract? Do you want to extract universities from all the sections of the resumes? Do you want to extract universities from only sections listed under “Educational History” of the resume?