Case Study 2 - PDF Data Extraction

Pulling data from pesky PDFs

Project Summary

PDFs are a common file type seen within business processes but are the most difficult file type to work with. Whether your process involves internal reports, financial statements, or invoices (like shown in this case study), receiving data in a PDF format can slow down or entirely stop using that data for additional purposes. In the best-case scenario, you can cleanly copy and paste the data from your pdf to excel. In the worst-case scenario, you or a member of your team is spending unnecessary hours manually inputting the data from each pdf file into an excel workbook.

While the best solution to this issue would be to find a way to better collect the data from the source, or potentially using an API to collect the data from the source system (See Case Study 1B), if that is not an option then a custom solution can be built to read the PDF files and extract necessary information. For this solution to be repeatable, the PDFs will need to be in the same or very similar format each time the script is used.

The below scenario showcases how a python script can be built to extract the details from a collection of PDF invoices. The data is then saved to excel and could be used in the required business process or visualized using a tool like Power BI.

Not all PDF files will lend themselves to being able to be read by python or similar tools. Files will a lot of complexity or changes from one version to another (such as internal financial statements that change format each period) will require collecting the necessary data prior to the PDF file being created.

If your team is struggling with PDF files that are used in your business processes, then let’s connect and design a solution for your needs!

Case Study 2 - PDF Data Extraction

Project Summary

Project Gallery