How do you analyze years’ worth of contract agreements when those contracts exist only in hard copy, in filing cabinets? And what if the data you need isn’t located in the same place on each document because those contracts took on different layouts throughout the years? What if some documents were originally printed in poor quality? Or if the scans are crooked, grainy, have lines running through them, background noise or other artifacts caused by bad scanner elements?
And finally, what if there isn’t a one-size-fits-all library available to solve your needs?
Understanding the Contract Documents
In our case, we needed to extract text from a specific table nested somewhere inside each contract. The cells in that table would effectively alternate, making for nice key/value pairs (e.g. “Contract start date” and “January 1, 2020”). However, various cells were row-spanning or column-spanning.
Divide and Conquer
Our starting point was understanding which tools were available, as well as their respective capabilities and limitations. While Amazon Textract was our tool of choice for OCR, we found that Amazon’s tools struggled to accurately organize the text into table cells, especially when the layout involved cells spanning rows and columns. We needed a tool to map the cells and thus delineate where one block of text starts and stops. In the absence of such a tool on the market, we developed it in house, resulting in a “Form Extractor.”
Source PDFs
The source documents were multi-page contracts. Unfortunately, the contract layout was not homogenous due to formatting and legal language changes over time. Our first step was breaking the PDFs into separate pages in order to generate an image for each page.
Amazon Textract
Our second step was to extract the text from each image. If an image was found to have the required table, we stored the page. All other pages were ignored.
Six Feet Up’s “Form Extractor”
The stored pages, along with the text output provided by Amazon Textract, were then run through our custom-built “Form Extractor.” First, our Form Extractor made adjustments for color correction and various other types of image repair. Then, the Form Extractor looked for lines, and constructed what it believed was the layout of the table. Cells were then clustered into forms, and the forms checked for fiducial values to validate which form had been detected. Finally, a repair process made inferences from the surrounding cells to ensure the form definition was complete. For example, perhaps there was a cell along the bottom row with borders that were too faded for Form Extractor to detect. If Form Extractor identified cells on 2+ sides of an otherwise unidentified region, it could intuit that a cell existed and account for it.
Final Output
The program ultimately returned a JSON object with the text recognition placed into key/value pairs for further analysis.
FactSet deployed the program on AWS with resources sufficient to process each document in 0.1 seconds. This resulted in years’ worth of contracts being processed quickly and the relevant data being made digital for further use.