We partnered with a fast-growing digital tax preparation and financial information platform allowing consumers to prepare tax returns approved by tax professionals and securely access information seamlessly. By implementing an event driven system and Optical Character Recognition (OCR), consumers are able to execute their tax review to be processed entirely by a computer system in real time. The highly scalable solution provides a streamlined process for consumers and with Data Lake integration serves as a source of valuable client insights for professionals in the financial industry.
It is no secret that tax documents come in a variety of formats and levels of quality. When managing a high volume of variable tax documents, there is a risk of process failure due to non-standardized formats. The challenge was to develop a highly scalable, reliable, and durable system with redundancies for failed processing attempts, retries, and fallbacks. The system would need to process documents in the millions a month during tax seasons with highly concurrent asynchronous processes including uploads, pre-processing, OCR, categorization, and storage. The system would also need to handle multi-tenant data lake capabilities for documents and insights owned by particular institutions.
We began with a review of the available DocAI parsers, specifically for supported tax documents, testing for defects and limitations, assessed and scored for quality, and built a solution that encompassed a wide variety of types and formats of documents. For example, when dealing with multiple-type document parsers, we were able to customize pipelines to recognize a mixture of document formats and aggregated task files found within a single document.
We designed and implemented a streaming data architecture, where a single document at different stages in the processing lifecycle would be backed up to enable retry at a failed stage. Redundancies were in place where when a document type was not recognized, there was a versatile OCR parsing solution to process that document. During parsing pipelines, we applied data scrubbing, anonymization of the data, access controls, and multi-tenant, globally unique blob storage. Serverless microservices were used for all compute workloads.
From a process standpoint, we initiated two work streams for each project team, with an emphasis on frequent touch points. The Data Science team lead the OCR process to extract valuable data from various document types using existing and newly built parsers, processing for image and data quality, cleaning, and organization of data. In parallel, the engineering team was building an event system architecture to upload, process, and store the extracted data. The new system was a series of asynchronous workloads that ran through custom data pipelines, boasting a reliable process, and advanced monitoring to manage failures.
Benefits of design decisions include:
Event driven architecture
Serverless compute
Improved Security with:
Best Practices adopted:
Serverless OCR and Document Type Extraction