Large Scale Tax Document Processing Systems

Highlights

We partnered with a fast-growing digital tax preparation and financial information platform allowing consumers to prepare tax returns approved by tax professionals and securely access information seamlessly. By implementing an event driven system and Optical Character Recognition (OCR), consumers are able to execute their tax review to be processed entirely by a computer system in real time. The highly scalable solution provides a streamlined process for consumers and with Data Lake integration serves as a source of valuable client insights for professionals in the financial industry.

‍

Success Metrics

Event-driven architecture allows for ingestion of over 100,000 documents per hour with orchestration of serverless OCR to execute character extraction from documents
A process that typically takes accountants up to 2 weeks to deliver for their clients, can be delivered same day including consideration of human verification
End to end ingestion, processing, storage of document results in minutes
Highly reliable — with retry for failures and process redundancies, highly scalable — with event driven architecture and serverless compute, and highly available service — with API gateway and serverless ingestion

‍

Industry

Financial Services

Headquarters

Unites States

TECHNOLOGIES USED

Challenge

It is no secret that tax documents come in a variety of formats and levels of quality. When managing a high volume of variable tax documents, there is a risk of process failure due to non-standardized formats. The challenge was to develop a highly scalable, reliable, and durable system with redundancies for failed processing attempts, retries, and fallbacks. The system would need to process documents in the millions a month during tax seasons with highly concurrent asynchronous processes including uploads, pre-processing, OCR, categorization, and storage. The system would also need to handle multi-tenant data lake capabilities for documents and insights owned by particular institutions.

Solution

We began with a review of the available DocAI parsers, specifically for supported tax documents, testing for defects and limitations, assessed and scored for quality, and built a solution that encompassed a wide variety of types and formats of documents. For example, when dealing with multiple-type document parsers, we were able to customize pipelines to recognize a mixture of document formats and aggregated task files found within a single document.

We designed and implemented a streaming data architecture, where a single document at different stages in the processing lifecycle would be backed up to enable retry at a failed stage. Redundancies were in place where when a document type was not recognized, there was a versatile OCR parsing solution to process that document. During parsing pipelines, we applied data scrubbing, anonymization of the data, access controls, and multi-tenant, globally unique blob storage. Serverless microservices were used for all compute workloads.

From a process standpoint, we initiated two work streams for each project team, with an emphasis on frequent touch points. The Data Science team lead the OCR process to extract valuable data from various document types using existing and newly built parsers, processing for image and data quality, cleaning, and organization of data. In parallel, the engineering team was building an event system architecture to upload, process, and store the extracted data. The new system was a series of asynchronous workloads that ran through custom data pipelines, boasting a reliable process, and advanced monitoring to manage failures.

Results

Benefits of design decisions include:

Event driven architecture

Limitless number of node additions for new documents types, workflows
Real time stream processing to reduce industry standard from weeks to minutes
Highly concurrent, asynchronous compute with minimal bottlenecks
Record keeping of events at different nodes to enable reliability and redundancies

Serverless compute

Unlimited scale required to meet document throughput and system availability

Improved Security with:

Short-lived signed urls
Least privilege IAM and bucket access
Shared VPC
Isolation of production and testing environments
API Gateway and credential-led access
Vertical integration of document processing removed reliance on third party APIs, keeping more traffic within VPC internals

Best Practices adopted:

Event reprocessing for disaster recovery
Terraform IaS
Custom SQL based logging of events
Database metrics

Serverless OCR and Document Type Extraction

Document types supported increase by 300% through Google parser types
Fallback parsers enabled % of documents processed by AI to reach 100%

Case Study

Transforming the Tax-Filing Industry with Large Scale Document Processing Systems

Highlights

Success Metrics

Challenge

Solution

Results

See more case studies

Predictive Maintenance for Oil and Gas Supermajors

Event-Driven Kubernetes pipelines for high performant at-home patient health monitoring systems

It all starts with a conversation

Company

Services

Case Study

Transforming the Tax-Filing Industry with Large Scale Document Processing Systems

Highlights

Success Metrics

Challenge

Solution

Results

See more case studies

Predictive Maintenance for Oil and Gas Supermajors

Event-Driven Kubernetes pipelines for high performant at-home patient health monitoring systems

It all starts with a conversation

Company

Services

Subscribe to our Newsletter