Scalable Smart Document Management with Amazon Bedrock Data Automation

Intelligent document processing (IDP) refers to technology that automates the extraction, analysis, and interpretation of essential data from various documents. Utilizing advanced machine learning (ML) and natural language processing algorithms, IDP solutions effectively extract and process structured data from unstructured text, thereby streamlining workflows that are document-centric.

When enhanced with generative AI features, IDP empowers organizations to revolutionize their document workflows through sophisticated understanding, structured data extraction, and automated categorization. IDP solutions driven by generative AI are adept at accommodating document varieties that traditional ML models may not have encountered. This advanced technological synergy proves beneficial across numerous industries, such as child support services, insurance, healthcare, financial services, and the public sector. While traditional manual processing causes bottlenecks and heightens the risk of errors, implementing these advanced solutions allows organizations to significantly boost their document workflow efficiency and information retrieval capabilities. AI-augmented IDP solutions enhance service delivery while alleviating the administrative burden in various document processing scenarios.

This methodology for document processing affords scalable, efficient, and high-value processing, leading to improved productivity, cost reductions, and better decision-making. Enterprises leveraging IDP augmented with generative AI can attain greater efficiency, an enhanced experience for customers, and accelerated growth.

In the blog post Scalable intelligent document processing using Amazon Bedrock, we illustrated the creation of a scalable IDP pipeline using Anthropic foundation models on Amazon Bedrock. While that method provided solid performance, the advent of Amazon Bedrock Data Automation brings a new level of efficiency and adaptability to IDP solutions. This post examines how Amazon Bedrock Data Automation enhances document processing capabilities and simplifies the automation journey.

Benefits of Amazon Bedrock Data Automation

Amazon Bedrock Data Automation introduces numerous features that notably enhance the scalability and accuracy of IDP solutions:

  • Confidence scores and bounding box data – Amazon Bedrock Data Automation offers confidence scores and bounding box data, which bolster data explainability and transparency. These features allow you to evaluate the reliability of extracted information, leading to more informed decision-making. For instance, low confidence scores may indicate the necessity for further human review or verification of specific data fields.
  • Blueprints for rapid development – Amazon Bedrock Data Automation supplies pre-built blueprints that ease the development of document processing pipelines, enabling quick creation and deployment of solutions. It also provides flexible output configurations to cater to diverse processing needs. For straightforward extraction tasks like OCR and layout, or for linearized text output from documents, standard outputs are available. For customized outputs, you can either design a unique extraction schema from scratch or utilize preconfigured blueprints from our catalog as a foundation, tailoring your blueprint to meet specific document types and business needs for more precise and targeted information retrieval.
  • Automatic classification support – With Amazon Bedrock Data Automation, documents are split and matched to the appropriate blueprints, ensuring exact document categorization. This intelligent routing reduces the need for manual document sorting, significantly cutting down on human intervention and expediting processing time.
  • Normalization – Amazon Bedrock Data Automation addresses a common challenge in IDP by implementing a comprehensive normalization framework that manages both key normalization (mapping various field labels to standardized names) and value normalization (converting extracted data into consistent formats, units, and data types). This method reduces the complexities of data processing so that organizations can automatically convert raw document extractions into standardized data that smoothly integrates with their existing systems and workflows.
  • Transformation – The transformation feature of Amazon Bedrock Data Automation converts complex document fields into structured, business-ready data by automatically decomposing combined information (such as addresses or names) into discrete, meaningful components. This functionality simplifies how organizations manage diverse document formats and aids teams in defining custom data types and field relationships that correspond with their existing database schemas and business applications.
  • Validation – Amazon Bedrock Data Automation bolsters document processing accuracy through the use of automated validation rules for extracted data, incorporating numeric ranges, date formats, string patterns, and cross-field checks. This validation framework aids organizations in automatically identifying data quality issues, initiating human reviews when necessary, and ensuring that extracted data meets specific business rules and compliance standards prior to entering downstream systems.

Solution overview

The diagram below depicts a fully serverless architecture that employs Amazon Bedrock Data Automation alongside AWS Step Functions and Amazon Augmented AI (Amazon A2I) to provide cost-effective scaling for document processing workloads of varying sizes.

AWS Architecture Diagram Showing Document Processing using Amazon Bedrock Data Automation and Human in the Loop

The Step Functions workflow processes various document types, including multipage PDFs and images, utilizing Amazon Bedrock Data Automation. It employs diverse Amazon Bedrock Data Automation blueprints (both standard and custom) within a single project to enable the processing of various document types such as immunization documents, conveyance tax certificates, child support services enrollment forms, and driver’s licenses.

The workflow processes a file (PDF, JPG, PNG, TIFF, DOC, DOCX) that may contain a single document or multiple documents according to the following steps:

  1. For multi-page documents, it splits along logical document boundaries
  2. Matches each document to the corresponding blueprint
  3. Applies the specific extraction instructions of the blueprint to retrieve information from each document
  4. Performs normalization, transformation, and validation on the extracted data based on the specifications in the blueprint

The Step Functions Map state is employed to process each document. If a document meets the confidence threshold, the output is sent to an Amazon Simple Storage Service (Amazon S3) bucket. If any extracted data falls below the confidence threshold, the document is referred to Amazon A2I for human review. Reviewers utilize the Amazon A2I UI with bounding box highlighting for selected fields to verify the extraction outcomes. Once the human review is finalized, a callback task token is utilized to resume the state machine, and human-reviewed output is dispatched to an S3 bucket.

To deploy this solution within an AWS account, please follow the instructions provided in the accompanying GitHub repository.

In the upcoming sections, we will discuss the specific features of Amazon Bedrock Data Automation implemented using this solution, employing the example of a child support enrollment form.

Automated Classification

In our implementation, we designate the document class name for each custom blueprint created, as demonstrated in the following screenshot. When processing multiple document types like driver’s licenses and child support enrollment forms, the system automatically applies the appropriate blueprint based on content analysis, ensuring that the correct extraction logic is utilized for each document type.

Bedrock Data Automation interface showing Child Support Form classification detail

Data Normalization

The data normalization process ensures that downstream systems receive consistently formatted data. We utilize both explicit extractions (for well-defined information clearly visible in the document) and implicit extractions (for information that requires transformation). For example, as depicted in the following screenshot, dates of birth are standardized to the YYYY-MM-DD format.

Bedrock Data Automation interface displaying extracted and normalized Date of Birth data

Likewise, Social Security Numbers are reformatted to XXX-XX-XXXX.

Data Transformation

For the child support enrollment application, we have implemented custom data transformations to align extracted data with specific requirements. For instance, our custom data type for addresses breaks down single-line addresses into structured fields (Street, City, State, ZipCode). These structured fields are reused across different address fields in the enrollment form (employer address, home address, other parent address), leading to consistent formatting and straightforward integration with existing systems.

Amazon Bedrock Data Automation interface displaying custom address type configuration with explicit field mappings

Data Validation

Our implementation encompasses validation rules to ensure data accuracy and compliance. In our example use case, we have implemented two validations: 1. verify the presence of the enrollee’s signature and 2. verify that the signed date isn’t in the future.

Bedrock extraction interface showing signature and date validation configurations

The subsequent screenshot illustrates the results of the validation rules applied to the document.

Amazon Bedrock-powered document automation showing form field validation, signature verification, and confidence scoring

Human-in-the-loop validation

The following screenshot demonstrates the extraction process, which includes a confidence score and is integrated with a human-in-the-loop process. It also illustrates the normalization applied to the date of birth format.

bda Human in the loop

Conclusion

Amazon Bedrock Data Automation greatly enhances IDP by introducing confidence scoring, bounding box data, automatic classification, and rapid development through blueprints. In this post, we highlighted how to leverage its advanced capabilities for data normalization, transformation, and validation. Upgrading to Amazon Bedrock Data Automation allows organizations to considerably reduce development time, enhance data quality, and create more robust, scalable IDP solutions that integrate with human review processes.

Stay informed about new capabilities and use cases for Amazon Bedrock by following the AWS Machine Learning Blog.


About the authors

Abdul NavazAbdul Navaz is a Senior Solutions Architect in the Amazon Web Services (AWS) Health and Human Services team, located in Dallas, Texas. With over a decade of experience at AWS, he focuses on modernization solutions for child support and child welfare agencies utilizing AWS services. Before taking on his role as a Solutions Architect, Navaz served as a Senior Cloud Support Engineer, focusing on networking solutions.

Venkata Kampana is a senior solutions architect in the Amazon Web Services (AWS) Health and Human Services team, based in Sacramento, Calif. In his role, he assists public sector customers in accomplishing their mission objectives with well-designed solutions on AWS.

Sanjeev PulapakaSanjeev Pulapaka is a principal solutions architect and AI lead for the public sector. Sanjeev is a published author with numerous blogs and a book on generative AI. He is also a noted speaker at various events including re:Invent and Summit. Sanjeev holds an undergraduate degree in engineering from the Indian Institute of Technology and an MBA from the University of Notre Dame.



Source link

Alex Parker

Alex Parker is a tech enthusiast and digital tools reviewer with over a decade of experience exploring software solutions that boost productivity. He specializes in file management, conversion technologies, and emerging AI-driven applications, helping readers choose the right tools for their needs.