A post from Amazon AWS : Automate invoice processing with Streamlit and Amazon Bedrock

A post from Amazon AWS : Automate invoice processing with Streamlit and Amazon Bedrock

Invoice processing is a critical yet often cumbersome task for businesses of all sizes, especially for large enterprises dealing with invoices from multiple vendors with varying formats. The sheer volume of data, coupled with the need for accuracy and efficiency, can make invoice processing a significant challenge. Invoices can vary widely in format, structure, and content, making efficient processing at scale difficult. Traditional methods relying on manual data entry or custom scripts for each vendor’s format can not only lead to inefficiencies, but can also increase the potential for errors, resulting in financial discrepancies, operational bottlenecks, and backlogs.

To extract key details such as invoice numbers, dates, and amounts, we use Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

In this post, we provide a step-by-step guide with the building blocks needed for creating a Streamlit application to process and review invoices from multiple vendors. Streamlit is an open source framework for data scientists to efficiently create interactive web-based data applications in pure Python. We use Anthropic’s Claude 3 Sonnet model in Amazon Bedrock and Streamlit for building the application front-end.

Solution overview

This solution uses the Amazon Bedrock Knowledge Bases chat with document feature to analyze and extract key details from your invoices, without needing a knowledge base. The results are shown in a Streamlit app, with the invoices and extracted information displayed side-by-side for quick review. Importantly, your document and data are not stored after processing.

The storage layer uses Amazon Simple Storage Service (Amazon S3) to hold the invoices that business users upload. After uploading, you can set up a regular batch job to process these invoices, extract key information, and save the results in a JSON file. In this post, we save the data in JSON format, but you can also choose to store it in your preferred SQL or NoSQL database.

The application layer uses Streamlit to display the PDF invoices alongside the extracted data from Amazon Bedrock. For simplicity, we deploy the app locally, but you can also run it on Amazon SageMaker Studio, Amazon Elastic Compute Cloud (Amazon EC2), or Amazon Elastic Container Service (Amazon ECS) if needed.

Prerequisites

To perform this solution, complete the following:

Install dependencies and clone the example

To get started, install the necessary packages on your local machine or on an EC2 instance. If you’re new to Amazon EC2, refer to the Amazon EC2 User Guide. This tutorial we will use the local machine for project setup.

To install dependencies and clone the example, follow these steps:

  1. Clone the repository into a local folder:
    git clone https://github.com/aws-samples/genai-invoice-processor.git
  2. Install Python dependencies
    • Navigate to the project directory:
      cd </path/to/your/folder>/genai-invoice-processor
    • Upgrade pip
      python3 -m pip install --upgrade pip
    • (Optional) Create a virtual environment isolate dependencies:
      python3 -m venv venv
    • Activate the virtual environment:
      1. Mac/Linux:
        source venv/bin/activate
      2. Windows:
        venv/Scripts/activate
  3. In the cloned directory, invoke the following to install the necessary Python packages:
    pip install -r requirements.txt

    This will install the necessary packages, including Boto3 (AWS SDK for Python), Streamlit, and other dependencies.

  4. Update the region in the config.yaml file to the same Region set for your AWS CLI where Amazon Bedrock and Anthropic’s Claude 3 Sonnet model are available.

After completing these steps, the invoice processor code will be set up in your local environment and will be ready for the next stages to process invoices using Amazon Bedrock.

Process invoices using Amazon Bedrock

Now that the environment setup is done, you’re ready to start processing invoices and deploying the Streamlit app. To process invoices using Amazon Bedrock, follow these steps:

Store invoices in Amazon S3

Store invoices from different vendors in an S3 bucket. You can upload them directly using the console, API, or as part of your regular business process. Follow these steps to upload using the CLI:

  1. Create an S3 bucket:
    aws s3 mb s3://<your-bucket-name> --region <your-region>

    Replace your-bucket-name with the name of the bucket you created and your-region with the Region set for your AWS CLI and in config.yaml (for example, us-east-1)

  2. Upload invoices to S3 bucket. Use one of the following commands to upload the invoice to S3.
    • To upload invoices to the root of the bucket:
      aws s3 cp </path/to/your/folder> s3://<your-bucket-name>/ --recursive
    • To upload invoices to a specific folder (for example, invoices):
      aws s3 cp </path/to/your/folder> s3://<your-bucket-name>/<prefix>/ --recursive
    • Validate the upload:
      aws s3 ls s3://<your-bucket-name>/

Process invoices with Amazon Bedrock

In this section, you will process the invoices in Amazon S3 and store the results in a JSON file (processed_invoice_output.json). You will extract the key details from the invoices (such as invoice numbers, dates, and amounts) and generate summaries.

You can trigger the processing of these invoices using the AWS CLI or automate the process with an Amazon EventBridge rule or AWS Lambda trigger. For this walkthrough, we will use the AWS CLI to trigger the processing.

We packaged the processing logic in the Python script invoices_processor.py, which can be run as follows:

python invoices_processor.py --bucket_name=<your-bucket-name> --prefix=<your-folder>

The --prefix argument is optional. If omitted, all of the PDFs in the bucket will be processed. For example:

python invoices_processor.py --bucket_name=’gen_ai_demo_bucket’

or

python invoices_processor.py --bucket_name='gen_ai_demo_bucket' --prefix='invoice'

Use the solution

This section examines the invoices_processor.py code. You can chat with your document either on the Amazon Bedrock console or by using the Amazon Bedrock RetrieveAndGenerate API (SDK). In this tutorial, we use the API approach.

    1. Initialize the environment: The script imports the necessary libraries and initializes the Amazon Bedrock and Amazon S3 client.
      import boto3
      import os
      import json
      import shutil
      import argparse
      import time
      import datetime
      import yaml
      from typing import Dict, Any, Tuple
      from concurrent.futures import ThreadPoolExecutor, as_completed
      from threading import Lock
      from mypy_boto3_bedrock_runtime.client import BedrockRuntimeClient
      from mypy_boto3_s3.client import S3Client
      
      # Load configuration from YAML file
      def load_config():
          """
          Load and return the configuration from the 'config.yaml' file.
          """
          with open('config.yaml', 'r') as file:
              return yaml.safe_load(file)
      
      CONFIG = load_config()
      
      write_lock = Lock() # Lock for managing concurrent writes to the output file
      
      def initialize_aws_clients() -> Tuple[S3Client, BedrockRuntimeClient]:
          """
          Initialize and return AWS S3 and Bedrock clients.
      
          Returns:
              Tuple[S3Client, BedrockRuntimeClient]
          """
          return (
              boto3.client('s3', region_name=CONFIG['aws']['region_name']),
              boto3.client(service_name='bedrock-agent-runtime', 
                           region_name=CONFIG['aws']['region_name'])
          )
    2. Configure : The config.yaml file specifies the model ID, Region, prompts for entity extraction, and the output file location for processing.
      aws: 
          region_name: us-west-2 
          model_id: anthropic.claude-3-sonnet-20240229-v1:0
          prompts: 
              full: Extract data from attached invoice in key-value format. 
              structured: | 
                  Process the pdf invoice and list all metadata and values in json format for the variables with descriptions in <variables></variables> tags. The result should be returned as JSON as given in the <output></output> tags. 
      
                  <variables> 
                      Vendor: Name of the company or entity the invoice is from. 
                      InvoiceDate: Date the invoice was created.
                      DueDate: Date the invoice is due and needs to be paid by. 
                      CurrencyCode: Currency code for the invoice amount based on the symbol and vendor details.
                      TotalAmountDue: Total amount due for the invoice
                      Description: a concise summary of the invoice description within 20 words 
                  </variables> 
      
                  Format your analysis as a JSON object in following structure: 
                      <output> {
                      "Vendor": "<vendor name>", 
                      "InvoiceDate":"<DD-MM-YYYY>", 
                      "DueDate":"<DD-MM-YYYY>",
                      "CurrencyCode":"<Currency code based on the symbol and vendor details>", 
                      "TotalAmountDue":"<100.90>" # should be a decimal number in string 
                      "Description":"<Concise summary of the invoice description within 20 words>" 
                      } </output> 
                  Please proceed with the analysis based on the above instructions. Please don't state "Based on the .."
              summary: Process the pdf invoice and summarize the invoice under 3 lines 
      
      processing: 
          output_file: processed_invoice_output.json
          local_download_folder: invoices
    3. Set up API calls: The RetrieveAndGenerate API fetches the invoice from Amazon S3 and processes it using the FM. It takes several parameters, such as prompt, source type (S3), model ID, AWS Region, and S3 URI of the invoice.
      def retrieve_and_generate(bedrock_client: BedrockRuntimeClient, input_prompt: str, document_s3_uri: str) -> Dict[str, Any]: 
          """ 
          Use AWS Bedrock to retrieve and generate invoice data based on the provided prompt and S3 document URI.
      
          Args: 
              bedrock_client (BedrockRuntimeClient): AWS Bedrock client 
              input_prompt (str): Prompt for the AI model
              document_s3_uri (str): S3 URI of the invoice document 
      
          Returns: 
              Dict[str, Any]: Generated data from Bedrock 
          """ 
          model_arn = f'arn:aws:bedrock:{CONFIG["aws"]["region_name"]}::foundation-model/{CONFIG["aws"]["model_id"]}' 
          return bedrock_client.retrieve_and_generate( 
              input={'text': input_prompt}, retrieveAndGenerateConfiguration={ 
                  'type': 'EXTERNAL_SOURCES',
                  'externalSourcesConfiguration': { 
                      'modelArn': model_arn, 
                      'sources': [ 
                          { 
                              "sourceType": "S3", 
                              "s3Location": {"uri": document_s3_uri} 
                          }
                      ] 
                  } 
              } 
          )
    4. Batch processing: The batch_process_s3_bucket_invoices function batch process the invoices in parallel in the specified S3 bucket and writes the results to the output file (processed_invoice_output.json as specified by output_file in config.yaml). It relies on the process_invoice function, which calls the Amazon Bedrock RetrieveAndGenerate API for each invoice and prompt.
      def process_invoice(s3_client: S3Client, bedrock_client: BedrockRuntimeClient, bucket_name: str, pdf_file_key: str) -> Dict[str, str]: 
          """ 
          Process a single invoice by downloading it from S3 and using Bedrock to analyze it. 
      
          Args: 
              s3_client (S3Client): AWS S3 client 
              bedrock_client (BedrockRuntimeClient): AWS Bedrock client 
              bucket_name (str): Name of the S3 bucket
              pdf_file_key (str): S3 key of the PDF invoice 
      
          Returns: 
              Dict[str, Any]: Processed invoice data 
          """ 
          document_uri = f"s3://{bucket_name}/{pdf_file_key}"
          local_file_path = os.path.join(CONFIG['processing']['local_download_folder'], pdf_file_key) 
      
          # Ensure the local directory exists and download the invoice from S3
          os.makedirs(os.path.dirname(local_file_path), exist_ok=True) 
          s3_client.download_file(bucket_name, pdf_file_key, local_file_path) 
      
          # Process invoice with different prompts 
          results = {} 
          for prompt_name in ["full", "structured", "summary"]:
              response = retrieve_and_generate(bedrock_client, CONFIG['aws']['prompts'][prompt_name], document_uri)
              results[prompt_name] = response['output']['text']
      
          return results
      def batch_process_s3_bucket_invoices(s3_client: S3Client, bedrock_client: BedrockRuntimeClient, bucket_name: str, prefix: str = "") -> int: 
          """ 
          Batch process all invoices in an S3 bucket or a specific prefix within the bucket. 
      
          Args: 
              s3_client (S3Client): AWS S3 client 
              bedrock_client (BedrockRuntimeClient): AWS Bedrock client 
              bucket_name (str): Name of the S3 bucket 
              prefix (str, optional): S3 prefix to filter invoices. Defaults to "". 
      
          Returns: 
              int: Number of processed invoices 
          """ 
          # Clear and recreate local download folder
          shutil.rmtree(CONFIG['processing']['local_download_folder'], ignore_errors=True)
          os.makedirs(CONFIG['processing']['local_download_folder'], exist_ok=True) 
      
          # Prepare to iterate through all objects in the S3 bucket
          continuation_token = None # Pagination handling
          pdf_file_keys = [] 
      
          while True: 
              list_kwargs = {'Bucket': bucket_name, 'Prefix': prefix}
              if continuation_token:
                  list_kwargs['ContinuationToken'] = continuation_token 
      
              response = s3_client.list_objects_v2(**list_kwargs)
      
              for obj in response.get('Contents', []): 
                  pdf_file_key = obj['Key'] 
                  if pdf_file_key.lower().endswith('.pdf'): # Skip folders or non-PDF files
                      pdf_file_keys.append(pdf_file_key) 
      
              if not response.get('IsTruncated'): 
                  break 
                  continuation_token = response.get('NextContinuationToken') 
      
          # Process invoices in parallel 
          processed_count = 0 
          with ThreadPoolExecutor() as executor: 
              future_to_key = { 
                  executor.submit(process_invoice, s3_client, bedrock_client, bucket_name, pdf_file_key): pdf_file_key
                  for pdf_file_key in pdf_file_keys 
              } 
      
              for future in as_completed(future_to_key):
                  pdf_file_key = future_to_key[future] 
                  try: 
                      result = future.result() 
                      # Write result to the JSON output file as soon as it's available 
                      write_to_json_file(CONFIG['processing']['output_file'], {pdf_file_key: result}) 
                      processed_count += 1 
                      print(f"Processed file: s3://{bucket_name}/{pdf_file_key}") 
                  except Exception as e: 
                      print(f"Failed to process s3://{bucket_name}/{pdf_file_key}: {str(e)}") 
      
          return processed_count
    5. Post-processing: The extracted data in processed_invoice_output.json can be further structured or customized to suit your needs.

This approach allows invoice handling from multiple vendors, each with its own unique format and structure. By using large language models (LLMs), it extracts important details such as invoice numbers, dates, amounts, and vendor information without requiring custom scripts for each vendor format.

Run the Streamlit demo

Now that you have the components in place and the invoices processed using Amazon Bedrock, it’s time to deploy the Streamlit application. You can launch the app by invoking the following command:

streamlit run review-invoice-data.py

or

python -m streamlit run review-invoice-data.py

When the app is up, it will open in your default web browser. From there, you can review the invoices and the extracted data side-by-side. Use the Previous and Next arrows to seamlessly navigate through the processed invoices so you can interact with and analyze the results efficiently. The following screenshot shows the UI.

There are quotas for Amazon Bedrock (of which some are adjustable) that you need to consider when building at scale with Amazon Bedrock.

Cleanup

To clean up after running the demo, follow these steps:

  • Delete the S3 bucket containing your invoices using the command
    aws s3 rb s3://<your-bucket-name> --force
  • If you set up a virtual environment, deactivate it by invoking deactivate
  • Remove any local files created during the process, including the cloned repository and output files
  • If you used any AWS resources such as an EC2 instance, terminate them to avoid unnecessary charges

Conclusion

In this post, we walked through a step-by-step guide to automating invoice processing using Streamlit and Amazon Bedrock, addressing the challenge of handling invoices from multiple vendors with different formats. We showed how to set up the environment, process invoices stored in Amazon S3, and deploy a user-friendly Streamlit application to review and interact with the processed data.

If you are looking to further enhance this solution, consider integrating additional features or deploying the app on scalable AWS services such as Amazon SageMaker, Amazon EC2, or Amazon ECS. Due to this flexibility, your invoice processing solution can evolve with your business, providing long-term value and efficiency.

We encourage you to learn more by exploring Amazon Bedrock, Access Amazon Bedrock foundation models, RetrieveAndGenerate API, and Quotas for Amazon Bedrock and building a solution using the sample implementation provided in this post and a dataset relevant to your business. If you have questions or suggestions, leave a comment.


About the Authors

Deepika Kumar is a Solution Architect at AWS. She has over 13 years of experience in the technology industry and has helped enterprises and SaaS organizations build and securely deploy their workloads on the cloud securely. She is passionate about using Generative AI in a responsible manner whether that is driving product innovation, boost productivity or enhancing customer experiences.

Jobandeep Singh is an Associate Solution Architect at AWS specializing in Machine Learning. He supports customers across a wide range of industries to leverage AWS, driving innovation and efficiency in their operations. In his free time, he enjoys playing sports, with a particular love for hockey.

Ratan Kumar is a solutions architect based out of Auckland, New Zealand. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable internet scale applications using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Read More

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *