A post from Amazon AWS : Unify structured data in Amazon Aurora and unstructured data in Amazon S3 for insights using Amazon Q

A post from Amazon AWS : Unify structured data in Amazon Aurora and unstructured data in Amazon S3 for insights using Amazon Q

In today’s data-intensive business landscape, organizations face the challenge of extracting valuable insights from diverse data sources scattered across their infrastructure. Whether it’s structured data in databases or unstructured content in document repositories, enterprises often struggle to efficiently query and use this wealth of information.

In this post, we explore how you can use Amazon Q Business, the AWS generative AI-powered assistant, to build a centralized knowledge base for your organization, unifying structured and unstructured datasets from different sources to accelerate decision-making and drive productivity. The solution combines data from an Amazon Aurora MySQL-Compatible Edition database and data stored in an Amazon Simple Storage Service (Amazon S3) bucket.

Solution overview

Amazon Q Business is a fully managed, generative AI-powered assistant that helps enterprises unlock the value of their data and knowledge. The key to using the full potential of Amazon Q lies in its ability to seamlessly integrate and query multiple data sources, from structured databases to unstructured content stores. In this solution, we use Amazon Q to build a comprehensive knowledge base that combines sales-related data from an Aurora MySQL database and sales documents stored in an S3 bucket. Aurora MySQL-Compatible is a fully managed, MySQL-compatible, relational database engine that combines the speed and reliability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.

This custom knowledge base that connects these diverse data sources enables Amazon Q to seamlessly respond to a wide range of sales-related questions using the chat interface. The following diagram illustrates the solution architecture.

Solution architecture

Prerequisites

For this walkthrough, you should have the following prerequisites:

Set up your VPC

Establishing a VPC provides a secure, isolated network environment for hosting the data sources that Amazon Q Business will access to index. In this post, we use an Aurora MySQL database in a private subnet, and Amazon Q Business accesses the private DB instance in a secure manner using an interface VPC endpoint.

Complete the following steps:

  1. Choose an AWS Region Amazon Q supports (for this post, we use the us-east-1 Region).
  2. Create a VPC or use an existing VPC with at least two subnets. These subnets must be in two different Availability Zones in the Region where you want to deploy your DB instance.
    1. Refer to Steps 1 and 2 in Configuring Amazon VPC support for Amazon Q Business connectors to configure your VPC so that you have a private subnet to host an Aurora MySQL database along with a security group for your database.
    2. Additionally, create a public subnet that will host an EC2 bastion server, which we create in the next steps.
  3. Create an interface VPC endpoint for Aurora powered by AWS PrivateLink in the VPC you created. For instructions, refer to Access an AWS service using an interface VPC endpoint.
    1. Specify the private subnet where the Aurora MySQL database resides along with the database security group you created.

Each interface endpoint is represented by one or more elastic network interfaces in your subnets, which is then used by Amazon Q Business to connect to the private database.

Set up an Aurora MySQL database

Complete the following steps to create an Aurora MySQL database to host the structured sales data:

  1. On the Amazon RDS console, choose Databases in the navigation pane.
  2. Choose Create database.
  3. Select Aurora, then Aurora (MySQL compatible).
  4. For Templates, choose Production or Dev/test.
  5. Under Settings, enter a name for your database cluster identifier. For example, q-aurora-mysql-source.
  6. For Credentials settings, choose Self-managed, give the admin user a password, and keep the rest of the parameters as default.
  7. Under Connectivity, for Virtual private cloud (VPC), choose the VPC that you created.
  8. For DB subnet group, create a new subnet group or choose an existing one. Keep the rest of the parameters as default.
  9. For Publicly accessible, choose NO.
  10. Under VPC security group (firewall), choose Existing and choose the existing security group that you created for the Aurora MySQL DB instance.
  11. Leave the remaining parameters as default and create the database.

Create an EC2 bastion host to connect to the private Aurora MySQL DB instance

In this post, you connect to the private DB instance from the MySQL Workbench client on your local machine through an EC2 bastion host. Launch the EC2 instance in the public subnet of the VPC you configured. The security group attached to this EC2 bastion host instance should be configured to allow SSH traffic (port 22) from your local machine’s IP address. To facilitate the connection between the EC2 bastion host and the Aurora MySQL database, the security group for the Aurora MySQL database should have an inbound rule to allow MySQL traffic (port 3306) from the security group of the EC2 bastion host. Conversely, the security group for the EC2 bastion host should have an outbound rule to allow traffic to the security group of the Aurora MySQL database on port 3306. Refer to Controlling access with security groups for more details.

Configure IAM Identity Center

An Amazon Q Business application requires you to use IAM Identity Center to manage user access. IAM Identity Center is a single place where you can assign your workforce users, also known as workforce identities, to provide consistent access to multiple AWS accounts and applications. In this post, we use IAM Identity Center as the SAML 2.0-aligned identity provider (IdP). Make sure you have enabled an IAM Identity Center instance, provisioned at least one user, and provided each user with a valid email address. The Amazon Q Business application needs to be in the same Region as the IAM Identity Center instance. For more information on enabling users in IAM Identity Center, see Add users to your Identity Center directory.

Create an S3 bucket

Create a S3 bucket in the us-east-1 Region with the default settings and create a folder with a name of your choice inside the bucket.

Create and load sample data

In this post, we use two sample datasets: a total sales dataset CSV file and a sales target document in PDF format. The total sales dataset contains information about orders placed by customers located in various geographical locations, through different sales channels. The sales document contains information about sales targets for the year for each of the sales channel. Complete the steps in the section below to load both datasets.

Aurora MySQL database

In the Amazon Q Business application, you create two indexes for the same Aurora MySQL table: one on the total sales dataset and another on an aggregated view of the total sales data, to cater to the different type of queries. Complete the following steps:

  1. Securely connect to your private Aurora MySQL database using an SSH tunnel through an EC2 bastion host.

This enables you to manage and interact with your database resources directly from your local MySQL Workbench client.

  1. Create the database and tables using the following commands on the local MySQL Workbench client:
CREATE DATABASE sales;
USE sales;
CREATE TABLE total_sales_data (customer_name text, product_name text, state_code text, state text, region text, order_number text, sales_channel text, warehouse_code text, procure_date date DEFAULT NULL, order_date date DEFAULT NULL, ship_date date DEFAULT NULL, delivery_date date DEFAULT NULL, currency_code text, sales_team_id text, customer_id text, store_id text, product_id text, order_quantity int DEFAULT NULL, discount_applied double DEFAULT NULL, unit_price double DEFAULT NULL, unit_cost double DEFAULT NULL, sales_team text, city_name text, county text, type text, latitude text, longitude text, area_code text, population text, household_income text, median_income text, land_area text, water_area text, time_zone text) ;
  1. Download the sample file csv in your local environment.
  2. Use the following code to insert sample data in your MYSQL client:
LOAD DATA LOCAL INFILE '/path/to/the/file/total_sales_dataset.csv' INTO TABLE sales.total_sales_data FIELDS TERMINATED BY ',' ENCLOSED BY '"' IGNORE 1 LINES;

If you encounter the error LOAD DATA LOCAL INFILE file request rejected due to restrictions on access when running the statements in MySQL Workbench 8.0, you might need to edit the connection. On the Connection tab, go to the Advanced sub-tab, and in the Others field, add the line OPT_LOCAL_INFILE=1 and start a new query tab after testing the connection.

  1. Verify the data load by running a select statement:
select count (*) from sales.total_sales_data;

This should return 7,991 rows.

The following screenshot shows the database table schema and the sample data in the table.

Preview of total_sales_data

Amazon S3 bucket

Download the sample file 2020_Sales_Target.pdf in your local environment and upload it to the S3 bucket you created. This sales target document contains information about the sales target for four sales channels and looks like the following screenshot.

Preview of sales-target PDF

Create an Amazon Q application

Complete the following steps to create an Amazon Q application:

  1. On the Amazon Q console, choose Applications in the navigation pane.
  2. Choose Create application.
  3. Provide the following details:
    1. In the Application details section, for Application name, enter a name for the application (for example, sales_analyzer).
    2. In the Service access section, for Choose a method to authorize Amazon Q, select Create and use a new service role.
    3. Leave all other default options and choose Create.

Create Q application

  1. On the Select retriever page, you configure the retriever. The retriever is an index that will be used by Amazon Q to fetch data in real time.
    1. For Retrievers, select Use native retriever.
    2. For Index provisioning, select Starter.
    3. For Number of units, use the default value of 1. Each unit can support up to 20,000 documents. For a database, each database row is considered a document.
    4. Choose Next.

Select retriever

Configure Amazon Q to connect to Aurora MySQL-Compatible

Complete the following steps to configure Amazon Q to connect to Aurora MySQL-Compatible:

  1. On the Connect data sources page, under Data sources, choose the Aurora (MySQL) data source.
  2. Choose Next.

Connect data sources

  1. In the Name and description section, configure the following parameters:
    1. For Data source name, enter a name (for example, aurora_mysql_sales).
    2. For Description, enter a description.
  2. In the Source section, configure the following parameters:
    1. For Host, enter the database endpoint (for example, <databasename>.<ID>.<region>.rds.amazonaws.com).

You can obtain the endpoint on the Amazon RDS console for the instance on the Connectivity & security tab.

    1. For Port, enter the Amazon RDS port for MySQL: 3306.
    2. For Instance, enter the database name (for example, sales).
    3. Select Enable SSL Certificate location.

Connect data sources MySQL

  1. For Authentication, choose Create a new secret with a name of your choice.
  2. Provide the user name and password for your MySQL database to create the secret.
  3. In the Configure VPC and security group section, choose the VPC and subnets where your Aurora MySQL database is located, and choose the default VPC security group.

Authorization

  1. For IAM role, choose Create a new service role.
  2. For Sync scope, under SQL query, enter the following query:
SELECT order_number, sales_channel, concat('customer_name: ',customer_name,' product_name: ',product_name,' state_code: ',state_code,' state: ',state, ' region: ',region,' order_number: ',' order_number: ',order_number,' sales_channel: ',sales_channel, ' warehouse_code: ',warehouse_code,' procure_date: ',procure_date,' order_date: ',order_date,' ship_date: ',ship_date, ' delivery_date: ',delivery_date,' currency_code: ', currency_code,' sales_team_id: ',sales_team_id, ' customer_id: ',customer_id,' store_id: ',store_id,' product_id: ',product_id,' order_quantity: ',order_quantity, ' discount_applied: ',discount_applied,' unit_price: ',unit_price,' unit_cost: ',unit_cost, ' sales_team: ',sales_team,' city_name: ',city_name, 'time_zone: ',time_zone) as sales_details FROM `sales`.total_sales_data

This select statement returns a primary key column, a document title column, and a text column that serves your document body for Amazon Q to answer questions. Make sure you don’t put ; at the end of the query.

  1. For Primary key column, enter order_number.
  2. For Title column, enter sales_channel.
  3. For Body column, enter sales_details.

IAM role

  1. Under Sync run schedule, for Frequency, choose Run on demand.
  2. Keep all other parameters as default and choose Add data source.

Sync mode

This process may take a few minutes to complete. After the aurora_mysql_sales data source is added, you will be redirected to the Connect data sources page.

  1. Repeat the steps to add another Aurora MySQL data source, called aggregated_sales, for the same database but with the following details in the Sync scope This data source will be used by Amazon Q for answering questions on aggregated sales.
    1. Use the following SQL query:
select scoy_id, sales_channel, concat('scoy_id: ',scoy_id,' order_year: ',order_year, ' sales_channel: ',sales_channel, ' total_order_quantity: ',total_order_quantity,' total_sales_amount: ',total_sales_amount, ' total_cost_amount: ', total_cost_amount, ' total_profit : ', total_profit, ' last_order_date: ',last_order_date) as sales_aggregates from ( select concat(sales_channel,year(order_date)) as scoy_id, year(order_date) as order_year, sales_channel, sum(order_quantity) as total_order_quantity, sum(unit_price*order_quantity) as total_sales_amount, sum(unit_cost*order_quantity) as total_cost_amount, sum((unit_price-unit_cost)*order_quantity) as total_profit, max(order_date) as last_order_date from sales.total_sales_data group by 1,2,3 ) aggregated_sales
    1. For Primary key column, enter scoy_id.
    2. For Title column, enter sales_channel.
    3. For Body column, enter sales_aggregates.

Sync scope

After adding the aggregated_sales data source, you will be redirected to the Connect data sources page again.

Configure Amazon Q to connect to Amazon S3

Complete the following steps to configure Amazon Q to connect to Amazon S3:

  1. On the Connect data sources page, under Data sources, choose Amazon S3.
  2. Under Name and description, enter a data source name (for example, s3_sales_targets) and a description.
  3. Under Configure VPC and security group settings, choose No VPC.

Connect data sources S3

  1. For IAM role, choose Create a new service role.
  2. Under Sync scope, for the data source location, enter the S3 bucket name containing the sales target PDF document.
  3. Leave all other parameters as default.

IAM role

  1. Under Sync run schedule, for Frequency, choose Run on demand.
  2. Choose Add data source.

Sync run schedule

  1. On the Connect data sources page, choose Next.
  2. In the Update groups and users section, choose Add users and groups.
  3. Choose the user as entered in IAM Identity Center and choose Assign.

Assign users and groups

  1. After you add the user, you can choose the Amazon Q Business subscription to assign to the user. For this post, we choose Q Business Lite.
  2. Under Web experience service access, select Create and use a new service role and enter a service role name.
  3. Choose Create application.

Add groups and users

After few minutes, the application will be created and you will be taken to the Applications page on the Amazon Q Business console.

Applications

Sync the data sources

Choose the name of your application and navigate to the Data sources section. For each of the three data sources, select the data source and choose Sync now. It will take several minutes to complete. After the sources have synced, you should see the Last sync status show as Completed.

Data sources

Customize and interact with the Amazon Q application

At this point, you have created an Amazon Q application, synced the data source, and deployed the web experience. You can customize your web experience to make it more intuitive to your application users.

  1. On the application details page, choose Customize web experience.

How it works

  1. For this post, we have customized the Title, Subtitle and Welcome message fields for our assistant.

Previewing sales analyzer

  1. After you have completed your customizations for the web experience, go back to the application details page and choose the web experience URL.
  2. Sign in with the IAM Identity Center user name and password you created earlier to start the conversation with assistant.

Welcome

You can now test the application by asking different questions, as shown in the following screenshot. You can observe in the following question that the channel names were fetched from the Amazon S3 sales target PDF.

Question 1

The following screenshots show more example interactions.

Question 2

Question 3

Question 4

The answer in the preceding example was derived from the two sources: the S3 bucket and the Aurora database. You can verify the output by cross-referencing the PDF, which has a target as $12 million for the in-store sales channel in 2020. The following SQL shows the actual sales achieved in 2020 for the same channel:

SELECT YEAR(order_date) AS order_year, sales_channel, SUM(unit_price*order_quantity) AS total_sales_amount FROM sales.total_sales_data WHERE YEAR(order_date)='2020' AND sales_channel='In-Store' GROUP BY 1,2;

In-store sales amount

As seen from the sales target PDF data, the 2020 sales target for the distributor sales channel was $7 million.

Question 5

The following SQL in the Aurora MySQL database shows the actual sales achieved in 2020 for the same channel:

SELECT YEAR(order_date) AS order_year, sales_channel, SUM(unit_price*order_quantity) AS total_sales_amount FROM sales.total_sales_data WHERE YEAR(order_date)='2020' AND sales_channel='Distributor' GROUP BY 1,2;

Distributor sales amount

The following screenshots show additional questions.

Question 6

Question 7

Question 8

You can verify the preceding answers with the following SQL:

SELECT order_date, order_number, order_quantity, state, warehouse_code, sales_channel, sales_team FROM sales.total_sales_data WHERE customer_name='Amylin Group' AND YEAR(order_date)='2020' AND product_name='outdoor furniture';

Amylin group data preview

Clean up

To avoid incurring future charges, clean up any resources you created as part of this solution, including the Amazon Q Business application:

  1. On the Amazon Q Business console, choose Applications in the navigation pane, select the application you created, and on the Actions menu, choose Delete.
  2. Delete the AWS Identity and Access Management (IAM) roles created for the application and data retriever. You can identify the IAM roles used by the Amazon Q Business application and data retriever by inspecting the associated configuration using the AWS console or AWS Command Line Interface (AWS CLI).
  3. Delete the IAM Identity Center instance you created for this walkthrough.
  4. Empty the bucket you created and then delete the bucket.
  5. Delete the Aurora MySQL instance and Aurora cluster.
  6. Shut down the EC2 bastion host instance.
  7. Delete the VPC and related components—the NAT gateway and interface VPC endpoint.

Conclusion

In this post, we demonstrated how organizations can use Amazon Q to build a unified knowledge base that integrates structured data from an Aurora MySQL database and unstructured data from an S3 bucket. By connecting these disparate data sources, Amazon Q enables you to seamlessly query information from two data sources and gain valuable insights that drive better decision-making.

We encourage you to try this solution and share your experience in the comments. Additionally, you can explore the many other data sources that Amazon Q for Business can seamlessly integrate with, empowering you to build robust and insightful applications.


About the Authors

Monjumi Sarma is a Technical Account Manager at Amazon Web Services. She helps customers architect modern, scalable, and cost-effective solutions on AWS, which gives them an accelerated path towards modernization initiatives. She has experience across analytics, big data, ETL, cloud operations, and cloud infrastructure management.

Akchhaya Sharma is a Sr. Data Engineer at Amazon Ads. He builds and manages data-driven solutions for recommendation systems, working together with a diverse and talented team of scientists, engineers, and product managers. He has experience across analytics, big data, and ETL.

Read More

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *