Data is the foundation for capturing maximum value from ai technology and quickly solving business problems. However, to unlock the potential of generative ai technologies, there is one key prerequisite: your data must be properly prepared. In this post, we describe how to use generative ai to update and scale your data pipeline using Amazon SageMaker Canvas for data preparation.
Data pipeline work typically requires a specialized skill in preparing and organizing data for security analysts to use to extract value, which can take time, increase risk, and increase time to value. With SageMaker Canvas, security analysts can securely and effortlessly access leading core models to prepare their data faster and remediate cybersecurity risks.
Data preparation involves careful formatting and thoughtful contextualization, starting from the client's problem. Now, with SageMaker Canvas Chat for Data Preparation capability, domain-savvy analysts can quickly prepare, organize, and extract value from data using a chat-based experience.
Solution Overview
Generative ai is revolutionizing the security space by providing natural language and personalized experiences, improving risk identification and remediation, while increasing business productivity. For this use case, we use SageMaker Canvas, Amazon SageMaker Data Wrangler, Amazon Security Lake, and Amazon Simple Storage Service (Amazon S3). Amazon Security Lake allows you to aggregate and normalize security data to analyze and gain a better understanding of security across your organization. Amazon S3 allows you to store and retrieve any amount of data anytime, anywhere. It offers industry-leading scalability, data availability, security and performance.
SageMaker Canvas now supports comprehensive data preparation capabilities powered by SageMaker Data Wrangler. With this integration, SageMaker Canvas provides an end-to-end code-free workspace to prepare data, build, and use machine learning (ML), and Amazon Bedrock core models to accelerate time from data to business insights. You can now discover and aggregate data from more than 50 data sources and explore and prepare data using more than 300 analyzes and transformations built into the SageMaker Canvas visual interface. You'll also see faster performance for transformations and analysis, and benefit from a natural language interface for exploring and transforming data for ML.
In this post, we demonstrate three key transformations; filtering, renaming columns, and extracting text from a column in the security findings data set. We also demonstrated using the data prep chat feature in SageMaker Canvas to analyze the data and visualize your findings.
Previous requirements
Before you get started, you need an AWS account. You also need to set up an Amazon SageMaker Studio domain. For instructions on setting up SageMaker Canvas, see Generate machine learning predictions without code.
Access the SageMaker Canvas chat interface
Complete the following steps to start using the SageMaker Canvas chat feature:
- In the SageMaker Canvas console, choose data controller.
- Low Data setsChoose Amazon S3 as the source and specify the Amazon Security Lake security findings dataset.
- Choose your data flow and choose Chat for data preparationwhich will display a chat interface experience with guided directions.
Filter data
For this post, we first want to filter out critical and high severity warnings, so we enter instructions in the chat box to eliminate findings that are not critical or of high severity. Canvas removes the rows, shows a preview of the transformed data, and provides the option to use code. We can add it to the list of steps in the Steps glass.
Rename columns
Next, we want to rename two columns, so we enter the following message in the chat box to rename the columns: download and qualification columns for Finding and Remediation. SageMaker Canvas generates a preview, and if you're happy with the results, you can add the transformed data to the data flow steps.
Extract text
To determine the regions of origin of the finds, you can enter instructions in the chat to Extract region text from UID column based on pattern arn:aws:security:securityhub:region:*
and create a new column called Region) to extract the text from the UID column region according to a pattern. SageMaker Canvas then generates code to create a new region column. The data preview shows that the findings originate from one region: us-west-2
. You can add this transformation to the data flow for later analysis.
Analyze the data
Finally, we want to analyze the data to determine if there is a correlation between the time of day and the number of critical findings. You can enter a request to summarize critical findings by time of day in the chat, and SageMaker Canvas will provide you with useful information for your research and analysis.
View findings
We then visualize the findings by severity over time for inclusion in a leadership report. You can ask SageMaker Canvas to generate a bar chart of severity versus time of day. In seconds, SageMaker Canvas created the graph grouped by severity. You can add this visualization to the analysis in the data flow and download it for your report. The data shows that findings originate in a region and occur at specific times. This gives us confidence in where to focus our investigation of security findings to determine root causes and corrective actions.
Clean
To avoid incurring unwanted charges, complete the following steps to clean up your resources:
- Empty the S3 bucket that you used as a source.
- Sign out of SageMaker Canvas.
Conclusion
In this post, we show you how to use SageMaker Canvas as an end-to-end no-code workspace for data preparation to create and use basic Amazon Bedrock models to accelerate time to business insights from the data.
Note that this approach is not limited to security findings; You can apply this to any generative ai use case that uses data preparation as a core element.
The future belongs to companies that can effectively harness the power of generative ai and large language models. But to do so, we must first develop a solid data strategy and understand the art of data preparation. By using generative ai to intelligently structure our data and work backwards from the customer, we can solve business problems faster. With SageMaker Canvas chat for data preparation, it's easy for analysts to get started and capture immediate value from ai.
About the authors
Sudeesh Sasidharan is a Senior Solutions Architect at AWS, within the Energy team. Sudeesh loves experimenting with new technologies and creating innovative solutions that solve complex business challenges. When he's not designing solutions or tinkering with the latest technologies, you can find him on the tennis court working on his backhand.
John Klacynski is a Senior Customer Solutions Manager within the AWS Independent Software Vendor (ISV) team. In this role, he programmatically helps ISV customers adopt AWS technologies and services to achieve their business goals faster. Prior to joining AWS, John led data product teams for large consumer packaged goods companies, helping them leverage valuable data insights to improve their operations and decision making.