Time series data is a distinct category that incorporates time as a fundamental element in its structure. In a time series, data points are collected sequentially, often at regular intervals, and typically exhibit certain patterns, such as trends, seasonal variations, or cyclical behavior. Common examples of time series data include sales revenue, system performance data (such as CPU and memory usage), credit card transactions, sensor readings, and user activity analytics.
Anomaly detection in time series is the process of identifying unexpected or unusual patterns in data that develop over time. An anomaly, also known as anomaly detection, is a type of anomaly that occurs when a data item is detected in a particular data item. isolated partoccurs when a data point deviates significantly from an expected pattern.
For some time series, such as those with well-defined expected ranges like machine operating temperatures or CPU usage, a threshold-based approach might be sufficient. However, in areas like fraud detection and sales, where simple rules are not sufficient due to their inability to detect anomalies in complex relationships, more sophisticated techniques are required to identify unexpected events.
In this post, we demonstrate how to build a robust real-time anomaly detection solution for streaming time series data using amazon Managed Service for Apache Flink and other AWS managed services.
Solution Overview
The following diagram illustrates the main architecture of the Anomaly Detection Stack solution.
This solution uses machine learning (ML) for anomaly detection and does not require users to have prior ai experience. It offers an AWS CloudFormation template for easy deployment in an AWS account. Using the CloudFormation template, you can deploy an application stack with the AWS resources required to detect anomalies. When you configure a stack, you create an application with an anomaly detection task or detector. You can configure multiple such stacks to run simultaneously, and each will analyze data and report on anomalies.
Once deployed, the application builds an ML model using the Random Cut Forest (RCF) algorithm. It initially obtains input time series data from amazon Managed Streaming for Apache Kafka (amazon MSK) using this live stream for model training. After training, the model continues to process the incoming data points from the stream. It evaluates these points against the historical trends of the corresponding time series. The model also generates an initial raw anomaly score during processing and maintains an internal threshold to remove noisy data points. The model subsequently generates a normalized anomaly score for each data point that the model treats as an anomaly. These scores, ranging from 0 to 100, indicate the deviation from typical patterns; scores closer to 100 signify higher anomaly levels. You have the flexibility to set a custom threshold on these anomaly scores, allowing you to define what you consider anomalous.
This solution uses a CloudFormation template, which takes inputs such as MSK broker endpoints and topics, AWS Identity and Access Management (IAM) roles, and other parameters related to the virtual private cloud (VPC) configuration. The template creates essential resources such as the Apache Flink application and the amazon SageMaker real-time endpoint in the customer's account.
To request access to this solution, please email anomalydetection-support-canvas@amazon.com.
In this post, we describe how a comprehensive solution can be built using the Anomaly Detection Stack. Consider a hypothetical sales scenario where AnyBooks, a bookstore on the campus of a large university, sells a variety of supplies to college students. Due to class schedules, their seasonality is such that they sell about 20 units of Item A and 30 units of Item B during even-numbered hours, and about half that amount during odd-numbered hours throughout the day. Recently, there have been some unexplained spikes in the number of items sold, and the management team wants to start tracking these anomalies in quantity so they can better plan their staffing and inventory levels.
The following diagram shows the detailed architecture for the end-to-end solution.
In the following sections, we discuss each layer shown in the diagram above.
Ingestion
At the ingestion layer, an AWS Lambda function retrieves the current minute's sales transactions from a PostgreSQL transactional database, transforms each record into a JSON message, and publishes it to a Kafka input topic. This Lambda function is configured to run every minute using amazon EventBridge Scheduler.
Anomaly Detection Stack
The Flink application starts the process of reading raw data from the input MSK topic, trains the model, and starts anomaly detection, finally logging them to the output MSK topic. The following code is the JSON of the output results:
A brief explanation of the output fields is provided below:
- extent – Represents the metric we are tracking to detect anomalies. In our case,
measure
The field is thequantity
of sales forItem-A
. - AggregateMeasurementValue – This represents the added value of
quantity
in the time window. - time series – This unique identifier corresponds to a combination of unique values for the dimensions and metrics. In this case, it is the product name.
Item-A
within theproduct_name
- Anomaly Confidence Score – As the model evolves through learning and inference, this confidence score will progressively improve.
- Anomaly Score – This field represents the anomaly detection score. With a
anomalyThreshold
set to 70, any value above 70 is considered a potential anomaly. - model scenario – When the model is in the learning phase, the
anomalyScore
is 0.0 and the value of this field is set toLEARNING
After learning is complete, the value of this field changes toINFERENCE
. - Anomaly decision threshold – The decision threshold is provided as input to the CloudFormation stack. If you determine that there are too many false positives, you can increase this threshold to change the sensitivity.
- AnomalyDecision – If the
anomalyScore
exceeds theanomalyDecisionThreshold
This field is set to 1, indicating that an anomaly was detected.
Transform
At the transformation layer, an amazon Data Firehose stream is configured to consume data from the Kafka output topic and invoke a Lambda function for transformation. The Lambda function flattens the nested JSON data from the Kafka topic. The transformed results are then split by date and stored in an amazon Simple Storage Service (amazon S3) bucket in Parquet format. An AWS Glue crawler is used to crawl the data to the amazon S3 location and catalog it in the AWS Glue Data Catalog, making it ready for querying and analysis.
Visualize
To visualize the data, we created an amazon QuickSight dashboard that connects to data in amazon S3 via the Data Catalog and queries it using amazon Athena. The dashboard can be refreshed to display the latest detected anomalies, as shown in the following screenshot.
In this example, the darker blue line on the line graph represents the seasonality of the quantity
measure for Item-A
over time, showing higher values during even hours and lower values during odd hours. The pink line represents the anomaly detection score, plotted on the right Y-axis. The anomaly score approaches 100 when the quantity value deviates significantly from its seasonal pattern. The blue line represents the anomaly threshold, set at 70. When anomalyScore
exceeds this threshold, anomalyDecision
is set to 1.
The “Number of time series tracked” KPI shows how many time series the model is currently monitoring. In this case, because we are tracking two products (Item-A
and Item-B
), the count is 2. The “Number of data points processed” KPI shows the total number of data points that the model has processed, and the “Anomaly confidence score” indicates the level of confidence in predicting anomalies. Initially, this score is low, but it will approach 100 as the model matures over time.
Notification
While visualization is valuable for investigating anomalies, data analysts often prefer to receive near real-time notifications about critical anomalies. This is achieved by adding a Lambda function that reads the results from the Kafka output topic and analyzes them. If the anomalyScore
If the value exceeds the defined threshold, the function invokes an amazon Simple Notification Service (amazon SNS) topic to send email or SMS notifications to a designated list, alerting the team about the anomaly in near real-time.
Conclusion
In this post, we demonstrated how to build a robust real-time anomaly detection solution for streaming time series data using Managed Service for Apache Flink and other AWS services. We walked through an end-to-end architecture that ingests data from a source database, passes it through an Apache Flink application that trains an ML model and detects anomalies, and then puts the anomaly data into an S3 data lake. Anomaly scores and decisions are visualized through a QuickSight dashboard connected to amazon S3 data using AWS Glue and Athena. Additionally, a Lambda function analyzes the results and sends notifications in near real-time.
With AWS managed services like amazon MSK, Data Firehose, Lambda, and SageMaker, you can quickly deploy and scale this anomaly detection solution for your own time series use cases. This allows you to automatically identify unexpected behaviors or patterns in your real-time data streams without manual rules or thresholds.
Try this solution and explore how real-time anomaly detection on AWS can reveal insights and optimize operations across your enterprise!
About the authors
Noah Soprala Noah is a Solutions Architect based in Dallas. He is a trusted advisor to his clients, helping them build innovative solutions using AWS technologies. Noah has over 20 years of experience in consulting, development, architecture, and solution delivery.
Dan Sinnreich is a Senior Product Manager at amazon SageMaker, focused on expanding no-code and low-code services. He is dedicated to making machine learning and generative ai more accessible and applying them to solve complex problems. Outside of work, he can be found playing hockey, scuba diving, and reading science fiction.
Syed Furqhan is a Senior Software Engineer for ai and ML at AWS. He has been involved in many AWS service launches such as amazon Lookout for Metrics, amazon Sagemaker, and amazon Bedrock. Currently, he is focused on generative ai initiatives as part of amazon Bedrock Core Systems. He is a clean code advocate and a subject matter expert on serverless and event-driven architecture. You can follow him on LinkedIn at syedfurqhan
Nirmal Kumar is a Senior Product Manager for the amazon SageMaker service. Committed to expanding access to ai and machine learning, he leads the development of no-code and low-code machine learning solutions. Outside of work, he enjoys traveling and reading nonfiction.