This post is co-written with Curtis Maher and Anjali Thatte of Datadog.
This post walks you through the new Datadog integration with AWS Neuronthat helps you monitor your AWS Trainium and AWS Inferentia instances by providing deep observability of resource utilization, model execution performance, latency, and infrastructure health in real time, allowing you to optimize loads of machine learning (ML) work and achieve high performance at scale. .
Neuron is the SDK used to run deep learning workloads on Trainium and Inferentia-based instances. AWS ai chips, Trainium and Inferentia, enable you to build and deploy generative ai models with higher performance and lower cost. With the increasing use of large models, which require a large number of accelerated computing instances, observability plays a critical role in machine learning operations, allowing you to improve performance, diagnose and fix faults, and optimize data utilization. resources.
Datadog, an observability and security platform, provides real-time monitoring for cloud infrastructure and machine learning operations. Datadog is pleased to launch its Neural integrationwhich extracts metrics collected by Neuron SDK Neuron monitor tool in Datadog, which allows you to track the performance of your Trainium and Inferentia based instances. By providing real-time visibility into model performance and hardware usage, Datadog helps you achieve efficient training and inference, optimized resource utilization, and prevention of service slowdowns.
Comprehensive monitoring of Trainium and Inferentia
Datadog's integration with Neuron SDK automatically collects metrics and logs from Trainium and Inferentia instances and sends them to the Datadog platform. To the allowing integrationUsers will find a ready-to-use dashboard in Datadog, making it easy to start monitoring quickly. You can also modify pre-existing panels and monitors and add new ones tailored to your specific monitoring requirements.
The Datadog dashboard provides a detailed view of the performance of your AWS ai chip (Trainium or Inferentia), such as the number of instances, availability, and AWS region. Real-time metrics provide an immediate snapshot of infrastructure health, with pre-configured monitors alerting teams to critical issues such as latency, resource utilization, and execution errors. The following screenshot shows an example panel.
For example, when latency increases on a specific instance, a monitor in the monitor summary section of the dashboard will turn red and trigger alerts via Datadog or other paging mechanisms (such as Slack or email). High latency can indicate high user demand or inefficient data channels, which can slow response times. By identifying these signals early, teams can respond quickly in real time to maintain high-quality user experiences.
Datadog's Neuron integration enables tracking of key performance aspects, providing crucial information for troubleshooting and optimization:
- NeuronCore Counters: Monitoring NeuronCore utilization helps ensure cores are being used efficiently, helping you identify if you need to make adjustments to balance workloads or optimize performance.
- Execution Status: You can monitor the progress of training jobs, including completed tasks and failed executions. This data ensures that models are trained smoothly and reliably. If failures increase, it may indicate issues with data quality, model configurations, or resource limitations that need to be addressed.
- Memory Used: You can get a granular view of memory usage on both the host and the Neuron device, including memory allocated for tensors and model execution. This helps you understand how effectively resources are being used and when it might be time to rebalance workloads or scale resources to prevent bottlenecks from causing interruptions during training.
- vCPU Usage in Neuron Runtime: You can monitor vCPU usage to ensure that your models do not overload the infrastructure. When vCPU usage exceeds a certain threshold, you will be notified to decide whether to redistribute workloads or update instance types to avoid training slowdowns.
By consolidating these metrics into a single view, Datadog provides a powerful tool for maintaining efficient, high-performing Neuron workloads, helping teams identify issues in real time and optimize infrastructure as needed. Using Neuron integration combined with Datadog LLM Observability capabilities, you can gain complete visibility into your large language model (LLM) applications.
Get started with Datadog, Inferentia and Trainium
Datadog's integration with Neuron provides real-time visibility into Trainium and Inferentia, helping you optimize resource utilization, troubleshoot issues, and achieve seamless performance at scale. To get started, see Monitoring AWS Inferentia and AWS Trainium.
To learn more about how Datadog integrates with amazon ML and Datadog services LLM Observabilitysee <a target="_blank" href="https://www.datadoghq.com/blog/monitor-amazon-bedrock-with-datadog/#get-started-with-amazon-bedrock-and-datadog” target=”_blank” rel=”noopener”>Monitor amazon Bedrock with Datadog and Monitoring amazon SageMaker with Datadog.
If you don't have a Datadog account yet, you can register for one <a target="_blank" href="https://www.datadoghq.com/blog/monitor-amazon-bedrock-with-datadog/” target=”_blank” rel=”noopener”>14 day free trial today.
About the authors
Curtis Maher is a product marketing manager at Datadog, focused on platform cloud and ai/ML integrations. Curtis works closely with Datadog's product, marketing and sales teams to coordinate product releases and help customers monitor and secure their cloud infrastructure.
Anjali Thatte He is a product manager at Datadog. He is currently focused on building technology to monitor ai infrastructure and machine learning tools and helping clients gain visibility into their ai application technology stacks.
Jason Mimick is a Senior Partner Solutions Architect at AWS who works closely with the product, engineering, marketing, and sales teams on a daily basis.
Anuj Sharma He is a Principal Solutions Architect at amazon Web Services. He specializes in modernizing applications with practical technologies such as serverless, containers, generative ai, and observability. With over 18 years of application development experience, he currently leads co-builds with AWS software partners focused on containers and observability.