Detecting and recovering from node issues for AWS Neuron nodes within Amazon EKS clusters

Implementing hardware resilience in your training infrastructure is critical to mitigating risks and enabling uninterrupted model training. By implementing features such as proactive health monitoring and automated recovery mechanisms, organizations can create a fault-tolerant environment capable of handling hardware failures or other issues without compromising the integrity of the training process.

In the post, we introduce AWS Neuron node problem detection and recovery. Set of demons for AWS Trainium and AWS Inferentia on amazon Elastic Kubernetes Service (amazon EKS). This component can quickly detect rare occurrences when Neuron devices fail by monitoring monitoring logs. It marks worker nodes on a failed Neuron device as unhealthy and quickly replaces them with new worker nodes. By accelerating the speed of issue detection and resolution, you increase the reliability of your ML training and reduce lost time and costs due to hardware failures.

This solution is applicable if you are using managed nodes or self-managed node groups (using amazon EC2 Auto Scaling groups) on amazon EKS. At the time of writing this post, automatic recovery of nodes provisioned by amazon EC2 Auto Scaling is not possible. carpenter Not yet supported.

Solution Overview

The solution is based on the DaemonSet node problem detection and recovery tool, a powerful tool designed to automatically detect and report various node-level issues in a Kubernetes cluster.

The node problem detector component will continuously monitor the kernel message (kmsg) logs the worker nodes. If it detects error messages related specifically to the Neuron device (which is the Trainium chip or AWS Inferentia), it will change NodeCondition to NeuronHasError on the Kubernetes API server.

The Node Recovery Agent is a standalone component that periodically checks the Prometheus metrics exposed by the Node Problem Detector. When it finds a node condition that indicates a problem with the Neuron device, it will take automatic action. First, it will mark the affected instance in the relevant Auto Scaling group as unhealthy, which will invoke the Auto Scaling group to stop the instance and launch a replacement. Additionally, the Node Recovery Agent will publish amazon CloudWatch metrics for users to monitor and alert on these events.

The following diagram illustrates the solution architecture and workflow.

In the following tutorial, we create an EKS cluster with Trn1 worker nodes, deploy the Neuron plugin for node problem detector, and inject a failure message into the node. We then observe that the failing node is stopped and replaced by a new one, and find a metric in CloudWatch that indicates the failure.

Prerequisites

Before you begin, make sure you have the following tools installed on your machine:

Deploy the node problem detection and recovery plugin

Complete the following steps to configure the Node Problem Detection and Recovery plugin:

Create an EKS cluster using data from an EKS Terraform module:

git clone https://github.com/awslabs/data-on-eks.git

export TF_VAR_region=us-east-2
export TF_VAR_trn1_32xl_desired_size=4
export TF_VAR_trn1_32xl_min_size=4
cd data-on-eks/ai-ml/trainium-inferentia/ && chmod +x install.sh
./install.sh

aws eks --region us-east-2 describe-cluster --name trainium-inferentia

# Creates k8s config file to authenticate with EKS
aws eks --region us-east-2 update-kubeconfig --name trainium-inferentia

kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-100-64-161-213.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
ip-100-64-227-31.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
ip-100-64-70-179.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde

Install the AWS Identity and Access Management (IAM) role required for the service account and the Node Problem Detector plugin.
Create a policy as shown below. Update the Resource key value to match the ARN of the node group that contains the Trainium and AWS Inferentia nodes, and update the ec2:ResourceTag/aws:autoscaling:groupName key value to match the auto scaling group name.

You can get these values from the amazon EKS console. Select Clusters in the navigation pane, open the trainium-inferentia cluster, select Node Groups and locate your node group.

# To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.

# Create npd-policy-trimmed.json
cat << EOF > npd-policy-trimmed.json
{
    "Version": "2012-10-17",
    "Statement": (
        {
            "Action": (
                "autoscaling:SetInstanceHealth",
                "autoscaling:DescribeAutoScalingInstances"
            ),
            "Effect": "Allow",
            "Resource": 
        },
        {
            "Action": (
                "ec2:DescribeInstances"
            ),
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "ForAllValues:StringEquals": {
                    "ec2:ResourceTag/aws:autoscaling:groupName": 
                }
            }
        },
        {
            "Action": (
                "cloudwatch:PutMetricData"
            ),
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "cloudwatch:Namespace": "NeuronHealthCheck"
                }
            }
        }
    )
}
EOF

This component will be installed as a DaemonSet on your EKS cluster.

# To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.

aws iam create-policy  \
--policy-name NeuronProblemDetectorPolicy \
--policy-document file://npd-policy-trimmed.json

# Note the ARN

CLUSTER_NAME=trainium-inferentia # Your EKS Cluster Name 
AWS_REGION=us-east-2
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)
POLICY_ARN=arn:aws:iam::$ACCOUNT_ID:policy/NeuronProblemDetectorPolicy

eksctl create addon --cluster $CLUSTER_NAME --name eks-pod-identity-agent \
  --region $AWS_REGION

eksctl create podidentityassociation \
    --cluster $CLUSTER_NAME \
    --namespace neuron-healthcheck-system \
    --service-account-name node-problem-detector \
    --permission-policy-arns="$POLICY_ARN" \
    --region $AWS_REGION
    
# Install the Neuron NPD and recovery plugin 

kubectl create ns neuron-healthcheck-system
curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml | kubectl apply -f - 
curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml | kubectl apply -f - 
curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml | kubectl apply -f -

# Expected result (with 4 Neuron nodes in cluster):

kubectl get pod -n neuron-healthcheck-system
NAME READY STATUS RESTARTS AGE
node-problem-detector-49p6w 2/2 Running 0 31s
node-problem-detector-j7wct 2/2 Running 0 31s
node-problem-detector-qr6jm 2/2 Running 0 31s
node-problem-detector-vwq8x 2/2 Running 0 31s

Container images in Kubernetes manifests are stored in a public repository as registry.k8s.io and public.ecr.awsFor production environments, we recommend that customers limit external dependencies that impact these areas and host container images in a private registry and sync images from public repositories. For detailed implementation information, see the blog post: Announcing Pull-Through Cache for Registry.k8s.io on amazon Elastic Container Registry.

By default, the node problem detector will not take any action in case of node failure. If you want the agent to automatically terminate the EC2 instance, update the DaemonSet as follows:

kubectl edit -n neuron-healthcheck-system ds/node-problem-detector

...
   env:
   - name: ENABLE_RECOVERY
     value: "true"

Try the node troubleshooter and recovery solution

Once the plugin is installed, you can see the Neuron conditions appear when you run kubectl describe nodeWe simulate a device failure by injecting error logs into the instance:

# Verify node conditions on any node. Neuron conditions should show up.

kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep Conditions: -A7

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     False   Fri, 29 Mar 2024 15:52:08 +0800   Thu, 28 Mar 2024 13:59:19 +0800   NeuronHasNoError             Neuron has no error
  MemoryPressure   False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:59:08 +0800   KubeletReady                 kubelet is posting ready status
# To get provider id
kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep -i provider | sed -E 's/.*\/((^\/)+)$/\1/'

i-0381404aa69eae3f6

# SSH into to the worker node and simulate the hardware error on the neuron device
aws ssm start-session --target i-0381404aa69eae3f6 --region us-east-2

Starting session with SessionId: lindarr-0069460593240662a

sh-4.2$
sh-4.2$ sudo bash
(root@ip-192-168-93-211 bin)# echo "test NEURON_HW_ERR=DMA_ERROR test" >> /dev/kmsg

After about 2 minutes, you will be able to see that the error has been identified:

kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep 'Conditions:' -A7

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     True    Fri, 29 Mar 2024 17:42:43 +0800   Fri, 29 Mar 2024 17:42:38 +0800   NeuronHasError_DMA_ERROR     test NEURON_HW_ERR=DMA_ERROR test

...

Events:
  Type     Reason                    Age   From            Message
  ----     ------                    ----  ----            -------
  Warning  NeuronHasError_DMA_ERROR  36s   kernel-monitor  Node condition NeuronHealth is now: True, reason: NeuronHasError_DMA_ERROR, message: "test NEURON_HW_ERR=DMA_ERROR test"

Now that the node's problem detector has detected the failure and the recovery agent has automatically taken the action of setting the node to unhealthy, amazon EKS will cordon off the node and evict the pods on it:

# Verify the Node scheduling is disabled.
kubectl get node 
NAME                                           STATUS                        ROLES    AGE    VERSION
ip-100-64-1-48.us-east-2.compute.internal      Ready                            156m   v1.29.0-eks-5e0fdde
ip-100-64-103-26.us-east-2.compute.internal    Ready                            94s    v1.29.0-eks-5e0fdde
ip-100-64-239-245.us-east-2.compute.internal   Ready                            154m   v1.29.0-eks-5e0fdde
ip-100-64-52-40.us-east-2.compute.internal     Ready                            156m   v1.29.0-eks-5e0fdde
ip-100-64-58-151.us-east-2.compute.internal    NotReady,SchedulingDisabled      27h    v1.29.0-eks-5e0fdde

You can open the CloudWatch console and check the metrics for NeuronHealthCheckYou can see the CloudWatch NeuronHasError_DMA_ERROR Metrics have the value 1.

After the replacement, you can see that a new worker node has been created:

# The new node with age 28s is the new node

kubectl get node 
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-65-77.us-east-2.compute.internal    Ready       28s   v1.29.0-eks-5e0fddev1.28.5-eks-5e0fdde
ip-192-168-81-176.us-east-2.compute.internal   Ready       9d    v1.29.5-eks-5e0fdde
ip-192-168-91-218.us-east-2.compute.internal   Ready       9d    v1.29.0-eks-5e0fdde
ip-192-168-94-83.us-east-2.compute.internal    Ready       9d    v1.29.0-eks-5e0fdde

Let's look at a real-world scenario, where a distributed training job is run, using an MPI operator as described in ai/training/Llama2″ target=”_blank” rel=”noopener”>Llama-2 in Trainiumand there is an unrecoverable Neuron error on one of the nodes. Before the plugin is deployed, the training job will be blocked, resulting in wasted time and computational costs. After the plugin is deployed, the node problem detector will proactively remove the problematic node from the cluster. In your training scripts, save checkpoints periodically so that training resumes from the previous checkpoint.

The following screenshot shows example logs from a distributed training job.

Training has started. (You can ignore loss=nan for now; it is a known issue and will be removed. For immediate use, see the reduction_train_loss metric.)

The following screenshot shows the checkpoint created in step 77.

Training stopped after one of the nodes had a problem at step 86. The error was manually injected for testing purposes.

After the faulty node was detected and replaced by the Neuron plugin for node problems and recovery, the training process resumed at step 77, which was the last checkpoint.

While autoscaling groups will stop nodes that are not performing properly, they may encounter issues that prevent replacement nodes from being launched. In such cases, training jobs will be stopped and require manual intervention. However, the stopped node will not incur any further charges on the associated EC2 instance.

If you want to perform custom actions beyond stopping instances, you can create CloudWatch alarms that watch metrics. NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERRORand NeuronHasError_NC_UNCORRECTABLE_ERRORand use a CloudWatch Metrics Insights query like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck To sum these values and evaluate the alarms, the following screenshots show an example.

Clean

To clean up all resources provided for this post, run the cleanup script:

# neuron-problem-detector-role-$CLUSTER_NAME
eksctl delete podidentityassociation \
--service-account-name node-problem-detector \
--namespace neuron-healthcheck-system \
--cluster $CLUSTER_NAME \
--region $AWS_REGION

# delete the EKS Cluster
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh

Conclusion

In this post, we show how the Neuron problem detector and recovery DaemonSet for amazon EKS work on EC2 instances powered by Trainium and AWS Inferentia. If you are running Neuron-based EC2 instances and using managed nodes or self-managed node pools, you can deploy the detector and recovery DaemonSet on your EKS cluster and benefit from improved reliability and fault tolerance of your machine learning training workloads in the event of a node failure.

About the authors

Harish Rao is a Senior Solutions Architect at AWS, specializing in large-scale distributed ai training and inference. He empowers customers to harness the power of ai to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Ziwen Ning is a Software Development Engineer at AWS. He is currently focused on improving the ai/ML experience by integrating AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming, and other sports, and immersing himself in music.

Geeta Gharpur is a Senior Software Developer on Annapurna’s ML Engineering team. She focuses on running large-scale ai/ML workloads on Kubernetes. She lives in Sunnyvale, California and enjoys listening to Audible in her spare time.

Darren Lin is a Cloud Native Solutions Architect at AWS focusing on domains such as Linux, Kubernetes, Containers, Observability, and Open Source technologies. In his free time, he enjoys working out and having fun with his family.

Detecting and recovering from node issues for AWS Neuron nodes within Amazon EKS clusters

Technical Terrence Team

With an extra £380, I would start buying shares with these 3 steps

Leave a Reply Cancel reply

Recommended.

‘Taproot Wizards’ will revive the magic of Bitcoin with $7.5 million in financing

The machine vision system combines recognition and image generation | MIT News

Ethereum is chasing the bull amid market uncertainty, data shows

Solana and Bitcoin sentiment remains bearish: signal to buy?

Final arguments in Google antitrust trial conclude and establish historic ruling

Categories

Important Links

Detecting and recovering from node issues for AWS Neuron nodes within Amazon EKS clusters

Solution Overview

Prerequisites

Deploy the node problem detection and recovery plugin

Try the node troubleshooter and recovery solution

Clean

Conclusion

About the authors

Related

Technical Terrence Team

With an extra £380, I would start buying shares with these 3 steps

Leave a Reply Cancel reply

Recommended.

‘Taproot Wizards’ will revive the magic of Bitcoin with $7.5 million in financing

The machine vision system combines recognition and image generation | MIT News

Ethereum is chasing the bull amid market uncertainty, data shows

Solana and Bitcoin sentiment remains bearish: signal to buy?

Final arguments in Google antitrust trial conclude and establish historic ruling

Categories

Important Links

Get daily news updates to your inbox!