-
Notifications
You must be signed in to change notification settings - Fork 238
Description
Describe the bug
NVIDIA GPU Operator Daemonset causes Cloudwatch Agent to no longer report container insights
Steps to reproduce
0. Create cluster in EKS that uses EC2 Autoscaling groups w/ g* family instance types
- Install cloudwatch-agent via EKS add-on
- Install nvidia GPU Operator (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)
- Scale nodes in autoscaling group down to 0
- Scale nodes back up to N > 0
- Notice that container insights schedule on new nodes are not present
What did you expect to see?
I expected to see container insights for pods on my GPU nodes
What did you see instead?
I did not see insights
What version did you use?
1.300058.0b1191
What config did you use?
{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"kubernetes":{"cluster_name":"","enhanced_container_insights":true,"metrics_collection_interval":60}}},"metrics":{"metrics_collected":{"cpu":{"measurement":["cpu_usage_idle","cpu_usage_iowait","cpu_usage_user","cpu_usage_system"[],"metrics_collection_interval":60,"totalcpu":false},"disk":{"measurement":["used_percent"],"metrics_collection_interval":60,"resources":[""]},"diskio":{"measurement":["io_time","read_bytes","write_bytes","reads","writes"],"metrics_collection_interval":60 ,"resources":[""]},"mem":{"measurement":["mem_used_percent"],"metrics_collection_interval":60},"netstat":{"measurement":["tcp_established","tcp_time_wait"],"metrics_collection_interval":60},"swap":{"measurement":["swap_used_percent"],"metrics_collection_interval":60}},"namespace":"ContainerInsights"}}
Environment
EKS
Official Ubuntu 22.04 EKS Image - AMI ami-0bc9e5f6e68a2ea03
Additional context
The problem is that the Nvidia GPU Operator is responsible for installing the nvidia-container-toolkit so that pods can access GPU resources
Then, it restarts containerd
Because the Cloudwatch agent mounts /run/containerd/containerd.sock directly, when containerd restarts, this mount breaks
Then, the cloudwatch agent spins "connection refused" for grpc calls to this socket forever.
I won't prescribe any solutions, but internal discussions on our side were hoping one of:
- cloudwatch daemonset on kube would mount the socket directories instead of the sockets directly (maybe security issue but would solve the bug)
- The cloudwatch-agent pods could restart if container sockets that USED to work stop working for an extended period of time
I don't see any other real options, but hoping to have a discussion.
This bug requires a kubectl rollout restart
of the Cloudwatch Agent DS each time a new node autoscales.