Nvidia GPU Operator Daemonset causes Cloudwatch Agent to lose Containerd socket on node autoscale up

**Describe the bug**
NVIDIA GPU Operator Daemonset causes Cloudwatch Agent to no longer report container insights 

**Steps to reproduce**
0. Create cluster in EKS that uses EC2 Autoscaling groups w/ g* family instance types
1. Install cloudwatch-agent via EKS add-on
2. Install nvidia GPU Operator (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)
3. Scale nodes in autoscaling group down to 0
4. Scale nodes back up to N > 0
5. Notice that container insights schedule on new nodes are not present

**What did you expect to see?**
I expected to see container insights for pods on my GPU nodes

**What did you see instead?**
I did not see insights

**What version did you use?**
1.300058.0b1191

**What config did you use?**
{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"kubernetes":{"cluster_name":"","enhanced_container_insights":true,"metrics_collection_interval":60}}},"metrics":{"metrics_collected":{"cpu":{"measurement":["cpu_usage_idle","cpu_usage_iowait","cpu_usage_user","cpu_usage_system"[],"metrics_collection_interval":60,"totalcpu":false},"disk":{"measurement":["used_percent"],"metrics_collection_interval":60,"resources":["*"]},"diskio":{"measurement":["io_time","read_bytes","write_bytes","reads","writes"],"metrics_collection_interval":60 ,"resources":["*"]},"mem":{"measurement":["mem_used_percent"],"metrics_collection_interval":60},"netstat":{"measurement":["tcp_established","tcp_time_wait"],"metrics_collection_interval":60},"swap":{"measurement":["swap_used_percent"],"metrics_collection_interval":60}},"namespace":"ContainerInsights"}} 
**Environment**
EKS
Official Ubuntu 22.04 EKS Image - AMI ami-0bc9e5f6e68a2ea03

**Additional context**
The problem is that the Nvidia GPU Operator is responsible for installing the nvidia-container-toolkit so that pods can access GPU resources
Then, it restarts containerd
Because the Cloudwatch agent mounts /run/containerd/containerd.sock directly, when containerd restarts, this mount breaks
Then, the cloudwatch agent spins "connection refused" for grpc calls to this socket forever.

I won't prescribe any solutions, but internal discussions on our side were hoping one of:
1. cloudwatch daemonset on kube would mount the socket directories instead of the sockets directly (maybe security issue but would solve the bug)
2. The cloudwatch-agent pods could restart if container sockets that USED to work stop working for an extended period of time

I don't see any other real options, but hoping to have a discussion.
This bug requires a `kubectl rollout restart` of the Cloudwatch Agent DS each time a new node autoscales.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nvidia GPU Operator Daemonset causes Cloudwatch Agent to lose Containerd socket on node autoscale up #1818

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nvidia GPU Operator Daemonset causes Cloudwatch Agent to lose Containerd socket on node autoscale up #1818

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions