Skip to content

Nvidia GPU Operator Daemonset causes Cloudwatch Agent to lose Containerd socket on node autoscale up #1818

@acwrenn53

Description

@acwrenn53

Describe the bug
NVIDIA GPU Operator Daemonset causes Cloudwatch Agent to no longer report container insights

Steps to reproduce
0. Create cluster in EKS that uses EC2 Autoscaling groups w/ g* family instance types

  1. Install cloudwatch-agent via EKS add-on
  2. Install nvidia GPU Operator (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html)
  3. Scale nodes in autoscaling group down to 0
  4. Scale nodes back up to N > 0
  5. Notice that container insights schedule on new nodes are not present

What did you expect to see?
I expected to see container insights for pods on my GPU nodes

What did you see instead?
I did not see insights

What version did you use?
1.300058.0b1191

What config did you use?
{"agent":{"region":"us-west-2"},"logs":{"metrics_collected":{"kubernetes":{"cluster_name":"","enhanced_container_insights":true,"metrics_collection_interval":60}}},"metrics":{"metrics_collected":{"cpu":{"measurement":["cpu_usage_idle","cpu_usage_iowait","cpu_usage_user","cpu_usage_system"[],"metrics_collection_interval":60,"totalcpu":false},"disk":{"measurement":["used_percent"],"metrics_collection_interval":60,"resources":[""]},"diskio":{"measurement":["io_time","read_bytes","write_bytes","reads","writes"],"metrics_collection_interval":60 ,"resources":[""]},"mem":{"measurement":["mem_used_percent"],"metrics_collection_interval":60},"netstat":{"measurement":["tcp_established","tcp_time_wait"],"metrics_collection_interval":60},"swap":{"measurement":["swap_used_percent"],"metrics_collection_interval":60}},"namespace":"ContainerInsights"}}
Environment
EKS
Official Ubuntu 22.04 EKS Image - AMI ami-0bc9e5f6e68a2ea03

Additional context
The problem is that the Nvidia GPU Operator is responsible for installing the nvidia-container-toolkit so that pods can access GPU resources
Then, it restarts containerd
Because the Cloudwatch agent mounts /run/containerd/containerd.sock directly, when containerd restarts, this mount breaks
Then, the cloudwatch agent spins "connection refused" for grpc calls to this socket forever.

I won't prescribe any solutions, but internal discussions on our side were hoping one of:

  1. cloudwatch daemonset on kube would mount the socket directories instead of the sockets directly (maybe security issue but would solve the bug)
  2. The cloudwatch-agent pods could restart if container sockets that USED to work stop working for an extended period of time

I don't see any other real options, but hoping to have a discussion.
This bug requires a kubectl rollout restart of the Cloudwatch Agent DS each time a new node autoscales.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions