Skip to content

Tolerations not applied to the DcgmExporter DaemonSet #290

@NicolasGrosjeanProbayes

Description

What I have done

I have added a toleration to DcgmExporter that I found here but using the DcgmExporter definition from your repository.

apiVersion: cloudwatch.aws.amazon.com/v1alpha1
kind: DcgmExporter
metadata:
  name: dcgm-exporter
  namespace: amazon-cloudwatch
  labels:
    k8s-app: dcgm-exporter
    version: v1
spec:
  image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04
  nodeSelector:
    kubernetes.io/os: linux
  serviceAccount: dcgm-exporter-service-acct
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"
[...]

Problem

The toleration is missing in the generated DaemonSet.

Workaround

Our AWS Solutions Architect has found a workaround by adding manually the toleration to the DaemonSet.

kubectl patch daemonset dcgm-exporter -n amazon-cloudwatch --type='json' -p='[ { "op": "add", "path": "/spec/template/spec/tolerations", "value": [ { "key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule" } ] } ]'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions