Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -383,6 +383,49 @@ Finally, you can run the following to cleanup your environment and delete the
./demo/delete-cluster.sh
```

## Installing the example driver on a GKE cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to also run the e2e test on GKE to make sure we don't accidentally break this. I'm not sure if there's an existing pattern to do that in Prow or if GitHub Actions makes that easy. We can definitely address that later.

It is also possible to run the example driver on a GKE cluster. For this, we
will use the pre-built image for the kubelet plugin, so there is no need
to build anything. All that is needed is a Google Cloud Platform account,
the gcloud CLI and Helm.

To keep things simple and identical to the Kind example, we will use a
single-node GKE cluster.

CDI must be enabled in containerd for the DRA driver to work. CDI is
enabled by default in GKE since 1.32.1-gke.1489001, so we will create
a cluster in the rapid channel to make sure we get a recent version.

Since DRA is still a beta feature, we need to explicitely enable it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Since DRA is still a beta feature, we need to explicitely enable it
Since DRA is still a beta feature, we need to explicitly enable it

when the cluster is created.

First, create a GKE cluster with gcloud.
```bash
gcloud container clusters create dra-example-driver-cluster \
--location=us-central1-c \
--release-channel=rapid \
--num-nodes=1 \
--enable-kubernetes-unstable-apis=resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices
```

Once the cluster is ready, we can install the DRA using Helm.

The kubelet plugin in the example driver is set up to run with priority class
`system-node-critical`. On GKE, pods are by default restricted from running
with this priority class, so we need to use a ResourceQuota to allow it. The
Helm chart supports, this, we just have to enable it.

```bash
helm upgrade -i \
--create-namespace \
--namespace dra-example-driver \
--set=resourcequota.enabled=true \
dra-example-driver \
deployments/helm/dra-example-driver
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the chart from the main branch might not always be compatible with the latest released version of the container image. Ideally we'd stick to either building the image from source for the checked out version of the chart (like the e2e CI tests do), or use compatible published release versions of each. I see published releases won't work here immediately without these unreleased changes to the chart, but we could merge the chart changes, cut a chart release, then merge the docs.

Seamless upgrade support for 1.33 is one thing I'm anticipating which will involve changes to both the image and the chart. Installing a chart that is set up for seamless upgrades and an image that isn't will likely cause issues, though that might only be in marginal cases users walking through the demo wouldn't normally hit. I haven't thought that particular scenario all the way through. For sensitive changes like that I'm confident we can work things out such that a little bit of skew is probably fine, but zero skew is of course preferred.

```

The examples in `demo/gpu-test{1,2,3,4,5}.yaml` works just like with Kind.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The examples in `demo/gpu-test{1,2,3,4,5}.yaml` works just like with Kind.
The examples in `demo/gpu-test{1,2,3,4,5}.yaml` work just like with Kind.


## Anatomy of a DRA resource driver

TBD
Expand Down
4 changes: 3 additions & 1 deletion deployments/helm/dra-example-driver/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,6 @@ version: 0.0.0-dev
# It is recommended to use it with quotes.
appVersion: "v0.1.0"

kubeVersion: "1.32.x"
# The "-0" suffix is to make sure the chart works on GKE clusters, which uses versions on
# the format 1.32.1-gke.1234567.
kubeVersion: "1.32.x-0"
15 changes: 15 additions & 0 deletions deployments/helm/dra-example-driver/templates/resourcequota.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{{- if .Values.resourcequota.enabled }}
apiVersion: v1
kind: ResourceQuota
metadata:
name: {{ include "dra-example-driver.fullname" . }}-resourcequota
namespace: {{ include "dra-example-driver.namespace" . }}
spec:
hard:
pods: {{ .Values.resourcequota.pods }}
{{- with .Values.resourcequota.scopeSelector.matchExpressions }}
scopeSelector:
matchExpressions:
{{- toYaml . | nindent 4 }}
{{- end }}
{{- end }}
11 changes: 11 additions & 0 deletions deployments/helm/dra-example-driver/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,14 @@ webhook:
# The name of the service account to use.
# If not set and create is true, a name is generated using the fullname template
name: ""

resourcequota:
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote to enable this for e2e tests for that extra bit of coverage like we do for the webhook which is disabled by default:

--set webhook.enabled=true \

pods: 10
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values:
- system-node-critical
- system-cluster-critical
Loading