-
Notifications
You must be signed in to change notification settings - Fork 78
WIP: Add instructions for running the example driver on GKE #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -383,6 +383,49 @@ Finally, you can run the following to cleanup your environment and delete the | |||||
| ./demo/delete-cluster.sh | ||||||
| ``` | ||||||
|
|
||||||
| ## Installing the example driver on a GKE cluster | ||||||
| It is also possible to run the example driver on a GKE cluster. For this, we | ||||||
| will use the pre-built image for the kubelet plugin, so there is no need | ||||||
| to build anything. All that is needed is a Google Cloud Platform account, | ||||||
| the gcloud CLI and Helm. | ||||||
|
|
||||||
| To keep things simple and identical to the Kind example, we will use a | ||||||
| single-node GKE cluster. | ||||||
|
|
||||||
| CDI must be enabled in containerd for the DRA driver to work. CDI is | ||||||
| enabled by default in GKE since 1.32.1-gke.1489001, so we will create | ||||||
| a cluster in the rapid channel to make sure we get a recent version. | ||||||
|
|
||||||
| Since DRA is still a beta feature, we need to explicitely enable it | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| when the cluster is created. | ||||||
|
|
||||||
| First, create a GKE cluster with gcloud. | ||||||
| ```bash | ||||||
| gcloud container clusters create dra-example-driver-cluster \ | ||||||
| --location=us-central1-c \ | ||||||
| --release-channel=rapid \ | ||||||
| --num-nodes=1 \ | ||||||
| --enable-kubernetes-unstable-apis=resource.k8s.io/v1beta1/deviceclasses,resource.k8s.io/v1beta1/resourceclaims,resource.k8s.io/v1beta1/resourceclaimtemplates,resource.k8s.io/v1beta1/resourceslices | ||||||
| ``` | ||||||
|
|
||||||
| Once the cluster is ready, we can install the DRA using Helm. | ||||||
|
|
||||||
| The kubelet plugin in the example driver is set up to run with priority class | ||||||
| `system-node-critical`. On GKE, pods are by default restricted from running | ||||||
| with this priority class, so we need to use a ResourceQuota to allow it. The | ||||||
| Helm chart supports, this, we just have to enable it. | ||||||
|
|
||||||
| ```bash | ||||||
| helm upgrade -i \ | ||||||
| --create-namespace \ | ||||||
| --namespace dra-example-driver \ | ||||||
| --set=resourcequota.enabled=true \ | ||||||
| dra-example-driver \ | ||||||
| deployments/helm/dra-example-driver | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using the chart from the main branch might not always be compatible with the latest released version of the container image. Ideally we'd stick to either building the image from source for the checked out version of the chart (like the e2e CI tests do), or use compatible published release versions of each. I see published releases won't work here immediately without these unreleased changes to the chart, but we could merge the chart changes, cut a chart release, then merge the docs. Seamless upgrade support for 1.33 is one thing I'm anticipating which will involve changes to both the image and the chart. Installing a chart that is set up for seamless upgrades and an image that isn't will likely cause issues, though that might only be in marginal cases users walking through the demo wouldn't normally hit. I haven't thought that particular scenario all the way through. For sensitive changes like that I'm confident we can work things out such that a little bit of skew is probably fine, but zero skew is of course preferred. |
||||||
| ``` | ||||||
|
|
||||||
| The examples in `demo/gpu-test{1,2,3,4,5}.yaml` works just like with Kind. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
|
||||||
| ## Anatomy of a DRA resource driver | ||||||
|
|
||||||
| TBD | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| {{- if .Values.resourcequota.enabled }} | ||
| apiVersion: v1 | ||
| kind: ResourceQuota | ||
| metadata: | ||
| name: {{ include "dra-example-driver.fullname" . }}-resourcequota | ||
| namespace: {{ include "dra-example-driver.namespace" . }} | ||
| spec: | ||
| hard: | ||
| pods: {{ .Values.resourcequota.pods }} | ||
| {{- with .Values.resourcequota.scopeSelector.matchExpressions }} | ||
| scopeSelector: | ||
| matchExpressions: | ||
| {{- toYaml . | nindent 4 }} | ||
| {{- end }} | ||
| {{- end }} |
| Original file line number | Diff line number | Diff line change | ||
|---|---|---|---|---|
|
|
@@ -87,3 +87,14 @@ webhook: | |||
| # The name of the service account to use. | ||||
| # If not set and create is true, a name is generated using the fullname template | ||||
| name: "" | ||||
|
|
||||
| resourcequota: | ||||
| enabled: false | ||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd vote to enable this for e2e tests for that extra bit of coverage like we do for the webhook which is disabled by default: dra-example-driver/test/e2e/setup-e2e.sh Line 36 in e82f291
|
||||
| pods: 10 | ||||
| scopeSelector: | ||||
| matchExpressions: | ||||
| - operator: In | ||||
| scopeName: PriorityClass | ||||
| values: | ||||
| - system-node-critical | ||||
| - system-cluster-critical | ||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to also run the e2e test on GKE to make sure we don't accidentally break this. I'm not sure if there's an existing pattern to do that in Prow or if GitHub Actions makes that easy. We can definitely address that later.