Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGX: set EPC limits via NRI annotations #1582

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

mythi
Copy link
Contributor

@mythi mythi commented Oct 30, 2023

No description provided.

@mythi mythi added the sgx SGX device plugin related issue label Oct 30, 2023
@mythi mythi force-pushed the PR-2023-050 branch 3 times, most recently from 66288ca to 56c0a9f Compare October 31, 2023 06:41
@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2023

Codecov Report

Attention: 30 lines in your changes are missing coverage. Please review.

Comparison is base (eb0de47) 51.94% compared to head (430a729) 51.68%.

❗ Current head 430a729 differs from pull request most recent head c876e22. Consider uploading reports for the commit c876e22 to get more accurate results

Files Patch % Lines
pkg/controllers/sgx/controller.go 0.00% 29 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1582      +/-   ##
==========================================
- Coverage   51.94%   51.68%   -0.26%     
==========================================
  Files          42       42              
  Lines        4911     4941      +30     
==========================================
+ Hits         2551     2554       +3     
- Misses       2210     2237      +27     
  Partials      150      150              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -32,10 +32,22 @@ spec:
- name: sgx-provision
mountPath: /dev/sgx_provision
readOnly: true
- name: nri-sgx-epc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the node selector for the NRI plugin could also including the intel.feature.node.kubernetes.io/sgx label that adds the NFD add-on to SGX enabled nodes?

This label is already used for the NFD NodeFeatureRule (see here). This more specific node selector could be an optional kustomize patch (or added to the epc-nfd overlay) and perhaps used for the entire DaemonSet? This would ensure that the components only run on the nodes on which they are required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can address this in the next rev. The current draft just pulls the NRI bits + the SGX webhook modifications and does not add support for any other deployment options just yet.

I was also checking if NFD can check "NRI is enabled" but that does not seem to be available.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in case an additional node selector would be added, you would like to add it to all deployment options and not just for the one based on the kustomize files (described here)?

Should I create an official feature request for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more specific node selector could be an optional kustomize patch (or added to the epc-nfd overlay) and perhaps used for the entire DaemonSet?

I could add intel.feature.node.kubernetes.io/sgx to the epc-nfd overlay by default. The "base" daemonSet is kept minimal and I'm likely going to drop the NRI bits to an overlay in the final version to make it user "opt-in".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ffuerste I made some updates to this. let me know how it looks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

I really like the implementation of the NFD label as a kustomize patch as part of the overlay for the intel-sgx-plugin DaemonSet.

Also, I think the defining the NFD label as a nodeSelector for the new cAdvisor DaemonSet is a pretty good idea since you named the directory for the resource files sgx_epc_metrics. This clearly points to the use case of providing metrics on EPC consumption.

However, maybe we should add a short comment to the label defined as the nodeSelector in the patches? E.g. # intel.feature.node.kubernetes.io/sgx: 'true' # added to SGX enabled nodes by NFD add-on. Maybe this is not necessary for the epc-nfd overlay, but I think it's beneficial for the cAdvisor resource file, because currently the dependency to the NFD add-on is not specified anywhere.

Moreover, maybe the new cAdvisor component should be mentioned in the README of the sgx plugin? Or you don't want to add it there, because it's maybe only needed until the cAdvisor in the kubelet is supporting the misc metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think it's beneficial for the cAdvisor resource file, because currently the dependency to the NFD add-on is not specified anywhere.

I can add the comment my next rebase. I'm not sure the label is going to be needed in the final version because cAdvisor could be used for cluster telemetry in general. Unless of course the desire is to only get the EPC+memory metrics... Let's see.

Moreover, maybe the new cAdvisor component should be mentioned in the README of the sgx plugin? Or you don't want to add it there, because it's maybe only needed until the cAdvisor in the kubelet is supporting the misc metrics?

I will add it later once I have better visibility on the PR readiness (which depends on runc 1.2 release I believe). We are most likely going to need the "standalone" deployment for a while.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you don't want to add it there, because it's maybe only needed until the cAdvisor in the kubelet is supporting the misc metrics?

I do not think kubelet will be adding metrics, as there are plans on reducing them:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I had checked with the KEP author on what the best approach for me would be and the suggestion was to start with runc+cAdvisor side of things. Longer run we'd need misc to be part of runtimes' metrics which is the future anyway but for now we'd use a standalone cAdvisor.

deployments/sgx_plugin/base/intel-sgx-plugin.yaml Outdated Show resolved Hide resolved
pkg/webhooks/sgx/sgx.go Outdated Show resolved Hide resolved
@mythi mythi force-pushed the PR-2023-050 branch 2 times, most recently from 77acf3e to bc4654f Compare November 13, 2023 12:41
@mythi mythi force-pushed the PR-2023-050 branch 3 times, most recently from c876e22 to a817f63 Compare February 8, 2024 14:20
@mythi mythi force-pushed the PR-2023-050 branch 3 times, most recently from 4cbfdfc to 69ddd56 Compare April 15, 2024 04:13
@mythi mythi force-pushed the PR-2023-050 branch 2 times, most recently from 7b3235e to 0b70ce4 Compare May 8, 2024 10:18
@rdower rdower closed this Jun 7, 2024
@rdower rdower reopened this Jun 7, 2024
@rdower rdower closed this Jun 7, 2024
@mythi mythi reopened this Jun 7, 2024
@mythi mythi closed this Jun 7, 2024
@rdower rdower reopened this Jun 10, 2024
@mythi mythi force-pushed the PR-2023-050 branch 3 times, most recently from cb3cd37 to ce38f8d Compare August 21, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sgx SGX device plugin related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants