Releases: containers/nri-plugins
v0.8.0
This is a new major release of NRI Reference Plugins. It brings several new features, a number of bug fixes, improvements to the build system, to CI, end-to-end tests, and test coverage.
What's New
Balloons Policy
-
New
preserve
policy option enables matching containers whose CPU
and memory affinity must not be modified by the resource policy.This enables allowing selected containers to access all CPUs and
memories. For example, allow pcm-sensor-server
to access MSRs on every CPU for low-level metrics:preserve: matchExpressions: - key: pod/labels/app.kubernetes.io/name operator: In values: - pcm-sensor-server
Earlier this required
cpu.preserve.resource-policy.nri.io
and
memory.preserve.resource-policy.nri.io
pod annotations. -
New
freqGovernor
CPU class option enables setting CPU frequency
governor based on the CPU class of a balloon. Example:balloonTypes: - name: powersaving cpuClass: mypowersave control: cpu: classes: mypowersave: freqGovernor: powersave
-
New
memoryTypes
balloon type option specifies required memory
types when setting memory affinity. For example, containers in
high-memory-bandwidth balloons will use only HBM when configured as:balloonTypes: - name: high-memory-bandwidth memoryTypes: - HBM
-
Support
memory-type.resource-policy.nri.io
pod annotation for
setting memory affinity into closest HBM, DRAM, PMEM, or any
combination. This annotation is a pod level override to the
memoryTypes
balloon type option. -
L2-cache group aware CPU allocation and sharing. For example,
containers in a balloon can be allowed to burst on idle
(unallocated) CPUs that share the same L2 cache as CPUs allocated to
the balloon.balloonTypes: - name: l2burst shareIdleCPUsInSame: l2cache
-
Override to
pinMemory
policy option in balloon type level. Enables
setting memory affinity of containers only in certain balloons while
others are not set, and vice versa. Example:pinMemory: false balloonTypes: - name: latency-sensitive pinMemory: true preferIsolCpus: true preferNewBalloons: true
-
New default configuration runs Guaranteed containers on dedicated
CPUs while BestEffort and Burstable containers are allowed to share
remaining CPUs on the same socket, but not cross socket boundaries. -
Balance BestEffort containers between balloons with equal amount of
available resources. -
Smaller risk for OOMs on
pinMemory: true
, as memory affinity was
refactored to use smart libmem.
Topology Aware Policy
The Topology Aware policy can now export prometheus metrics per topology zone. Exported metrics include pool CPU set and memory set, shared CPU subpool total capacity, allocations and available capacity, memory total capacity, allocations and available amount, number of assigned containers and containers in the shared subpool.
To enable exporting these metrics, make sure that you are running with the latest policy configuration custom resource definition and you have policy
included in the spec/instrumentation/metrics/enabled
slice, like this:
...
spec:
...
instrumentation:
...
metrics:
enabled:
- policy
...
The Topology Aware policy can now use data from the kubelet's Pod Resource API to generate extra topology hints for resource allocation and alignment. These hints are disabled in the default configuration installed by Helm charts. To enable them, make sure that you are running with the latest policy configuration custom resource definition and you have spec/agent/podResourceAPI
set to true in the configuration, like this:
spec:
agent:
...
podResourceAPI: true
...
- Support
memory-type.resource-policy.nri.io
pod annotation for
setting memory affinity into closest HBM, DRAM or PMEM, or any
combination.
What's Changed
Balloons Policy Fixes and Improvements
- balloons: add "preserve" option to match containers whose pinning must not be modified by @askervin in #368
- balloons: add support for cpu frequency governor tuning by @fmuyassarov in #374
- balloons: set frequency scaling governor only when requested by @fmuyassarov in #379
- balloons: improve handling of containers with no CPU requests by @askervin in #386
- balloons: add debug logging to selecting a balloon type by @askervin in #396
- balloons: support for L2 cache cluster allocation by @askervin in #384
- balloons: add memoryTypes to balloon types by @askervin in #395
- Add balloon type specific pinMemory option by @askervin in #451
Topology Aware Policy Fixes and Improvements
- metrics: add topology-aware policy metrics collection. by @klihub in #406
- topology-aware: correctly reconfigure implicit affinities for configuration changes. by @klihub in #394
- fixes: copy assigned memory zone in grant clone. by @klihub in #413
New Policy Agnostic Metrics, Common De Facto Exporters
- metrics: cleanup metrics registration, collection and gathering. by @klihub in #403
- metrics: add de-facto standard collectors. by @klihub in #404
- metrics: simplify policy/backend metrics collection interface. by @klihub in #408
- metrics: add policy system collector. by @klihub in #405
Topology Hints Based on Pod Resource API
- podresapi: agent,config,helm: make agent runtime configurable. by @klihub in #418
- podresapi: resmgr,agent: generate topology hints from Pod Resource API. by @klihub in #419
- podresapi: topology-aware: use Pod Resource API hints if present. by @klihub in #420
- agent,resmgr: merge PodResources{List,Map}, cache last List() result. by @klihub in #423
Common Resource Management Fixes and Improvements
- resmgr: fix "qosclass" in policy expressions by @askervin in #387
- resmgr,agent: propagate startup config error back to CR. by @klihub in #416
- libmem: implement policy-agnostic memory allocation/accounting. by @klihub in #332
- libmem: typo and thinko fixes. by @klihub in #381
- sysfs: enable faking CPU cache configurations using OVERRIDE_SYS_CACHES by @askervin in #383
- cpuallocator, plugins: handle priority as an option. by @klihub in #414
- Fix typos in expression code doc and matchExpression yamls by @askervin in #370
Helm Chart and Configuration Fixes and Improvements
- helm: enable prometheus autodiscovery by @klihub in #393
- helm: new balloons default configuration by @askervin in #391
- apis/config: use consistent assignment in +kubebuilder:validation tags. by @klihub in #397
- sample-configs: fix a copy-pasted comment thinko. by @klihub in #402
End-to-end Testing Fixes and Improvements
- e2e: pull and save runtime logs after each test. by @klihub in #367
- e2e: adjust metrics test for updated PrettyName(). by @klihub in #366
- e2e: switch default test distro to fedora/40-cloud-base. by @klihub in #375
- e2e: fix provisioning for Ubuntu cloud image. by @klihub in #377
- e2e: enable vagrant debugging. by @klihub in #376
- e2e: adjust $VM_HOSTNAME for policy node config usage. by @klihub in #378
- e2e: skip long running tests by default. by @klihub in #373
- e2e: fix command filenames in test output directories by @askervin in #390
- e2e: containerd 2.0.0. provisioning fixup. by @klihub in #400
- e2e/balloons: remove unknown/unused helm-launch argument. by @klihub in #407
Build Environment Fixes and Improvements
v0.7.1
This release of NRI Reference Plugins brings new features, a few bug fixes, and updates to the documentation.
Highlights
- balloons policy now supports assigning kernel-isolated CPU cores to balloons when available. To prefer isolated CPU cores for a balloon, use the new
preferIsolCpus
boolean configuration option. For instance,
balloonTypes:
- name: high-prio-physical-core
minCPUs: 2
maxCPUs: 2
preferNewBalloons: true
preferIsolCpus: true
hideHyperthreads: true
...
- balloons policy now supports assigning performance optimized or energy efficient CPU cores to balloons when available. For instance, to define a balloon with energy efficient core preference and another one with performance core preference use the new
preferCoreType
configuration option like this:
balloonTypes:
- name: low-prio
namespaces:
- logging
- monitoring
preferCoreType: efficient
...
- name: high-prio
preferCoreType: performance
...
- Topology-aware policy now allocates CPU cores in clusters of shared last-level cache. Whenever this provides different grouping than the rest of the topology, for instance hyperthreads, the CPU allocator now divides cores into groups defined by shared last-level cache. The topology-aware policy tries to allocate as few LLC groups to a container as possible and tries to avoid sharing an LLC group by multiple containers.
What's New
- balloons: add support for isolated cpus. by @fmuyassarov in #344
- balloons: add support for power efficient & high performance cores by @fmuyassarov in #354
- cpuallocator: implement clustered allocation based on cache groups. by @klihub in #343
What Changed
Resource assignment policies should now try harder to detect when a new container is a restarted instance of an existing container which has just exited or crashed. This should fix problems where a crashing container could not be restarted on an nearly fully allocated node.
- deps: bump NRT dependencies to v0.1.2. by @fmuyassarov in #348
- topology-aware: add missing SingleThreadForCPUs() to mockSysfs. by @klihub in #349
- balloons: add support for isolated cpus. by @fmuyassarov in #344
- cpuallocator: implement clustered allocation based on cache groups. by @klihub in #343
- fixes: fix host-wait-vm-ssh-server, improve vm-reboot. by @klihub in #350
- fix: clean up plugin at the beginning/end of tests. by @klihub in #351
- doc: add availableResources in the balloons policy documentation by @askervin in #355
- build: allow building a single plugin image. by @klihub in #357
- balloons: add support for power efficient & high performance cores by @fmuyassarov in #354
- e2e: fix cni_plugin=bridge in provisioning a vm by @askervin in #359
- e2e: bridge CNI setup fixes for Fedora/containerd. by @klihub in #361
- e2e: use bridge CNI plugin by default. by @klihub in #362
- CI: verify in smaller steps, verify binary builds. by @klihub in #364
- resmgr: lifecycle overlap detection and workaround. by @klihub in #358
Full Changelog: v0.7.0...v0.7.1
v0.7.0
This release of NRI Reference Plugins brings in new features and important bug fixes.
Highlights
- Topology-aware and balloons resource policies now support soft-disabling of hyperthreads per container. This improves the performance of some classes of workloads. Both policies support new pod annotation:
and the balloons policy has new balloon-type option
hide-hyperthreads.resource-policy.nri.io/container.<CONTAINER-NAME>: "true"
hideHyperthreads
that soft-disables hyperthreads on all containers assigned to a balloon of this type. - The topology-aware policy supports pinning containers to high-bandwidth memory (HBM), or both HBM and DRAM, when pods are annotated with
memory-type.resource-policy.nri.io/container.<CONTAINER-NAME>: hbm memory-type.resource-policy.nri.io/container.<CONTAINER-NAME>: hbm,dram
- Automatic hardware topology hint generation has been fixed in the topology-aware policy. For instance, if a container uses a PCI device, the policy prefers pinning the container to CPUs and memory that are close to the device.
What's New
- balloons: hideHyperthreads balloon type option and annotation by @askervin in #338
- topology-aware: add support for hide-hyperthreads annotation. by @askervin in #331
What Changed
- topology-aware: don't ignore HBM memory nodes without close CPUs. by @klihub in #329
- topology-aware: relax NUMA node topology checks. by @klihub in #336
- resmgr: exit when ttrpc connection goes down. by @klihub in #319
- cpuallocator: don't filter based on single CoreKind. by @klihub in #345
- sysfs,cpuallocator: fix CPU cluster discovery. by @klihub in #337
- sysfs: survive NUMA nodes without memory. by @klihub in #339
- sysfs: allow non-uniform thread count. by @klihub in #340
- helm: flip podPriorityClassNodeCritical to true. by @klihub in #312
- config-manager: allow configuring NRI timeouts. by @klihub in #318
New Contributors
Full Changelog: v0.5.0...v0.7.0
v0.5.1
This release of the NRI Reference Plugins brings a few improvements to hardware topology detection and resource assignment.
What's New
- cpuallocator: topology discovery fixes and improvements. by @klihub in #206
- cpuallocator: add support for hybrid core discovery, preferred allocation. by @klihub in #295
- topology-aware: configurable allocation priority by @klihub in #282
- resmgr: enable opentelemetry tracing (span propagation) over the NRI ttrpc connection. by @klihub in #293
Updates, Fixes, and Other Improvements
- sysfs: dump system discovery results in a more predictable order. by @klihub in #294
- github: package and publish interim unstable Helm charts from the main and release branches by @marquiz, @klihub in #303
Full Changelog: v0.4.1...v0.5.1
v0.4.1
This major release of the NRI Reference Plugins brings new features to a few plugins, numerous smaller other improvements, and several bug fixes.
Highlights
- balloons policy: add
groupBy
balloon type option
Group containers into same balloon instances if theirgroupBy
expressions evaluate to the same group. For example, the following expression prefers assigning all containers in the pod to a balloon that already contains containers from the same namespace and have the samensballoon
pod label value
...
balloonTypes:
- name: my-pods
groupBy: ${pod/namespace}-${pod/labels/nsballoon}
...
If there is no such a balloon, or if such instances do not have enough CPUs, then finding a suitable balloon continues as before: assign to some other existing balloon or create a new balloon if that is preferred.
- balloons policy: add balloon
matchExpressions
option
Assign containers to balloon instances by balloon match expressions, similar to affinity expression of the topology-aware policy. Expressions are evaluated for containers which are not explicitly assigned to any balloon by an annotation. If an expression matches a container, the container is assigned to an instance of the corresponding balloon. For instance, the following matchExpression will grab all containers with matching pod names to the associated balloon:
...
balloonTypes:
- name: my-pods
matchExpressions:
- key: pod/name
operator: MatchesAny
values: [ myPod*, nginx* ]
...
- balloons & topology-aware policies: allow preserving existing resource assignments
Containers and pods can now be annotated to prevent the policy from touching their existing CPU or memory pinning.
What's New
- balloons: implement groupBy option by @askervin in #278
- balloons: allow assigning containers to balloons by runtime-evaluated expressions by @klihub in #260
- balloons policy: more regular built-in balloons, treat them much like user-defined ones
Built-inreserved
anddefault
balloon types are no longer special cases. They can be configured with the same parameters as user-defined balloons. - balloons: support preserving CPU and memory pinning by @askervin in #257
- topology-aware: support preserving CPU and memory pinning by @askervin in #258
- feat(helm): Introduce priorityClassName system-node-critical by @ffuerste in #220
- helm: allow setting NRI plugin index via values by @klihub in #227
Updates, Fixes And Other Improvements
- balloons: fix the order of assigning containers into balloons by @askervin in #273
- balloons: fix logged balloon name by @klihub in #259
- balloons & topology-aware policies: better handling of
UpdateContainer[Resources]
requests
Fill in missing bits in partial container resource updates from the current resource assignment. Filter out redundant resource updates without invoking the policy. - memtierd: update the nri-memtierd plugin to use memtierd v0.1.1 by @askervin in #287
- operator: ensure to kustomize operator manifests before local deployment by @fmuyassarov in #240
- resmgr: better expression validation, cleaner key resolution by @klihub in #256
- resmgr: inject mount before container state update by @klihub in #223
- resmgr: log containers by pretty name during startup by @klihub in #245
- instrumentation: fix resource creation, use parent-based sampler by @klihub in #233
- instrumentation: allow proper reconfiguration of tracing by @klihub in #234
- cache: support annotations to preserve CPU and memory pinning by @askervin in #249
- cache, resmgr: expose key evaluation, implement key substitution by @klihub in #277
- cache: fix generated pod scope and simple affinity expressions by @klihub in #285
- cache: store creation time of pod and containers cache objects by @askervin in #272
- topology-aware: log resource operations at info level by @klihub in #252
- doc: clarify selecting balloon type by @askervin in #281
- doc: more consistent terminology in balloons documentation by @askervin in #269
- fixes: rename default config group label, support/fall back to deprecated labels. by @klihub in #231
New Contributors
Full List of Merged PRs
For a full list of changes see v0.3.2...v0.4.1
v0.3.2
This patch release fixes image versioning for the operator.
What's New
operator: point containerImage tag to the latest release (v0.3.2) by @fmuyassarov in #218
v0.3.1
This patch release fixes operator category to use community defined one instead of custom category.
What's New
*operator: switch to community defined category" by @fmuyassarov in #217
v0.3.0
This release brings a few new plugins, the NRI Plugins Operator, and a new configuration mechanism to configure plugins using CRDs.
What's New
-
Ansible-based operator created with operator-sdk to manage the life cycle of the nri-plugins
- Operator by @fmuyassarov in #208
-
New plugins:
-
Improvements in policies:
What's Changed
- Configuration for plugins switched from ConfigMaps to per-plugin CRDs
Note that existing deployments need to be updated, converting any old ConfigMap to the corresponding custom resource of the policy used in the deployment. Please see the provided sample configurations or the custom resource definitions for policy-specific CRD details.
List of PRs
- docs: fix link to "all releases" by @marquiz in #81
- docs: convert all rst to md by @marquiz in #80
- pkg: eliminate usage of obsolete github.com/pkg/errors. by @klihub in #83
- build: Fix baseurl when building documentation version selector by @jukkar in #84
- instrumentation: use a more appropriate propagator. by @klihub in #86
- Helm improvments by @fmuyassarov in #85
- config: Fix configuration files for kustomize by @changzhi1990 in #88
- gihub-action: fix helm linter action trigger path by @fmuyassarov in #89
- Helm chart configmap fixes and some YAML clean up by @fmuyassarov in #90
- github: Disable e2e-test workflow by @jukkar in #102
- helm: fix topology-aware helm chart naming by @fmuyassarov in #95
- helm/kustomize: fix configmaps by @poussa in #104
- build(deps): bump pygments from 2.13.0 to 2.15.0 in /docs by @dependabot in #98
- sample-configs: drop rdt from the sample configuration by @marquiz in #106
- pkg/multierror: switch to native golang multierror handling. by @klihub in #108
- Add utility to patch containerd config for enabling nri by @fmuyassarov in #107
- config-manager: bump go version to 1.20 by @fmuyassarov in #110
- build: band-aid build system to allow building binaries other than resource policy plugins. by @klihub in #109
- Automate Helm charts releasing process via GitHub action by @fmuyassarov in #112
- github/workflows: ditch softprops/turnstyle action by @marquiz in #113
- github: use the git checkout action to fetch all refs by @marquiz in #114
- Memory QoS example plugin by @askervin in #115
- Add nri-memtierd plugin by @askervin in #117
- memory-qos: fix image building by @askervin in #122
- docs: improve plugins installation instructions by @fmuyassarov in #119
- deployment: fix the nri volume type to DirectoryOrCreate by @fmuyassarov in #123
- helm: add charts for memtierd and memory-qos by @fmuyassarov in #118
- helm: improve labels and selector labels set on the DaemonSet by @fmuyassarov in #124
- helm: fix config-manager image name reference in charts defaults by @fmuyassarov in #125
- nri-memtierd: mount only needed dirs from host by @askervin in #121
- helm: don't set image.tag in values.yaml by @marquiz in #127
- github: set chart version when packaging Helm charts by @marquiz in #128
- github-action: update charts path to include memtierd and memory-qos by @fmuyassarov in #129
- deployment: refactor config manager to support NRI enabling in CRI-O by @fmuyassarov in #120
- helm: support NRI enabling prior to running memtierd & memory-qos plugins by @fmuyassarov in #131
- github: issue templates by @marquiz in #130
- Bump golang version to 1.20.10 by @marquiz in #133
- go.mod: update deps by @marquiz in #134
- helm: don't set initContainerImage:.tag in values.yaml by @marquiz in #136
- cleanup: nuke unused scripts, Makefile bits and defunct functionality. by @klihub in #135
- e2e: document e2e test dependencies by @askervin in #139
- github: fix publish-helm-charts workflow by @marquiz in #143
- scripts: update helm-publisher by @marquiz in #141
- workflow: fix the tagging docker images build from new tags by @fmuyassarov in #145
- docs: update charts installation instructions by @fmuyassarov in #147
- deployment: adjust Pod restartPolicy to OnFailure by @fmuyassarov in #152
- Add a helper script for preparing the release by @marquiz in #138
- fixes: fix a bunch of problems in bootstrapping e2e tests. by @klihub in #154
- deployment: drop DaemonSet unsupported OnFailure restartPolicy by @fmuyassarov in #155
- fixes: fix image tagging scheme in publishing workflow. by @klihub in #158
- scripts: ditch versions for .rpm and .deb packages. by @klihub in #161
- balloons: add PreferSpreadOnPhysicalCores by @askervin in #126
- deployment/helm: always use DirectoryOrCreate for /var/run/nri. by @klihub in #162
- build: don't try to gofmt generated files. by @klihub in #165
- helm: add an option to set tolerations by @fmuyassarov in #157
- build/docs: make site-build work again from git worktrees. by @klihub in #168
- docs: add basic README for each Helm chart on their own directory by @fmuyassarov in #167
- plugins: add sgx-epc plugin. by @klihub in #156
- deployment/helm/sgx-epc: fix a small README typo. by @klihub in #170
- deployment: fix trailing whitespace errors. by @klihub in #169
- docs: drop unneeded and outdated resource-policy/README.md by @marquiz in #173
- docs: fix md style by @marquiz in #172
- deployment/helm: add default .helmignores. by @klihub in #174
- docs: update the main readme with the list of available plugins by @fmuyassarov in #175
- config-manager: don't crash if no runtime is found running. by @klihub in #180
- .github: use release-* release branch pattern. by @klihub in #183
- fixes: remove unused gRPC bits, stop forcing gRPC to 1.38.0. by @klihub in #186
- balloons: use new config more consistently, fix incorrect test config keys. by @klihub in #188
- e2e: fix balloons test02-prometheus-metrics by @askervin in #191
- test/e2e: test VM provisioning fixes by @klihub in #192
- test/e2e: fix vagrant usage, make cilium version configurable. by @klihub in #193
- helm: fix missing tolerations option in sgx-epc plugin by @fmuyassarov in https://github.com/containers/...
v0.2.3
This patch release updates Helm Chart documentation, fixes automatic enabling of NRI support in CRI-O runtimes, and updates obsolete image dependencies.
What's Changed
- config-manager: don't crash if no runtime is found running by @marquiz in #181
- github: use release-* release branch pattern. by @klihub in #184
- docs: cherry-pick doc changes from main branch by @fmuyassarov in #182
- deployment: update files for v0.2.3 by @fmuyassarov in #185
- deps: cherry-pick #186: update gRPC bits, stop forcing version. by @klihub in #187
Full Changelog: v0.2.2...v0.2.3
v0.2.2
This patch adds support to specify tolerations with Helm, in addition to fixing some issues in the Helm charts.
What's Changed
- github: fix publish-helm-charts workflow by @fmuyassarov in #151
- deployment: adjust Pod restartPolicy to OnFailure by @fmuyassarov in #153
- drop DaemonSet unsupported OnFailure restartPolicy by @fmuyassarov in #159
- cherry-pick #156: fix image tagging scheme in publishing workflow. by @klihub in #160
- always use DirectoryOrCreate for /var/run/nri by @fmuyassarov in #163
- scripts: don't make unknown version invalid (cherry-picked 97a3e65). by @klihub in #177
- scripts: update helm-publisher by @marquiz in #176
- helm: add an option to set tolerations (backport of #157) by @klihub in #178
- deployment: update files for v0.2.2 by @marquiz in #179
Full Changelog: v0.2.1...v0.2.2