Skip to content

Releases: containers/nri-plugins

v0.8.0

18 Dec 08:15
Compare
Choose a tag to compare

This is a new major release of NRI Reference Plugins. It brings several new features, a number of bug fixes, improvements to the build system, to CI, end-to-end tests, and test coverage.

What's New

Balloons Policy

  • New preserve policy option enables matching containers whose CPU
    and memory affinity must not be modified by the resource policy.

    This enables allowing selected containers to access all CPUs and
    memories. For example, allow pcm-sensor-server
    to access MSRs on every CPU for low-level metrics:

    preserve:
      matchExpressions:
        - key: pod/labels/app.kubernetes.io/name
          operator: In
          values:
            - pcm-sensor-server
    

    Earlier this required cpu.preserve.resource-policy.nri.io and
    memory.preserve.resource-policy.nri.io pod annotations.

  • New freqGovernor CPU class option enables setting CPU frequency
    governor based on the CPU class of a balloon. Example:

    balloonTypes:
    - name: powersaving
      cpuClass: mypowersave
    control:
      cpu:
        classes:
          mypowersave:
            freqGovernor: powersave
    
  • New memoryTypes balloon type option specifies required memory
    types when setting memory affinity. For example, containers in
    high-memory-bandwidth balloons will use only HBM when configured as:

    balloonTypes:
    - name: high-memory-bandwidth
      memoryTypes:
      - HBM
    
  • Support memory-type.resource-policy.nri.io pod annotation for
    setting memory affinity into closest HBM, DRAM, PMEM, or any
    combination. This annotation is a pod level override to the
    memoryTypes balloon type option.

  • L2-cache group aware CPU allocation and sharing. For example,
    containers in a balloon can be allowed to burst on idle
    (unallocated) CPUs that share the same L2 cache as CPUs allocated to
    the balloon.

    balloonTypes:
    - name: l2burst
      shareIdleCPUsInSame: l2cache
    
  • Override to pinMemory policy option in balloon type level. Enables
    setting memory affinity of containers only in certain balloons while
    others are not set, and vice versa. Example:

    pinMemory: false
    balloonTypes:
    - name: latency-sensitive
      pinMemory: true
      preferIsolCpus: true
      preferNewBalloons: true
    
  • New default configuration runs Guaranteed containers on dedicated
    CPUs while BestEffort and Burstable containers are allowed to share
    remaining CPUs on the same socket, but not cross socket boundaries.

  • Balance BestEffort containers between balloons with equal amount of
    available resources.

  • Smaller risk for OOMs on pinMemory: true, as memory affinity was
    refactored to use smart libmem.

Topology Aware Policy

The Topology Aware policy can now export prometheus metrics per topology zone. Exported metrics include pool CPU set and memory set, shared CPU subpool total capacity, allocations and available capacity, memory total capacity, allocations and available amount, number of assigned containers and containers in the shared subpool.

To enable exporting these metrics, make sure that you are running with the latest policy configuration custom resource definition and you have policy included in the spec/instrumentation/metrics/enabled slice, like this:

...
spec:
...
  instrumentation:
  ...
    metrics:
      enabled:
      - policy
...

The Topology Aware policy can now use data from the kubelet's Pod Resource API to generate extra topology hints for resource allocation and alignment. These hints are disabled in the default configuration installed by Helm charts. To enable them, make sure that you are running with the latest policy configuration custom resource definition and you have spec/agent/podResourceAPI set to true in the configuration, like this:

spec:
  agent:
    ...
    podResourceAPI: true
...
  • Support memory-type.resource-policy.nri.io pod annotation for
    setting memory affinity into closest HBM, DRAM or PMEM, or any
    combination.

What's Changed

Balloons Policy Fixes and Improvements

  • balloons: add "preserve" option to match containers whose pinning must not be modified by @askervin in #368
  • balloons: add support for cpu frequency governor tuning by @fmuyassarov in #374
  • balloons: set frequency scaling governor only when requested by @fmuyassarov in #379
  • balloons: improve handling of containers with no CPU requests by @askervin in #386
  • balloons: add debug logging to selecting a balloon type by @askervin in #396
  • balloons: support for L2 cache cluster allocation by @askervin in #384
  • balloons: add memoryTypes to balloon types by @askervin in #395
  • Add balloon type specific pinMemory option by @askervin in #451

Topology Aware Policy Fixes and Improvements

  • metrics: add topology-aware policy metrics collection. by @klihub in #406
  • topology-aware: correctly reconfigure implicit affinities for configuration changes. by @klihub in #394
  • fixes: copy assigned memory zone in grant clone. by @klihub in #413

New Policy Agnostic Metrics, Common De Facto Exporters

  • metrics: cleanup metrics registration, collection and gathering. by @klihub in #403
  • metrics: add de-facto standard collectors. by @klihub in #404
  • metrics: simplify policy/backend metrics collection interface. by @klihub in #408
  • metrics: add policy system collector. by @klihub in #405

Topology Hints Based on Pod Resource API

  • podresapi: agent,config,helm: make agent runtime configurable. by @klihub in #418
  • podresapi: resmgr,agent: generate topology hints from Pod Resource API. by @klihub in #419
  • podresapi: topology-aware: use Pod Resource API hints if present. by @klihub in #420
  • agent,resmgr: merge PodResources{List,Map}, cache last List() result. by @klihub in #423

Common Resource Management Fixes and Improvements

  • resmgr: fix "qosclass" in policy expressions by @askervin in #387
  • resmgr,agent: propagate startup config error back to CR. by @klihub in #416
  • libmem: implement policy-agnostic memory allocation/accounting. by @klihub in #332
  • libmem: typo and thinko fixes. by @klihub in #381
  • sysfs: enable faking CPU cache configurations using OVERRIDE_SYS_CACHES by @askervin in #383
  • cpuallocator, plugins: handle priority as an option. by @klihub in #414
  • Fix typos in expression code doc and matchExpression yamls by @askervin in #370

Helm Chart and Configuration Fixes and Improvements

  • helm: enable prometheus autodiscovery by @klihub in #393
  • helm: new balloons default configuration by @askervin in #391
  • apis/config: use consistent assignment in +kubebuilder:validation tags. by @klihub in #397
  • sample-configs: fix a copy-pasted comment thinko. by @klihub in #402

End-to-end Testing Fixes and Improvements

  • e2e: pull and save runtime logs after each test. by @klihub in #367
  • e2e: adjust metrics test for updated PrettyName(). by @klihub in #366
  • e2e: switch default test distro to fedora/40-cloud-base. by @klihub in #375
  • e2e: fix provisioning for Ubuntu cloud image. by @klihub in #377
  • e2e: enable vagrant debugging. by @klihub in #376
  • e2e: adjust $VM_HOSTNAME for policy node config usage. by @klihub in #378
  • e2e: skip long running tests by default. by @klihub in #373
  • e2e: fix command filenames in test output directories by @askervin in #390
  • e2e: containerd 2.0.0. provisioning fixup. by @klihub in #400
  • e2e/balloons: remove unknown/unused helm-launch argument. by @klihub in #407

Build Environment Fixes and Improvements

  • build: enable building debug binaries and images by @askervin in #388
  • build: update controller-tools to v0.16.5. by @klihub in #398
  • build: enable race-detector in DEBUG=1 builds. by @klihub in #409
  • build: enable race-detector in image build, too. by @klihub in #410
  • d...
Read more

v0.7.1

23 Sep 08:21
Compare
Choose a tag to compare

This release of NRI Reference Plugins brings new features, a few bug fixes, and updates to the documentation.

Highlights

  • balloons policy now supports assigning kernel-isolated CPU cores to balloons when available. To prefer isolated CPU cores for a balloon, use the new preferIsolCpus boolean configuration option. For instance,
balloonTypes:
  - name: high-prio-physical-core
    minCPUs: 2
    maxCPUs: 2
    preferNewBalloons: true
    preferIsolCpus: true
    hideHyperthreads: true
...
  • balloons policy now supports assigning performance optimized or energy efficient CPU cores to balloons when available. For instance, to define a balloon with energy efficient core preference and another one with performance core preference use the new preferCoreType configuration option like this:
balloonTypes:
  - name: low-prio
    namespaces:
      - logging
      - monitoring
    preferCoreType: efficient
...
  - name: high-prio
    preferCoreType: performance
...
  • Topology-aware policy now allocates CPU cores in clusters of shared last-level cache. Whenever this provides different grouping than the rest of the topology, for instance hyperthreads, the CPU allocator now divides cores into groups defined by shared last-level cache. The topology-aware policy tries to allocate as few LLC groups to a container as possible and tries to avoid sharing an LLC group by multiple containers.

What's New

  • balloons: add support for isolated cpus. by @fmuyassarov in #344
  • balloons: add support for power efficient & high performance cores by @fmuyassarov in #354
  • cpuallocator: implement clustered allocation based on cache groups. by @klihub in #343

What Changed

Resource assignment policies should now try harder to detect when a new container is a restarted instance of an existing container which has just exited or crashed. This should fix problems where a crashing container could not be restarted on an nearly fully allocated node.

  • deps: bump NRT dependencies to v0.1.2. by @fmuyassarov in #348
  • topology-aware: add missing SingleThreadForCPUs() to mockSysfs. by @klihub in #349
  • balloons: add support for isolated cpus. by @fmuyassarov in #344
  • cpuallocator: implement clustered allocation based on cache groups. by @klihub in #343
  • fixes: fix host-wait-vm-ssh-server, improve vm-reboot. by @klihub in #350
  • fix: clean up plugin at the beginning/end of tests. by @klihub in #351
  • doc: add availableResources in the balloons policy documentation by @askervin in #355
  • build: allow building a single plugin image. by @klihub in #357
  • balloons: add support for power efficient & high performance cores by @fmuyassarov in #354
  • e2e: fix cni_plugin=bridge in provisioning a vm by @askervin in #359
  • e2e: bridge CNI setup fixes for Fedora/containerd. by @klihub in #361
  • e2e: use bridge CNI plugin by default. by @klihub in #362
  • CI: verify in smaller steps, verify binary builds. by @klihub in #364
  • resmgr: lifecycle overlap detection and workaround. by @klihub in #358

Full Changelog: v0.7.0...v0.7.1

v0.7.0

03 Jul 07:30
Compare
Choose a tag to compare

This release of NRI Reference Plugins brings in new features and important bug fixes.

Highlights

  • Topology-aware and balloons resource policies now support soft-disabling of hyperthreads per container. This improves the performance of some classes of workloads. Both policies support new pod annotation:
    hide-hyperthreads.resource-policy.nri.io/container.<CONTAINER-NAME>: "true"
    
    and the balloons policy has new balloon-type option hideHyperthreads that soft-disables hyperthreads on all containers assigned to a balloon of this type.
  • The topology-aware policy supports pinning containers to high-bandwidth memory (HBM), or both HBM and DRAM, when pods are annotated with
    memory-type.resource-policy.nri.io/container.<CONTAINER-NAME>: hbm
    memory-type.resource-policy.nri.io/container.<CONTAINER-NAME>: hbm,dram
    
  • Automatic hardware topology hint generation has been fixed in the topology-aware policy. For instance, if a container uses a PCI device, the policy prefers pinning the container to CPUs and memory that are close to the device.

What's New

  • balloons: hideHyperthreads balloon type option and annotation by @askervin in #338
  • topology-aware: add support for hide-hyperthreads annotation. by @askervin in #331

What Changed

  • topology-aware: don't ignore HBM memory nodes without close CPUs. by @klihub in #329
  • topology-aware: relax NUMA node topology checks. by @klihub in #336
  • resmgr: exit when ttrpc connection goes down. by @klihub in #319
  • cpuallocator: don't filter based on single CoreKind. by @klihub in #345
  • sysfs,cpuallocator: fix CPU cluster discovery. by @klihub in #337
  • sysfs: survive NUMA nodes without memory. by @klihub in #339
  • sysfs: allow non-uniform thread count. by @klihub in #340
  • helm: flip podPriorityClassNodeCritical to true. by @klihub in #312
  • config-manager: allow configuring NRI timeouts. by @klihub in #318

New Contributors

Full Changelog: v0.5.0...v0.7.0

v0.5.1

29 Mar 15:54
194c433
Compare
Choose a tag to compare

This release of the NRI Reference Plugins brings a few improvements to hardware topology detection and resource assignment.

What's New

  • cpuallocator: topology discovery fixes and improvements. by @klihub in #206
  • cpuallocator: add support for hybrid core discovery, preferred allocation. by @klihub in #295
  • topology-aware: configurable allocation priority by @klihub in #282
  • resmgr: enable opentelemetry tracing (span propagation) over the NRI ttrpc connection. by @klihub in #293

Updates, Fixes, and Other Improvements

  • sysfs: dump system discovery results in a more predictable order. by @klihub in #294
  • github: package and publish interim unstable Helm charts from the main and release branches by @marquiz, @klihub in #303

Full Changelog: v0.4.1...v0.5.1

v0.4.1

16 Mar 14:48
659d042
Compare
Choose a tag to compare

This major release of the NRI Reference Plugins brings new features to a few plugins, numerous smaller other improvements, and several bug fixes.

Highlights

  • balloons policy: add groupBy balloon type option
    Group containers into same balloon instances if their groupBy expressions evaluate to the same group. For example, the following expression prefers assigning all containers in the pod to a balloon that already contains containers from the same namespace and have the same nsballoon pod label value
  ...
  balloonTypes:
    - name: my-pods
      groupBy: ${pod/namespace}-${pod/labels/nsballoon}
  ...

If there is no such a balloon, or if such instances do not have enough CPUs, then finding a suitable balloon continues as before: assign to some other existing balloon or create a new balloon if that is preferred.

  • balloons policy: add balloon matchExpressions option
    Assign containers to balloon instances by balloon match expressions, similar to affinity expression of the topology-aware policy. Expressions are evaluated for containers which are not explicitly assigned to any balloon by an annotation. If an expression matches a container, the container is assigned to an instance of the corresponding balloon. For instance, the following matchExpression will grab all containers with matching pod names to the associated balloon:
  ...
  balloonTypes:
    - name: my-pods
      matchExpressions:
        - key: pod/name
          operator: MatchesAny
          values: [ myPod*, nginx* ]
  ...

What's New

  • balloons: implement groupBy option by @askervin in #278
  • balloons: allow assigning containers to balloons by runtime-evaluated expressions by @klihub in #260
  • balloons policy: more regular built-in balloons, treat them much like user-defined ones
    Built-in reserved and default balloon types are no longer special cases. They can be configured with the same parameters as user-defined balloons.
  • balloons: support preserving CPU and memory pinning by @askervin in #257
  • topology-aware: support preserving CPU and memory pinning by @askervin in #258
  • feat(helm): Introduce priorityClassName system-node-critical by @ffuerste in #220
  • helm: allow setting NRI plugin index via values by @klihub in #227

Updates, Fixes And Other Improvements

  • balloons: fix the order of assigning containers into balloons by @askervin in #273
  • balloons: fix logged balloon name by @klihub in #259
  • balloons & topology-aware policies: better handling of UpdateContainer[Resources] requests
    Fill in missing bits in partial container resource updates from the current resource assignment. Filter out redundant resource updates without invoking the policy.
  • memtierd: update the nri-memtierd plugin to use memtierd v0.1.1 by @askervin in #287
  • operator: ensure to kustomize operator manifests before local deployment by @fmuyassarov in #240
  • resmgr: better expression validation, cleaner key resolution by @klihub in #256
  • resmgr: inject mount before container state update by @klihub in #223
  • resmgr: log containers by pretty name during startup by @klihub in #245
  • instrumentation: fix resource creation, use parent-based sampler by @klihub in #233
  • instrumentation: allow proper reconfiguration of tracing by @klihub in #234
  • cache: support annotations to preserve CPU and memory pinning by @askervin in #249
  • cache, resmgr: expose key evaluation, implement key substitution by @klihub in #277
  • cache: fix generated pod scope and simple affinity expressions by @klihub in #285
  • cache: store creation time of pod and containers cache objects by @askervin in #272
  • topology-aware: log resource operations at info level by @klihub in #252
  • doc: clarify selecting balloon type by @askervin in #281
  • doc: more consistent terminology in balloons documentation by @askervin in #269
  • fixes: rename default config group label, support/fall back to deprecated labels. by @klihub in #231

New Contributors

Full List of Merged PRs

For a full list of changes see v0.3.2...v0.4.1

v0.3.2

29 Dec 10:28
v0.3.2
Compare
Choose a tag to compare

This patch release fixes image versioning for the operator.

What's New

operator: point containerImage tag to the latest release (v0.3.2) by @fmuyassarov in #218

v0.3.1

28 Dec 11:47
v0.3.1
398fdf5
Compare
Choose a tag to compare

This patch release fixes operator category to use community defined one instead of custom category.

What's New

*operator: switch to community defined category" by @fmuyassarov in #217

v0.3.0

22 Dec 18:46
v0.3.0
Compare
Choose a tag to compare

This release brings a few new plugins, the NRI Plugins Operator, and a new configuration mechanism to configure plugins using CRDs.

What's New

  • Ansible-based operator created with operator-sdk to manage the life cycle of the nri-plugins

  • New plugins:

  • Improvements in policies:

    • balloons: preferCloseToDevices balloon type option by @askervin in #203
    • balloons: add PreferSpreadOnPhysicalCores by @askervin in #126

What's Changed

  • Configuration for plugins switched from ConfigMaps to per-plugin CRDs

Note that existing deployments need to be updated, converting any old ConfigMap to the corresponding custom resource of the policy used in the deployment. Please see the provided sample configurations or the custom resource definitions for policy-specific CRD details.

List of PRs

  • docs: fix link to "all releases" by @marquiz in #81
  • docs: convert all rst to md by @marquiz in #80
  • pkg: eliminate usage of obsolete github.com/pkg/errors. by @klihub in #83
  • build: Fix baseurl when building documentation version selector by @jukkar in #84
  • instrumentation: use a more appropriate propagator. by @klihub in #86
  • Helm improvments by @fmuyassarov in #85
  • config: Fix configuration files for kustomize by @changzhi1990 in #88
  • gihub-action: fix helm linter action trigger path by @fmuyassarov in #89
  • Helm chart configmap fixes and some YAML clean up by @fmuyassarov in #90
  • github: Disable e2e-test workflow by @jukkar in #102
  • helm: fix topology-aware helm chart naming by @fmuyassarov in #95
  • helm/kustomize: fix configmaps by @poussa in #104
  • build(deps): bump pygments from 2.13.0 to 2.15.0 in /docs by @dependabot in #98
  • sample-configs: drop rdt from the sample configuration by @marquiz in #106
  • pkg/multierror: switch to native golang multierror handling. by @klihub in #108
  • Add utility to patch containerd config for enabling nri by @fmuyassarov in #107
  • config-manager: bump go version to 1.20 by @fmuyassarov in #110
  • build: band-aid build system to allow building binaries other than resource policy plugins. by @klihub in #109
  • Automate Helm charts releasing process via GitHub action by @fmuyassarov in #112
  • github/workflows: ditch softprops/turnstyle action by @marquiz in #113
  • github: use the git checkout action to fetch all refs by @marquiz in #114
  • Memory QoS example plugin by @askervin in #115
  • Add nri-memtierd plugin by @askervin in #117
  • memory-qos: fix image building by @askervin in #122
  • docs: improve plugins installation instructions by @fmuyassarov in #119
  • deployment: fix the nri volume type to DirectoryOrCreate by @fmuyassarov in #123
  • helm: add charts for memtierd and memory-qos by @fmuyassarov in #118
  • helm: improve labels and selector labels set on the DaemonSet by @fmuyassarov in #124
  • helm: fix config-manager image name reference in charts defaults by @fmuyassarov in #125
  • nri-memtierd: mount only needed dirs from host by @askervin in #121
  • helm: don't set image.tag in values.yaml by @marquiz in #127
  • github: set chart version when packaging Helm charts by @marquiz in #128
  • github-action: update charts path to include memtierd and memory-qos by @fmuyassarov in #129
  • deployment: refactor config manager to support NRI enabling in CRI-O by @fmuyassarov in #120
  • helm: support NRI enabling prior to running memtierd & memory-qos plugins by @fmuyassarov in #131
  • github: issue templates by @marquiz in #130
  • Bump golang version to 1.20.10 by @marquiz in #133
  • go.mod: update deps by @marquiz in #134
  • helm: don't set initContainerImage:.tag in values.yaml by @marquiz in #136
  • cleanup: nuke unused scripts, Makefile bits and defunct functionality. by @klihub in #135
  • e2e: document e2e test dependencies by @askervin in #139
  • github: fix publish-helm-charts workflow by @marquiz in #143
  • scripts: update helm-publisher by @marquiz in #141
  • workflow: fix the tagging docker images build from new tags by @fmuyassarov in #145
  • docs: update charts installation instructions by @fmuyassarov in #147
  • deployment: adjust Pod restartPolicy to OnFailure by @fmuyassarov in #152
  • Add a helper script for preparing the release by @marquiz in #138
  • fixes: fix a bunch of problems in bootstrapping e2e tests. by @klihub in #154
  • deployment: drop DaemonSet unsupported OnFailure restartPolicy by @fmuyassarov in #155
  • fixes: fix image tagging scheme in publishing workflow. by @klihub in #158
  • scripts: ditch versions for .rpm and .deb packages. by @klihub in #161
  • balloons: add PreferSpreadOnPhysicalCores by @askervin in #126
  • deployment/helm: always use DirectoryOrCreate for /var/run/nri. by @klihub in #162
  • build: don't try to gofmt generated files. by @klihub in #165
  • helm: add an option to set tolerations by @fmuyassarov in #157
  • build/docs: make site-build work again from git worktrees. by @klihub in #168
  • docs: add basic README for each Helm chart on their own directory by @fmuyassarov in #167
  • plugins: add sgx-epc plugin. by @klihub in #156
  • deployment/helm/sgx-epc: fix a small README typo. by @klihub in #170
  • deployment: fix trailing whitespace errors. by @klihub in #169
  • docs: drop unneeded and outdated resource-policy/README.md by @marquiz in #173
  • docs: fix md style by @marquiz in #172
  • deployment/helm: add default .helmignores. by @klihub in #174
  • docs: update the main readme with the list of available plugins by @fmuyassarov in #175
  • config-manager: don't crash if no runtime is found running. by @klihub in #180
  • .github: use release-* release branch pattern. by @klihub in #183
  • fixes: remove unused gRPC bits, stop forcing gRPC to 1.38.0. by @klihub in #186
  • balloons: use new config more consistently, fix incorrect test config keys. by @klihub in #188
  • e2e: fix balloons test02-prometheus-metrics by @askervin in #191
  • test/e2e: test VM provisioning fixes by @klihub in #192
  • test/e2e: fix vagrant usage, make cilium version configurable. by @klihub in #193
  • helm: fix missing tolerations option in sgx-epc plugin by @fmuyassarov in https://github.com/containers/...
Read more

v0.2.3

27 Oct 13:28
8b4beb6
Compare
Choose a tag to compare
v0.2.3 Pre-release
Pre-release

This patch release updates Helm Chart documentation, fixes automatic enabling of NRI support in CRI-O runtimes, and updates obsolete image dependencies.

What's Changed

  • config-manager: don't crash if no runtime is found running by @marquiz in #181
  • github: use release-* release branch pattern. by @klihub in #184
  • docs: cherry-pick doc changes from main branch by @fmuyassarov in #182
  • deployment: update files for v0.2.3 by @fmuyassarov in #185
  • deps: cherry-pick #186: update gRPC bits, stop forcing version. by @klihub in #187

Full Changelog: v0.2.2...v0.2.3

v0.2.2

25 Oct 13:47
v0.2.2
4e91e62
Compare
Choose a tag to compare
v0.2.2 Pre-release
Pre-release

This patch adds support to specify tolerations with Helm, in addition to fixing some issues in the Helm charts.

What's Changed

Full Changelog: v0.2.1...v0.2.2