Skip to content

v0.8.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 18 Dec 08:15
· 9 commits to main since this release

This is a new major release of NRI Reference Plugins. It brings several new features, a number of bug fixes, improvements to the build system, to CI, end-to-end tests, and test coverage.

What's New

Balloons Policy

  • New preserve policy option enables matching containers whose CPU
    and memory affinity must not be modified by the resource policy.

    This enables allowing selected containers to access all CPUs and
    memories. For example, allow pcm-sensor-server
    to access MSRs on every CPU for low-level metrics:

    preserve:
      matchExpressions:
        - key: pod/labels/app.kubernetes.io/name
          operator: In
          values:
            - pcm-sensor-server
    

    Earlier this required cpu.preserve.resource-policy.nri.io and
    memory.preserve.resource-policy.nri.io pod annotations.

  • New freqGovernor CPU class option enables setting CPU frequency
    governor based on the CPU class of a balloon. Example:

    balloonTypes:
    - name: powersaving
      cpuClass: mypowersave
    control:
      cpu:
        classes:
          mypowersave:
            freqGovernor: powersave
    
  • New memoryTypes balloon type option specifies required memory
    types when setting memory affinity. For example, containers in
    high-memory-bandwidth balloons will use only HBM when configured as:

    balloonTypes:
    - name: high-memory-bandwidth
      memoryTypes:
      - HBM
    
  • Support memory-type.resource-policy.nri.io pod annotation for
    setting memory affinity into closest HBM, DRAM, PMEM, or any
    combination. This annotation is a pod level override to the
    memoryTypes balloon type option.

  • L2-cache group aware CPU allocation and sharing. For example,
    containers in a balloon can be allowed to burst on idle
    (unallocated) CPUs that share the same L2 cache as CPUs allocated to
    the balloon.

    balloonTypes:
    - name: l2burst
      shareIdleCPUsInSame: l2cache
    
  • Override to pinMemory policy option in balloon type level. Enables
    setting memory affinity of containers only in certain balloons while
    others are not set, and vice versa. Example:

    pinMemory: false
    balloonTypes:
    - name: latency-sensitive
      pinMemory: true
      preferIsolCpus: true
      preferNewBalloons: true
    
  • New default configuration runs Guaranteed containers on dedicated
    CPUs while BestEffort and Burstable containers are allowed to share
    remaining CPUs on the same socket, but not cross socket boundaries.

  • Balance BestEffort containers between balloons with equal amount of
    available resources.

  • Smaller risk for OOMs on pinMemory: true, as memory affinity was
    refactored to use smart libmem.

Topology Aware Policy

The Topology Aware policy can now export prometheus metrics per topology zone. Exported metrics include pool CPU set and memory set, shared CPU subpool total capacity, allocations and available capacity, memory total capacity, allocations and available amount, number of assigned containers and containers in the shared subpool.

To enable exporting these metrics, make sure that you are running with the latest policy configuration custom resource definition and you have policy included in the spec/instrumentation/metrics/enabled slice, like this:

...
spec:
...
  instrumentation:
  ...
    metrics:
      enabled:
      - policy
...

The Topology Aware policy can now use data from the kubelet's Pod Resource API to generate extra topology hints for resource allocation and alignment. These hints are disabled in the default configuration installed by Helm charts. To enable them, make sure that you are running with the latest policy configuration custom resource definition and you have spec/agent/podResourceAPI set to true in the configuration, like this:

spec:
  agent:
    ...
    podResourceAPI: true
...
  • Support memory-type.resource-policy.nri.io pod annotation for
    setting memory affinity into closest HBM, DRAM or PMEM, or any
    combination.

What's Changed

Balloons Policy Fixes and Improvements

  • balloons: add "preserve" option to match containers whose pinning must not be modified by @askervin in #368
  • balloons: add support for cpu frequency governor tuning by @fmuyassarov in #374
  • balloons: set frequency scaling governor only when requested by @fmuyassarov in #379
  • balloons: improve handling of containers with no CPU requests by @askervin in #386
  • balloons: add debug logging to selecting a balloon type by @askervin in #396
  • balloons: support for L2 cache cluster allocation by @askervin in #384
  • balloons: add memoryTypes to balloon types by @askervin in #395
  • Add balloon type specific pinMemory option by @askervin in #451

Topology Aware Policy Fixes and Improvements

  • metrics: add topology-aware policy metrics collection. by @klihub in #406
  • topology-aware: correctly reconfigure implicit affinities for configuration changes. by @klihub in #394
  • fixes: copy assigned memory zone in grant clone. by @klihub in #413

New Policy Agnostic Metrics, Common De Facto Exporters

  • metrics: cleanup metrics registration, collection and gathering. by @klihub in #403
  • metrics: add de-facto standard collectors. by @klihub in #404
  • metrics: simplify policy/backend metrics collection interface. by @klihub in #408
  • metrics: add policy system collector. by @klihub in #405

Topology Hints Based on Pod Resource API

  • podresapi: agent,config,helm: make agent runtime configurable. by @klihub in #418
  • podresapi: resmgr,agent: generate topology hints from Pod Resource API. by @klihub in #419
  • podresapi: topology-aware: use Pod Resource API hints if present. by @klihub in #420
  • agent,resmgr: merge PodResources{List,Map}, cache last List() result. by @klihub in #423

Common Resource Management Fixes and Improvements

  • resmgr: fix "qosclass" in policy expressions by @askervin in #387
  • resmgr,agent: propagate startup config error back to CR. by @klihub in #416
  • libmem: implement policy-agnostic memory allocation/accounting. by @klihub in #332
  • libmem: typo and thinko fixes. by @klihub in #381
  • sysfs: enable faking CPU cache configurations using OVERRIDE_SYS_CACHES by @askervin in #383
  • cpuallocator, plugins: handle priority as an option. by @klihub in #414
  • Fix typos in expression code doc and matchExpression yamls by @askervin in #370

Helm Chart and Configuration Fixes and Improvements

  • helm: enable prometheus autodiscovery by @klihub in #393
  • helm: new balloons default configuration by @askervin in #391
  • apis/config: use consistent assignment in +kubebuilder:validation tags. by @klihub in #397
  • sample-configs: fix a copy-pasted comment thinko. by @klihub in #402

End-to-end Testing Fixes and Improvements

  • e2e: pull and save runtime logs after each test. by @klihub in #367
  • e2e: adjust metrics test for updated PrettyName(). by @klihub in #366
  • e2e: switch default test distro to fedora/40-cloud-base. by @klihub in #375
  • e2e: fix provisioning for Ubuntu cloud image. by @klihub in #377
  • e2e: enable vagrant debugging. by @klihub in #376
  • e2e: adjust $VM_HOSTNAME for policy node config usage. by @klihub in #378
  • e2e: skip long running tests by default. by @klihub in #373
  • e2e: fix command filenames in test output directories by @askervin in #390
  • e2e: containerd 2.0.0. provisioning fixup. by @klihub in #400
  • e2e/balloons: remove unknown/unused helm-launch argument. by @klihub in #407

Build Environment Fixes and Improvements

  • build: enable building debug binaries and images by @askervin in #388
  • build: update controller-tools to v0.16.5. by @klihub in #398
  • build: enable race-detector in DEBUG=1 builds. by @klihub in #409
  • build: enable race-detector in image build, too. by @klihub in #410
  • dev: add Tiltfile for local development by @fmuyassarov in #382
  • Tilt: turn on prometheus metrics exporting by default for local development by @fmuyassarov in #411
  • images: fix FromAsCasing warnings by @fmuyassarov in #380
  • fixes: fix vagrant dotenv loading and default qemu directory. by @klihub in #389
  • Migrate code-gen to kube_codegen.sh by @fmuyassarov in #412
  • operator: ensure tree is restored to a clean state by @fmuyassarov in #415
  • docs: fix build error, avoid testdata scan infinite loop. by @klihub in #421

Dependency Updates:

Codespell Fixes, Codespell Now Enabled in CI

  • .codespell*,.github,*: add codespell configuration, workflow, fix codespell errors. by @klihub in #356
  • .github: one more codespell fix. by @klihub in #371
  • codespell: ignore more files. by @klihub in #372
  • .github: add workflow for rejecting PRs that introduce whitespace errors. by @klihub in #171

Golangci-lint Fixes, Golangci-lint Now Enabled in CI

Full Changelog

Full Changelog: v0.7.1...v0.8.0