Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive amount of Credentials Lookups when contributing environment variables #967

Open
Dohbedoh opened this issue Oct 17, 2024 · 19 comments

Comments

@Dohbedoh
Copy link

Dohbedoh commented Oct 17, 2024

Jenkins and plugins versions report

When the option Export OpenTelemetry configuration as environment variables is enabled, every single pipeline step will cause a credentials lookup to happen. This might seem insignificant but when using External Credentials system (such as the Hashicorp Vault Credentials Provider) the results in very excessive fetch requests.

While the credentials provider could be blamed for .. maybe implementing a cache.. Open Telemetry seem to be abusing the credentials API at every single step when that option is enabled.

Environment
Jenkins: 2.479.1
OS: Linux - 6.1.100+
Java: 17.0.13 - Red Hat, Inc. (OpenJDK 64-Bit Server VM)
---
credentials:1381.v2c3a_12074da_b_
opentelemetry:3.1391.vcb_a_a_b_9779d75
opentelemetry-api:1.40.0-36.v1e02b_b_4db_8f4

What Operating System are you using (both controller, and any agents involved in the problem)?

UBI docker images

Reproduction steps

  • Spin up a controller
  • Set an Credentials Provider like Hashicorp Vault plugin for example
  • Create a Vault Credentials for Open Telemetry
  • Configure Open Telemetry with Header Authentication (any authentication I think would do) and use the Vault credentials
  • Enable Export OpenTelemetry configuration as environment variables
  • Create a pipeline like the following:
node {
    def stages = [:]
    for(int i=0; i<1000; i++) {
        stages[i] = {
            echo "ok"
        }
    }
    parallel stages
}
  • Build the pipeline, notice all the calls made to external provider
  • Disable Export OpenTelemetry configuration as environment variables
  • See that it’s not happening anymore (expect for maybe one for the run)

Expected Results

The external Credentials provider not be hammered with credentials fetching requests

Actual Results

The external Credentials provider not be hammered with credentials fetching requests

Anything else?

No response

Are you interested in contributing a fix?

No response

@cyrille-leclerc
Copy link
Contributor

Valid problem, complicated to solve in the OTel plugin.

Question:

  • What would be the TTL of the credentials?
  • Did you submit an RFE to the Jenkins Hashicorp Vault plugin?

Workaround ideas:

  • did you consider deploying on the Jenkins Controller an OTel Collector with an OTLP receiver that doesn't require authentication?
  • hardcoding the authentication secret in the otel.exporter.otlp.headers config param (see OTEL_EXPORTER_OTLP_HEADERS)?

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Nov 6, 2024

@Dohbedoh or @timja can you connect us to Jenkins contributors who have a view on the Jenkins Credentials Provider and the question of TTL of credentials lookups in external vaults? I feel it's more a question for the remote credentials provider than for the consumers like the OTel Plugin.

@timja
Copy link
Member

timja commented Nov 6, 2024

Its @jenkinsci/credentials-plugin-developers most likely @jtnord

From my memory remote credentials themselves aren't supposed to be cached.
You can cache the metadata e.g. the results of a list credentials call (without the secret data) but the actual secret value isn't supposed to be cached.

Open Telemetry seem to be abusing the credentials API at every single step when that option is enabled.

@Dohbedoh / @cyrille-leclerc do you know where the code is thats causing this problem?


If this is just for the otel credential I would expect it to be loaded once for the run and cached for the duration of the run, (could be convinced for a longer cache)

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Nov 6, 2024

The following pipeline causes the same behavior without having the OpenTelemetry plugin installed; there is no TTL for the credentials, and they are always requested to the provider, which can cause quota exhaustion on some providers; the solution is a TTL in the credentials provider to fix the issue for all use cases. Implementing something in the OpenTelemetry plugin does not make sense because it does not resolve the real issue; it only patches it for a single plugin.

node {
  def stages = [:]
  for(int i=0; i<1000; i++) {
    stages[i] = {
      def secrets = [
        [path: 'secret/token_secret', engineVersion: 1, secretValues: [
            [envVar: 'token', vaultKey: 'value_one']]],
      ]
      def configuration = [vaultUrl: 'http://my-very-other-vault-url.com',
                          vaultCredentialId: 'my-vault-cred-id',
                          engineVersion: 1]
      withVault([configuration: configuration, vaultSecrets: secrets]) {
        curl -H "Token: $token" https: //some.api/
      }
    }
  }
  parallel stages
}

@timja
Copy link
Member

timja commented Nov 6, 2024

Yes but most plugins won't call it on every step they'll only call it once...

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Nov 6, 2024

I support @kuisathaverat 's points:

  • If a Jenkins component is implementing caching of credentials, it should be the credentials provider.
    • If downstream plugins implement credentials caching, we will suffer inconsistencies in the behaviour, security vulnerabilities, and code duplication
    • If the the Jenkins credentials providers state that credentials should not be cached, downstream plugins should not try to cache those credentials
  • Due to the imperative nature of Jenkins pipelines, it's likely that steps using credentials can be called in loops and thus require credentials caching

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Nov 6, 2024

There is another use case that causes the same behavior. If you configure OpenTelemetry plugin to use Vault credentials and you spin up 100, 200, 300, or 500 jobs, you will make the same amount of requests to the Vault even though you did not enable the Export OpenTelemetry configuration as environment variables. A TTL implemented in the plugin for the credentials used in a pipeline does not solve that issue either.
So maybe using external credentials for plugin access does not make much sense because you can cause tons of requests to the external provider. I don't know how often those credentials change, but I guess not much, and if they do you can implement a credentials update pipeline in your controler.

@jtnord
Copy link
Member

jtnord commented Nov 6, 2024

Hello :)

Whilst credential providers could cache, I would consider it bad form.
e.g. You could get jobs that fail as someone updated an SSH deployment credential. What would be appropriate here?

I'm not sure why OTEL is using a credential here and for what means. Without exporing the why, a credential should only be used when it is to be shared (re-used) accross different "things" with various different things, for something specific to one piece of code a Secret used in plugins (global) configuration should be used.

The following pipeline causes the same behavior without having the OpenTelemetry plugin installed; there is no TTL for the credentials, and they are always requested to the provider, which can cause quota exhaustion on some providers; the solution is a TTL in the credentials provider to fix the issue for all use cases. Implementing something in the OpenTelemetry plugin does not make sense because it does not resolve the real issue; it only patches it for a single plugin.

you are only trading one issue for another. The pipeline is abusing a system, as per above caching (introducing a TTL) could introduce random failures, by failing to connect to the service for $TTL as the password was rotated.

Whilst any credential provider could choose to implement this, to not have random failures of CI (which is just a very bad thing) the only possible value would be -1 to disable caching.
You are then in the realms of "oh I only want this specific credential to be cached"...

@jtnord
Copy link
Member

jtnord commented Nov 6, 2024

Whilst credential providers could cache, I would consider it bad form. e.g. You could get jobs that fail as someone updated an SSH deployment credential. What would be appropriate here?

having said that if whatever external provider has some ability to support caching itself (e.g. a HTTP GET with If-Modified-Since headers) then it can make sense for the provider to use that information and have some limited caching. If conditional requests do not apply to rate-limiting, or reduce the load on the external system is another matter entirely.

@timja
Copy link
Member

timja commented Nov 6, 2024

for something specific to one piece of code a Secret used in plugins (global) configuration should be used.

Maybe once, these days credentials API is used for most secrets. It allows treating all secrets the same and using external providers to store and rotate these credentials, I wouldn't go down the Secret approach in plugins these days, its very rare to see that now.

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Nov 6, 2024

Another use case that could cause the same behavior is when you configure the logs backend in the OpenTelemetry plugin and use an external credentials provider. When you have 100-500 users checking logs, you will have a request per user to get the credentials from the external provider every time the user requests a log page. This could be a lot more problematic than the other two use cases. We all know the F5 Users and the GB console logs.

So, for large systems, I would not recommend using external credentials to configure any plugin that could trigger hundreds of requests on user iterattion.

@timja
Copy link
Member

timja commented Nov 6, 2024

But wouldn't the credential just be retrieved once to view the logs? and potentially retrieved again on unauthorised?

@kuisathaverat
Copy link
Contributor

But wouldn't the credential just be retrieved once to view the logs? and potentially retrieved again on unauthorised?

retrieved again

@jtnord
Copy link
Member

jtnord commented Nov 6, 2024

for something specific to one piece of code a Secret used in plugins (global) configuration should be used.

Maybe once, these days credentials API is used for most secrets. It allows treating all secrets the same and using external providers to store and rotate these credentials, I wouldn't go down the Secret approach in plugins these days, its very rare to see that now.

use CasC for secret rotation (with ability to link to Vault etc).

The use case for Secrets is real as per this (ab)use of an API

@jtnord
Copy link
Member

jtnord commented Nov 6, 2024

Another use case that could cause the same behavior is when you configure the logs backend in the OpenTelemetry plugin and use an external credentials provider. When you have 100-500 users checking logs, you will have a request per user to get the credentials from the external provider every time the user requests a log page. This could be a lot more problematic than the other two use cases. We all know the F5 Users and the GB console logs.

Please explain how a cache in the credentials provider would work here without giving these 100-500 users 403 errors and the inability to view logs for the cache period?

@kuisathaverat
Copy link
Contributor

Please explain how a cache in the credentials provider would work here without giving these 100-500 users 403 errors and the inability to view logs for the cache period?

If a secret's TTL is 1 minute, you would reduce the number of requests for that secret to one per minute. So, if you have 500 user requests for logs, you will only make one request to the external service; without that TTL, you will make 500.

@cyrille-leclerc
Copy link
Contributor

A TTL of a few dozens of seconds would make sense to me.
Yes there is a risk of credentials rotation but this risk always exists and it would help us workaround challenges like credentials lookups in for loops in pipelines.

@Dohbedoh
Copy link
Author

Dohbedoh commented Nov 8, 2024

FWIW, I think Open Telemetry was an obvious culprit because it retrieve credentials details on every single step execution. Which can easily reach a high rate in Jenkins. I don't think any other plugin I have seen do that.
Note that the for loop is just a scenario for reproduction with a single job. The environment where I detected this is a rather active environment with a moderate amount of concurrent / parallel execution. And Jenkins is supposed to handle concurrent CI build execution, at scale. So in an active environment, you are very likely to experience a rate of several hundreds requests a minute.

I agree that Credentials Provider should probably implements TTL solution. But what is concerning is the fact that it takes one plugin setup, like Open Telemetry, to realize that a remote credentials provider needs a TTL implementation. Why a Jenkins integration with a Remote Credentials Provider be fine until you integrate with Open Telemetry. As if you need to pass the "Open Telemetry test" to validate your remote credentials provider implementation if you know what I mean :/
Or maybe this is some gotchas that should be added to the Credentials API documentation ? That Credentias Provider should be able to handle N requests per seconds, which usually means caching / TTL for remote credentials providers ?

@kuisathaverat
Copy link
Contributor

I don't think any other plugin I have seen do that.

Any plugin configured to use an external credentials provider can trigger the same. You only need to trigger parallel stuff, and you make a mess, so it is too easy to abuse. Indeed, I stopped using the Vault plugin because of that. For example, for the SSH build agent, if you configure the credentials in an external provider and you request 100 agents (I said 100 because it is not much), you will make 100 requests to the external provider.

Or maybe this is some gotchas that should be added to the Credentials API documentation ? That Credentias Provider should be able to handle N requests per seconds, which usually means caching / TTL for remote credentials providers ?

At the end, it is something that depends on your infrastructure, if you can make 10000000000 requests without issues and it is not a big deal for you, you can configure it, but if you can only make 100 requests per minute (or something like that) you should know that you should configure other kinds of credentials or it will cause issues. Putting a warning on the credentials configuration dropdown could be easy to warm users up and let them decide (it's up to you).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants