Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to monitor and troubleshoot problems to launch build agents #95

Open
cyrille-leclerc opened this issue May 6, 2021 · 4 comments
Open
Labels
enhancement New feature or request

Comments

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented May 6, 2021

Problem Description

The launch of build agents can have problems, administrators need to be alerted on, to monitor, and to troubleshoot these problems.

Example of problems

  • Problem with the configuration of dynamic agents (invalid credentials...)
  • Outage of underlying cloud provisioning infrastructure (no more VMs available, quota reached...)

Solutions

Probably related to hudson.slaves.ComputerListener#onLaunchFailure

but probably not enough.

hudson.slaves.CloudProvisioningListener#onFailure

Issues

Cloud Providers might not report errors with the provisioner:

@cyrille-leclerc cyrille-leclerc added the enhancement New feature or request label May 6, 2021
@v1v
Copy link
Member

v1v commented May 21, 2021

I've been testing it out the CloudProvisioningListener interface with the Google Compute Engine plugin.

What have I seen so far?

  • Google Compute Engine with oneShot configuration produces some stacktrace warnings
2021-05-21 12:02:51.759+0000 [id=100]	WARNING	hudson.slaves.NodeProvisioner#lambda$update$6: Unexpected exception encountered while provisioning agent obs11-ubuntu-18-linux-370cgd
java.io.IOException: Agent failed to connect, even though the launcher didn't report it. See the log output for details.
	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:320)
	at hudson.slaves.SlaveComputer$$Lambda$264/0x0000000000000000.call(Unknown Source)
Caused: java.util.concurrent.ExecutionException
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:206)
	at com.google.jenkins.plugins.computeengine.ComputeEngineCloud.lambda$getPlannedNodeFuture$0(ComputeEngineCloud.java:315)
	at com.google.jenkins.plugins.computeengine.ComputeEngineCloud$$Lambda$507/0x0000000000000000.call(Unknown Source)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:71)
	at java.util.concurrent.FutureTask.run$$$capture(FutureTask.java:266)
	at java.util.concurrent.FutureTask.run(FutureTask.java)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:823)

Even though the agent was created when the build requested an agent and removed afterwards.

Pipeline:

pipeline {
    agent {
      label 'linux'
    }

    stages {
        stage('Build') {
            steps {
                echo 'hi'
            }
        }
    }
}

Screenshot 2021-05-21 at 12 55 49

Screenshot 2021-05-21 at 12 55 55

A bit of context,

As far as I see the Google Compute Engine uses the OnceRetentionStrategy when the task has finished and therefore the taskCompleted is executed for the AbstractCloudSlave.

I'm not sure if we can use the CloudProvisioningListener without fixing the root cause error with the exceptions when using one shot workers.

What are your thoughts?

@v1v
Copy link
Member

v1v commented Jun 22, 2021

I've just asked in the jenkins-dev-mailing-list

@v1v
Copy link
Member

v1v commented Jun 23, 2021

The issue with the oneShot provisioning seems to be fixed with the Jenkins core version in our test instance: 2.189.1.

image

@v1v
Copy link
Member

v1v commented Jun 23, 2021

Metrics are now available in the plugin, so I'd like now to move forward with the comment in #101 (comment) and tracking the provisioning of the cloud agents using a distributed trace.

Should both traces the one we are tracking so far, linked to the ci build, linked with the one for the cloud agents? I'm not sure whether it's possible though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants