Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

branchprotector continuously failing #33900

Closed
BenTheElder opened this issue Dec 3, 2024 · 22 comments · Fixed by #34143
Closed

branchprotector continuously failing #33900

BenTheElder opened this issue Dec 3, 2024 · 22 comments · Fixed by #34143
Labels
area/prow Issues or PRs related to prow help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@BenTheElder
Copy link
Member

BenTheElder commented Dec 3, 2024

It's timing out after 5h20m: https://prow.k8s.io/?job=ci-test-infra-branchprotector

This is causing us to not sync branch protection rules, e.g. related to #33857 / #33880

We have a big problem with this tooling not scaling, it uses a ton of API quota and needs to run continuously and scales with the number of repos/branches which is only growing.

/area prow
/sig testing
/priority important-soon

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Dec 3, 2024
@k8s-ci-robot k8s-ci-robot added area/prow Issues or PRs related to prow sig/testing Categorizes an issue or PR as relevant to SIG Testing. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 3, 2024
@BenTheElder BenTheElder pinned this issue Dec 3, 2024
@BenTheElder
Copy link
Member Author

filed #33901 as a stopgap. it's not great.

@BenTheElder
Copy link
Member Author

Aborted the current run and started a new re-run with the latest config and the 12h timeout, but we know it will be at least 5+ hours before we can tell if that worked as a stopgap ... https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-test-infra-branchprotector/1864084620887199744

@BenTheElder
Copy link
Member Author

That run failed at 11h35m57s.

The logs are enourmous and full of 404 errors:

protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.2 from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.4 from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.5 from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-7.0 from protected=false: get current branch protection: getting branch protection 404: Not Found, update jobselector from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.1 from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.3 from protected=true: get current branch protection: getting branch protection 404: Not Found], update kubectl-validate: [update alexzielenski-patch-1 from protected=true: get current branch protection: getting branch protection 404: Not Found, update alexzielenski-patch-2 from protected=true: get current branch protection: getting branch protection 404: Not Found, update apelisse-exit-code-suggestion from protected=true: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found], update kube-scheduler-wasm-extension: [update golang-sdk from protected=false: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.1 from protected=true: get current branch protection: getting branch protection 404: Not Found], update sigs-github-actions: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update knftables: update master from protected=true: get current branch protection: getting branch protection 404: Not Found, update noderesourcetopology-api: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update cloud-pv-admission-labeler: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update cluster-inventory-api: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update obscli: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update node-ipam-controller: [update readme from protected=true: get current branch protection: getting branch protection 404: Not Found, update update-codegen/fix from protected=true: get current branch protection: getting branch protection 404: Not Found, update init from protected=false: get current branch protection: getting branch protection 404: Not Found, update e2e_aojea from protected=true: get current branch protection: getting branch protection 404: Not Found, update integration-tests from protected=true: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found], update lws: [update release-0.2.0 from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.4.0 from protected=false: get current branch protection: getting branch protection 404: Not Found, update release-0.4.1 from protected=false: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.1 from protected=true: get current branch protection: getting branch protection 404: Not Found, update release-0.4.2 from protected=false: get current branch protection: getting branch protection 404: Not Found, update release-0.3.0 from protected=true: get current branch protection: getting branch protection 404: Not Found, update rupliu/migrate from protected=true: get current branch protection: getting branch protection 404: Not Found, update rupliu/refactor from protected=true: get current branch protection: getting branch protection 404: Not Found], update testgrid: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update referencegrant-poc: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update karpenter-provider-cluster-api: [update multiplex-client from protected=false: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found], update kube-network-policies: [update aojea-patch-3 from protected=false: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-4 from protected=false: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-6 from protected=false: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-7 from protected=false: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-1 from protected=true: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-2 from protected=true: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-5 from protected=false: get current branch protection: getting branch protection 404: Not Found, update aojea-patch-8 from protected=false: get current branch protection: getting branch protection 404: Not Found], update secrets-store-sync-controller: [update gh-pages from protected=false: get policy: kubernetes-sigs/secrets-store-sync-controller=gh-pages defines a policy, which requires protect: true, update main from protected=true: get current branch protection: getting branch protection 404: Not Found], update wg-device-management: update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update etcd-manager: [update fix-release-process-docs from protected=false: get current branch protection: getting branch protection 404: Not Found, update main from protected=true: get current branch protection: getting branch protection 404: Not Found], update cve-feed-osv: [update main from protected=true: get current branch protection: getting branch protection 404: Not Found, update vulns-CVE-2020-10749.json from protected=true: get current branch protection: getting branch protection 404: Not Found, update vulns-CVE-2024-5321.json from protected=true: get current branch protection: getting branch protection 404: Not Found, update vulns-CVE-2020-8553.json from protected=false: get current branch protection: getting branch protection 404: Not Found, update vulns-CVE-2024-7646.json from protected=false: get current branch protection: getting branch protection 404: Not Found], update wg-serving: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update multi-network: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update multi-network-api: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update llm-instance-gateway: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update gwctl: [update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update release-0.1 from protected=false: get current branch protection: getting branch protection 404: Not Found], update cni-dra-driver: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update container-object-storage-interface: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update ingate: update main from protected=false: get current branch protection: getting branch protection 404: Not Found, update kjob: update main from protected=false: get current branch protection: getting branch protection 404: Not

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/ci-test-infra-branchprotector/1864084620887199744

@BenTheElder
Copy link
Member Author

I think it at least barely completed running without being killed by the timeout, but the error output is massive and it feels like it's not running correctly.

NOTE: this has actually been ongoing for months AFAICT, there have been scattered threads in slack. It's just gotten enough attention now to make sure we have a tracking issue >.<

@BenTheElder
Copy link
Member Author

some discussion but nothing conclusive in https://kubernetes.slack.com/archives/CDECRSC5U/p1733337720772809

@pohly
Copy link
Contributor

pohly commented Dec 5, 2024

This causes the "Waiting for status to be reported" for the removed pull-kubernetes-verify-lint because it is the job of the branch protector to tell GitHub which statuses it should wait for - or in this case, not wait for anymore.

The removal of that job has not been mirrored to GitHub yet, despite that completed run.

@pacoxu
Copy link
Member

pacoxu commented Dec 5, 2024

A quick proposal can be dividing this CI to several CIs group by sig/project/org. Is this doable?

@BenTheElder
Copy link
Member Author

A quick proposal can be dividing this CI to several CIs group by sig/project/org. Is this doable?

That was discussed in the slack thread above. But currently the job is completing, so quota/throughput, while bad, is not the problem.

It's clearly also bugged. Someone will have to spend more time investigating this.

An immediate mitigation is reaching out to github management to manually update github settings, but these problems will continue to impact the organization if the tooling isn't fixed.

@BenTheElder
Copy link
Member Author

Raised in #github-management: https://kubernetes.slack.com/archives/C01672LSZL0/p1733438889444729

@BenTheElder
Copy link
Member Author

@MadhavJivrajani manually removed pull-kubernetes-verify-lint from kubernetes/kubernetes branch protection rules as a stop-gap mitigation. But this system is still fundamentally not working well.

@BenTheElder
Copy link
Member Author

We're still seeing issues with blocking merge requirements not being removed from repos: https://kubernetes.slack.com/archives/C01672LSZL0/p1734188195640379

/help

@k8s-ci-robot
Copy link
Contributor

@BenTheElder:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

We're still seeing issues with blocking merge requirements not being removed from repos: https://kubernetes.slack.com/archives/C01672LSZL0/p1734188195640379

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Dec 16, 2024
@BenTheElder
Copy link
Member Author

filed kubernetes-sigs/prow#345

@dims
Copy link
Member

dims commented Jan 6, 2025

👀

@BenTheElder
Copy link
Member Author

After updating the PAT scopes thanks to debugging from @danilo-gemoli 🙏 https://kubernetes.slack.com/archives/CDECRSC5U/p1736268326525839?thread_ts=1736193846.842659&cid=CDECRSC5U

https://prow.k8s.io/?job=ci-test-infra-branchprotector

fails in only 13 minutes, but it has a new inscutiable error:

{"component":"branchprotector","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:101","func":"main.(*Errors).add","level":"info","msg":"update kubernetes-sigs: [update node-feature-discovery: update gh-pages from protected=false: get policy: kubernetes-sigs/node-feature-discovery=gh-pages defines a policy, which requires protect: true, update cluster-proportional-autoscaler: update gh-pages from protected=false: get policy: kubernetes-sigs/cluster-proportional-autoscaler=gh-pages defines a policy, which requires protect: true, update external-dns: update gh-pages from protected=false: get policy: required prow jobs require branch protection, update metrics-server: update gh-pages from protected=false: get policy: kubernetes-sigs/metrics-server=gh-pages defines a policy, which requires protect: true, update descheduler: update gh-pages from protected=false: get policy: kubernetes-sigs/descheduler=gh-pages defines a policy, which requires protect: true, update apisnoop: update gh-pages from protected=true: get policy: kubernetes-sigs/apisnoop=gh-pages defines a policy, which requires protect: true, update aws-ebs-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/aws-ebs-csi-driver=gh-pages defines a policy, which requires protect: true, update aws-fsx-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/aws-fsx-csi-driver=gh-pages defines a policy, which requires protect: true, update sig-storage-local-static-provisioner: update gh-pages from protected=false: get policy: required prow jobs require branch protection, update aws-efs-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/aws-efs-csi-driver=gh-pages defines a policy, which requires protect: true, update secrets-store-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/secrets-store-csi-driver=gh-pages defines a policy, which requires protect: true, update node-feature-discovery-operator: update gh-pages from protected=false: get policy: kubernetes-sigs/node-feature-discovery-operator=gh-pages defines a policy, which requires protect: true, update nfs-subdir-external-provisioner: update gh-pages from protected=false: get policy: kubernetes-sigs/nfs-subdir-external-provisioner=gh-pages defines a policy, which requires protect: true, update nfs-ganesha-server-and-external-provisioner: update gh-pages from protected=false: get policy: kubernetes-sigs/nfs-ganesha-server-and-external-provisioner=gh-pages defines a policy, which requires protect: true, update secrets-store-sync-controller: update gh-pages from protected=false: get policy: kubernetes-sigs/secrets-store-sync-controller=gh-pages defines a policy, which requires protect: true]","severity":"info","time":"2025-01-08T16:07:26Z"}
{"component":"branchprotector","error":"update kubernetes: [update kube-state-metrics: update gh-pages from protected=false: get policy: kubernetes/kube-state-metrics=gh-pages defines a policy, which requires protect: true, update ingress-nginx: update gh-pages from protected=false: get policy: kubernetes/ingress-nginx=gh-pages defines a policy, which requires protect: true, update autoscaler: update gh-pages from protected=false: get policy: kubernetes/autoscaler=gh-pages defines a policy, which requires protect: true, update cloud-provider-aws: update gh-pages from protected=false: get policy: kubernetes/cloud-provider-aws=gh-pages defines a policy, which requires protect: true, update cloud-provider-openstack: update gh-pages from protected=false: get policy: kubernetes/cloud-provider-openstack=gh-pages defines a policy, which requires protect: true, update cloud-provider-vsphere: update gh-pages from protected=true: get policy: kubernetes/cloud-provider-vsphere=gh-pages defines a policy, which requires protect: true]","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:152","func":"main.main","level":"error","msg":"0","severity":"error","time":"2025-01-08T16:07:26Z"}
{"component":"branchprotector","error":"update kubernetes-sigs: [update node-feature-discovery: update gh-pages from protected=false: get policy: kubernetes-sigs/node-feature-discovery=gh-pages defines a policy, which requires protect: true, update cluster-proportional-autoscaler: update gh-pages from protected=false: get policy: kubernetes-sigs/cluster-proportional-autoscaler=gh-pages defines a policy, which requires protect: true, update external-dns: update gh-pages from protected=false: get policy: required prow jobs require branch protection, update metrics-server: update gh-pages from protected=false: get policy: kubernetes-sigs/metrics-server=gh-pages defines a policy, which requires protect: true, update descheduler: update gh-pages from protected=false: get policy: kubernetes-sigs/descheduler=gh-pages defines a policy, which requires protect: true, update apisnoop: update gh-pages from protected=true: get policy: kubernetes-sigs/apisnoop=gh-pages defines a policy, which requires protect: true, update aws-ebs-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/aws-ebs-csi-driver=gh-pages defines a policy, which requires protect: true, update aws-fsx-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/aws-fsx-csi-driver=gh-pages defines a policy, which requires protect: true, update sig-storage-local-static-provisioner: update gh-pages from protected=false: get policy: required prow jobs require branch protection, update aws-efs-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/aws-efs-csi-driver=gh-pages defines a policy, which requires protect: true, update secrets-store-csi-driver: update gh-pages from protected=false: get policy: kubernetes-sigs/secrets-store-csi-driver=gh-pages defines a policy, which requires protect: true, update node-feature-discovery-operator: update gh-pages from protected=false: get policy: kubernetes-sigs/node-feature-discovery-operator=gh-pages defines a policy, which requires protect: true, update nfs-subdir-external-provisioner: update gh-pages from protected=false: get policy: kubernetes-sigs/nfs-subdir-external-provisioner=gh-pages defines a policy, which requires protect: true, update nfs-ganesha-server-and-external-provisioner: update gh-pages from protected=false: get policy: kubernetes-sigs/nfs-ganesha-server-and-external-provisioner=gh-pages defines a policy, which requires protect: true, update secrets-store-sync-controller: update gh-pages from protected=false: get policy: kubernetes-sigs/secrets-store-sync-controller=gh-pages defines a policy, which requires protect: true]","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:152","func":"main.main","level":"error","msg":"1","severity":"error","time":"2025-01-08T16:07:26Z"}
{"component":"branchprotector","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:154","func":"main.main","level":"fatal","msg":"Encountered 2 errors protecting branches","severity":"fatal","time":"2025-01-08T16:07:26Z"}

@BenTheElder
Copy link
Member Author

{"component":"branchprotector","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:101","func":"main.(*Errors).add","level":"info","msg":"update kubernetes: [update kube-state-metrics: update gh-pages from protected=false: get policy: kubernetes/kube-state-metrics=gh-pages defines a policy, which requires protect: true, update ingress-nginx: update gh-pages from protected=false: get policy: kubernetes/ingress-nginx=gh-pages defines a policy, which requires protect: true, update autoscaler: update gh-pages from protected=false: get policy: kubernetes/autoscaler=gh-pages defines a policy, which requires protect: true, update cloud-provider-aws: update gh-pages from protected=false: get policy: kubernetes/cloud-provider-aws=gh-pages defines a policy, which requires protect: true, update cloud-provider-openstack: update gh-pages from protected=false: get policy: kubernetes/cloud-provider-openstack=gh-pages defines a policy, which requires protect: true, update cloud-provider-vsphere: update gh-pages from protected=true: get policy: kubernetes/cloud-provider-vsphere=gh-pages defines a policy, which requires protect: true]","severity":"info","time":"2025-01-08T15:58:45Z"}

@BenTheElder
Copy link
Member Author

So they're all for gh-pages branches? Maybe another permission needed on the token or something?

@danilo-gemoli
Copy link
Contributor

danilo-gemoli commented Jan 9, 2025

I guess this time it's a different issue. The error states what follow:

update kubernetes: [
	update kube-state-metrics:       update gh-pages from protected=false: get policy: kubernetes/kube-state-metrics=gh-pages       defines a policy, which requires protect: true,
	update ingress-nginx:            update gh-pages from protected=false: get policy: kubernetes/ingress-nginx=gh-pages            defines a policy, which requires protect: true,
	update autoscaler:               update gh-pages from protected=false: get policy: kubernetes/autoscaler=gh-pages               defines a policy, which requires protect: true,
	update cloud-provider-aws:       update gh-pages from protected=false: get policy: kubernetes/cloud-provider-aws=gh-pages       defines a policy, which requires protect: true,
	update cloud-provider-openstack: update gh-pages from protected=false: get policy: kubernetes/cloud-provider-openstack=gh-pages defines a policy, which requires protect: true,
	update cloud-provider-vsphere:   update gh-pages from protected=true:  get policy: kubernetes/cloud-provider-vsphere=gh-pages   defines a policy, which requires protect: true
]

the relevant config (here) for those repositories is:

branch-protection:
  orgs:
    kubernetes:
      protect: true
      repos:
        autoscaler:
          branches:
            gh-pages:
              protect: false
        cloud-provider-aws:
          branches:
            gh-pages:
              protect: false
        cloud-provider-openstack:
          branches:
            gh-pages:
              protect: false
        cloud-provider-vsphere:
          branches:
            gh-pages:
              protect: false
        ingress-nginx:
          branches:
            gh-pages:
              protect: false
        kube-state-metrics:
          branches:
            gh-pages:
              protect: false

it seems that branch protection is enabled at org level:

branch-protection:
  orgs:
    kubernetes:
      protect: true

but each of those repository defines protect: false on the gh-pages branch. The error is coming from here:

	if policy.Protect != nil && !*policy.Protect {
                ...
		if policy.defined() && !boolValFromPtr(c.BranchProtection.AllowDisabledPolicies) {
			return nil, fmt.Errorf("%s/%s=%s defines a policy, which requires protect: true", org, repo, branch)
		}
                ...
	}

if I'm reading this correctly it means that a child branch policy can't override its parent unless:

branch-protection:
  allow_disabled_policies: true

which is not the case.

@pacoxu
Copy link
Member

pacoxu commented Jan 13, 2025

https://github.com/kubernetes-sigs/prow/blob/8e8a5cfe7516358bf3e4449635c32b3935aed860/cmd/branchprotector/protect.go#L459-L462

	bp, err := p.cfg.GetPolicy(orgName, repo, branchName, branch, p.cfg.GetPresubmitsStatic(orgName+"/"+repo), &protected)
	if err != nil {
		return fmt.Errorf("get policy: %w", err)
	}

the permission to get policy seems not to be granted.

@BenTheElder
Copy link
Member Author

ok now we just have:

{"component":"branchprotector","error":"update kubernetes-sigs: [update external-dns: update gh-pages from protected=false: get policy: required prow jobs require branch protection, update sig-storage-local-static-provisioner: update gh-pages from protected=false: get policy: required prow jobs require branch protection]","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:152","func":"main.main","level":"error","msg":"0","severity":"error","time":"2025-01-13T20:05:49Z"}
{"component":"branchprotector","file":"sigs.k8s.io/prow/cmd/branchprotector/protect.go:154","func":"main.main","level":"fatal","msg":"Encountered 1 errors protecting branches","severity":"fatal","time":"2025-01-13T20:05:49Z"}

After #34125

@BenTheElder
Copy link
Member Author

Confirmed green: https://prow.k8s.io/?job=ci-test-infra-branchprotector

Thank you very much @danilo-gemoli !

@danilo-gemoli
Copy link
Contributor

Finally!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Issues or PRs related to prow help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants
@pohly @dims @BenTheElder @pacoxu @danilo-gemoli @k8s-ci-robot and others