Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645

Open
thomaspeitz opened this issue Jan 9, 2025 · 4 comments
Open
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@thomaspeitz
Copy link
Contributor

thomaspeitz commented Jan 9, 2025

Keeping this open while our investigation is running. We cannot explain it yet.
Will fill up with more details as soon as we have understood it deeper.
As it broke only a few environments it is harder to debug.
But it it is warning to check your log lines duriung upgrade

/tmp/nginx/nginx.pid

What happened:
Upgraded our ingress-controller via helm from

version: 4.11.3

to

version: 4.12.0

Causing a major outage on 4/10 clusters. We can not understand yet why.
Kubernetes version 1.31.x

  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
-- | -- | -- | -- | --
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.000 | name=ingress-nginx-general-r6-controller-565c5966f7-8p4rq kind=Pod objectAPIversion=v1 objectRV=2931225444 eventRV=2931226571 reportingcontroller=nginx-ingress-controller sourcecomponent=nginx-ingress-controller reason=RELOAD type=Warning count=1 msg="Error reloading NGINX: exit status 1\n2025/01/09 07:45:35 [notice] 215#215: signal process started\n2025/01/09 07:45:35 [error] 215#215: open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\nnginx: [error] open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\n" |  
  |

What you expected to happen:
Ingress controller continues to work.

I am not sure yet. I keep it open while we investigate deeper.

Kubernetes version (use kubectl version):
v1.31.3-eks-59bf375

Environment:
AWS / EKS

  • Cloud provider or hardware configuration:
    AWS
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:
    • kubectl version
    • kubectl get nodes -o wide

Other data is going to follow after we did a breakdown

How to reproduce this issue:
Hard to reproduce as it is currently happening on the nodes which we cannot test again.

Update 10.01 - 00:10 - Tested again a deployment of the faulty version. Ssl certs were sendings as K8s Fake certs on some domains but the old version were sending the real letsencrypt certs. Looks like a TLS issue after upgrade.

@thomaspeitz thomaspeitz added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 9, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@longwuyuan
Copy link
Contributor

longwuyuan commented Jan 12, 2025

Some security features are enabled out of the box now. Please check the change logs. Highlight

⚠️ Enable security features by default (https://github.com/kubernetes/ingress-nginx/pull/11819) ⚠️

This changes the default of the following CLI arguments:

    --enable-annotation-validation gets enabled by default.

It also changes the default of the following configuration options:

    allow-cross-namespace-resources gets disabled by default.
    annotations-risk-level gets lowered to "High" by default.
    strict-validate-path-type gets enabled by default.

So snippets and other related annotations may be the underlying reason but not visible, yet leading up to the behavior you see.

/remove-kind bug

Since it can not be reproduced yet, lets wait till more data is posted here to apply the bug label

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 12, 2025
@thomaspeitz
Copy link
Contributor Author

Thanks for the hints! Really appreciate it.

Tried with some updated helm config, to set all those values to old defaults.
Sadly with the same result.

controller:
  enableAnnotationValidations: false

  config:
    allow-cross-namespace-resources: "true"
    strict-validate-path-type: "false"
    annotations-risk-level: "Critical"

I cannot reproduce yet. Will try to add tomorrow some certs, some domains, some ingress objects to another cluster to get this somehow in a reproducable way.

Can currently test only nightly in a small timeframe due to the impact this has and being only reproducable on a cluster with important traffic.

Maybe it has to do something with CDN as well. As I see less requests on ingress after migration then before if I run some integration tests.

@thomaspeitz
Copy link
Contributor Author

#11821
Could be the lua plugin as well. Oversaw something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

3 participants