Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645

thomaspeitz · 2025-01-09T08:36:21Z

Keeping this open while our investigation is running. We cannot explain it yet.
Will fill up with more details as soon as we have understood it deeper.
As it broke only a few environments it is harder to debug.
But it it is warning to check your log lines duriung upgrade

/tmp/nginx/nginx.pid

What happened:
Upgraded our ingress-controller via helm from

version: 4.11.3

to

version: 4.12.0

Causing a major outage on 4/10 clusters. We can not understand yet why.
Kubernetes version 1.31.x

  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
-- | -- | -- | -- | --
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | nginx: [error] open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.583 | 2025/01/09 07:45:35 [error] 215#215: open() "/tmp/nginx/nginx.pid" failed (2: No such file or directory) |  
  |   | 2025-01-09 07:45:35.000 | name=ingress-nginx-general-r6-controller-565c5966f7-8p4rq kind=Pod objectAPIversion=v1 objectRV=2931225444 eventRV=2931226571 reportingcontroller=nginx-ingress-controller sourcecomponent=nginx-ingress-controller reason=RELOAD type=Warning count=1 msg="Error reloading NGINX: exit status 1\n2025/01/09 07:45:35 [notice] 215#215: signal process started\n2025/01/09 07:45:35 [error] 215#215: open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\nnginx: [error] open() \"/tmp/nginx/nginx.pid\" failed (2: No such file or directory)\n" |  
  |

What you expected to happen:
Ingress controller continues to work.

I am not sure yet. I keep it open while we investigate deeper.

Kubernetes version (use kubectl version):
v1.31.3-eks-59bf375

Environment:
AWS / EKS

Cloud provider or hardware configuration:
AWS
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Install tools:
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
Basic cluster related info:
- kubectl version
- kubectl get nodes -o wide

Other data is going to follow after we did a breakdown

How to reproduce this issue:
Hard to reproduce as it is currently happening on the nodes which we cannot test again.

Update 10.01 - 00:10 - Tested again a deployment of the faulty version. Ssl certs were sendings as K8s Fake certs on some domains but the old version were sending the real letsencrypt certs. Looks like a TLS issue after upgrade.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-01-09T08:36:30Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

longwuyuan · 2025-01-12T02:55:18Z

Some security features are enabled out of the box now. Please check the change logs. Highlight

⚠️ Enable security features by default (https://github.com/kubernetes/ingress-nginx/pull/11819) ⚠️

This changes the default of the following CLI arguments:

    --enable-annotation-validation gets enabled by default.

It also changes the default of the following configuration options:

    allow-cross-namespace-resources gets disabled by default.
    annotations-risk-level gets lowered to "High" by default.
    strict-validate-path-type gets enabled by default.

So snippets and other related annotations may be the underlying reason but not visible, yet leading up to the behavior you see.

/remove-kind bug

Since it can not be reproduced yet, lets wait till more data is posted here to apply the bug label

thomaspeitz · 2025-01-13T21:47:06Z

Thanks for the hints! Really appreciate it.

Tried with some updated helm config, to set all those values to old defaults.
Sadly with the same result.

controller:
  enableAnnotationValidations: false

  config:
    allow-cross-namespace-resources: "true"
    strict-validate-path-type: "false"
    annotations-risk-level: "Critical"

I cannot reproduce yet. Will try to add tomorrow some certs, some domains, some ingress objects to another cluster to get this somehow in a reproducable way.

Can currently test only nightly in a small timeframe due to the impact this has and being only reproducable on a cluster with important traffic.

Maybe it has to do something with CDN as well. As I see less requests on ingress after migration then before if I run some integration tests.

thomaspeitz · 2025-01-14T07:26:30Z

#11821
Could be the lua plugin as well. Oversaw something.

thomaspeitz added the kind/bug Categorizes issue or PR as related to a bug. label Jan 9, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jan 9, 2025

k8s-ci-robot added the needs-priority label Jan 9, 2025

strongjz added this to [SIG Network] Ingress NGINX Jan 9, 2025

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645

Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645

thomaspeitz commented Jan 9, 2025 •

edited

Loading

k8s-ci-robot commented Jan 9, 2025

longwuyuan commented Jan 12, 2025 •

edited

Loading

thomaspeitz commented Jan 13, 2025

thomaspeitz commented Jan 14, 2025

Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645

Downtime after upgrading to 1.12.0 - Open "/tmp/nginx/nginx.pid" failed #12645

Comments

thomaspeitz commented Jan 9, 2025 • edited Loading

k8s-ci-robot commented Jan 9, 2025

longwuyuan commented Jan 12, 2025 • edited Loading

thomaspeitz commented Jan 13, 2025

thomaspeitz commented Jan 14, 2025

thomaspeitz commented Jan 9, 2025 •

edited

Loading

longwuyuan commented Jan 12, 2025 •

edited

Loading