Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workload is admitted, but job remains suspended #3936

Open
arvind-v opened this issue Jan 7, 2025 · 3 comments
Open

Workload is admitted, but job remains suspended #3936

arvind-v opened this issue Jan 7, 2025 · 3 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@arvind-v
Copy link

arvind-v commented Jan 7, 2025

What happened:
I am trying to get a simple test case with Kueue running. The Workload is admitted, but the Job named test-job-one remains in a suspended state.

$ kubectl get workloads -A
NAMESPACE    NAME                     QUEUE               RESERVED IN     ADMITTED   FINISHED   AGE
kueue-jobs   job-test-job-one-5e3de   ml-training-queue   cluster-queue   True                  92s

$ kubectl get jobs -A
NAMESPACE        NAME              STATUS      COMPLETIONS   DURATION   AGE
kubemod-system   kubemod-crt-job   Complete    1/1           4s         64m
kueue-jobs       test-job-one      Suspended   0/1                      112s

What you expected to happen:
I was expecting that the job would run once admitted.

How to reproduce it (as minimally and precisely as possible):

ResourceFlavor, ClusterQueue, LocalQueue and Job specs:

---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor
spec: {}
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: cluster-queue
spec:
  namespaceSelector: {} # match all
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: "300m"
      - name: "memory"
        nominalQuota: "512Mi"
      - name: "pods"
        nominalQuota: 5
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: ml-training-queue
  namespace: kueue-jobs
spec:
  clusterQueue: cluster-queue
---
apiVersion: batch/v1
kind: Job
metadata:
  name: test-job-one
  namespace: kueue-jobs
  labels:
    kueue.x-k8s.io/queue-name: ml-training-queue
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: test
        image: busybox
        command: ["sh", "-c", "echo 'Hello from Kueue!' && sleep 30"]
        resources:
          requests:
            cpu: "200m"
            memory: "256Mi"
      restartPolicy: Never
---

Here is the output from the troubleshooting steps. This is the only job in the cluster.

$ kubectl describe job  -n kueue-job
Name:             test-job-one
Namespace:        kueue-jobs
Selector:         batch.kubernetes.io/controller-uid=d01cd96b-e986-4533-b9ab-50965226e5a0
Labels:           kueue.x-k8s.io/queue-name=ml-training-queue
Annotations:      <none>
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Suspend:          true
Backoff Limit:    6
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=d01cd96b-e986-4533-b9ab-50965226e5a0
           batch.kubernetes.io/job-name=test-job-one
           controller-uid=d01cd96b-e986-4533-b9ab-50965226e5a0
           job-name=test-job-one
  Containers:
   test:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      echo 'Hello from Kueue!' && sleep 30
    Requests:
      cpu:         200m
      memory:      256Mi
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Events:
  Type    Reason           Age                    From                        Message
  ----    ------           ----                   ----                        -------
  Normal  Suspended        4m22s                  job-controller              Job suspended
  Normal  CreatedWorkload  4m22s                  batch/job-kueue-controller  Created Workload: kueue-jobs/job-test-job-one-5e3de
  Normal  Started          4m22s (x2 over 4m22s)  batch/job-kueue-controller  Admitted by clusterQueue cluster-queue

Environment:

  • Kubernetes version (use kubectl version): Client Version: v1.30.2, Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3, Server Version: v1.31.4-eks-2d5f260
  • Kueue version (use git describe --tags --dirty --always): v0.10.0
  • Cloud provider or hardware configuration: Amazon EKS
  • OS (e.g: cat /etc/os-release): Amazon Linux 2
  • Install tools: helm
@arvind-v arvind-v added the kind/bug Categorizes issue or PR as related to a bug. label Jan 7, 2025
@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Jan 7, 2025

I tried testing it locally. It seems to be working fine.

kind create cluster

helm install kueue kueue/ --create-namespace --namespace kueue-system
kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=5m

kubectl create ns kueue-jobs
kubectl apply -f manifests.yaml

kubectl get po -n kueue-jobs                                          
NAME                 READY   STATUS    RESTARTS   AGE
test-job-one-4scls   1/1     Running   0          24s

kubectl describe job -n kueue-jobs
Name:             test-job-one
Namespace:        kueue-jobs
Selector:         batch.kubernetes.io/controller-uid=4e82e59b-4aee-4446-8a8a-6004faabcf43
Labels:           kueue.x-k8s.io/queue-name=ml-training-queue
Annotations:      <none>
Parallelism:      1
Completions:      1
Completion Mode:  NonIndexed
Suspend:          false
Backoff Limit:    6
Start Time:       Tue, 07 Jan 2025 09:24:32 +0200
Pods Statuses:    1 Active (1 Ready) / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  batch.kubernetes.io/controller-uid=4e82e59b-4aee-4446-8a8a-6004faabcf43
           batch.kubernetes.io/job-name=test-job-one
           controller-uid=4e82e59b-4aee-4446-8a8a-6004faabcf43
           job-name=test-job-one
  Containers:
   test:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      echo 'Hello from Kueue!' && sleep 30
    Requests:
      cpu:         200m
      memory:      256Mi
    Environment:   <none>
    Mounts:        <none>
  Volumes:         <none>
  Node-Selectors:  <none>
  Tolerations:     <none>
Events:
  Type    Reason            Age   From                        Message
  ----    ------            ----  ----                        -------
  Normal  Suspended         15s   job-controller              Job suspended
  Normal  CreatedWorkload   15s   batch/job-kueue-controller  Created Workload: kueue-jobs/job-test-job-one-223a0
  Normal  Started           15s   batch/job-kueue-controller  Admitted by clusterQueue cluster-queue
  Normal  SuccessfulCreate  15s   job-controller              Created pod: test-job-one-k56js
  Normal  Resumed           15s   job-controller              Job resumed

@mbobrovskyi
Copy link
Contributor

Could you please check logs ?

kubectl logs -f -l app.kubernetes.io/name=kueue -n kueue-system

@kannon92
Copy link
Contributor

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants