scheduler – runlevl4

One of the things we noticed in our dev cluster at work during the initial stages of our OpenShift deployment is that while we were deliberately mucking with things, Pods would end up getting scheduled together on the same node. This didn’t seem prudent to us since we were playing around with purposely killing nodes, and such affinity could potentially lead to a negative impact to our clients. This is what led me down the path of anti-affinity. I have to admit that I find the entire premise and behavior of the Kubernetes scheduler fascinating.

The goal of this post is to describe how to spread your workload amongst your nodes for greater fault tolerance. We’ll discuss both hard and soft affinities. Unfortunately, I only have two worker nodes in my lab environment so it’ll be a little more challenging to demonstrate, but I think you’ll get the gist.

Let’s say I just run a simple imperative command to create a quick Deployment with four NGINX replicas:

$ kubectl run no-affinity --replicas=4 --image=nginx --labels=app=no-affinity
deployment.apps/no-affinity created

I can see the the scheduler actually did a great job of evenly distributing the workload.

$ kubectl get po -o wide -l app=no-affinity
NAME                           READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
no-affinity-6dc48758bf-lwdd5   1/1     Running   0          23s   10.244.2.25    runlevl43c.mylabserver.com   <none>           <none>
no-affinity-6dc48758bf-nbgzl   1/1     Running   0          23s   10.244.1.111   runlevl42c.mylabserver.com   <none>           <none>
no-affinity-6dc48758bf-sx6mr   1/1     Running   0          23s   10.244.2.24    runlevl43c.mylabserver.com   <none>           <none>
no-affinity-6dc48758bf-xmq2r   1/1     Running   0          23s   10.244.1.112   runlevl42c.mylabserver.com   <none>           <none>

Let’s presume that I don’t want any identical pods running together. This is where hard affinity comes into play. Let’s look at a Deployment descriptor which defines hard affinity. Since our goal is to separate our Pods, rather than keep them together, we’re going to use anti-affinity.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: affinity
  name: affinity
spec:
  replicas: 4
  selector:
    matchLabels:
      app: affinity
  template:
    metadata:
      labels:
        app: affinity
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - affinity
            topologyKey: kubernetes.io/hostname
      containers:
      - image: nginx
        name: affinity

In this example, we’re telling the scheduler that we are going to require that any Pods whose label matches app=affinity cannot be scheduled together. Let’s create the Deployment and see what happens.

kubectl apply -f req.yaml

$ kubectl get po -o wide -l app=affinity
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
affinity-dc6c5999-7n6bk   0/1     Pending   0          39s   <none>         <none>                       <none>           <none>
affinity-dc6c5999-brtsx   0/1     Pending   0          39s   <none>         <none>                       <none>           <none>
affinity-dc6c5999-gn4k8   1/1     Running   0          39s   10.244.2.26    runlevl43c.mylabserver.com   <none>           <none>
affinity-dc6c5999-mhjzt   1/1     Running   0          39s   10.244.1.113   runlevl42c.mylabserver.com   <none>           <none>

Notice that now we only have two out of our four declared Pods running. The first two listed are in a Pending state because the scheduler is adhering to our rule that the Pods cannot be scheduled together. This probably isn’t what we really want. Our aim is to have the schedule do its best, though. Let’s change the descriptor so that we tell the scheduler that we prefer this behavior, but that it isn’t required.

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: affinity
  name: affinity
spec:
  replicas: 4
  selector:
    matchLabels:
      app: affinity
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: affinity
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 1
            podAffinityTerm:
              topologyKey: kubernetes.io/hostname
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - affinity
      containers:
      - image: nginx
        name: affinity

In this example, we’re now saying that the anti-affinity pattern is preferred, not required. This is known as soft affinity. Note that the weight¹ field is required with preferred scheduling. If we delete the existing Deployment and re-create it with this new Descriptor we should see behavior similar to our initial example.

$ kubectl get po -o wide -l app=affinity
NAME                       READY   STATUS    RESTARTS   AGE   IP             NODE                         NOMINATED NODE   READINESS GATES
affinity-944d8c9f9-498hz   1/1     Running   0          10s   10.244.1.114   runlevl42c.mylabserver.com   <none>           <none>
affinity-944d8c9f9-jljmq   1/1     Running   0          10s   10.244.1.115   runlevl42c.mylabserver.com   <none>           <none>
affinity-944d8c9f9-m582c   1/1     Running   0          10s   10.244.2.27    runlevl43c.mylabserver.com   <none>           <none>
affinity-944d8c9f9-plmbf   1/1     Running   0          10s   10.244.2.28    runlevl43c.mylabserver.com   <none>           <none>

And this is what we see. Again, if you have more worker nodes available, it may be easier to see how you can get some disparate balancing. Whether or not this would be a problem for us in production is TBD. However, I plan on accounting for it in our deployments regardless to err on the side of safety and resiliency.

The weight field in preferredDuringSchedulingIgnoredDuringExecution is in the range 1-100. For each node that meets all of the scheduling requirements (resource request, RequiredDuringScheduling affinity expressions, etc.), the scheduler will compute a sum by iterating through the elements of this field and adding “weight” to the sum if the node matches the corresponding MatchExpressions. This score is then combined with the scores of other priority functions for the node. The node(s) with the highest total score are the most preferred. Source ↩︎

Tag: scheduler

Separate, But Equal