One of the things we noticed in our dev cluster at work during the initial stages of our OpenShift deployment is that while we were deliberately mucking with things, Pods would end up getting scheduled together on the same node. This didn’t seem prudent to us since we were playing around with purposely killing nodes, and such affinity could potentially lead to a negative impact to our clients. This is what led me down the path of anti-affinity. I have to admit that I find the entire premise and behavior of the Kubernetes scheduler fascinating.
The goal of this post is to describe how to spread your workload amongst your nodes for greater fault tolerance. We’ll discuss both hard and soft affinities. Unfortunately, I only have two worker nodes in my lab environment so it’ll be a little more challenging to demonstrate, but I think you’ll get the gist.
Let’s say I just run a simple imperative command to create a quick Deployment with four NGINX replicas:
$ kubectl run no-affinity --replicas=4 --image=nginx --labels=app=no-affinity deployment.apps/no-affinity created
I can see the the scheduler actually did a great job of evenly distributing the workload.
$ kubectl get po -o wide -l app=no-affinity NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES no-affinity-6dc48758bf-lwdd5 1/1 Running 0 23s 10.244.2.25 runlevl43c.mylabserver.com <none> <none> no-affinity-6dc48758bf-nbgzl 1/1 Running 0 23s 10.244.1.111 runlevl42c.mylabserver.com <none> <none> no-affinity-6dc48758bf-sx6mr 1/1 Running 0 23s 10.244.2.24 runlevl43c.mylabserver.com <none> <none> no-affinity-6dc48758bf-xmq2r 1/1 Running 0 23s 10.244.1.112 runlevl42c.mylabserver.com <none> <none>
Let’s presume that I don’t want any identical pods running together. This is where hard affinity comes into play. Let’s look at a Deployment descriptor which defines hard affinity. Since our goal is to separate our Pods, rather than keep them together, we’re going to use anti-affinity.
apiVersion: apps/v1 kind: Deployment metadata: labels: app: affinity name: affinity spec: replicas: 4 selector: matchLabels: app: affinity template: metadata: labels: app: affinity spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - affinity topologyKey: kubernetes.io/hostname containers: - image: nginx name: affinity
In this example, we’re telling the scheduler that we are going to require that any Pods whose label matches app=affinity
cannot be scheduled together. Let’s create the Deployment and see what happens.
kubectl apply -f req.yaml
$ kubectl get po -o wide -l app=affinity NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES affinity-dc6c5999-7n6bk 0/1 Pending 0 39s <none> <none> <none> <none> affinity-dc6c5999-brtsx 0/1 Pending 0 39s <none> <none> <none> <none> affinity-dc6c5999-gn4k8 1/1 Running 0 39s 10.244.2.26 runlevl43c.mylabserver.com <none> <none> affinity-dc6c5999-mhjzt 1/1 Running 0 39s 10.244.1.113 runlevl42c.mylabserver.com <none> <none>
Notice that now we only have two out of our four declared Pods running. The first two listed are in a Pending
state because the scheduler is adhering to our rule that the Pods cannot be scheduled together. This probably isn’t what we really want. Our aim is to have the schedule do its best, though. Let’s change the descriptor so that we tell the scheduler that we prefer this behavior, but that it isn’t required.
apiVersion: apps/v1 kind: Deployment metadata: creationTimestamp: null labels: app: affinity name: affinity spec: replicas: 4 selector: matchLabels: app: affinity strategy: {} template: metadata: creationTimestamp: null labels: app: affinity spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 podAffinityTerm: topologyKey: kubernetes.io/hostname labelSelector: matchExpressions: - key: app operator: In values: - affinity containers: - image: nginx name: affinity
In this example, we’re now saying that the anti-affinity pattern is preferred, not required. This is known as soft affinity. Note that the weight
1 field is required with preferred scheduling. If we delete the existing Deployment and re-create it with this new Descriptor we should see behavior similar to our initial example.
$ kubectl get po -o wide -l app=affinity NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES affinity-944d8c9f9-498hz 1/1 Running 0 10s 10.244.1.114 runlevl42c.mylabserver.com <none> <none> affinity-944d8c9f9-jljmq 1/1 Running 0 10s 10.244.1.115 runlevl42c.mylabserver.com <none> <none> affinity-944d8c9f9-m582c 1/1 Running 0 10s 10.244.2.27 runlevl43c.mylabserver.com <none> <none> affinity-944d8c9f9-plmbf 1/1 Running 0 10s 10.244.2.28 runlevl43c.mylabserver.com <none> <none>
And this is what we see. Again, if you have more worker nodes available, it may be easier to see how you can get some disparate balancing. Whether or not this would be a problem for us in production is TBD. However, I plan on accounting for it in our deployments regardless to err on the side of safety and resiliency.
- The weight field in preferredDuringSchedulingIgnoredDuringExecution is in the range 1-100. For each node that meets all of the scheduling requirements (resource request, RequiredDuringScheduling affinity expressions, etc.), the scheduler will compute a sum by iterating through the elements of this field and adding “weight” to the sum if the node matches the corresponding MatchExpressions. This score is then combined with the scores of other priority functions for the node. The node(s) with the highest total score are the most preferred. Source ↩︎