- https://blog.pipetail.io/posts/2020-05-04-most-common-mistakes-k8s/
- https://banzaicloud.com/blog/k8s-vertical-pod-autoscaler/
- https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
- https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler
- https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
- https://blog.marekbartik.com/posts/2018-06-29_kubernetes-in-production-poddisruptionbudget/
- https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
Heeeeello, everyone!
Intro
Some time ago, I was working with Kubernetes and MicroService deployments. That was not a bare-metal deployment of k8s, but AWS EKS. It was configured via Terraform and, maybe, someday I’ll write an article about that…
At that time, I read a lot of recommendations and collected the most important of them into one article. But now, let’s look at collected recommendations!
This article does not contain recommendations like “any service sould have it’s own Service Account configured and RBAC should be explicitly defined for each Account” or “network policies must be set for each application deployment”. Those are must have.
Resource Requests/Limits
Root cause
Those values must be set!
The reason is quite simple: if you don’t set limits – the container will be allowed to consume all resources available to the worker node. This will affect all containers running on the same node. We faced that while mass start services whether because of node die or cluster-autoscaler
scale down unused nodes or just parallel deploy of newer services versions.
We had Java Spring Boot based services and, as all of you know, they’re quite greedy for the resources, especially while starting up.
After some testing, we came with these values:
## ...skipped...
resources:
limits:
cpu: 1000m
memory: 500Mi
requests:
cpu: 200m
memory: 256Mi
CPU Requests/Limits
To determine proper value, you must know how many Cores on your node! Note, that 1000m
means whole 1 core!
In this example, we requested guaranteed memory to 256Mb of RAM and 200m CPU and limited (burstable) maximum memory to 500Mb of RAM, and 1 CPU. The m
for the CPU stands for milli
, and thus 1000m
is 1
CPU, based on this definition, you can write cpu: 1
and this will be equal to cpu: 1000m
.
Tricky Memory Limit
But I want to warn you on setting memory limit, that in most cases at some point your container could be killed with OOM because we can guarantee only 256Mb of RAM (in our example) and that does not mean that at some point of time, the node will have additional 256Mb (see the limit set to 500Mb). In this case, your container will be killed with OOM reason.
In case of CPU limiting, the node will be just overcommitted which could affect other containers, running on the same node.
Choosing right node CPU
For high CPU consuming apps it’s highly recommended to run those on separate nodes designated for high CPU usage, like AWS EC2 c-type instances. That will require an additional definition of nodeSelector
configuration in deployment resource.
You can always get current usage of CPU and Memory for each pod, container inside pods, and worker node overall:
kubectl top pods
kubectl top pods --containers
kubectl top nodes
This command requires Metrics Server deployed into the cluster
Metrics Server
Metrics Server is a cluster-wide aggregator of resource usage data. By default, it is deployed in clusters created by kube-up.sh
script as a Deployment object. If you use a different Kubernetes setup mechanism, you can deploy it using the provided deployment components.yaml file.
Metrics Server collects metrics from the Summary API, exposed by Kubelet on each node, and is registered with the main API server via Kubernetes aggregator.
Learn more about the metrics server in the design doc.
Recommended configuration
Hard to say what exact numbers should be set for your deployment. That also heavily depends on the framework you use in your application.
The only recommendation is to run tests, get values…
For SpringBoot based applications, we used these:
## ...skipped...
resources:
limits:
cpu: 1000m
memory: 500Mi
requests:
cpu: 200m
memory: 384Mi
That was enough to run the application without high loads, those CPU Limits helped services to start flawlessly.
Conclusion
Resources must be set. There is an option to make them dynamic – Vertical Pod Autoscaler.
Vertical Pod Autoscaler
Worth to read
- Virtual Pod Autoscaler article by Banzaicloud: https://banzaicloud.com/blog/k8s-vertical-pod-autoscaler/
- GoogleCloud article about Virtual Pod Autoscaler: https://cloud.google.com/kubernetes-engine/docs/concepts/verticalpodautoscaler#limitations_for_vertical_pod_autoscaling
- Visualize VPA Recommendations: https://github.com/FairwindsOps/goldilocks/
What is Vertical Pod Autoscaler?
VerticalPodAutoscaler
resource can help you automate away the manual process of setting Resource Requests/Limits by looking at cpu/mem usage in time and setting or recommending new requests and limits based on that all over again.
Vertical Pod Autoscaler is a set of services, so it needs to be depoyed first. More details on Installation process you can find on this page.
VPA usage warning
Official documentation warns, that VPA is not yet ready for use with JVM-based workloads due to limited visibility into the actual memory usage of the workload. More details on limitations you can find on this page.
Vertical Pod Autoscaler consists of several components. Most interesting here is the Recommendation service which is watching for Resource consumption and provides recommendations for Resource Requests and Limits.
Lets enable VPA for our service:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-service-vpa
namespace: my-namespace
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: my-service
updatePolicy:
updateMode: "Off"
And after some time, you’ll can get created VPA resource, which contains Recommendations by invoking following command:
kubectl describe vpa my-service-vpa
and its output:
API Version: autoscaling.k8s.io/v1
Kind: VerticalPodAutoscaler
Name: my-service-vpa
Namespace: my-namespace
Labels: <none>
Annotations: <none>
Metadata: <none>
Spec:
Target Ref:
API Version: apps/v1
Kind: Deployment
Name: my-service
Update Policy:
Update Mode: Off
Status:
Conditions:
Last Transition Time: 2020-05-23T14:28:35Z
Status: True
Type: RecommendationProvided
Recommendation:
Container Recommendations:
Container Name: my-service
Lower Bound:
Cpu: 25m
Memory: 350837913
Target:
Cpu: 143m
Memory: 410771395
Uncapped Target:
Cpu: 143m
Memory: 410771395
Upper Bound:
Cpu: 246m
Memory: 670336299
At the end of this output, we can see Container Recommendations, which could be used to set those manually, or when Update mode enabled, those will be set automatically.
How does it work?
First of all, VPA services should be deployed into the cluster. All instructions could be found here. Also, it requires Metrics API Server to be deployed as well. VPA can work in three different modes:
- Off: VPA just recommends its best estimate for resources in the status section of the VerticalPodAutoscaler CRD.
- Initial: VPA changes the container resource section to follow its estimate when new pods are created.
- Auto: VPA adjusts the container resource spec of running pods, destroying them and recreating them (this process will honor the Pod Disruption Budget, if set).
Update mode: Auto
Update mode: Auto
will trigger pod to be restarted on each Resource Request/Limit change! So configuring Pod Disruption Budget for service is highly recommended in order to eliminate Service Outage.
So, what we can get from Container Recommendations for Guaranteed and Burstable configuration:
Name | Description |
---|---|
Lower bound | Minimal amount of resources that should be set for the container. |
Upper bound | Maximum amount of resources that should be set (above which you are likely wasting resources). |
Target | VPA’s recommendation based on the algorithm described here and considering additional constraints specified in the VPA CRD. The constraints are not shown in the above VPA example. They allow you to set minimum and maximum caps to VPA’s recommendation, see here for more details. |
Uncapped target | VPA’s recommendation without considering any additional constraints. |
Recommended configuration
As for now, this resource does not have any configuration like others, so, I’d prefer this to be deployed in Recommendation mode to get some statistics and recommendations only.
Conclusion
This is recommended, but with enabled automated resource change, it could trigger scaling worker nodes. So, the current recommendation is to deploy but only to get recommendations on setting Resource Requests and Limits.
Horizontal Pod Autoscaler
Resource types
Require any type of Metrics API Server deployed into cluster depending on what metrics we might want to scale our application. See Liveness/Readiness probes details below.
Require Resource Requests/Limits defined
Resource Requests and Limits in Development’s Pod Template must be defined to use this.
Root cause
The main reason is that we want to automate the scaling of our service if we start receiving more traffic and our pods getting too hot
Recommended configuration
We will use simple scaling based on CPU and Memory
Since we can use percentage values from Metrics API Server, here I’ll provide only a common configuration that can be applied to all services as a starting point. More precise values you can get after some load testing.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: myapp-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 7
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 150
API Version
We use API version v2beta2
here. This is important since the previous API version has a different definition!
Resource values
Those values are percentages and based on Resource Request (Guaranteed) value! That is why we set 150
of averageUtilization
for memory
resource.
Liveness and readiness probes
Root cause
How to know when service up and can server traffic? People usually do not know the difference between Liveness and Readiness Probes.
First of all, probes help Kubernetes to understand two things:
- Pod is up in general (Liveness probe)
- Pod is ready to receive traffic (Readiness probe)
Moreover, those also work with tight integration with Kubernetes Service resource. Service always check if Deployment’s Pods has Readiness check successful. If it is not then this Pod will be excluded from receiving traffic until Readiness check succeeds again. For example, one of the Deployment Pods starts handling too much traffic or an expensive computation and in this case, we might want to let this pod cool down and finish current tasks, and only after that, it should continue serving incoming requests. That’s the purpose of Readiness probe check. Incase you defined the same endpoint for Liveness, Kubernetes could think that Pod is dead and will restart Pod with all related data loss. Why would you restart a pod that is healthy and just doing a lot of work? Also, this configuration should work in tight integration with Horizontal Pod Autoscaler
More details you can find on this page.
Recommended configuration
Liveness probe
There are two ways for Liveness checks.
Liveness via tcpSocket
check of Application Port:
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 20
tcpSocket:
port: 8080
and via httpGet
:
livenessProbe:
initialDelaySeconds: 60
periodSeconds: 20
httpGet:
path: /health/liveness
port: 8080
SpringBoot based application
Spring Boot has special endpoints for Liveness and Readiness probes. More details about that you can find on this page.
Readiness probe
Readiness will be the same disregarding the chosen Liveness check way:
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
failureThreshold: 10
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 2
timeoutSeconds: 10
Thresholds, delays, etc
To find proper values, measure your application startup timing, how long takes initialization, etc. For each framework, those values could be different, as well as different classes, providers, external resources, or services your application use.
Conclusion
Never use same endpoint for Liveness and Readiness Probe. If the liveness probe is equal to the readiness probe, you are in big trouble. Probes must be set with proper endpoints!
Do not fail either of the probes if any of your shared dependencies is down, it would cause the cascading failure of all the pods. You are shooting yourself in the foot.
Pod Disruption Budget
This only will help with voluntary disruptions and will not work with not work for involuntary disruptions!
Root cause
From time to time your nodes and cluster have to be upgraded, decommissioned, scaled up, or down. That could lead that your service can become unresponsive, for example, all pods of the service/deployment were running on one same node which was “marked” to be deleted by cluster-autoscaler. In this case, all pods of the Deployment will be down till they’ll be recreated on another live worker node. But we can tell Cluster that we want to have at least one pod to be up and the “marked” node will not be deleted until at least one pod will be ready to serve traffic on any other live node, in other words – migration of probes by recreation without Out-of-Service.
Another example. A quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.
There are two main approaches (they could be used together)
Option 1: minAvailable
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: myapp
Option 2: maxUnavailable
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
maxUnavailable: 1
selector:
matchLabels:
app: myapp
More information on Pod Disruptions you can find on this page.
Recommended configuration
This is a common configuration for any service. The only thing that should be changed is the number.
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: myapp-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: myapp
Important note about `minAvailable`
The deployment must have Replica Count set to at least 2! Otherwise, such PDB configuration will stuck automated Pod eviction and this should be solved manually
Conclusion
Pod Disruption Budget must be defined!
Pod Anti Affinity
Root cause
Running, for example, 3 pod replicas of your application deployment, the node goes down and all the replicas with it. Huh? All the replicas were running on one node? Wasn’t Kubernetes supposed to be magical and provide HA?! Nope, you can’t expect Kubernetes scheduler to enforce anti-affinities for your pods. You have to define them explicitly.
Recommended configuration
## ...skipped...
labels:
app: myapp
## ...skipped...
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- myapp
topologyKey: "kubernetes.io/hostname
So, here we hardly defined that only one pod could run on node by using “hard” rule, prefixed with required*
. To avoid triggering autoscaler, we can use “soft”
rule, prefixed with preferred*
:
## ...skipped...
labels:
app: myapp
## ...skipped...
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- myapp
topologyKey: "kubernetes.io/hostname"
This is a simple check if the node already has a pod with the label app
and its value myapp
. That’s it. No magic here, sorry :)
Conclusion
This is highly recommended to be defined, but, bear in mind, if you use required*
rules, in the case of HPA activities it could cause additional nodes to scale up to at least the number of replicas defined by HPA :)
So it’s recommended to use “soft” rule with the prefix preferred*
.
Outro
I believe, there should more, but at this time, these look a must-have!
Happy deployments
!