Troubleshooting in Kubernetes

Ankur
3 min readApr 15, 2021

With time innovation is advancing each hour simultaneously troubleshooting turns out to be really difficult. At the point when we are running microservices in kubernetes its turns out to be more unpredictable to discover the course cause why things are coming up short. Today we are talking about how easily we can troubleshoot issues in kubernetes.

Assume we have deployed an application in kubernetes and in a specific span, we are getting 502 blunder code or when we do the deployments, restart Pods some requests are failing.

To start the troubleshooting we have to identify a point from where we should start the debugging. So first we should know how our application is deployed, what all the hopes are included in the flow.

Let’s take a simple example of a frontend application deployed in the kubernetes and have below architecture

POD with replica set count is two

In the above example, we can see that we have deployed an application with service type LoadBalancer, in which the replica set count is two. This means at any point in time we have two application containers that will be running.

To identify the issue we need to check the below things in the kubernetes

  • Events
  • Restart time of the POD
  • Traffic during POD restart
  • Reason for POD restart
  • Termination grace period
  • Readiness and Liveness Prob

First view the event logs of the POD. Event logs have all the required data but let suppose if you don’t have events log then try to debug using other methods.

kubectl get events -n namespace pod_name

Restart time of the POD, for that we have to describe the POD

kubectl describe pod -n namespace pod_name | grep 'Start Time'

One reason could be the traffic amount and our POD is not able to handle that load. So please check the traffic amount when you were getting the alerts.

Now check the reason why the POD restarted by using the Exit code of the POD. Try to check the exit code status of POD like below

kubectl describe pod -n namespace pod_name

Below is the meaning of the Exit codes

Exit Code 0: Absence of an attached foreground process
Exit Code 1: Indicates failure due to application error
Exit Code 137: Indicates failure as container received SIGKILL
Exit Code 139: Indicates failure as container received SIGSEGV
Exit Code 143: Indicates failure as container received SIGTERM

Termination Grace Period: the most important factor when we are getting alerts. By default termination grace period time is 30 sec. So during this time, the Kubernetes does not send any new traffic to the POD and tries to serve the existing requests running in the POD.

So by checking how many requests your container can serve in that time based on that update the termination grace period value.

Readiness and Liveness Prob: help us to monitor the application running inside the POD. Readiness tells the kubernetes that the application is up now you can start the traffic while liveness prob checks the application uptime if it’s found that the application is not up and running it restarts the POD.

TL;DR

Events, Liveness and Readiness Probes, TGP, and PODs will help you to troubleshoot issues quickly.

--

--

Ankur

DevOps Engineer with 10+ years of experience in the IT Industry. In-depth experience in building highly complex, scalable, secure and distributed systems.