Troubleshooting in Kubernetes

With time innovation is advancing each hour simultaneously troubleshooting turns out to be really difficult. At the point when we are running microservices in kubernetes its turns out to be more unpredictable to discover the course cause why things are coming up short. Today we are talking about how easily we can troubleshoot issues in kubernetes.

Assume we have deployed an application in kubernetes and in a specific span, we are getting 502 blunder code or when we do the deployments, restart Pods some requests are failing.

To start the troubleshooting we have to identify a point from where we should start the debugging. So first we should know how our application is deployed, what all the hopes are included in the flow.

Let’s take a simple example of a frontend application deployed in the kubernetes and have below architecture

POD with replica set count is two

In the above example, we can see that we have deployed an application with service type LoadBalancer, in which the replica set count is two. This means at any point in time we have two application containers that will be running.

To identify the issue we need to check the below things in the kubernetes

  • Events
  • Restart time of the POD
  • Traffic during POD restart
  • Reason for POD restart
  • Termination grace period
  • Readiness and Liveness Prob

First view the event logs of the POD. Event logs have all the required data but let suppose if you don’t have events log then try to debug using other methods.

Restart time of the POD, for that we have to describe the POD

One reason could be the traffic amount and our POD is not able to handle that load. So please check the traffic amount when you were getting the alerts.

Now check the reason why the POD restarted by using the Exit code of the POD. Try to check the exit code status of POD like below

Below is the meaning of the Exit codes

Termination Grace Period: the most important factor when we are getting alerts. By default termination grace period time is 30 sec. So during this time, the Kubernetes does not send any new traffic to the POD and tries to serve the existing requests running in the POD.

So by checking how many requests your container can serve in that time based on that update the termination grace period value.

Readiness and Liveness Prob: help us to monitor the application running inside the POD. Readiness tells the kubernetes that the application is up now you can start the traffic while liveness prob checks the application uptime if it’s found that the application is not up and running it restarts the POD.

TL;DR

Events, Liveness and Readiness Probes, TGP, and PODs will help you to troubleshoot issues quickly.

DevOps Engineer with 10+ years of experience in the IT Industry. In-depth experience in building highly complex, scalable, secure and distributed systems.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store