Site reliability engineering with Kubernetes

In this week’s Writer’s Room, Andela Community member and Site Reliability Engineer Frank Adu discusses Site Reliability Engineering using Kubernetes, an open-source system for automating deployment, scaling, and management of containerized applications.

How big is the CLOUD ?

There’s no doubt that Microservices and the Cloud have changed in recent years, driven towards Cloud Native and Cloud Agnostic with Kubernetes running application workloads. As a result, several big questions have arisen:

  1. How reliably will services continue serving client calls and avoiding service degradation with monitoring? And how do I know if a service is failing before the client does?
  2. To avoid a single point of failure, which best Disaster Recovery strategies do I adopt ? 
  3. How do I Perform high availability system design? Automating my Continuous Delivery pipelines?
  4. How do I trace http errors to reduce time for debugging and reporting to members of the Engineering team?
  5. How do I monitor production workloads and alert on errors?
  6. How does a business measure running applications success with dashboard visualization?
  7. And… How do I Set SLO to meet SLA? 

Site Reliability Engineering best practices answers to above questions and more…

What does an SRE do ?

One of the main day-to-day tasks of a Site Reliability Engineer or Consultant in an Engineering team is ensuring that software systems are reliable, which means that the systems meet performance requirements. Part of their daily tasks includes working with developers, architects and data operations teams to maximize the reliability of applications as they move down the software delivery pipeline tasked with reliability.

Key areas of focus on kubernetes as Site Reliability Engineer

  • Service uptime (99.9% ~ 95.x%)  availability
  • Reliability
  • System observability
  • HA’s and DR’s
  • Automation
  • Fault-tolerance
  • GitOps (CI – Automation)
  • Containers
  • Single-point of failure 
  • Monitoring

Management Call

  • Oncall plans
  • Incident management
  • RCA’s
  • Blameless postmortem
  • Software Design

There’s nothing as much fun and challenging working as an SRE in a Kubernetes environment !

Below we will delve into some key aspects and best practices an SRE will implement on Kubernetes production environment. Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and managing containerized applications. We’ll be looking into on-premises use cases ensuring high availability with fault-tolerance also in the Cloud environment.


Kubernetes was built for production readiness with a fault-tolerant design. It is High Availability READY with various Kubernetes components ensuring state as defined within the manifest file. For an on-premises use case you’ll need three more nodes in addition to your three clusters: one to act as a load balancer, and the other two to act as master nodes for quorum (Proxy node, secondary master node, and third master).

If you’re running applications etc. on the Cloud, just add nodes with necessary configs, but on-premises deploy LB configured to pass through traffic, and you can use HAProxy, then get k8s running on the 2nd and 3rd master. See the rough sketch diagram below:


Managing resources is critical to service reliability. Some strategies used at service levels are ‘vertical scaling’ to increase memory and CPU resources. You need to consider several elements to get started. At what level should your algorithm kick-in due to external service requests? Above 30%, 50% ? This is a very common question, and use case, in an eCommerce service application, i.e Black Fridays, promotional sales or traffic surge within certain times of the day. When it comes to caching strategies, what caching system are you likely to use ? Whichever one you choose must fulfil query performance at the database backend layer and latency from the API layer as well.

Resource Metrics

The code snippet below features an HPA set to resource utilization to 60%. You can also set targets for containers tracking resource usage.

Observability and Resiliency with Mesh

Why do you need Service Mesh? If you’re running or planning microservices in a cloud-agnostic space with Kuberentes, it’s essential. Here are two use cases:

  1. Using various mesh services running
  2. as sidecars aggregating logs and pushing metrics to service monitoring with Grafana and Prometheus as OSS, or with other enterprise observability tools to mention few, Datadog, NewRelic etc, distributed tracing, fluentd.
  3. Using cloud native toolings building APM creating dashboard and setting alerts on endpoints failures. Backend observability is very critical to business, creating internal backend alerts on connection failures to databases or backend systems as source of truth to running services. 

Using Service Mesh to debug and mitigate App Failures

Here are some best practices not limited to the list:

  • Service Mesh status check

In many situations, it’s helpful to first do a check of the status of the service mesh components. If the mesh itself is failing, such as the control plane not working, then app failures you’re seeing may actually be caused by a larger problem, and not an issue with the app itself. Example, linkered check command .

  • Service Proxy status check

Sometimes you may want to do a status check for other aspects of the service mesh, in addition to or instead of the ones you’ve just reviewed. For instance, you may just want to check the status of the service proxies that your app is supposed to be using. In Linkerd, you can do that by adding the –-proxy   flag to the linkered check command.

  • Service route Metrics

If your service mesh status checks don’t report any problems, a common next step in troubleshooting is to look at the metrics for the app’s service routes in the mesh. In other words, you want to see measurements of how each of the routes within the mesh that the app uses are performing. These measurements are often useful for purposes other than troubleshooting, such as determining how the performance of an app could be improved.

Let’s say that you’re troubleshooting an app that is experiencing intermittent slowdowns and failures. You could look at the per-route metrics for the app to see if there’s a particular route that is the cause or is involved somehow.

  • Injecting debug container

If you need to take an even closer look at what’s happening inside a pod, you may be able to have your service mesh inject a debug container into that pod. A debug container is designed to monitor the activity within the pod and to collect information on that activity, such as capturing network packets. In the Linkerd example, you can inject a debug container by adding the --enable-debug-sidecar flag to the linkerd inject command. You can then open a shell to the debug container and issue commands within the container to gather more information and continue troubleshooting the problem.

Linkerd’s debug container (called the “debug sidecar,”) see htt‌ps://

  • Request logging

If you need more detail about requests and responses than you can get from the service route metrics, you may want to log the individual requests and responses.

Warning: Logging requests can generate a rapidly growing amount of log data. In many cases, you will only need to see a few logged requests, and not massive volumes of them.

Here are examples of a few logged requests: This log was generated by running the linkerd tap command. Blank lines have been added between the log entries to improve readability. These three entries all involve the same request. The first one shows what the request was, and the second shows the status code that was returned (in this case, a 503, Service Unavailable). The second and third entries both contain metrics for how this request was handled. This additional information, beyond what could be seen in route-level metrics, may help narrow your search for the problem.

To make more sense of metrics and create some recording rules, configure Prometheus to scrap proxies endpoints.

  • Distributed tracing

Another way that service meshes provide observability is through distributed tracing. The idea behind distributed tracing is to have a special trace header added to each request, with a unique ID inside each header. Typically this unique ID is a universally unique identifier (UUID) that is added at the point of Ingress. This way each ID typically relates to a user-initiated request, which can be useful when troubleshooting. Each request can be uniquely identified and its flow through the mesh monitored—where and when it traverses.

Every service mesh implements distributed tracing in different ways, but they have a few things in common. They all require the app code to be modified so each request will have the unique trace header added and propagated through the entire call chain of services. They also require the use of a separate tracing backend.

Distributed tracing is intended to be used when metrics and other information already collected by the service mesh doesn’t provide enough information to troubleshoot a problem or understand an unexpected behavior. When used, distributed tracing can provide valuable insights as to what’s happening within a service mesh.

Security, mTLS

Service meshes can protect the communications between pods by using Transport Layer Security (TLS), a cryptographic protocol. TLS uses cryptography to ensure that the information being communicated can’t be monitored or altered by others. For example, if a malicious actor had access to the networks the service mesh uses, that actor wouldn’t be able to see the information being transferred in the microservice-to-microservice communications.

Load Balancing Algorithm

There are several algorithms for performing load balancing: Round Robin, Least Request, and Session Affinity. There are numerous implementations and use cases of these algorithms. For example, weighting is often added to round robin and least request algorithms so that some microservice instances receive a larger or smaller share of the requests than others. Again, you might favor microservice instances that typically process requests more quickly than others. In practice, load balancing algorithms alone often don’t provide enough resilience. For example, they will continue to send requests to microservice instances that have failed and no longer respond to requests. This is where adding strategies like timeouts and automatic retries can be beneficial.


Code compliance, Infrastructure-as-Code, and Helm deployment (app management) are among the cheapest yet most expensive practices. Some good practices with code quality with linters, control and use cases as failovers. Another way to get back up as quickly as possible is automating repetitive tasks on your infrastructure such as backup, SSL certs, running installations, and more.

GitOpsCD – Automation

GitOps enhances speed while improving availability strategies in production environments with quick service rollback and Update – release integrity and ownership driven, with cloud-native tools, but not limited to ArgoCD with the most interesting Application controller which controls and monitors applications continuously and compares current live state with desired target state (specified in the repository). If a OutOfSync is detected, it will take corrective actions. Flux can only observe one repo at a time, meaning you have generally one flux instance running for each app. In contrast, ArgoCD may observe multiple repos, comes equipped with a GUI dashboard, can be federated with an identity provider, and is more enterprise-ready.

Container HealthCheck

  • Using init containers, these are specialized containers that run before app containers in a pod. Init containers can contain utilities or setup scripts not present in an app image. They have advantages over startup related code, using sleep shell command for instance, running a jar along side kafka, basically using custom code for setup that are not present in an app image and also provides a mechanism to block or delay app container startup until a set of preconditions are met.
  • Regular containers with lifecycle, livenessProbe, readinessProbe , or s tartupProbe because they must run to completion before the Pod can be ready.

Single Point of Failure

Best Strategies to avoid failures in your production environment, Multi-region, DR’s and IaC. Kindly see a production environment in a Multi-region deployment on AWS cloud:

Watch this space for a future article: Developing SLO and Monitoring as an SRE!

Want to be part of the Andela Community? Then join the Andela Talent Network!

With more than 175,000 technologists in our community, in over 90 countries, we’re committed to creating diverse remote engineering teams with the world’s top talent. And our network members enjoy being part of a talented community, through activities, benefits, collaboration, and virtual and in-person meetups.

All you need to do to join the Andela Talent Network is to follow our simple sign-up process. 

Submit your details via our online application then…

Complete an English fluency test – 15 minutes.

Complete a technical assessment on your chosen skill (Python, Golang, etc.) – 1 hour.

Meet with one of our Senior Developers for a technical interview – 1 hour.

Visit the Andela Talent Network sign-up page to find out more.

If you found this blog useful, check out our other blog posts for more essential insights!

Related Posts