Site Reliability Engineering - Smart Way Of Scaling & Managing Enterprise By Supriyo Das, VP, Wipro Technologies

Site Reliability Engineering - Smart Way Of Scaling & Managing Enterprise

Supriyo Das, VP, Wipro Technologies

Given Covid-19 pandemic more and more services are going online - whether it is consumer, media, education, enterprise collaboration or health. Even parts of Manufacturing Operations are going online – in the Cloud. While moving to cloud is the first step but it is equally important to make sure these services are kept up and running at 99.999% uptime.

There are two dimensions to it.- Reliability and Cost. A Google Study says 40-90% of the total costs of a system are incurred after birth. These need specialized software engineering team focused on optimizing performance and ensure stability and reliability across all services. Site Reliability Engineering (SRE) address this & incorporates software engineering aspects and applies them to infrastructure and operations problems creating scalable and highly reliable software systems.. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE Engineer spends considerable bandwidth on development tasks such as adopting new features, scaling or automation. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

Modern hyperscale products typically comprises of several moving parts and a multi-cloud-based setup poses even more complexity.

The traditional approach based on the philosophy of “prevent system from failing” doesn’t quite work. With that many moving parts, there is bound to be a disruption somewhere, resulting in failures. The philosophy hence needs to be more like “expect failures to happen; build systems that are resilient to these failures”. An SRE Engineer spends up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention & rest 50% of time on development tasks in order to address the root cause of failures, scale the automation & new features for system resiliency.

If you can’t measure it, you can’t manage it. In a Service Reliability Engineering Framework, we create a Service Level Agreement (SLA) between provider & customer, which get further broken-down in-Service Level Objective (SLO) – specific metrics like uptime, response time & measured with Service Level Indicators (SLI). So, for example, if your SLA specifies that your systems will be available 99.95% of the time, your SLO is likely 99.95% uptime and your SLI is the actual measurement of your uptime. Maybe it’s 99.96%. Maybe 99.99%. To stay in compliance with your SLA, the SLI will need to meet or exceed the promises made in SLA document.

“Modern hyperscale products typically comprises of several moving parts and a multi-cloud-based setup poses even more complexity”

An SRE System implement a system independent of Product Team to measure the SLO- often using 3rd Party Probes. It will also have a rich set of tools & frameworks to provide visibility to the health and performance of services across all layers including Availability, Performance, Monitoring dashboard and Alerts.

Another important Metric is Error Budget. While development team would like to push new releases , SRE will always like to ensure the resilience & reliability. This implies an incentive to push back against a high rate of change. But that will defeat the innovation. Most of the time we don’t need always an 100% uptime. For example, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! Rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance is optimized. Error Budget defines how much time you're willing to allow your systems to be down and it depends on the SLA that you've defined with the product team. The development team can ‘spend’ this error budget in the way they like. If the product is currently running flawlessly, with few or no errors, they can launch new releases. Conversely, if they have met or exceeded the error budget and are operating at or below the defined SLA, all launches are frozen until they reduce the number of errors to a level that allows the launch to proceed.

A Typical SRE Life Cycle will be as following: -

• Planning & Requirement Finalization:- SLO, SLI Definition, Architect inputs/review for” Scalability, Performance, Reliability” & Capacity and scalability planning

• Design, Development & Testing:- Design Monitoring aspects, Release Process, Configuration & Performance engineering, Network/Service Deployment Topology design

• Product Launch & Operations:- Test, Launch and Deployment, Monitoring of SLO & Metrics, Error budget tracking. Revisit SLO if needed, Emergency Response, Alerts, Logging, triage, incident resolution/ fixes

• Continuous Improvement & Automation:- Incidents are accepted in SRE. Incidents are analyzed for root cause & software is enhanced so that system becomes further resilient to the failures.

Testing is a very important aspect of SRE. Complex systems are built today using several modules integrated together. While lot of integration testing do happen, there are unknown areas in the system that starts shining in production in the unwanted situations. SRE uses various proactive testing methods like Chaos Engineering to get exposure to these parts of the system before a real incident happens. SRE also uses several other Testing’s like Canary Testing, AB Testing etc in order to test new releases w/o affecting overall system performance.

A frequently asked question is what the difference between DevOps & SRE is. DevOps, as considered in the organizations focuses more on the automation part, SREs focus is more on the aspects like system availability, observability, and scale considerations.

In summary while it is easy understand & manage a small piece of the System, it is a different complexity altogether when we look at entire system level for today’s Enterprises which are distributed across the globe, Cloud native & microservices enabled & needs frequent new release launches/updates. Traditional IT Operations will not be able to cope with the complexity & scale. This needs deep software engineering skills, product knowledge, automation experience & software problem solving skills. Systems needs to automatically heal & scale. Enterprises needs to have a strong SRE team to manage these complexities & keep the lights on within the budget.