Understanding SLI, SLO, and SLA

Service Level Indicator (SLI):

Imagine that SLI means how well something is doing what it’s supposed to do. In a technical overview, this is something you can “feel” when using products that use this approach. For instance, from a user’s perspective, response time, error rate, and availability are Indicators.

As mentioned above, we can get deeper and see a classic example of SLI:

Request Latency for requests should be under 330 milliseconds
The availability of the server should be 99.9% for a given period.
Throughput for an e-Commerce endpoint, for instance, using the number of successful purchases per minute
The error rate for the service should be below 1%

All the points above help us measure the service level delivered by a system. When thinking about SLI, remember the association with Product Managers, Product Owners, and SREs, where technical and clean objectives are designed.

Service Level Objectives (SLO):

On the other side, SLO works with the word “promise.” This happens because you must perform a certain way most of the time and quantify the reliability of a product. After all, it is directly related to the customer experience.

Can some cases be associated with SLO:

Response time of 100 milliseconds for all requests
System uptime of 99.99%
An error rate of less than 0.8%
Error budget

Generally, SLO attempts tend to be aggressive. However, the goal of perfection could not be worth it. In the end, the customers need to be happy. If 99.99% causes customers happiness and mindfulness, it is unnecessary to change for a higher value.

In the Preface of the book Implementing Service Level Objectives: A Practical Guide to Slis, Slos, and Error Budgets[7], the author gives a great example about You Don’t Have to Be Perfect that could help you on your journey with SLO.

Service Level Agreement (SLA):

If some “agreement” mentioned above is broken, a value, price, or touchable must be on the table. In other words, a contract. Almost all of the consequences are financial, but can vary as said before, for instance:

Uptime falls below 99.9% in a Black Friday week. As a result, the provider will issue a discount of 40% to the customer.
Support requests will be responded to within 1 hour.
Maintenance will be scheduled outside of business hours.

Real-World SLI, SLO, and SLA Examples in Public Clouds

AWS Example

SLI: API Gateway average latency (measured via CloudWatch):
- SLI = Percentage of requests with latency < 200ms
SLO: 99.5% of API requests must have latency < 200ms over a rolling 30-day window.
SLA: If monthly API uptime drops below 99.9%, the customer receives a 10% service credit.

Implementation:

Use CloudWatch metrics and alarms to monitor latency and availability.
Define SLOs in documentation and dashboards.
Reference: AWS Service Level Agreements

Azure Example

SLI: Azure App Service HTTP 5xx error rate (measured via Azure Monitor):
- SLI = Percentage of successful HTTP requests
SLO: 99.95% of requests must succeed each month.
SLA: If uptime falls below 99.95%, a service credit is issued per Azure SLA.

Implementation:

Use Azure Monitor and Application Insights for real-time tracking.
Set up alerts and dashboards for SLO compliance.

GCP Example

SLI: Google Cloud Storage availability (measured via Stackdriver Monitoring):
- SLI = Percentage of successful object retrievals
SLO: 99.99% monthly availability for object retrievals.
SLA: If availability drops below 99.99%, customers receive credits as per GCP SLA.

Implementation:

Use Cloud Monitoring to track and alert on SLI breaches.
Document SLOs and SLAs in internal and customer-facing docs.

How a Private Company Defines and Follows SLI, SLO, and SLA

1. Define SLIs (What to Measure)

Identify key user journeys (e.g., login, checkout, API call).
Choose measurable indicators (latency, error rate, availability, throughput).
Example: SLI = Percentage of checkout requests completed in < 1s.

2. Set SLOs (Targets for SLIs)

Set realistic, customer-focused targets (e.g., 99.9% of checkouts < 1s).
Involve product, engineering, and business teams.
Document SLOs in runbooks and dashboards.

3. Establish SLAs (External Commitments)

Define contractual obligations (e.g., 99.9% uptime per month, 1-hour support response).
Specify remedies for breaches (service credits, penalties).
Communicate SLAs to customers and stakeholders.

4. Monitor and Enforce

Use cloud-native tools (CloudWatch, Azure Monitor, GCP Monitoring) to track SLIs.
Automate alerting for SLO breaches.
Review SLO/SLA performance in regular ops meetings.

5. Iterate and Improve

Analyze incidents and error budgets.
Adjust SLOs as business needs evolve.
Share learnings with engineering and product teams.

Example: SLI/SLO/SLA Table for a SaaS API

Metric	SLI Definition	SLO Target	SLA Commitment
Latency	% requests < 300ms (API Gateway)	99.5% per month	99.0% per month, 10% credit if breached
Error Rate	% HTTP 5xx errors (App Service)	<0.5% per month	<1% per month, 5% credit if breached
Availability	% successful requests (Cloud Storage)	99.99% per month	99.9% per month, 10% credit if breached

Best Practices:

Use Infrastructure as Code (Terraform, ARM, Deployment Manager) to automate monitoring setup.
Store SLO definitions in version control and keep them visible to all teams.
Regularly review and update SLOs/SLA as your product and customer needs change.

For more, see: