freundcloud

Understanding SLI, SLO, and SLA

Service Level Indicator (SLI):

Imagine that SLI means how well something is doing what it’s supposed to do. In a technical overview, this is something you can “feel” when using products that use this approach. For instance, from a user’s perspective, response time, error rate, and availability are Indicators.

As mentioned above, we can get deeper and see a classic example of SLI:

  • Request Latency for requests should be under 330 milliseconds
  • The availability of the server should be 99.9% for a given period.
  • Throughput for an e-Commerce endpoint, for instance, using the number of successful purchases per minute
  • The error rate for the service should be below 1%

All the points above help us measure the service level delivered by a system. When thinking about SLI, remember the association with Product Managers, Product Owners, and SREs, where technical and clean objectives are designed.

Service Level Objectives (SLO):

On the other side, SLO works with the word “promise.” This happens because you must perform a certain way most of the time and quantify the reliability of a product. After all, it is directly related to the customer experience.

Can some cases be associated with SLO:

  • Response time of 100 milliseconds for all requests
  • System uptime of 99.99%
  • An error rate of less than 0.8%
  • Error budget

Generally, SLO attempts tend to be aggressive. However, the goal of perfection could not be worth it. In the end, the customers need to be happy. If 99.99% causes customers happiness and mindfulness, it is unnecessary to change for a higher value.

In the Preface of the book Implementing Service Level Objectives: A Practical Guide to Slis, Slos, and Error Budgets[7], the author gives a great example about You Don’t Have to Be Perfect that could help you on your journey with SLO.

Service Level Agreement (SLA):

If some “agreement” mentioned above is broken, a value, price, or touchable must be on the table. In other words, a contract. Almost all of the consequences are financial, but can vary as said before, for instance:

  • Uptime falls below 99.9% in a Black Friday week. As a result, the provider will issue a discount of 40% to the customer.
  • Support requests will be responded to within 1 hour.
  • Maintenance will be scheduled outside of business hours.

Real-World SLI, SLO, and SLA Examples in Public Clouds

AWS Example

  • SLI: API Gateway average latency (measured via CloudWatch):

    • SLI = Percentage of requests with latency < 200ms
  • SLO: 99.5% of API requests must have latency < 200ms over a rolling 30-day window.
  • SLA: If monthly API uptime drops below 99.9%, the customer receives a 10% service credit.

Implementation:

  • Use CloudWatch metrics and alarms to monitor latency and availability.
  • Define SLOs in documentation and dashboards.
  • Reference: AWS Service Level Agreements

Azure Example

  • SLI: Azure App Service HTTP 5xx error rate (measured via Azure Monitor):

    • SLI = Percentage of successful HTTP requests
  • SLO: 99.95% of requests must succeed each month.
  • SLA: If uptime falls below 99.95%, a service credit is issued per Azure SLA.

Implementation:

  • Use Azure Monitor and Application Insights for real-time tracking.
  • Set up alerts and dashboards for SLO compliance.

GCP Example

  • SLI: Google Cloud Storage availability (measured via Stackdriver Monitoring):

    • SLI = Percentage of successful object retrievals
  • SLO: 99.99% monthly availability for object retrievals.
  • SLA: If availability drops below 99.99%, customers receive credits as per GCP SLA.

Implementation:

  • Use Cloud Monitoring to track and alert on SLI breaches.
  • Document SLOs and SLAs in internal and customer-facing docs.

How a Private Company Defines and Follows SLI, SLO, and SLA

1. Define SLIs (What to Measure)

  • Identify key user journeys (e.g., login, checkout, API call).
  • Choose measurable indicators (latency, error rate, availability, throughput).
  • Example: SLI = Percentage of checkout requests completed in < 1s.

2. Set SLOs (Targets for SLIs)

  • Set realistic, customer-focused targets (e.g., 99.9% of checkouts < 1s).
  • Involve product, engineering, and business teams.
  • Document SLOs in runbooks and dashboards.

3. Establish SLAs (External Commitments)

  • Define contractual obligations (e.g., 99.9% uptime per month, 1-hour support response).
  • Specify remedies for breaches (service credits, penalties).
  • Communicate SLAs to customers and stakeholders.

4. Monitor and Enforce

  • Use cloud-native tools (CloudWatch, Azure Monitor, GCP Monitoring) to track SLIs.
  • Automate alerting for SLO breaches.
  • Review SLO/SLA performance in regular ops meetings.

5. Iterate and Improve

  • Analyze incidents and error budgets.
  • Adjust SLOs as business needs evolve.
  • Share learnings with engineering and product teams.

Example: SLI/SLO/SLA Table for a SaaS API

Metric SLI Definition SLO Target SLA Commitment
Latency % requests < 300ms (API Gateway) 99.5% per month 99.0% per month, 10% credit if breached
Error Rate % HTTP 5xx errors (App Service) <0.5% per month <1% per month, 5% credit if breached
Availability % successful requests (Cloud Storage) 99.99% per month 99.9% per month, 10% credit if breached

Best Practices:

  • Use Infrastructure as Code (Terraform, ARM, Deployment Manager) to automate monitoring setup.
  • Store SLO definitions in version control and keep them visible to all teams.
  • Regularly review and update SLOs/SLA as your product and customer needs change.

For more, see: