Site Reliability Engineering

- września 08, 2018

Site Reliability Engineering

1) SLA (Service Level Agreement) -> SLO + money

2) SLO (Service Level Objective)
-What are the bounds?
-Up 90% of time
-http 5xx < 10/min
-size > 10mb
- < 100 packet drops/a
- < 5s replication latency

3) SLI (Service Level Indicator)
-basically -any metric
-up or down
-latency (metryka, jak szybko się ta strona, ping)
-TTFB (metryka, time to first bite, czas po krórym strona będzie się wczytywać)
-size
-replication lag
-bandwidth
-errors
-Http 2xx, 5xx

4) Count the nines

No such thing as 100%

-99,0% - 7,3 h/month
-99,9% - 43 min/month
-99,99% - 4,3 min/month
-99,999% - 26 s/month
-99,9999% - 2,6 s/month

5) Error budget (100 - SLO)

6) How to fulfil the SLA

-you can make more reliable systems from less reliable parts

Trzy podejścia:
n+1
n+m
n+n

n- liczba potrzebnych serwerów

-uswanie spof (single point of failure)

-gracefull degradation

Cache:
1) usprawnienie coś co jest pod nim (skalowanie backendu)
2) skalowanie backendu ale jeśli padnie to nadal możemy coś casheować

7) It's not always about uptime

Szukaj na tym blogu

DevSecOps Notes

Site Reliability Engineering

Komentarze

Prześlij komentarz

Popularne posty z tego bloga

Kubernetes

Helm

Ansible Tower / AWX