Site Reliability Engineering

ideas from Google SRE book

  • Fail sanely

  • Progressive rollouts

  • Monitoring

    • Pages
    • Tickets
    • Logging
  • Measure in terms of the user

  • Error budgets

  • Postmortems

  • Capacity planning
    • N+2 instances
    • Load testing
    • Validate forecasts with reality
  • Exponential back off
  • SRE half operational half development
  • Disaster role playing

Preparedness And Disaster testing

  • Figure out ways to make it more robust
  • Ensure that systems will react as we think
  • Determine weaknesses
  • Live drills
  • Swing capacity
  • Focus on safety
  • attention to detail
  • Simulations
  • Training and certs
  • Detailed requirements and design
  • Defense in depth and breadth

Postmortem culture

  • What happened
  • The effectiveness of the response
  • What we would do differently next time
  • What actions will be taken to make sure it doesn’t happen again

Automation and reduced overhead

  • Automation is a double edged sword
  • Methodical human approach is an alternative
  • Slow and steady is better in high stake cases
  • Multiple checks and balances
  • Human oversight

Structural and rational decision making

  • The inputs are clear
  • The basis is agreed upon in advance
  • Any assumption are explicitly stated
  • Data driven over feelings and opinions

May 29, 2020