Google SRE book
~2 mins read
- Fail sanely
- Progressive rollouts
- Monitoring
- Pages
- Tickets
- Logging
- Measure in terms of the user
- Error budgets
- Postmortems
- Capacity planning
- N+2 instances
- Load testing
- Validate forecasts with reality
- Exponential back off
- SRE half operational half development
- Disaster role playing
Preparedness And Disaster testing
- Figure out ways to make it more robust
- Ensure that systems will react as we think
- Determine weaknesses
- Live drills
- Swing capacity
- Focus on safety
- attention to detail
- Simulations
- Training and certs
- Detailed requirements and design
- Defense in depth and breadth
Postmortem culture
- What happened
- The effectiveness of the response
- What we would do differently next time
- What actions will be taken to make sure it doesn’t happen again
Automation and reduced overhead
- Automation is a double edged sword
- Methodical human approach is an alternative
- Slow and steady is better in high stake cases
- Multiple checks and balances
- Human oversight
Structural and rational decision making
- The inputs are clear
- The basis is agreed upon in advance
- Any assumption are explicitly stated
- Data driven over feelings and opinions