Chaos Engineering
With so many interacting components, the number of things that can go wrong in a distributed system is enormous. You’ll never be able to prevent all possible failure modes, but you can identify many of the weaknesses in your system before they’re triggered by these events. This report introduces you to Chaos Engineering, a method of experimenting on infrastructure that lets you...
With so many interacting components, the number of things that can go wrong in a distributed system is enormous. You’ll never be able to prevent all possible failure modes, but you can identify many of the weaknesses in your system before they’re triggered by these events. This report introduces you to Chaos Engineering, a method of experimenting on infrastructure that lets you expose weaknesses before they become a real problem.
Members of the Netflix team that developed Chaos Engineering explain how to apply these principles to your own system. By introducing controlled experiments, you’ll learn how emergent behavior from component interactions can cause your system to drift into an unsafe, chaotic state.
- Hypothesize about steady state by collecting data on the health of the system
- Vary real-world events by turning off a server to simulate regional failures
- Run your experiments as close to the production environment as possible
- Ramp up your experiment by automating it to run continuously
- Minimize the effects of your experiments to keep from blowing everything up
- Learn the process for designing chaos engineering experiments
- Use the Chaos Maturity Model to map the state of your chaos program, including realistic goals
译者介绍
侯杰,美利金融集团技术副总裁,TGO鲲鹏会会员,毕业于南京大学;曾就职于IBM中国、IBM澳大利亚和iClick(爱点击);在多个行业的大型组织机构中负责过研发和管理工作,拥有十多年大规模分布式信息系统的设计、研发和实施经验。
技术审校者
周洋,花名中亭,阿里巴巴高可用架构团队高级技术专家,混沌工程布道师,开源项目ChaosBlade发起人。具有多年高可用保障、产品研发和系统架构经验,曾担任2015年双11稳定性负责人。目前负责高可用技术云化输出,并担任应用高可用服务(AHAS)及集团突袭演练负责人。