软件架构中“弹性”的多种含义-轻识

在软件架构领域的中文文档、书籍中，经常可以看到“弹性”这个专业术语，但在不同的语境下含义可能会不同。

在英语中，elastic 和 resilient 两个单词都可以翻译为“弹性的”，但是它们在软件架构中代表的含义却完全不同，需要避免混淆。

Elastic

Elastic 代表的“弹性”强调的是可伸缩性。

在《Designing Data-Intensive Applications》一书中，对 elastic 的定义：

Some systems are elastic, meaning that they can automatically add computing resources when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled systems are simpler and may have fewer operational surprises.

翻译：某些系统具有弹性，意味着它检测到负载增加时，可以自动化地增加计算资源。而非弹性的系统则需要手动扩展（人工分析性能并决定向系统中添加更多的机器）。当系统负载很难预测时，弹性系统会非常有用，而手动扩展的系统更加简单，并且可以减少操作上的意外。

Elastic 表示系统可以根据负载情况和相关策略自动调整计算资源，所以也称为 Auto Scaling。例如一个电商应用会在大促时有更大的负载量，则需要自动添加更多的服务器等资源保障系统正常提供服务，而在平时负载量小的时候，则自动减少资源来控制成本。当我们看到“弹性伸缩”这个术语时，要知道这里的“弹性”代表的意思就是 elastic。

例如，Kubernetes 提供了 HorizontalPodAutoscaler，支持 Pod 水平自动扩缩容。阿里云等商业云平台也都提供了类似的弹性伸缩服务（Elastic Scaling Service），可根据负载情况和策略自动调整计算能力（即实例数量）。

Resilient

Elastic 代表的“弹性”强调的是复原能力。

在《Designing Data-Intensive Applications》一书中，对 resilient 的定义：

The things that can go wrong are called faults, and systems that anticipate faults and can cope with them are called fault-tolerant or resilient.

翻译：可能出错的事情被称为故障，系统能够预测并应对故障的能力称为容错或弹性。

Resilient 表示系统有容错和故障恢复能力，从而系统具有可靠性。

例如，Java 著名的开源库 Hystrix 的介绍是这样的：

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.

这里的 resilience 当然指的不是弹性伸缩能力，而是容错能力。

除 Hystrix 外，其它开源的容错库：

Resilience4j: Resilience4j is a fault tolerance library designed for Java8 and functional programming.
Sentinel: A powerful flow control component enabling reliability, resilience and monitoring for microservices.
Polly: Polly is a .NET resilience and transient-fault-handling library that allows developers to express policies such as Retry, Circuit Breaker, Timeout, Bulkhead Isolation, and Fallback in a fluent and thread-safe manner.
go-resiliency: Resiliency patterns for golang.
Semian: Resiliency toolkit for Ruby for failing fast.

实现 resilient 的策略通常包括断路器（Circuit Breaker）、限流器（Rate Limiter）、重试（Retry）、舱壁（Bulkhead）等，更多可参考：https://github.com/App-vNext/Polly#resilience-policies 。

下面简单介绍常用的几种容错的策略：

重试（Retry）：很多错误是短暂的并且可以自动恢复的，对这种问题采用重试策略。
断路器（Circuit Breaker）：类似于电路或股市中的“熔断”概念，当系统发生严重故障（大量超时或失败）时，为了避免后续持续不断的请求导致故障系统过载，超时导致网络、线程资源占用，最终产生雪崩，而在一段时间内直接 fail fast（快速失败，即直接返回错误而不再去请求故障的系统模块）。
舱壁（Bulkhead）：《泰坦尼克号》电影中有一段对船体的描述：船体包含 16 个相互隔离的水密舱，即使有 4 个水密舱受损进水也能保证船漂浮在海面上。架构设计中舱壁模式参考的就是这种方式，将资源进行隔离，例如可以为调用多个服务的消费者分配每个服务独立的连接池，从而保证一种故障只会影响到其对应的资源，而不会造成级联故障。
限流器（Rate Limiter）：通过限流算法（如令牌桶算法、漏桶算法），限制在特定时间段内的执行次数、数据量等指标，从而防止系统过载。