探究微软悉尼数据中心西区断服事件-轻识

本次案例微软澳大利亚东部数据中心经历了一次长达46小时的中断事件，起因是电力供应问题导致冷却系统故障，进而影响服务。微软对此的反思和应对措施集中在优化紧急操作程序(EOP)，尤其是冷水机组的自动重启机制，以减少人工干预需求。

这一事件凸显了即便在高度自动化的环境中，关键时刻能够快速响应仍是确保服务连续性的关键因素。正所谓“解决问题的关键，是找到关键的问题。”

人员配置标准:数据中心是否面临不必要的中断风险?

Staffing levels: are data centers at risk of unnecessary outages?

随着数据中心自动化程度的提高，客户自然希望确保他们的数据可用性能够尽可能接近100%，并询问是否有足够的员工可用以实现高水平的正常运行时间。当潜在的中断风险发生时，是否有足够的技术人员值班可用，以尽快恢复服务。

With increasing data center automation, it’s only natural for clients to want assurance that their data will be available as close to 100 percent of the time as possible, and to ask whether enough data center staff are available to achieve a high level of uptime. They also want to know that when a potential outage occurs, there are enough technicians on duty or available to restore services as soon as possible.

2023年8月30日，微软在悉尼的澳大利亚东部地区遭遇了一次宕机，持续了46小时。

Microsoft suffered an outage on 30th August 2023 in its Australia East region in Sydney, lasting 46 hours.

客户在访问或使用Azure、Microsoft 365和Power Platform服务时遇到问题。它是由08:41(UTC)的电力中断触发的，并影响了澳洲东区的三个可用区之一。

Customers experienced issues with accessing or using Azure, Microsoft 365, and Power Platform services. It was triggered by a utility power sag at 08.41 UTC and impacted one of the three Availability Zones of the region.

微软官方解释说:“电压骤降导致冷却系统中冷机的部分离线，在努力恢复冷却的同时，数据中心的温度上升到高于运行阈值的水平。我们关闭了小部分既定的计算和存储规模单元，以降低温度并防止损坏硬件。”

Microsoft explains: “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware.”

尽管如此，绝大多数服务在22:40(UTC)之前已经恢复，但直到2023年9月3日20.00 (UTC)才全面缓解。微软表示，这是因为一些服务受到了长期的影响，“主要是由于依赖于恢复存储、SQL数据库和Cosmos DB服务。”

Despite this, the vast majority of services were recovered by 22.40 UTC, but they weren’t able to complete a full mitigation until 20.00 UTC on 3rd September 2023. Microsoft says this was because some services experienced a prolonged impact, “predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.”

电压骤降的原因

Voltage sag cause

据微软称，市电电压骤降是因为位于澳大利亚东部地区可用区18英里处的电力基础设施遭到雷击。电压骤降导致多个数据中心的冷却系统冷却机组关闭。部分机组自动重启，但仍有13台机组重启失败，需要人工干预。为此，现场团队访问了冷机所在的数据中心屋顶设施，并依次重启一个数据中心至下一个数据中心的冷机。”

The utility voltage sag was caused, according to the company, by a lightning strike on electrical infrastructure situated 18 miles from the impacted Availability Zone of the Australia East region. They add: “The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next.”

影响了什么？

What was the impact?

“当团队到达需要手动重启的最后五个冷机时，这些冷却器(冷冻水回路)内的水已经达到过高的温度，无法重新启动。在这种情况下，重启被自我保护机制所抑制，该机制的作用是防止在高温下处理水可能发生的冷机损坏。无法重新启动的五台冷机为受此事件影响的两个相邻数据机房提供冷却。”

“By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”

微软表示，受影响的两个数据机房至少需要四台冷机才能冷却。在电压下降之前，冷却能力由七台冷机提供，其中五台正在运行，两台处于备用状态。由于数据机房温度升高，部分网络、计算和存储基础设施开始自动关闭。温度上升影响了服务可用性。然而，现场数据中心团队不得不在UTC时间11:34开始对剩余的网络、计算和存储基础设施进行远程关闭，以保护数据持久性、基础设施健康，并解决热失控问题。

Microsoft says the two impacted data halls require at least four chillers to be operational. The cooling capacity before the voltage sag consisted of seven chillers, with five of them in operation and two on standby. The company says that some networking, compute, and storage infrastructure began to shut down automatically as data hall temperatures increased. This temperature increase impacted service availability. However, the onsite data center team had to begin a remote shutdown of any remaining networking, compute, and storage infrastructure at 11.34 UTC to protect data durability, infrastructure health, and to address the thermal runaway.

人员配置评估

Staffing review

微软列举了多项缓解措施，其中包括在数据中心增加技术人员人手配置，“在更改冷机管理系统之前，准备好执行冷机的手动重启程序，以防止重启失败。”夜班团队临时从三名技术人员增加到七名，使他们能够充分理解根本问题，从而采取恰当的缓解措施。尽管如此，微软认为，如果当时遵循了“基于负荷”的冷机重启顺序，当时的技术人员配置将足以防止影响发生。

Amongst the many mitigations, Microsoft says it increased its technician staffing levels at the data center “to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures.” The night team was temporarily increased from three to seven technicians to enable them to properly understand the underlying issues, so that appropriate mitigations can be put in place. It nevertheless believes the staffing levels at “the time would have been sufficient to prevent impact if a ‘load based' chiller restart sequence had been followed, which we have since implemented.”

报告补充道：“初步回看事后调查报告中提到的数据中心人员配置水平仅考虑了现场‘关键环境’工作人员数量。这并没有准确描述我们数据中心的总体人员配置水平。为了消除这一误解，我们在状态历史记录页面上公布的初步公开事后调查报告中做出了修改。”

It adds: “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.”

然而，在 “Azure事件回顾：VVTQ-J98”的深入讨论中，微软亚太区数据中心运营副总裁Michael Hughes针对有关现场工作人员比公司最初声明的更多的评论进行了回应。还有人提出，真正的解决方案不一定是增加现场人员数量。也有人建议，真正的解决方案应该是应急操作程序（EOPs）中基于模式的顺序，这可能并不会改变人员配置水平。

Yet in a Deep Dive ‘Azure Incident Retrospective: VVTQ-J98’, Michael Hughes – VP of APAC datacenter operations at Microsoft, responded to comments about more staff being onsite than the company had originally said were present. It was also suggested that the real fix wasn’t necessarily to have more people onsite. It was also suggested that the real fix is a mode-based sequence in the emergency operating procedures (EOPs), which may not change staffing levels.

Hughes解释说：“报告中提到的三个内容只是与那些可以重置冷水机组的人员有关。现场有他们的运营人员，同时也有运营中心的人。所以那份信息是不准确的，你是对的。”他让我们设身处地想象一下，当时有20台冷机出现了3次电压骤降，且都处于错误状态。随后，有13台需要手动重启，这要求在非常大的场地范围内调配人力。

Hughes explains: “The three that came out in the report just relate to people who are available to reset the chillers. There were people in their operation staff onsite, and there were also people in the operations center. So that information was incorrect, but you’re right.” He asks us to put ourselves in the moment with 20 chillers posting 3 sags and all in an erroneous state. Then 13 require a manual restart, requiring the deployment of manpower across a very large site.

他补充道：“你得跑到建筑物的屋顶上去手动重置冷机，而且时间紧迫。”冷机受到影响，温度不断上升，工作人员不得不急忙奔波于场地各处，试图重置冷机。但他们未能及时到达机组前，导致了热失控。优化的方案是优先处理负荷最高的数据中心——那些热负荷最高、运行机架数量最多的区域且需要恢复冷却功能的数据中心。

“You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”, he adds. With chillers impacted and temperatures rising, staff are having to scramble across the site to try to reset the chillers. They don’t quite get to the pod in time, leading to the thermal runaway. The answer in terms of optimization is to go to the highest load data centers – those that have the highest thermal load and highest number of racks operating to recover cooling there.

因此，重点是恢复热负荷最高的冷机。这意味着对微软紧急操作程序（EOP）部署方式的一种调整，关乎系统应有的运作方式，而这些本应由软件来处理。自动重启本应发生，Hughes认为不应该需要任何人工干预。现在这个问题已经得到了解决。他认为“如果有软件能解决问题，你就永远不会想要部署人力去修复问题。”这促使冷机管理系统的变更，以防止此类事故再次发生。

So, the focus was to recover the chillers with the highest thermal load. This amounts to a tweak on how Microsoft’s EOP is deployed, and it’s about what the system is supposed to do, which he says should have been taken care of by the software. The auto-restart should have happened, and Hughes argues that there shouldn’t have had to be any manual intervention. This has now been fixed. He believes that “you never want to deploy humans to fix problems if you get software to do it for you.” This led to an update of the chiller management system to stop the incident from occurring again.

行业问题及风险

Industry issue and risk

Uptime Institute数字基础设施运营副总裁Ron Davis补充说，要指出的是，这些问题及其相关风险不仅限于微软事件。“我曾亲身经历过这类事件，当电力故障发生时，冗余设备未能切换启用，冷冻水温度迅速上升至一个程度，以至于相关的冷机无法启动，”

Ron Davis, vice president of digital infrastructure operations at the Uptime Institute, adds that it’s important to point out that these issues and the risks associated with them exist beyond the Microsoft event. “I have been involved in this sort of incident, when a power event occurred and redundant equipment failed to rotate in, and the chilled water temperature quickly increased to a level that prohibited any associated chiller(s) from starting,”

他补充道：“这种情况会发生，而且可能发生在任何组织身上。数据中心的运营至关重要。从设施角度来看，保持数据中心的正常运行时间和可用性是其首要任务。”接着是行业面临人员短缺的问题。他表示，从设备、系统和基础设施的角度来看，这个行业正在走向成熟。即使是远程监控和数据中心自动化也在不断改进。然而，在紧急情况下，特别是在微软案例中概述的那种应急响应期间，仍然严重依赖关键运维技术人员的存在和行动。

he comments before adding: “This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running.” Then there is the issue of why the industry is experiencing a staffing shortage. He says the industry is maturing from an equipment, systems, and infrastructure perspective. Even remote monitoring and data center automation are getting better. Yet there is still a heavy reliance on the presence and activities of critical operating technicians - especially during an emergency response as outlined in the Microsoft case.

写在最后

在数据中心自动化日益增强的背景下，客户对数据可用性接近100%的需求促使行业重新审视人员配置与运营策略。很多时候，单一的原因导致的问题是叠加的，人员配置应综合考虑业务连续性要求，以及应急响应的程序也应持续改进。通过这种多维度的策略，数据中心才能更好地准备和应对未来可能出现的各种挑战，确保服务的高可用性和客户数据的安全性。

展望未来，数据中心行业将更注重智能化管理和预防性维护，如何让自动工具更加场景化，优化人员和工具的配合。利用人工智能和机器学习预测并解决潜在问题，减少对外部突发事件的敏感性。最终，结合技术创新与人力资源优化，实现更加稳定可靠的数据中心运营，将是行业共同追求的目标。