Netflix Scales "Human Infrastructure" to Manage Global Live Operations

Mark Silvester — Thu, 30 Apr 2026 08:10:00 GMT

Netflix has introduced a "human infrastructure" layer to manage live broadcasts at scale. Using a low-latency "telemetry hot path" and a Live Operations Centre, the company now balances automated scaling with human oversight. This shift, which mirrors strategies at AWS and Disney+, focuses on maintaining reliability through expert intervention during high-concurrency global events.

By Mark Silvester

Article: When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World

Rohan Vardhan — Wed, 22 Apr 2026 09:00:00 GMT

Sovereign fault domains are failure boundaries defined by legal, political, or physical jurisdiction rather than hardware topology. The article maps geopolitical events to known distributed-systems failure modes, argues multi-region should replace multi-AZ as the HA baseline for systems crossing jurisdictions, and outlines design patterns, chaos experiments, and an ALE model to justify the spend.

By Rohan Vardhan

InfoQ - Reliability

Netflix Scales "Human Infrastructure" to Manage Global Live Operations

Article: When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World