AWS Went Down? Multi-Cloud Isn't the Answer

Pope Kim Oct 23, 2025

Over the past few days, many people have lost their illusion of "safety."

On October 20 (local time), AWS's US-EAST-1 region suffered a massive outage.

Countless apps and services went down one after another. The cause was traced to DNS resolution failures and issues that originated in internal subsystems and data layers (such as EC2 and DynamoDB APIs).
Social media, gaming, productivity tools—even parts of government and education systems—were shaken. It took nearly an entire day to recover, and the ripple effects lingered.

The very next day, a service our company uses on Azure started slowing down, flooding our office with alerts. (For reference: whenever we have a server outage, a disco ball spins and we have an impromptu dance party. Red, blue, dance-dance~) Come to think of it, there was also a large-scale Azure Front Door issue earlier this month (October 9), which even affected the management portal. Microsoft's own status page said they mitigated it by rerouting traffic and purging caches.
Clearly, this isn't just a "single-vendor" problem.

And of course, today's news articles, blogs, and think-pieces are full of the same claim:

"Multi-cloud is safer."

It sounds convincing—but in reality, multi-vendor setups can't fix fundamental control-plane (CP) issues or upstream dependencies like global DNS, identity systems, or routing. Even Wall Street touts multi-cloud as the answer, yet that doesn't change the fact that AWS itself was shaking across the board on that very same day.

My point is simple: "Cloud means safer" is a myth.
Let's break it down again—something I've said countless times on my YouTube lives.

1) Multiply the SLAs and you'll see reality

Let's say you run one web server and one database.
(That's the bare minimum architecture, right?)

  • SLA 99.9% × 99.9% = 99.8001%
    About 17.5 hours of downtime per year (0.1999% × 8760h)
  • SLA 99.9% × 99.9% × 99.9% = 99.7003%
    About 26.3 hours of downtime per year

"Serverless means I'm fine," you say?

Serverless still has servers.
The name is just marketing. In the end, it all depends on the reliability of physical and virtual resources, control planes, and rollout mechanisms. Unless you believe in "chicken-free chicken," you can't escape the multiplication of failure probabilities.

2) The real risk isn't hardware — it's software changes

Cloud providers love to talk about high availability stories like
"Even if one AZ goes down, we'll stay up." That's about hardware and facilities.

But the truth is, most major outages start from software or control-plane changes. If a release goes wrong, healthy hardware can receive bad configuration at scale, and multiple regions or services can collapse at once. The latest AWS incident? It started with a shared dependency between data paths and DNS resolution—classic domino effect.

3) "The vendor doesn't deploy on my timeline"

The fundamental risk of the cloud is the loss of change control.

Vendors push rolling updates on their schedule, and you have almost no authority to verify release quality. If your relatively simple service breaks because of a cloud update, it means something broke at the baseline feature level(which means the vendor didn't test it well enough). And guess who pays for it? Your service.

4) The forgotten advantages of on-prem / colocation

  • You control when to update.
    No surprise 3 AM deployments behind your back.
  • If something breaks mid-deploy, you can roll back or hot-fix immediately.
  • A human (with skills) is right there to respond in real time.

It's like performing surgery without blood reserves ready. No matter how great the hospital (cloud) infrastructure is, if your team has no control over the contingency plan, you're in danger.

5) "Multi-cloud is the answer"? — That sounded cool back when we were newbies.

Multi-cloud overlooks two big realities:

  1. Data Gravity
    Keeping transactional data consistently replicated and fail-over-ready across clouds is a nightmare of latency, consistency, and locking strategies.
  2. Shared dependencies
    Global DNS, identity, SaaS CI/CD, logging, alerts, CDNs, version control—different vendors, yes, but shared upper-layer dependencies mean they can all fall together.

Sure, for certain workloads—read-only caches, content delivery, or non-critical back-office tasks— multi-vendor setups can reduce risk.
But for core transactional paths, multi-cloud often increases complexity, cost, and the blast radius of outages.

So how can we be "safer"?

This isn't a "ditch the cloud" rant. Cloud is fantastic for initial launches and experiments. But once your product hits a stable phase, consider the following:

  1. Escape single-region, minimize single control plane
    Even within the same vendor, use region/account separation to reduce blast radius. Split control and data planes (e.g., separate management/audit accounts).
  2. The courage to use "boring" tech
    Simplify core systems with proven components instead of shiny new managed ones. Migrate in controlled, incremental phases.
  3. Strict change management
    Canary releases, circuit breakers, feature flags, graceful degradation—non-negotiable. Align your own release calendar with vendor change windows and avoid peak-time changes.
  4. Off-cloud backstops
    Provide read-only static or cached fail-safes (e.g., critical notice pages, cached order history) accessible through alternate paths.
  5. Real-world game days
    Your team should physically rehearse failure scenarios—DNS down, identity down, storage down, message broker down—
    and internalize the runbook, RTO, and RPO.
  6. Mature-phase strategy: Hybrid / On-prem relocation
    Bring core transactional and stateful tiers on-prem or to colocation. Keep edge, burst, and analytics in the cloud.
    The cloud should become a tool you use when needed, not the place you live in.

In summary: assuming "people never make mistakes" is the real danger

The cloud is a great tool. I still happily use it for early-stage products.

But once a service stabilizes, we should reduce cloud dependency and take back control of our core systems. That's what creates real safety.

It's not "Cloud = Safe."
It's "Controlled change + Verifiable design = Safe."