Azariah said:
I would be curious to hear from someone who paid for the multi-site tier of service. Did it work?
Also, from a reputation perspective, I would think Azure could benefit from simply offering multi-site as its base package. The reputational damage done by this probably costs far more than the revenue generated by bumping people up a tier.
I can't speak to customers affected by this specific outage, however, I can speak to the general question.
Having a multi-site base package is problematic for many customers... and many applications. Many are just not written for that type of resiliency. At the end of the day... you have to have a single-source-of-truth...
Overall, the question is can you tolerate the risk of a single point of failure and if so, where. One server. One Cable. One firewall. One IP.
The 1st hurdle is redundant content. Can you replicate your content across sites? Do you push or pull that content. Image you are at on a TexAgs forum hosted in San Antonio and NYC. You make a post to a thread. How do the people on the NYC site see your post?
Getting users to multiple sites is very simple, easy and reliable. DNS in itself provides such redundancy, as it hands out multiple IP Addresses, if update your authoritative NS, it should be a relatively quick transition. You saw that when TexAgs was able to change the landing page from a 404 to their Twitter feed. GSLB/GLB does that through health checking algorithms that automatically change IP Addresses handed out. The residing issue is TTL for resolution.
Azure, AWS, GCF are just data centers where people host content. In the old days, you could host your equipment in Exodus, Inflow, et al. You buy floor space and a rack... put your "on-prem" equipment there and the hoster provides: Power, Connectivity, A/C, etc. Every thing was touted as redundant, including internet access, power, along with a diesel generator. You could chose one server, or one rack/stack, or redundancy within the hoster. You could also use GLB/GSLB, which is DNS based to failover between different hosters. Two Exodus locations for instance.
Multi-site resiliency was pretty standard over 20 years ago.
There is nothing new to learn here, IMO.
Disclosure... I work with AWS, Azure, GCP, On-Prem and I deal in products that address these issues.
I agree the cost is nominal compared to the cost of an outage, like yesterday... there are thousands of App Owners who looked at the Azure issue and are either saying, "We have the insurance (aka multi-site) to not be affected," or are saying, "We dodge a bullet this time. Hope it isn't us next time."
Even a manual intervention could have been quick, provided the content was stored off-site.