Azure outage

2,767 Views | 26 Replies | Last: 7 yr ago by AGSPORTSFAN07
Tailgate88
How long do you want to ignore this user?
So as we all know Azure was down in our region yesterday for most of the day. The explanation is a lightning strike caused a voltage spike that resulted in cooling issues, so systems started to shut down automatically.

I'd love to know more details about what really happened. I thought data centers had redundant systems and backup generators to prevent this type of outage.
Wildmen03
How long do you want to ignore this user?
Apparently it was causing Xbox Live issues as well. I also thought there would have been at least one redundant copy of sites that they could switch to.

But maybe they did for sites that are considered essential, and sites that aren't as vital don't get that luxury.
ntxVol
How long do you want to ignore this user?
Seems odd to me as well. A rare sequence of events brings everything down. Bet they take a hard look at what could be done differently.
CapCity12thMan
How long do you want to ignore this user?
I got the impression that the San Antonio data center was the one hit, and the A/C issues triggered a graceful shutdown of a ton of services. The problems I see are:

1) they didn't have backup A/C or generators to even keep things up.
2) This completely brought down this Azure region, with no apparently redundancy from another region to support the services there
3) This region apparently housed Azure AD services which run other Azure/MS services as well, effectively preventing people from authenticating to their Azure based services from other regions, including the UK. It strikes me odd that the AD service - again - there was no redundancy? This is a huge hole IMHO.

It seems that for the amount of time things were "down" all over the world, it proves there was no redundancy in place for a lot of things. A simple Tier 1 data center is supposed to have 12 hours of backup power. Maybe they had backup power but no backup A/C - I am not sure. Thankful we run our business on AWS.

jagouar1
How long do you want to ignore this user?
Quote:

Thankful we run our business on AWS.
AWS has had just as many "major" outages as Microsoft has had (and google is no better). The key for all of these providers is to learn from these mistakes and never let this particular incident affect them in the same way again.
hph6203
How long do you want to ignore this user?
Our document generation system at work was down for half the day because of this as well. Good thing it was a slow day.
.
SteveA
How long do you want to ignore this user?
It's largely up to the clients to provide fail over redundancy. Microsoft allows to run on multiple sites but it costs more.
CapCity12thMan
How long do you want to ignore this user?

Quote:

It's largely up to the clients to provide fail over redundancy.

This was my point, but from the perspective that MS should have built redundancy in their services - especially something like Azure AD, and they didn't. In this instance they are their own client, and failed.

redd38
How long do you want to ignore this user?
If Azure was never down then they'd never be able to upsell their multi-region solutions.
kb2001
How long do you want to ignore this user?
If a site is running solely on one region, then they've accepted the risk of that region going down and the site being offline for the duration, whether this acceptance was intentional or not. The absence of planning for such an outage is entirely on the company that failed to do so. When AWS had their S3 outage in N Virginia in spring of 2017, it took down a number of high profile sites, Tinder was a visible one. The AWS status page was even down, the site was hosted out of a N Virginia S3 bucket with no cross region redundancy, which left them with egg on their face. When this outage happened, we were down for about 30 minutes while we failed over to Oregon, then were fine

Each company must do their own analysis to determine whether or not it's worth the cost to build multi-region. Oftentimes, a company will decide after the fact that the outage was a great enough impact that they should have been multi-region, and will start to act accordingly.

While it is certainly frustrating when your service provider has issues that leave you stranded with nothing to do but wait, it should also invoke frustration at yourself or your organization for not having planned accordingly. Realistically, most companies don't put serious efforts into their DR plans until they've been burned, or when that huge potential client insists on it before signing the deal.

ABATTBQ11
How long do you want to ignore this user?
CapCity12thMan said:

I got the impression that the San Antonio data center was the one hit, and the A/C issues triggered a graceful shutdown of a ton of services. The problems I see are:

1) they didn't have backup A/C or generators to even keep things up.
2) This completely brought down this Azure region, with no apparently redundancy from another region to support the services there
3) This region apparently housed Azure AD services which run other Azure/MS services as well, effectively preventing people from authenticating to their Azure based services from other regions, including the UK. It strikes me odd that the AD service - again - there was no redundancy? This is a huge hole IMHO.

It seems that for the amount of time things were "down" all over the world, it proves there was no redundancy in place for a lot of things. A simple Tier 1 data center is supposed to have 12 hours of backup power. Maybe they had backup power but no backup A/C - I am not sure. Thankful we run our business on AWS.


I can speak a little to number 1...

The cooling loads on a data center are enormous. So much so that companies (Microsoft) have explored putting small server farms in shallow ocean waters to efficiently keep them cool. With loads that big, there is no such thing as redundant A/C. Redundant power? Yes. A/C? No. You a) would have no space for it, and b) the cost would be prohibitive compared to the mere possibility of something going wrong. You might have backup systems that will pick up slack for maintenance, but you won't have an entire system backup.

Our A/C is actually currently down due to a voltage spike tripping alarms. We're waiting on someone to come out and figure out exactly what needs to be reset because they cannot be cleared and the system will not restart until that happens. A commercial A/C system with air handlers, chillers, hydronic pumps, etc is a lot more complicated than your residential unit, and you can't just flip a switch to turn it on. Hell, some of these systems have their own buildings because they're so large. I'm sure a data center's are even more complex because they're extremely large systems (never worked on a data center, but have worked on central utility plants).

That being said, it's better to gracefully shut down the A/C system when there's a bad enough voltage spike that could damage critical components than to keep it running to maintain up time. If something really breaks, the downtime to fix it could be way worse than the down time to restart everything. Data centers do have backup generators, but those don't mean anything if the entire A/C system shuts down and they have nothing to power.


ETA: For an example of the size of some of these systems, look at the NSA's Utah data center. Of those four buildings, the outer two are dedicated power and pump buildings, with 12 cooling towers outside. That's a huge chunk of real estate when looking at its share of the entire facility.
CapCity12thMan
How long do you want to ignore this user?

Quote:

The AWS status page was even down, the site was hosted out of a N Virginia S3 bucket with no cross region redundancy, which left them with egg on their face.


same situation...AWS were their own client and failed with regard to the status page.

FYI - We run out of US West (Oregon) and fail over to US-East (Virginia) for our US based customers. Ireland/Frankfurt for our UK customers.

Our business is such we have SLAs of a 12 hour RTO and a 1 hour RPO. With those parameters we don't need 24x7 HA to the Nth degree, so we have architected our solution accordingly. In our DR failover testing, we are averaging ~1 hour RTO, so we have plenty of cushion to meet our SLAs.

CapCity12thMan
How long do you want to ignore this user?
appreciate the insight on the A/C, that's good info

Azariah
How long do you want to ignore this user?
I would be curious to hear from someone who paid for the multi-site tier of service. Did it work?

Also, from a reputation perspective, I would think Azure could benefit from simply offering multi-site as its base package. The reputational damage done by this probably costs far more than the revenue generated by bumping people up a tier.
CapCity12thMan
How long do you want to ignore this user?
I am not familiar with Azure...is there such a thing as simply purchasing "multi-tier" service?

Within AWS - they are just providing the IaaS and some PaaS...its up to the customer to architect for High, but some services have this somewhat built it...so I don;t think this is a matter of just purchasing something different.

I could build a website and have it in AWS, but that doesn;t really make any guarantees about failover at all. I could be a hardware crash away from being down being "in AWS". BUt that is not the right way to build a website, so it is up to me to make sure I don't let hardware cripple my business.

DallasAg 94
How long do you want to ignore this user?
Azariah said:

I would be curious to hear from someone who paid for the multi-site tier of service. Did it work?

Also, from a reputation perspective, I would think Azure could benefit from simply offering multi-site as its base package. The reputational damage done by this probably costs far more than the revenue generated by bumping people up a tier.
I can't speak to customers affected by this specific outage, however, I can speak to the general question.

Having a multi-site base package is problematic for many customers... and many applications. Many are just not written for that type of resiliency. At the end of the day... you have to have a single-source-of-truth...

Overall, the question is can you tolerate the risk of a single point of failure and if so, where. One server. One Cable. One firewall. One IP.

The 1st hurdle is redundant content. Can you replicate your content across sites? Do you push or pull that content. Image you are at on a TexAgs forum hosted in San Antonio and NYC. You make a post to a thread. How do the people on the NYC site see your post?

Getting users to multiple sites is very simple, easy and reliable. DNS in itself provides such redundancy, as it hands out multiple IP Addresses, if update your authoritative NS, it should be a relatively quick transition. You saw that when TexAgs was able to change the landing page from a 404 to their Twitter feed. GSLB/GLB does that through health checking algorithms that automatically change IP Addresses handed out. The residing issue is TTL for resolution.

Azure, AWS, GCF are just data centers where people host content. In the old days, you could host your equipment in Exodus, Inflow, et al. You buy floor space and a rack... put your "on-prem" equipment there and the hoster provides: Power, Connectivity, A/C, etc. Every thing was touted as redundant, including internet access, power, along with a diesel generator. You could chose one server, or one rack/stack, or redundancy within the hoster. You could also use GLB/GSLB, which is DNS based to failover between different hosters. Two Exodus locations for instance.

Multi-site resiliency was pretty standard over 20 years ago.

There is nothing new to learn here, IMO.

Disclosure... I work with AWS, Azure, GCP, On-Prem and I deal in products that address these issues.

I agree the cost is nominal compared to the cost of an outage, like yesterday... there are thousands of App Owners who looked at the Azure issue and are either saying, "We have the insurance (aka multi-site) to not be affected," or are saying, "We dodge a bullet this time. Hope it isn't us next time."

Even a manual intervention could have been quick, provided the content was stored off-site.
DallasAg 94
How long do you want to ignore this user?
CapCity12thMan said:

I am not familiar with Azure...is there such a thing as simply purchasing "multi-tier" service?

Within AWS - they are just providing the IaaS and some PaaS...its up to the customer to architect for High, but some services have this somewhat built it...so I don;t think this is a matter of just purchasing something different.

I could build a website and have it in AWS, but that doesn;t really make any guarantees about failover at all. I could be a hardware crash away from being down being "in AWS". BUt that is not the right way to build a website, so it is up to me to make sure I don't let hardware cripple my business.
Depending on the service you are using from Azure, it was originally designed more as a PaaS, but you can also deploy IaaS.

AFAIK, there is no "easy-button" that says "duplicate my content to two sites." You would need to replicate your content, and then use their version of GSLB found by AWS's "Route53," using "Traffic Manager."

AWS, Azure and GCP are just Infrastructure for you to deploy on.
SeattleAgJr
How long do you want to ignore this user?
^
|
^

NERD!
ABATTBQ11
How long do you want to ignore this user?
No problem. I just wish I wasn't speaking from experience right now.
AtlAg05
How long do you want to ignore this user?
Sometimes people just don't think things through, take for example, the Atlanta airport power outage.

The main power and the backup lines ran through the same area, so when there was a fire in that area both went down.

It could have just been bad planning.
91AggieLawyer
How long do you want to ignore this user?
Texas, with all its natuaral potential weather issues, is just really not a good place for data centers. Maybe high points in the deep hill country, but other than that, I can't think of where I'd want to park 10,000 servers in Texas and hope for the best.
kb2001
How long do you want to ignore this user?
You make a lot of good points, but there's an idea you stated a couple times that I disagree with completely, and a few others worth commenting on

Quote:

Azure, AWS, GCF are just data centers where people host content.
This is wrong on so many levels. The companies that treat these cloud services platforms as just hosted infrastructure are the ones who typically have a lot of trouble getting value out of it, and who have trouble designing to run on it properly. They are much more than just data centers where people host content.

Quote:

AWS, Azure and GCP are just Infrastructure for you to deploy on.
Same thing, completely wrong, no need to go again
Quote:

Overall, the question is can you tolerate the risk of a single point of failure and if so, where. One server. One Cable. One firewall. One IP.
Exactly. In this case, people who failed to follow best practices and implement multi region sites got in trouble. The argument about whether this is cost effective seems moot after an outage, but it generally comes down to cost barriers up front
Quote:

The 1st hurdle is redundant content. Can you replicate your content across sites? Do you push or pull that content. Image you are at on a TexAgs forum hosted in San Antonio and NYC. You make a post to a thread. How do the people on the NYC site see your post?
Yes, however this is an oversimplified way of viewing it on cloud services platforms. The content is meaningless without the means to deliver it, and vice versa. Doing a lift and shift cloud migration is going to leave you with the same headaches just in somebody else's datacenter. If that's what happens, you've already failed in migrating to the cloud. Designing your applications and datasets to take advantage of the service offerings allows you not only to take advantage of the potential cost savings, and it makes DR easy.

The single source of truth challenge can be solved a number of ways quite easily. You shouldn't have to think about this too much, it's really a simple thing to do if you're using the cloud services instead of running on hosted infrastructure. The old RDBMS databases are more problematic, but frankly this is solved quite easily.
Quote:

Even a manual intervention could have been quick, provided the content was stored off-site.
Perhaps, it depends on how long it takes to restore your datasets, and what level you need back to hit your RTO requirements. If you've designed well to take advantage of the cloud services platforms, it can be incredibly easy to deploy your applications. In our case, everything we deploy is automated and fully dynamic. If we need to change regions, we just update the target region and hit go. It takes about 30 minutes to deploy the entire application and web layer of our entire platform. Since we keep the datasets warm, that is just a scale up and go. Even during Amazon's S3 outage, we have our asset bucket replicated so we change one more flag and deployments look there instead.
Quote:

AFAIK, there is no "easy-button" that says "duplicate my content to two sites."
I can't speak to Azure, but I know for AWS there are several Easy buttons to do just this, depending on which service you're referring to. RDS supports cross-region replication, as one example. S3 supports bucket replication.

I deal in on-prem and AWS, so I can't speak to Azure. I can speak personally that our biggest failures in AWS have come when people try to think of it as hosted infrastructure rather than taking advantage of the services.
CapCity12thMan
How long do you want to ignore this user?
RDS does not support CRR on all RDBMS platforms

techno-ag
How long do you want to ignore this user?
91AggieLawyer said:

Texas, with all its natuaral potential weather issues, is just really not a good place for data centers. Maybe high points in the deep hill country, but other than that, I can't think of where I'd want to park 10,000 servers in Texas and hope for the best.
I gotta say Bryan-College Station isn't bad. Far enough inland to avoid the worst of hurricanes. BTU and its municipal partners do a decent job of keeping the lights on. People drive from Houston to fill up with gas and stay in the many hotels. No place is perfect but BCS is not bad.
Trump will fix it.
UmustBKidding
How long do you want to ignore this user?
Bryan is actually well suited as a DR site for the downtown Houston area. Basically it's right at the delay limit for native fiber channel to work without issues. Last time I was in fibertown they were rows of emc SAN boxes that were mates of nodes around Houston. I know insparity told by bil take your laptop with you and head to Bryan to ride out ike.
Texas is fine for data center placement. Things like cheap hydro power attracts sites but unfortunately the speed of light basically makes it important to keep sites close to users in many cases.
DallasAg 94
How long do you want to ignore this user?
kb2001 said:

You make a lot of good points, but there's an idea you stated a couple times that I disagree with completely, and a few others worth commenting on

Quote:

Azure, AWS, GCF are just data centers where people host content.
This is wrong on so many levels. The companies that treat these cloud services platforms as just hosted infrastructure are the ones who typically have a lot of trouble getting value out of it, and who have trouble designing to run on it properly. They are much more than just data centers where people host content.

Quote:

AWS, Azure and GCP are just Infrastructure for you to deploy on.
Same thing, completely wrong, no need to go again
Quote:

Overall, the question is can you tolerate the risk of a single point of failure and if so, where. One server. One Cable. One firewall. One IP.
Exactly. In this case, people who failed to follow best practices and implement multi region sites got in trouble. The argument about whether this is cost effective seems moot after an outage, but it generally comes down to cost barriers up front
Quote:

The 1st hurdle is redundant content. Can you replicate your content across sites? Do you push or pull that content. Image you are at on a TexAgs forum hosted in San Antonio and NYC. You make a post to a thread. How do the people on the NYC site see your post?
Yes, however this is an oversimplified way of viewing it on cloud services platforms. The content is meaningless without the means to deliver it, and vice versa. Doing a lift and shift cloud migration is going to leave you with the same headaches just in somebody else's datacenter. If that's what happens, you've already failed in migrating to the cloud. Designing your applications and datasets to take advantage of the service offerings allows you not only to take advantage of the potential cost savings, and it makes DR easy.

The single source of truth challenge can be solved a number of ways quite easily. You shouldn't have to think about this too much, it's really a simple thing to do if you're using the cloud services instead of running on hosted infrastructure. The old RDBMS databases are more problematic, but frankly this is solved quite easily.
Quote:

Even a manual intervention could have been quick, provided the content was stored off-site.
Perhaps, it depends on how long it takes to restore your datasets, and what level you need back to hit your RTO requirements. If you've designed well to take advantage of the cloud services platforms, it can be incredibly easy to deploy your applications. In our case, everything we deploy is automated and fully dynamic. If we need to change regions, we just update the target region and hit go. It takes about 30 minutes to deploy the entire application and web layer of our entire platform. Since we keep the datasets warm, that is just a scale up and go. Even during Amazon's S3 outage, we have our asset bucket replicated so we change one more flag and deployments look there instead.
Quote:

AFAIK, there is no "easy-button" that says "duplicate my content to two sites."
I can't speak to Azure, but I know for AWS there are several Easy buttons to do just this, depending on which service you're referring to. RDS supports cross-region replication, as one example. S3 supports bucket replication.

I deal in on-prem and AWS, so I can't speak to Azure. I can speak personally that our biggest failures in AWS have come when people try to think of it as hosted infrastructure rather than taking advantage of the services.

My comments were intentionally an over simplification. In the basic form and how most customers have deploy in the cloud, it is lift-and-shift. When a CEO says X% of our applications are going to move to the cloud, they speak of existing Apps which will not be developed in the cloud using native cloud services.

IaaS, IMO, dominates these deployments. SaaS has been successful... SFDC, O365, etc. You are talking about Apps which developer companies have migrated to the Apps to cloud.

Azure is more PaaS (outside O365), and falls along the lines of what you are referencing.

Most migrations I've dealt with are more of the lift-and-shift type. Customers essentially replicate what they have on-prem, in the cloud. I agree customers often do not see the real value, and the idea of a "cost savings" is abandoned before a decision of which cloud to use, generally takes place. They start with the idea of moving it to the cloud with native services, and then abandon it when they realize investing money to develop the app is too costly and will take too long.

The problem with using cloud native services is that for mission critical Apps 1) the native services are often rudimentary and ultimately require the pre-existing infrastructure for functionality not available and 2) take a long time to develop in the cloud, prolonging adoption. Cloud native Apps also require customization for existing Apps, which means they are not quick to be used to replace on-prem solutions.

nwspmp
How long do you want to ignore this user?
UmustBKidding said:

Bryan is actually well suited as a DR site for the downtown Houston area. Basically it's right at the delay limit for native fiber channel to work without issues. Last time I was in fibertown they were rows of emc SAN boxes that were mates of nodes around Houston. I know insparity told by bil take your laptop with you and head to Bryan to ride out ike.
Texas is fine for data center placement. Things like cheap hydro power attracts sites but unfortunately the speed of light basically makes it important to keep sites close to users in many cases.



I rode out Ike at work. Was working at The Eagle at the time and we were told that we'd be the emergency site for the AP since they were leaving Houston. Don't think anyone ever showed but we did get things going well. I work in more conventional IT now and would have to agree ; B/CS is a good location. Short drive from major population centers in Texas, robust power with in area generating capacity, good quality fiber hubs, relatively inexpensive commercial real estate, educated but cheap workforce, lot of good assets available.
AGSPORTSFAN07
How long do you want to ignore this user?
CapCity12thMan said:

I got the impression that the San Antonio data center was the one hit, and the A/C issues triggered a graceful shutdown of a ton of services. The problems I see are:

1) they didn't have backup A/C or generators to even keep things up.
2) This completely brought down this Azure region, with no apparently redundancy from another region to support the services there
3) This region apparently housed Azure AD services which run other Azure/MS services as well, effectively preventing people from authenticating to their Azure based services from other regions, including the UK. It strikes me odd that the AD service - again - there was no redundancy? This is a huge hole IMHO.

It seems that for the amount of time things were "down" all over the world, it proves there was no redundancy in place for a lot of things. A simple Tier 1 data center is supposed to have 12 hours of backup power. Maybe they had backup power but no backup A/C - I am not sure. Thankful we run our business on AWS.

In the least you'd think they had a Liebert on a redudant power supply.
Refresh
Page 1 of 1
 
×
subscribe Verify your student status
See Subscription Benefits
Trial only available to users who have never subscribed or participated in a previous trial.