AWS is a comprehensive, evolving cloud computing program built to provide IT infrastructure for businesses, and the program has an active user base of roughly 1,000,000 customers. With the massive responsibility of providing services for such a large customer base, it’s important for their services to be reliable and trustworthy.
Unfortunately, we recently saw a major outage of one of their largest servers, which lasted multiple hours before it was resolved. Customers started reporting issues after one of their main servers, US-EAST-1, began causing outages in its most popular region, Northern Virginia. This AWS server began experiencing unexpected behavior after an automated scale capacity overwhelmed the networking devices between the internal network and the AWS network. A few major companies affected by this outage included Netflix, Disney+, and Delta Airlines, with hundreds more to name.
Unfortunately, this isn’t the first instance of AWS having an outage resulting in services across the world becoming inaccessible. This server has faced consistent outages dating back to 2008. On December 6, 2021, the same AWS US-EAST-1 server went down, bringing down various subsidiaries such as IMDb and Ring and games such as Valorant, Clash of Clans, Destiny 2, and more.
For three hours, Amazon slowly brought the server back up to speed, though ultimately couldn’t determine why the outage happened. Leaked reports led to theories it was a targeted attack on the server.
Shortly before this, on September 26, 2021, AWS US-EAST-1 faced a series of failures over the course of eight hours, from 8 p.m. to 4 a.m. It began by impacting AWS services such as Redshift, OpenSearch, Elasticache, and RDS databases and proceeded to impact applications such as Signal and the New York Times Games page. It resulted in poor performance across these AWS services, and despite repairs made around 9 p.m., it continued to disrupt users. Once again, despite claiming the issue was resolved around midnight, a glitch took the server back down at around 1 a.m. and wasn’t resolved until 4 a.m. with a notice they hadn’t restored all EBS volumes.
Alerting is the cornerstone of any monitoring tool. When your website goes down or there’s a problem, you want to know about it before it affects your business or customers. SolarWinds® Pingdom® allows for comprehensive monitoring with a combination of synthetic and real user monitoring for ultimate visibility and enhanced troubleshooting. Pingdom offers insights on where the outage is happening, as well as which services were affected. With recurring outages like this one, Pingdom can help by alerting you about issues with your servers before they happen. Try it free.