On January 16, 2024, several reports indicated that tech giant Oracle experienced an outage impacting regions across multiple continents. Fortunately for Oracle©, service was restored in less than an hour, and this incident didn’t generate as much internet chatter as other outage case studies (such as with Facebook in February 2024 or iCloud in January 2024). Still, much can be learned from the Oracle outage.
In this article, we’ll look at the details of the outage and what IT operations teams and website administrators can learn from it.
Scope of the outage
The January 16, 2024, outage was initially observed at approximately 8:45 a.m. EST and lasted approximately 35 minutes. According to reports, nodes in North America, Asia, Europe, and Australia were affected. Another outage on Oracle infrastructure began around 9:10 a.m. EST, affecting nodes in North America, Europe, and Australia. Reports indicated that Oracle cloud service customers and downstream partners were affected.
Notably, beyond the reporting by media outlets, there was little noise about the outage on social media platforms or forums.
Root cause
There was no public root cause analysis (RCA) published regarding the Oracle outage. From the reports and associated data, it seems that entire portions of Oracle networks became inaccessible via the internet. This implies several possible causes, including:
- Routing issues: Problems with routing protocols like BGP, such as erroneous route withdrawals or misconfigurations, could cause a network to go offline.
- Failed updates: Untested or under-tested patches with unexpected consequences can cause network issues.
- Equipment failures: A device failing or malfunctioning could lead to routing failures that create a domino effect and cause a network outage.
What can you learn from the Oracle outage?
The Oracle outage is interesting for several reasons. First, the limited information from official sources makes it a good thought experiment: How might your team hit the same symptoms? Additionally, the limited chatter from Oracle users suggests the organization probably did something right, and that’s something we should learn from.
With this in mind, let’s jump into our top three takeaways.
Takeaway #1: Focus on users
What we know about the incident is that there were reports of an Oracle-related network outage. That implies some systems went down. However, the limited uproar from users implies that Oracle didn’t let things get too far out of hand. They must have had mechanisms in place to limit user impact and/or recover quickly.
Granted, you might not have an infrastructure setup like Oracle’s. Nonetheless, this gives us several key points to consider when implementing user-focused website monitoring and infrastructure:
- Monitor from the user’s perspective. Users are often spread across multiple geographic regions, and they interact with websites differently than traditional monitoring tools. Real user monitoring (RUM) from multiple regions can help teams implement user-centric monitoring.
- Failover fast. If you can’t afford downtime, then you should invest in fast, reliable failover mechanisms. Technologies such as load balancers, database clusters, and DNS failover solutions can help; however, the implementation details matter less than the principle: Make downtime as painless for your users as is practical for your business.
- Keep users informed if something goes wrong. If your users don’t know whether something is wrong (Is the site down, or is it just me?), then they may waste their time and your support staff’s time chasing down a known issue. When you have a service interruption, make sure to communicate clearly. For example, a public status page can provide users with a simple, easy-to-understand source of truth while you work on addressing a service incident.
Takeaway #2: Monitor your dependencies
The reports indicated that Oracle customers and downstream partners were affected. Many modern businesses have multiple dependencies that could bring down key systems. In some cases, a service provider is responsible for monitoring dependencies. For example, this is true with AWS infrastructure, which falls on AWS’s side of their shared responsibility model.
However, when a third party is responsible for the service, you’re still accountable to your users for the services you maintain. When you have monitoring in place to identify where an issue lies, it can help you to:
- Save time when your engineers are troubleshooting an outage
- Set proper expectations with your users
- Enable workarounds, such as switching to another service provider during an extended outage
Dependency monitoring can also help you detect issues with services that don’t directly impact your site’s “uptime,” such as:
- Payment APIs
- Analytics services
- Email services
These services could go down for your users (for example, due to an expired API key or license) without indicating a complete outage. Exactly how you monitor a third-party dependency will vary. One approach is transaction monitoring, which can help you detect when something breaks in a user journey that involves third-party services.
Takeaway #3: Set realistic expectations
Oracle chairman and cofounder Larry Ellison once said Oracle’s cloud “never ever goes down.” Taken literally, this isn’t a realistic expectation to set for your users. Modern systems are complex, and many intertwined dependencies can take a service offline. Set realistic expectations for your users and work to meet or exceed them.
For many services, this boils down to getting the following expectations right:
- Recovery time objective (RTO): How long a system can be down
- Recovery point objective (RPO): How much data can be lost during an incident
- Service level agreement (SLA): How much cumulative downtime a system can have over a given period (such as monthly, quarterly, or annually).
The more aggressive your SLAs, RTOs, and RPOs, the more you should invest in fault tolerance and high availability. Consider these numbers:
- A 99.5% availability SLA allows for 43.8 hours of downtime per year.
- A 99.9% availability SLA brings that down to about 8.76 hours of downtime per year.
- A 99.999% (“five nines”) availability SLA allows for about 5.26 minutes of downtime per year.
If you offer your users “five nines,” then you leave your systems and technicians with a much smaller margin for error. If your business can’t afford the corresponding investment in redundancy and high availability, don’t overpromise. After all, outages impact even the biggest players in the tech industry.
How SolarWinds Pingdom can help improve your website monitoring
SolarWinds® Pingdom© is a simple and powerful website monitoring tool that supports user-centric monitoring capabilities like RUM and transaction monitoring. SolarWinds Pingdom also enables teams to run checks from multiple regions across the globe to help you avoid blind spots in your monitoring strategy.