Twitter has been in the news a lot lately. It’s one of the biggest internet companies in the world, so the public notices right away when its servers are down, and its brand reputation, stock price, revenue, and users can be affected. In recent months, users have noticed more Twitter outages and service disruptions than usual.
While it’s impossible to avoid outages, looking at why Twitter may be experiencing issues can be a good learning experience for all web development teams. In this post, we’ll cover some of the most recent Twitter outages and help you understand why they occur and how they could have been prevented.
Significant Twitter outages
First, let’s look at some of the most recent and significant Twitter outages.
December 28 outage
On December 28, 2022, Twitter users could not use Twitter for almost five hours. Instead, they saw the following message:
Some users also reported seeing “Rate Limit Exceeded” when they tried to access Twitter during this time, which indicates Twitter’s servers could not handle the number of incoming requests, leading to a complete service disruption. Even users who could access the website reported insanely slow load times and frequent connection issues.
Twitter reported upgrading its back-end server architecture shortly after the outage to optimize speed and performance.
Jan 23 outage
On January 23, Android users reported they could not post a tweet. Also, some Android users reported some tweets weren’t loading for them at all. Users saw the message, “Oops, something went wrong.” It appeared loading and sending tweets wasn’t working as intended on Twitter’s Android app.
February 8 and February 18 outage
A similar problem where tweets wouldn’t load also occurred on February 8. Users reported seeing error messages and were unable to post.
Another outage occurred roughly ten days later when timelines and replies on Twitter threads started breaking.
March 6 outage
On March 6, 2023, Twitter again went down for a few hours, and users couldn’t use it normally or experienced problems accessing links, images, and videos. Thousands were affected by this outage, and many people in some regions reported the website was slower than usual.
Compared to 2022, when Twitter experienced nine service disruptions for the entire year, the frequency of Twitter outages has increased significantly during the past eight months. According to NetBlocks, an organization devoted to tracking internet outages, Twitter experienced at least four widespread outages in February 2023.
Potential causes of recent Twitter outages
Many of these outages happened because of architecture changes in Twitter’s APIs, configurational changes in its back-end system, and routine server updates. Let’s look at each cause in detail.
Internal configurational changes
The outage on March 6, 2023, was caused by a configurational change. However, this configurational change was only made in one part of the server, a stand-alone API. So how did a slight misconfiguration lead to an entire service going down? An engineer attempting to fix a separate issue in the system made a configuration change. Since the effect of an internal configurational change in Twitter’s back-end system wasn’t isolated, it escalated to other services, eventually causing an outage for several hours affecting millions of users.
The effects of internal configurational changes can be isolated by refining the back-end architecture. If your back-end system has a bunch of discrete microservices communicating with each other, a change in one microservice shouldn’t bring another down. Even in a monolith architecture, implementing process separation and proper error handling can help prevent this kind of effect on an internal configurational change.
External integrations
Another possible cause of Twitter outages is external integrations. We know Twitter integrates many third-party APIs and services for some of its features. Third-party integrations allow users to share videos from YouTube Xbox gameplay and send tweets directly to Slack, Discord, and Service Now. But if your system is too tightly coupled with third-party integrations, you are at a higher risk of creating a problem. A bug in a third-party service shouldn’t disrupt your whole server, but it appears this is what happened to Twitter on February 8.
Poorly maintained code
As you scale to your next million users, your codebase will grow immensely in size and complexity. The scale at which Twitter operates is beyond imagination. Picture this: tweets are constantly fetched, content is dynamically generated for millions of users, threads are loading up with messages, video is streaming worldwide, and more. However, having a perfect codebase to handle this might be impossible, which is where routine refactoring can help.
Over the last few years, Twitter has continuously evolved its architecture, scale, and features. The codebase is mindbogglingly large, which can contribute to the complexity of feature updates and routine maintenance. But not updating or maintaining code leads to errors and unanticipated effects from issues.
Routinely refactoring the codebase, especially the more complex pieces, is essential. Having the appropriate code quality prevents unknown errors and bugs from surfacing and leading to a total outage and helps you iterate and update your codebase faster.
Brittle APIs
APIs are the pillar of back-end services in any system. You must properly handle exceptions when developing and designing an API. If an API is too brittle, it can easily break things and cause a service outage. The Twitter outage on March 1, 2023, is reported as resulting from an issue in the timeline API. Due to this issue, the timeline stopped working altogether and caused tweets and replies to break and disappear.
Preventing outages
The best defense against outages is a strong back-end architecture and system design with robust and reliable APIs forming a stable backbone of a disruption-free system. However, there will always be edge case scenarios, flukes, errors, and bugs you can’t plan or test for and can cause an outage later. You and your development team must be the first to know about those cases. You want to avoid learning about problems from your users or reading about them on a social media site. There are plenty of availability and performance monitoring tools you can easily integrate to help you understand your system better.
Pingdom® is one such tool designed to alert you about potential problems. It can also detect potential downtime, which can result later. It can provide actionable insights into your application’s uptime and performance you can use to prevent a potential outage from occurring.
Why preventing outages matters
The recent outages experienced by Twitter show the effect outages can have on a large tech company and serve as a warning to developers and engineers. However, the varying causes of Twitter’s recent outages aren’t unusual and are common problems many development engineering teams face daily.
The overall lesson from looking at these incidents is you need to ensure back-end systems can handle internal configuration changes and external integration changes during their creation. The code must also be well-designed and built around robust APIs. Investing in strong monitoring tools is as important as investing in a reliable technical architecture. In today’s fast-paced digital world, even a brief outage can have significant consequences for businesses and their customers. The organizations surviving and thriving are those taking proactive steps to prevent outages to help ensure their services remain reliable and accessible.This post was written by Siddhant Varma. Siddhant is a full-stack JavaScript developer with expertise in front-end engineering. He’s worked with scaling multiple startups in India and has experience building products in the Ed-Tech and healthcare industries. Siddhant has a passion for teaching and a knack for writing. He’s also taught programming to many graduates, helping them become better future