Photo by Joshua Rawson-Harris on Unsplash
The BookMyShow Coldplay Conundrum
A Lesson in High-Scale System Design
On September 23, 2024, BookMyShow, India's leading ticketing platform, faced a significant challenge when tickets for Coldplay's highly anticipated Mumbai concerts went on sale. The platform experienced a crash just minutes before the scheduled ticket release, leaving thousands of eager fans frustrated and unable to access the booking page.
This incident serves as a prime example of the challenges software professionals face when designing large-scale ticketing systems, particularly in handling the "thundering herd" problem.
Understanding the Thundering Herd
The thundering herd problem occurs when a large number of processes or threads, waiting for a single event, are awakened simultaneously when that event occurs. In BookMyShow's case, this manifested as millions of users attempting to access the ticketing system at precisely 12 PM when sales opened.
Imagine a scenario where thousands of people are waiting outside a store for a limited-edition product. When the doors open, everyone rushes in at once, potentially causing chaos and overwhelming the store's capacity. This real-world analogy closely mirrors what happened to BookMyShow's servers.
The Challenges of High-Scale Ticketing Systems
Designing a system to handle such massive concurrent traffic presents several challenges:
Load Balancing: Distributing incoming requests evenly across multiple servers to prevent any single point of failure.
Caching: Implementing efficient caching mechanisms to reduce database load and improve response times.
Queue Management: Creating a robust queueing system to handle excess traffic and prevent server overload.
Database Optimisation: Ensuring database operations can handle numerous simultaneous read and write operations.
Scalability: Designing the system to scale horizontally to accommodate traffic spikes.
Mitigating the Thundering Herd
To address these challenges, software professionals usually employ several strategies, including:
Implement a Virtual Waiting Room
Create a holding area for users before they enter the actual ticketing system. This helps manage traffic flow and prevents server overload.
Use Exponential Backoff with Jitter
When retrying failed requests, implement an exponential backoff algorithm with added randomness (jitter). This approach, similar to what PayPal used to solve their thundering herd problem, helps spread out retry attempts and prevents synchronised floods of requests.
Leverage Cloud Auto-scaling
Utilise cloud services that can automatically scale resources based on demand. This ensures the system can handle traffic spikes without manual intervention.
Employ Caching Strategies
Implement intelligent caching to reduce the load on backend services. This could include caching frequently accessed data like event details and seat availability.
Optimise Database Operations
Use database sharding, read replicas, and other optimisation techniques to handle high-volume concurrent operations efficiently.
As we all understand, no one can guarantee 100% resilience. It is always going to be a tug of war between cost and availability. As we continue to push the boundaries of what's possible in large-scale system design, incidents like these remind us of the importance of continuous improvement and adaptation in the face of ever-growing user demands.