On Wednesday 30.10.2024 between 13:23 - 13:40 UTC one of our primary infrastructure providers, OVHcloud, experienced network disruptions across multiple data centers. This resulted in a partial or complete service outage for a large part of the traffic destined to both Labrador CMS and Labrador Front.
Our internal monitoring systems reported the first unavailable services and sites at 13:25 UTC. Initial investigation revealed that a network outage was ongoing at one of our infrastructure providers, causing connectivity disruptions for all their services.
Network services started to return at 13:40 UTC and all systems came back online. At 13:45 UTC, all systems were confirmed to be operational.
Services affected by this incident are specified in the table below. All Labrador CMS customers were affected to varying degrees. CMS access was down, but most customers without their own Varnish cache layers were still available for readers of cached pages.
Service name | Minutes | Time from — to |
---|---|---|
Labrador CMS | 20 | 13:25 — 13:45 |
Labrador Front | 20 | 13:25 — 13:45 |
Following is a timeline that describes the entire incident handling process. All times UTC.
2024.10.30 13:25
Initial service outage alerts registered2024.10.30 13:27
Large scale network outage confirmed2024.10.30 13:32
Statuspage updated and customers notified2024.10.30 13:40
Network back up again.2024.10.30 13:45
All services confirmed operational and customers notified.The root cause of the incident was determined to be network disruptions at our infrastructure provider, caused by one of their pairing partners pushing a faulty network update.
We are continuously working on improving and decentralizing our infrastructure so that we are less vulnerable to these large scale network outages.
One of our current largest efforts in this regard is moving more parts of the Labrador CMS and Front infrastructure to the cloud. Currently storage, image rendering and Varnish caching has been moved to AWS, with the rest of Labrador Front following in the coming months.
For more information on the incident, the OVHcloud incident report can be found here: https://network.status-ovhcloud.com/incidents/qgb1ynp8x0c4
In addition, Cloudflare has an interesting blog post with some more details here: https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage/