Service outage due to network issues

Incident Report for Labrador CMS

Postmortem

Summary

On Wednesday 30.10.2024 between 13:23 - 13:40 UTC one of our primary infrastructure providers, OVHcloud, experienced network disruptions across multiple data centers. This resulted in a partial or complete service outage for a large part of the traffic destined to both Labrador CMS and Labrador Front.

Details

Our internal monitoring systems reported the first unavailable services and sites at 13:25 UTC. Initial investigation revealed that a network outage was ongoing at one of our infrastructure providers, causing connectivity disruptions for all their services.

Network services started to return at 13:40 UTC and all systems came back online. At 13:45 UTC, all systems were confirmed to be operational.

Impacted services

Services affected by this incident are specified in the table below. All Labrador CMS customers were affected to varying degrees. CMS access was down, but most customers without their own Varnish cache layers were still available for readers of cached pages.

Service name	Minutes	Time from — to
Labrador CMS	20	13:25 — 13:45
Labrador Front	20	13:25 — 13:45

Incident timeline

Following is a timeline that describes the entire incident handling process. All times UTC.

2024.10.30 13:25 Initial service outage alerts registered
2024.10.30 13:27 Large scale network outage confirmed
2024.10.30 13:32 Statuspage updated and customers notified
2024.10.30 13:40 Network back up again.
2024.10.30 13:45 All services confirmed operational and customers notified.

Root cause

The root cause of the incident was determined to be network disruptions at our infrastructure provider, caused by one of their pairing partners pushing a faulty network update.

Planned actions

We are continuously working on improving and decentralizing our infrastructure so that we are less vulnerable to these large scale network outages.

One of our current largest efforts in this regard is moving more parts of the Labrador CMS and Front infrastructure to the cloud. Currently storage, image rendering and Varnish caching has been moved to AWS, with the rest of Labrador Front following in the coming months.

Additional reading

For more information on the incident, the OVHcloud incident report can be found here: https://network.status-ovhcloud.com/incidents/qgb1ynp8x0c4

In addition, Cloudflare has an interesting blog post with some more details here: https://blog.cloudflare.com/cloudflare-perspective-of-the-october-30-2024-ovhcloud-outage/

Posted Oct 31, 2024 - 07:53 CET

Resolved

Our infrastructure provider reports that the issue has been identified and fixed. The incident is resolved and we will update with more details when we have them.

Posted Oct 30, 2024 - 15:17 CET

Monitoring

The problem was caused by issues at one of our infrastructure providers. Services are back online now, but we are actively monitoring. A more thorough description of the issues will follow when we have more information from our server provider.

Posted Oct 30, 2024 - 14:46 CET

Investigating

We are experiencing issues across all services and are actively investigating.

Posted Oct 30, 2024 - 14:32 CET

This incident affected: Labrador Editor, Labdevs Development, and Labrador Frontend.