LMS outage

Incident Report for Xen Learning

Postmortem

What happened?
On May 7 at 2:49 PM UTC (7:19 PM IST, 10:49 AM US EST), a production deployment was started regarding the fix for restricting concurrent save_user_state API calls.

At 3:16 PM UTC (7:46 PM IST, 11:16 AM US EST), the deployment was completed successfully.

After the deployment, database migrations were initiated as part of the deployment process. During migration execution, a network connection issue caused the migration process to fail. As a result, database-hit queries across the platform started returning 500 errors, which caused the LMS site to become inaccessible to users.

The development team immediately started investigating the issue and worked on restoring the service.

At 3:55 PM UTC (8:25 PM IST, 11:55 AM US EST), the issue was resolved by successfully re-running the migrations, and the LMS site was restored and functioning normally.

 

Why did this happen?
The incident occurred because the database migrations failed during the production deployment process due to a network connection issue.

Since the migrations were not completed successfully, database-related queries started failing and returning 500 errors, which caused the LMS outage.

 

What was the incident’s duration?
The incident lasted approximately 39 minutes, from discovery to remediation.

 

Were there any consequences or damages?
Users were unable to access the LMS platform during the outage period because database-hit APIs and queries were failing with 500 errors.

Is the incident resolved?
Yes. The incident was resolved by successfully re-running the failed migrations, which restored the LMS functionality.

 

What will prevent further similar incidents?

To prevent similar incidents in the future, the following measures will be implemented:

  • Additional verification steps will be added after deployment and before migration execution.
  • Migration execution status will be monitored more closely during production deployments.
  • Network connection stability checks will be performed before and during migration execution wherever applicable.
  • Rollback and recovery procedures will be further documented and streamlined for faster remediation during deployment-related failures.
Posted May 12, 2026 - 22:04 AEST

Resolved

Users in the US region are unable to access the LMS platform during the outage period because database-hit APIs and queries were failing with 500 errors.
Posted May 07, 2026 - 23:30 AEST