LMS outage

Incident Report for Xen Learning

Postmortem

What was the incident’s duration?
The incident lasted approximately 12 minutes, from 8:18 PM IST (10:48 AM US EST, 2:48 PM UTC) to 8:30 PM IST (11:00 AM US EST, 3:00 PM UTC).

Were there any consequences or damages?
Users were temporarily unable to access or properly use the LMS platform during the outage period.

Multiple APIs returned 504 Gateway Timeout errors due to the excessive concurrent load generated by repeated save_user_state API requests.

How was the incident addressed?
The development team immediately started investigating the issue after identifying the production outage.

During the investigation, the issue was traced to duplicate and excessive save_user_state API calls originating from the video embed player.

Kubernetes autoscaling automatically increased the LMS pod count, which stabilized the platform and restored service availability.

A permanent fix was then implemented and deployed to production on May 7th, 2026.

The fix includes:

  • The save_user_state API is now triggered only once after the user completes dragging the video slider.
  • The save_user_state API is now triggered only once for pause actions.

After deployment, the excessive API call behavior was resolved successfully.

Is the incident resolved?
Yes. The incident has been resolved.

The LMS platform was automatically restored after Kubernetes pod autoscaling stabilized the environment, and the permanent fix was deployed to production on May 7th, 2026, to prevent recurrence.

What will prevent further similar incidents?

To prevent similar incidents in the future, the following measures will be implemented:

  • Additional validation for frontend-triggered API calls to prevent duplicate requests.
  • Improved throttling and request handling for high-frequency user interaction events.
  • Additional monitoring and alerting for abnormal spikes in API request counts.
  • Improved incident communication processes to provide faster customer-facing updates during active production events.
  • Continued monitoring of autoscaling behavior and API performance metrics in production environments.
Posted May 12, 2026 - 22:03 AEST

Resolved

Our monitoring systems detected slowed performances and errors, elevating to 504 Gateway Timeout responses and abnormal API traffic patterns for the LMS in the US region.
Posted May 06, 2026 - 23:30 AEST