App is getting 504 Gateway timeouts

Incident Report for Air

Postmortem

Overview

  • Incident name: Internal Cleanup Led to API Request Handling Issues
  • Date and time: 2025-09-30, roughly 12:45 PM–1:45 PM ET
  • Affected areas: Consumers of Air’s API
  • Status: Resolved

Customer impact

  • What customers experienced: During this period foreground apps were left with intermittent success in accessing the api with more impact seen on operations modifying data
  • Scope: Actions that modified data and intermittent impact to actions reading data. Some background tasks such as media processing, AI enrichments, indexing, and downloads were delayed during the incident but resumed following resolution without degradation.
  • Duration: 2025-09-30, roughly 12:45 PM–1:45 PM ET
  • Data and security: No data loss or security exposure occurred.

What happened

Internal cleanup on a dataset led to increased load on Air’s primary database which led to subsequent queries being impacted (latency and/or timing out).

Root cause

  • Primary cause: cleanup logic was not throttled correctly to limit load it placed on the primary database.

Timeline (high level)

  • 12:45 PM: Degradation from cleanup observed and cleanup terminated
  • 1:00 PM: Load on primary decreases and intermittent api access is observed
  • 1:08 PM: longer running queries that were created from cleanup terminated
  • 1:25 PM: Additional read capacity added to the database to support volume of requests during recovery
  • 1:45 PM: All app functionality returned to nominal behavior and async processing resumed

Preventative actions

  • Immediate fixes completed

    • Increased environment capacity to handle load
    • Cleanup process placed back in review for further refinement
  • Near-term improvements

    • Refine approval and application out-of-band processing (e.g cleanup, etc)
    • Alert on early indicators (e.g., database locks, etc) for faster detection
  • Long-term investments

    • Dedicated system for out-of-band processing to provide additional guardrails

Frequently asked questions

  • Was any customer data lost?

    • No. We confirmed no data loss or security exposure.
  • Do customers need to take any action?

    • No. All queued work has been safely processed. If anything still looks off, please let us know and we will investigate immediately.
  • How will we keep you updated?

    • Your account team will share any follow-ups on improvements. We will also post future status updates through our standard channels if needed.
Posted Oct 06, 2025 - 19:37 UTC

Resolved

This incident has been resolved.
Posted Sep 30, 2025 - 17:45 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Sep 30, 2025 - 17:25 UTC

Investigating

We are currently investigating this issue.
Posted Sep 30, 2025 - 16:45 UTC
This incident affected: Web App.