Reindexing/Uploading Issues

Incident Report for Air

Postmortem

Overview

  • Incident name: Delayed media asset processing
  • Date and time: 2025-09-10, roughly 2:03 PM–6:10 PM ET (with safe reprocessing continuing through 12:30 AM on 9/11)
  • Affected areas: Background processing of certain actions such as search reindexing, upload processing, and zip downloads. It’s worth clarifying that uploading originals were functioning as usual.
  • Status: Resolved

Customer impact

  • What customers experienced: Some background tasks such as media processing, AI enrichments, indexing, and downloads were delayed. Foreground app access remained available.
  • Scope: A subset of actions across customers experienced delays; not every action was impacted.
  • Duration: Delays primarily between ~2:03 PM and ~6:10 PM ET on 9/10. Backlogs were fully cleared by ~12:30 AM on 9/11.
  • Data and security: No data loss or security exposure occurred.

What happened

A spike in background activity coincided with limited available infrastructure capacity and a rolling deployment. This combination created a large queue of background work. While service remained available, some background tasks took longer than normal to start and complete. We mitigated by increasing capacity, adjusting infrastructure to remove bottlenecks, and carefully reprocessing queued work to restore expected performance.

Root cause

  • Primary cause: Background work was scheduled across many dynamic queues. With capacity already at its maximum, a sudden increase in queued work reduced per-queue processing rate, allowing a backlog to build and slow processing further.
  • Contributing factors: Rolling deployment temporarily reduced available workers, and incoming event volume was significantly above typical peaks.

Timeline (high level)

  • 2:03 PM: Issue reported; we observed a growing background queue
  • 2:34 PM: Incident declared; investigation and mitigation began
  • 3:15 PM: First mitigation reduced the queue briefly; additional steps identified
  • 5:10 PM: Second mitigation increased capacity and stabilized the system
  • 6:10 PM: Backlog returned to normal levels; background processing restored
  • 12:30 AM (9/11): All queued background work safely reprocessed

Preventative actions

  • Immediate fixes completed

    • Increased environment capacity and tuned scaling to better handle spikes
    • Adjusted deployment approach to avoid temporary under-provisioning during rollouts
  • Near-term improvements

    • Add safeguards to prevent queue growth from reducing throughput
    • Alert on early indicators (e.g., concurrent queue count and processing rate) for faster detection
  • Long-term investments

    • Optimize scheduling to keep processing performance stable as the number of queues grows
    • Evaluate architecture options (partitioning and mediated queueing) for additional resilience

Frequently asked questions

  • Was any customer data lost?

    • No. We confirmed no data loss or security exposure.
  • Do customers need to take any action?

    • No. All queued work has been safely processed. If anything still looks off, please let us know and we will investigate immediately.
  • How will we keep you updated?

    • Your account team will share any follow-ups on improvements. We will also post future status updates through our standard channels if needed.
Posted Sep 16, 2025 - 21:08 UTC

Resolved

This incident has been resolved.
Posted Sep 11, 2025 - 01:28 UTC

Monitoring

We are starting to see tasks being processed again. Search indexing is slightly delayed, which is expected given the current processing load. While recovery has begun, we still have a backlog of tasks to reprocess and are continuing to monitor until we are fully in the clear.
Posted Sep 11, 2025 - 01:02 UTC

Investigating

We’re currently investigating an issue where reindexing and uploading are not working in some workspaces. Our engineering team is working to identify the cause and restore full functionality as quickly as possible.
Posted Sep 10, 2025 - 18:43 UTC
This incident affected: Web App.