Azure status history
Product:
Region:
Date:
September 2025
Service Management Operation Failures - East US 2 (Tracking ID VKY3-PF8)
What happened?
Between 09:12 UTC and 18:50 UTC on 10 September 2025, a platform issue resulted in an impact to multiple Azure services in the East US 2 region, more specifically two zones (Az02 and Az03). Impacted customers may have experienced error notifications when performing service management operations - such as create, delete, update, scaling, start or stop - for resources hosted in this region. The primary impacted service affected was Virtual Machines or Virtual Machines Scale Sets, but this would have resulted in issues for services dependent upon such Compute resources, such as Azure Databricks, Azure Kubernetes Service, Azure Synapse Analytics, Backup, and Data Factory.
Customers that still see failed or unhealthy resources should attempt to update or redeploy the resource.
What do we know so far?
Our investigation identified that the issue impacting resource provisioning in East US 2 was linked to a failure in the platform component responsible for managing resource placement. The system is designed to recover quickly from transient issues, but in this case, the prolonged performance degradation caused recovery mechanisms themselves to become a source of instability.
The incident was primarily driven by a combination of platform recovery behavior and sustained performance degradation. While customer-generated load remained within expected limits, internal platform services began retrying failed operations aggressively when performance issues emerged. These retries, intended to support resilience, instead created a surge in internal system activity.
How did we respond?
- 09:12 UTC on 10 September 2025 – Customer impact began.
- 09:13 UTC on 10 September 2025– Our monitoring systems observed a rise in failure rates, triggering an alert and prompting our team to initiate an investigation.
- 12:08 UTC on 10 September 2025 – We identified unhealthy dependencies in core infrastructure components as initial contributing factors.
- 13:34 UTC on 10 September 2025 – Began mitigation efforts that included - Restarted critical service components to restore functionality, reroute workloads from affected infrastructure, initiated multiple recovery cycles for the impacted backend service, on recovery internal workloads processed through backlogs to get to current healthy state, and executed capacity operations to free up resources.
- 18:50 UTC on 10 September 2025 – After a period of monitoring to validate the health of services, we were confident that the control plane service was restored, and no further impact was observed to downstream services for this issue.
What happens next?
- Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
- To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
- For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness