Skip Navigation

Azure status history

This page contains Post Incident Reviews (PIRs) of previous service issues, each retained for 5 years. From November 20, 2019, this included PIRs for all issues about which we communicated publicly. From June 1, 2022, this includes PIRs for broad issues as described in our documentation.

Product:

Region:

Date:

February 2026

2/3

Mitigated – Managed identities for Azure resources, and dependent services – Operation failures in East US and West US (Tracking ID _M5B-9RZ)

What happened?

Between 00:10 UTC and 06:05 UTC on 03 February 2026, a platform issue with the 'Managed identities for Azure resources' (formerly Managed Service Identity) service in the East US and West US regions which impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens.

What do we know so far?

Following the mitigation of an earlier outage, a large spike in traffic overwhelmed a platform service for managed identities in East US and West US regions. This impacted the creation and use of Azure resources with assigned managed identities, including but not limited to Azure Synapse Analytics, Azure Databricks, Azure Stream Analytics, Azure Kubernetes Service, Microsoft Copilot Studio, Azure Chaos Studio, Azure Database for PostgreSQL Flexible Servers, Azure Container Apps, Azure Firewall and Azure AI Video Indexer.

Once the failures began, many infrastructure components began to retry aggressively, overwhelming service capacity and limits. While we were able to scale up our service, the new capacity quickly became overwhelmed and further compensated by shedding load.

How did we respond?

  • 00:10 UTC on 03 February 2026 - Customer impact began.
  • 00:14 UTC on 03 February 2026 - Automated alerting identified the issue. Engineers quickly recognized that the service was overloaded and began to scale up.
  • 00:50 UTC on 03 February 2026 - The first set of infrastructure scale-ups completed, but the new capacity was unable to handle the traffic volume due to an increasing backlog of retried requests.
  • 02:00 UTC on 03 February 2026 - A second, much larger set of infrastructure scale-ups completed. Once again, the capacity was unable to handle the volume of backlogs and retries.
  • 03:55 UTC on 03 February 2026 - To recover infrastructure capacity, we began rolling out a change to remove all traffic from the service so that the infrastructure could be repaired without load.
  • 04:20 UTC on 03 February 2026 - All infrastructure nodes became healthy, serving no traffic.
  • 04:25 UTC - 6:05 UTC on 03 February 2026 - We gradually ramped traffic to the infrastructure, allowing all backlogged identity operations to complete safely.
  • 06:05 UTC on 03 February 2026 - With all backlogged operations complete and current, service was restored, and customer impact was mitigated.

What happens next?

  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: 
  • For more information on Post Incident Reviews, refer to 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, for broader guidance on preparing for cloud incidents, refer to 
2/2

Mitigated – Virtual Machines and dependent services – Service management issues in multiple regions (Tracking ID FNJ8-VQZ)

What happened?

Between 19:46 UTC on 02 February 2026 and 06:05 UTC on 03 February 2026, a platform issue resulted in an impact for multiple Azure services in multiple regions.

  • Azure Virtual Machines – Customers may have experienced failures when deploying or scaling virtual machines, including errors during provisioning and lifecycle operations.
  • Azure Virtual Machine Scale Sets – Customers may have experienced failures when scaling instances or applying configuration changes.
  • Azure Kubernetes Service – Customers may have experienced failures in node provisioning and extension installation.
  • Azure DevOps and GitHub Actions – Customers may have experienced pipeline failures when tasks required virtual machine extensions or related packages.
  • Managed identities for Azure resources – Customers may have experienced authentication failures for workloads relying on this service.
  • Other dependent services – Customers may have experienced degraded performance or failures in operations that required downloading extension packages from Microsoft-managed storage accounts. (Azure Arc enabled servers, Azure Database for PostgreSQL)

What do we know so far?

A policy change was unintentionally applied to a subset of Microsoft‑managed storage accounts including ones used to host virtual machine extension packages. The policy blocked public read access which disrupted scenarios like virtual machine extension package downloads. This triggered widespread extension installation failures and caused downstream impact for services that rely on virtual machine scale set provisioning.

How did we respond?

  • 19:46 UTC on 02 February 2026 – Customers began experiencing issues while attempting to complete service management operations in multiple regions.
  • 19:55 UTC on 02 February 2026 – Service monitoring detected failure rates exceeding failure limit thresholds.
  • 20:10 UTC on 02 February 2026 – We started collaborating with additional teams to devise a mitigation solution.
  • 21:15 UTC on 02 February 2026 – We applied a primary mitigation and validated it was successful.
  • 21:50 UTC on 02 February 2026 – Began broader mitigation to impacted storage accounts.
  • 21:53 UTC on 02 February 2026 – In parallel, we completed an additional workstream to ensure no recurrence of the main triggering process.
  • 01:52 UTC on 03 February 2026 – We completed mitigation to impacted storage accounts.
  • 02:15 UTC on 03 February 2026 – Reviewed additional data and monitored downstream services to ensure that all mitigations were in place for all impacted storage accounts.
  • 06:05 UTC on 03 February 2026 – We concluded our monitoring and confirmed that all customer impact was mitigated.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: . For more information on Post Incident Reviews, refer to . The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: . Finally, for broader guidance on preparing for cloud incidents, refer to .

January 2026

1/27

Preliminary Post Incident Review (PIR) - Azure OpenAI Service - Issues in Sweden Central (Tracking ID BMVM-7_G)

During this incident, we temporarily used our public Azure Status page because it was taking too long to determine exactly which customers, regions, or services were affected. Since the issue has been mitigated and customers who were impacted can read relevant communications in Azure Service Health, this public post will be removed once the final PIR is published. For details on how this public Status page is used, refer to 

What happened?

Between 09:22 UTC and 16:12 UTC on 27 January 2026, and again between 11:14 UTC and 13:35 UTC on 29 January 2026, a platform issue resulted in an impact to the Azure OpenAI Service in the Sweden Central region. Impacted customers may have experienced HTTP 500/503 errors, failed inference requests, and/or issues with model deployment metadata. This issue also affected the Agent Service and other downstream AI Services dependent on Azure OpenAI in this region.

What went wrong and why?

The Azure OpenAI service uses Azure Managed Redis to cache metadata used for serving incoming requests. Caches expiring and being fetched from metadata service is a very normal part of operations.

However, during this incident, certain cache entries with a larger than usual payload and higher usage, triggered a faster refresh cycle than normal. This unusually high increase in requests to Redis resulted in cache responses slowing down significantly resulting in request timeouts.

As a result, the Azure OpenAI service ended up heavily querying the backend that stores metadata which eventually became overwhelmed. This resulted in the customer availability impact.

How did we respond?

  • 09:22 UTC on 27 January 2026 – The issue was detected through service monitoring which is also when customers began to see intermittent availability issues were observed.
  • 12:36 UTC on 27 January 2026 – Initiated mitigation to restart the IRM service on the Sweden Central clusters.
  • 12:46 UTC on 27 January 2026 – Identified that Sweden Central cluster is seeing pods crashing with out-of-memory errors.
  • 13:02 UTC on 27 January 2026 – Initiated mitigation workflow by scaling out nodes in the cluster to improve request handling and resilience.
  • 15:30 UTC on 27 January 2026 – Started to increase the memory available in the pods to alleviate memory load on the cluster.
  • 15:53 UTC on 27 January 2026 – Completed increase in memory in the pods to alleviate memory load on the cluster.
  • 16:12 UTC on 27 January 2026 – Service(s) restored, and customer impact mitigated.


  • 11:14 UTC on 29 January 2026 – Customer impact began again.
  • 11:21 UTC on 29 January 2026 – We detected this issue when our internal service monitoring system upon observing an increase in error messages, reduced service performance, and occasional interruptions in results prompting our team to investigate.
  • 12:07 UTC on 29 January 2026 – We identified that the cause of the issue appears to be system instability and memory pressure.
  • 12:34 UTC on 29 January 2026 – To address the issue, we increased memory and pod capacity on the backend cluster supporting Sweden Central, which restored service availability and stabilized request handling.
  • 13:35 UTC on 29 January 2026 – Services restored, and customer impact mitigated.

How are we making incidents like this less likely or less impactful?

  • We have increased memory limits for service pods hosting the metadata services, to improve stability and prevent out-of-memory crashes. (Completed)
  • We have scaled out the metadata service instances, to handle increased load and improve request resilience. (Completed)
  • We have enhanced our monitoring and alerting across regions, to detect abnormal memory usage patterns more quickly. (Completed)
  • We have initiated detailed investigations to identify and fix the factors contributing to the underlying resource exhaustion issues. (Estimated completion: February 2026)
  • We are developing automation scripts and systems for regular pod management and mitigation, to maintain service stability. These measures collectively aim to improve service resilience and reduce both the frequency and severity of similar incidents in the future. (Estimated completion: February 2026)
  • We are rolling out an optimization to cache refresh concurrency behavior, that will limit load to backend services, while improving performance. (Estimated completion: February 2026)

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR to Azure Service Health with additional details.


How can customers make incidents like this less impactful?

  • Consider managing endpoints in multiple regions, to enable the ability to failover in case the primary region is ever unavailable. For guidance on Business Continuity and Disaster Recovery (BCDR) scenarios, review 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: .

How can we make our incident communications more useful?

December 2025

12/22

Post Incident Review (PIR) – Entra Privileged Identity Management – Customers experiencing API failures (Tracking ID FV31-PQG)

Watch our 'Azure Incident Retrospective' video about this incident:

What happened?

Between 08:05 UTC and 18:30 UTC on 22 December 2025, the Microsoft Entra Privileged Identity Management (PIM) service experienced issues that impacted role activations for a subset of customers. Some requests returned server errors (5xx) and, for a portion of customers, some requests returned unexpected client errors (4xx). This issue manifested as API failures, elevated latency, and various activation errors.

Impacted actions could have included:

  • Reading eligible or active role assignments
  • Eligible or Active Role assignment management
  • PIM Policy and PIM Alerts management
  • Activation of Eligible or Active Role assignments
  • Deactivation of role assignments initiated by customer (deactivations triggered by the expiration of previous activation were not impacted)
  • Approval of role assignment activation or extension

Customer initiated PIM operations from various Microsoft management portals, mobile app, and API calls were likely impacted. During this incident period, operations that were retried may have succeeded.

What went wrong and why?

Under normal conditions, Privileged Identity Management processes requests—such as role activations—by coordinating activity across its front-end APIs, traffic routing layer, and the backend databases that store role activation information. These components work together to route requests to healthy endpoints, complete required checks, and process activations without delay. To support this flow, the system maintains a pool of active connections for responsiveness and automatically retries brief interruptions to keep error rates low, helping ensure customers experience reliable access when reading, managing, or activating role assignments.

As part of ongoing work to improve how the PIM service stores and manages data, configuration changes which manage the backend databases were deployed incrementally using safe deployment practices. The rollout progressed successfully through multiple rings, with no detected errors and all monitored signals remaining within healthy thresholds.

When the deployment reached a ring operating under higher workload, the additional per-request processing increased demand on the underlying infrastructure that hosts the API which manages connections to the backend databases. Although the service includes throttling mechanisms designed to protect against spikes in API traffic, this scenario led to elevated CPU utilization without exceeding request-count thresholds, so throttling did not engage. Over time, the sustained load caused available database connections to be exhausted, and the service became unable to process new requests efficiently. This resulted in delays, timeouts, and errors for customers attempting to view or activate privileged roles.

How did we respond?

  • 21:36 UTC on 15 December 2025 – The configuration change deployment was initiated.
  • 22:08 UTC on 19 December 2025 – The configuration change was progressed to the ring with the heavy workload.
  • 08:05 UTC on 22 December 2025 – Initial customer impact began.
  • 08:26 UTC on 22 December 2025 – Automated alerts received for intermittent, low volume of errors prompting us to start investigating.
  • 10:30 UTC on 22 December 2025 – We attempted isolated restarts on impacted database instances in an effort to mitigate low-level impact.
  • 13:03 UTC on 22 December 2025 – Automated monitoring alerted us to elevated error rates and a full incident response was initiated.
  • 13:22 UTC on 22 December 2025 – We identified that calls to the database were intermittently timing out. Traffic volume appeared to be normal with no significant surge detected however we observed the spike in CPU utilization.
  • 13:54 UTC on 22 December 2025 – Mitigation efforts began, including beginning to scale out the impacted environments.
  • 15:05 UTC on 22 December 2025 – Scale out efforts were observed as decreasing error rates but not completely eliminating failures. Further instance restarts provided temporary relief.
  • 15:25 UTC on 22 December 2025 – Scaling efforts continued. We engaged our database engineering team to help investigate.
  • 16:37 UTC on 22 December 2025 – While we did not correlate the deployment to this incident, we initiated a rollback of the configuration change.
  • 17:20 UTC on 22 December 2025 – Scale-out efforts completed.
  • 17:45 UTC on 22 December 2025 – Service availability telemetry was showing improvements. Some customers began to report recovery.
  • 18:30 UTC on 22 December 2025 – Customer impact confirmed as mitigated, after rollback of configuration change had completed and error rates had returned to normal levels.

How are we making incidents like this less likely or less impactful?

  • We have rolled back the problematic configuration change across all regions. (Completed)
  • For outages that manifest later as a result of configuration updates, we are developing a mechanism to help engineers correlate these signals more quickly. (Estimated completion: January 2026)
  • We are working to ensure this configuration change will not inadvertently introduce excessive load before we redeploy this again. (Estimated completion: January 2026)
  • We are working on updating our auto-scale configuration to be more responsive to changes in CPU usage. (Estimated completion: January 2026)
  • We are enabling monitoring and runbooks for available database connections to respond to emerging issues sooner. (Estimated completion: February 2026)

How can customers make incidents like this less impactful?

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

12/8

Post Incident Review (PIR) – Azure Resource Manager – Service management failures affecting Azure Government (Tracking ID ML7_-DWG)

What happened?

Between 11:04 and 14:13 EST on 08 December 2025, customers using any of the Azure Government regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure App Service, Azure Backup, Azure Communication Services, Azure Data Factory, Azure Databricks, Azure Functions, Azure Kubernetes Service, Azure Maps, Azure Migrate, Azure NetApp Files, Azure OpenAI Service, Azure Policy (including Machine Configuration), Azure Resource Manager, Azure Search, Azure Service Bus, Azure Site Recovery, Azure Storage, Azure Virtual Desktop, Microsoft Fabric, and Microsoft Power Platform (including AI Builder and Power Automate).

What went wrong and why?

Azure Resource Manager (ARM) is the gateway for management operations for Azure services. ARM does authorization for these operations based on authorization policies, stored in Cosmos DB accounts that are replicated to all regions. On 08 December 2025, an inadvertent automated key rotation resulted in ARM failures to fetch authorization policies that are needed to evaluate access. As a result, ARM was temporarily unable to communicate with underlying storage resources, causing failures in service-to-service communication and affecting resource management workflows across multiple Azure services. This issue surfaced as authentication failures and 500 Internal Server errors to customers across all clients. Because the content of the Cosmos DB accounts for authorization policies is replicated globally, all regions within the Azure Government cloud were affected.

Microsoft services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the Microsoft internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

Earlier on the same day of this incident, a similar key rotation issue affected services in the Azure in China sovereign cloud. Both the Azure Government and Azure in China sovereign cloud had their (separate but equivalent) keys created on the same day, starting completely independent timers, back in February 2025 – so each was inadvertently rotated on their respective timers, approximately three hours apart. As such, the key used by ARM for the Azure Government regions was automatically rotated, before the same key rotation issue affecting the Azure in China regions was fully mitigated. Although potential impact to other sovereign clouds was discussed as part of the initial investigation, we did not have a sufficient understanding of the inadvertent key rotation to be able to prevent impact in the second sovereign cloud, Azure Government.

How did we respond?

  • 11:04 EST on 08 December 2025 – Customer impact began.
  • 11:07 EST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
  • 11:38 EST on 08 December 2025 – We began applying a fix for the impacted authentication components.
  • 13:58 EST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
  • 14:13 EST on 08 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

  • First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
  • Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
  • We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
  • Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

  • There was nothing that customers could have done to avoid or minimize impact from this specific service incident.
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey:

12/8

Post Incident Review (PIR) – Azure Resource Manager – Service management failures affecting Azure in China (Tracking ID JSNV-FBZ)

What happened?

Between 16:50 CST on 08 December 2025 and 02:00 CST on 09 December 2025 (China Standard Time) customers using any of the Azure in China regions may have experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations attempted through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Affected services included but were not limited to: Azure AI Search, Azure API Management, Azure App Service, Azure Application Insights, Azure Arc, Azure Automation, Azure Backup, Azure Data Factory, Azure Databricks, Azure Database for PostgreSQL Flexible Server, Azure Kubernetes Service, Azure Logic Apps, Azure Managed HSM, Azure Marketplace, Azure Monitor, Azure Policy (including Machine Configuration), Azure Portal, Azure Resource Manager, Azure Site Recovery, Azure Stack HCI, Azure Stream Analytics, Azure Synapse Analytics, and Microsoft Sentinel.

What went wrong and why?

Azure Resource Manager (ARM) is the gateway for management operations for Azure services. ARM does authorization for these operations based on authorization policies, stored in Cosmos DB accounts that are replicated to all regions. On 08 December 2025, an inadvertent automated key rotation resulted in ARM failures to fetch authorization policies that are needed to evaluate access. As a result, ARM was temporarily unable to communicate with underlying storage resources, causing failures in service-to-service communication and affecting resource management workflows across multiple Azure services. This issue surfaced as authentication failures and 500 Internal Server errors to customers across all clients. Because the content of the Cosmos DB accounts for authorization policies is replicated globally, all regions within the Azure in China cloud were affected.

Azure services use an internal system to manage keys and secrets, which also makes it easy to perform regular needed maintenance activities, such as rotating secrets. Protecting identities and secrets is a key pillar in our Secure Future Initiative to reduce risk, enhance operational maturity, and proactively prepare for emerging threats to identity infrastructure – by prioritizing secure authentication and robust key management. In this case, our ARM service was using a key in a ‘manual mode’ which means that any key rotations would need to be manually coordinated, so that traffic could be moved to use a different key before the key could be regenerated. The Cosmos DBs that ARM use for accessing authorization policies was intentionally onboarded to the internal service which governs the account key lifecycle, but unintentionally configured with the option to automatically rotate the keys enabled. This automated rotation should have been disabled as part of the onboarding process, until such time as it was ready to be fully automated.

How did we respond?

  • 16:50 CST on 08 December 2025 – Customer impact began.
  • 16:59 CST on 08 December 2025 – Engineering was engaged to investigate based on automated alerts.
  • 18:37 CST on 08 December 2025 – We identified the underlying cause as the incorrect key rotation.
  • 19:16 CST on 08 December 2025 – We identified mitigation steps and began applying a fix for the impacted authentication components. This was tested and validated before being applied.
  • 22:00 CST on 08 December 2025 – We began to restart ARM instances, to speed up the mitigation process.
  • 23:53 CST on 08 December 2025 – Many services had recovered but residual impact remained for some services.
  • 02:00 CST on 09 December 2025 – All customer impact confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

  • First and foremost, our ARM team have conducted an audit to ensure that there are no other manual keys that are misconfigured to be auto-rotated, across all clouds. (Completed)
  • Our internal secret management system has paused automated key rotations for managed keys, until usage signals are made available on key usage – see the Cosmos DB change safety repair item below. (Completed)
  • We will complete the migration to auto-rotated Cosmos DB account keys for ARM authentication accounts, across all clouds. (Estimated completion: February 2026)
  • Our Cosmos DB team will introduce change safety controls that block regenerating keys that have usage, by emitting a relevant usage signal. (Estimated completion: Public Preview by April 2026, General Availability to follow by August 2026)

How can customers make incidents like this less impactful?

  • There was nothing that customers could have done to avoid or minimize impact from this specific service incident. 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: 
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: 
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: