Two Years with Azure Local: Why I Chose Proxmox

VMware with vSAN was never a candidate. The Broadcom acquisition and what followed needs no further explanation for anyone working in this industry.

Background: S2D Was Actually Fine

Before Azure Local there was Storage Spaces Direct, which I ran in production from 2018 onward with RDMA over Converged Ethernet. S2D as a storage architecture is technically sound. The write path, the cache tier behavior with NVMe, SMB Direct over RoCEv2 giving you genuine kernel-bypass RDMA. There is real engineering in there and performance reflected it.

The important distinction is that S2D in that era was transparent because the CLI was the intended management path. PowerShell gave you real information and using it felt like working with the product, not around it. The failure modes were understandable and recoverable without vendor involvement for most issues.

Azure Local still runs S2D under the hood and the PowerShell surface for the storage layer still works if you go looking for it. But it is no longer how the product expects to be managed. The intended path is WAC and Arc, and dropping to PowerShell for troubleshooting increasingly feels like circumventing the product rather than operating it. The underlying storage engine did not regress. The management philosophy around it did.

What Changed with Azure Local

The hardware requirements are strict and deliberate. Azure Local only runs on certified nodes or integrated systems, with firmware versions validated by Microsoft. The installer validates the entire environment before proceeding and will refuse to continue if anything is out of spec. In theory this should mean every successfully deployed cluster is known-good. In practice it means that when things go wrong post-deployment, the operator has nothing to blame themselves for, and Microsoft’s engineers have nothing to point at either.

The core problem

The entire Azure control plane for an on-premises Azure Local cluster runs as a single VM on the cluster it manages. This VM runs Kubernetes internally and handles updates, RBAC, Arc connectivity, and the management layer for everything. If this VM is corrupted or enters a bad state during an update (which happens), the cluster loses its management plane entirely. The underlying Hyper-V and S2D may be perfectly healthy. Your data may be intact. But the answer from Microsoft engineering is still: redeploy.

Windows Admin Center is the primary management interface for hardware health and cluster status. It has been in preview since its introduction and remains there. When it works, the information it surfaces for drive health is minimal. When it does not work, which is often, you are operating blind.

The Update Problem

Azure Local updates are delivered through the Lifecycle Manager, orchestrated via Azure Arc. Firmware and driver updates are not bundled by Microsoft directly. Instead, hardware vendors provide what Microsoft calls a Solution Builder Extension, a vendor-supplied package containing firmware, drivers, and hardware-specific validation logic. Before the cluster will show a combined solution update as ready to install, the SBE package for your hardware must be obtained from the vendor and made available to the update pipeline. The cluster health checks gate the update on this being present and matched to your hardware model. Once the SBE is in place, the Lifecycle Manager orchestrates everything together: OS updates, agent updates, and the vendor firmware pack, as a single coordinated operation across all nodes.

In theory this is a sensible design. Firmware validation is the hardware vendor’s responsibility and tying it into the update readiness check means you cannot accidentally run an OS update on outdated firmware. In practice the SBE integration is another dependency in a chain that has many places to fail, and when it fails the error surface is no better than the rest of the update pipeline.

When an update fails, the error output is a stack trace. Not a meaningful one. A stack trace that told me nothing actionable and told the Microsoft engineers on the call nothing actionable either. The troubleshooting process for a failed update invariably followed this pattern: collect logs, send to Microsoft, wait, get a response that amounted to “we’re not sure, try this,” it does not work, escalate, engineer joins the call, spends time on the system, concludes the only path forward is to redeploy.

4full redeployments on same hardware

2years of production evaluation

0successful update recoveries

Four redeployments on the same hardware. Each one preceded by Microsoft engineering involvement. Each one ending in the same conclusion. By the fourth, our vendor, who had been on every support call and had their own Microsoft relationship, stopped pushing back on the decision to move on. When the people who sell you the product stop arguing for it, that is information.

I was also active in the official Azure Local Slack community throughout this period. When I posted about the issues I was hitting, the responses were not troubleshooting suggestions or “have you tried this.” They were laugh reactions. Not because anyone found it funny, but because everyone in that community recognised the same wall. The laugh reaction is what you reach for when commiserating seriously about the same unfixed problems for the hundredth time stops feeling productive. That is not a healthy sign for a product’s community.

To be specific about what a redeploy means in this context: it is not a reinstall that preserves VM data. It is a complete teardown. VMs need to be backed up, cluster destroyed, nodes reimaged, cluster rebuilt, VMs restored. On production infrastructure this is a significant event, not a routine maintenance operation.

The Monitoring Gap That Actually Matters

Hardware failures are not a crisis in a well-designed cluster. They are an expected operational event that the architecture is built to tolerate. A drive fails, the cluster degrades gracefully, you get an alert, you order a replacement, you swap it. The failure itself is not the problem. Not knowing about it is.

Azure Local’s hardware monitoring relies on two surfaces: Windows Admin Center for cluster health, and the vendor BMC for hardware events. Neither of these caught a problem that cost us significantly in undetected performance degradation.

After migrating to Proxmox and adding the drives from the old cluster as OSDs in Ceph, two Solidigm NVMe drives were flagged with slow OSD warnings immediately, literally at the moment of first add, before those OSDs had served a single client request. Ceph detected them during initial PG assignment and scrubbing. The latency on those drives was not marginally elevated. Operations were taking upward of 30 seconds on hardware where sub-millisecond is the baseline.

What Azure Local showed for those drives

Both drives: present, healthy, no errors. Lenovo XClarity reported no faults. Windows Admin Center showed no warnings. The BMC has no visibility into SSD internal performance state. It knows the drive exists and answers basic queries. An SSD with 30 second op latency looks identical to a healthy one from the BMC's perspective.

What Ceph showed

Slow OSDs detected. Yellow cluster health. Specific OSD IDs flagged. Latency statistics available immediately via ceph osd perf. Time to detection from first add: seconds.

Those drives ran in the Azure Local cluster for an unknown period in that degraded state. The cluster had no mechanism to detect it. VMs running on storage backed by those drives were experiencing elevated latency that had no visible cause in any monitoring surface. We were troubleshooting performance issues without knowing the underlying reason.

Ceph’s detection method is not magic. It works because the system continuously measures actual IO performance of every OSD and compares it against peers handling the same workload. A degraded drive cannot hide in that environment because the anomaly is immediately visible relative to baseline. This is a byproduct of Ceph’s architecture, not a special monitoring feature. The telemetry exists because the distributed system needs it internally to manage data placement and replication.

The Migration Incident

When migrating VMs off Azure Local, the reasonable assumption is that VM disk images live in the cluster storage directory. That is where you configured storage, that is where you expect your data. I copied the cluster storage directory and proceeded with decommissioning.

Three VMs had disk images stored outside the cluster storage path. Azure Local’s management layer had placed them elsewhere without surfacing this in any obvious way. We had backups, so no data was lost, but the situation illustrates the opacity problem clearly: a storage platform where the authoritative location for data is not actually authoritative is a platform you cannot reason about safely.

The Proxmox equivalent of this problem does not exist in the same way. Storage configuration is explicit, data locations are visible, and the relationship between a VM and its backing storage is transparent at every layer.

MAC Address Spoofing Simply Does Not Work

This one is worth its own section because it is not a performance limitation or an operational inconvenience. It is a hard architectural dead end that rules out entire categories of workloads entirely.

MAC address spoofing on virtual NICs is a requirement for running any network appliance that needs to source traffic from a MAC address other than the one assigned by the hypervisor. Wireless LAN controllers, VRRP and CARP setups, certain VPN concentrators, and various routing appliances all depend on it. In Proxmox, enabling this on a vNIC is a checkbox and it works exactly as expected.

In Azure Local, it does not work. The reason is structural: Azure Local requires SR-IOV capable hardware as part of its certified node specification, and Switch Embedded Teaming (the virtual switch type used) with SR-IOV enabled is fundamentally incompatible with MAC address spoofing. Microsoft’s Hyper-V simply does not implement MAC spoofing on SR-IOV enabled virtual switches. The option may appear in the interface but enabling it silently falls back to software switching, and in practice the spoofed MAC never reaches the network correctly. This is a known and documented issue going back to Windows Server 2016 when SET switches were introduced, and it has not been fixed.

The practical consequence is that you cannot run a wireless LAN controller VM on Azure Local. Or anything else that depends on MAC spoofing. There is no workaround that preserves SR-IOV, and disabling SR-IOV across the board is not a realistic option on certified hardware where it is expected to be active. You simply cannot run those workloads on the platform. What is genuinely baffling is that this has been a known issue since Windows Server 2016 introduced SET switches, which means it has gone unfixed for the better part of 10 years at the time of writing. For a platform marketed at enterprise infrastructure, that is a remarkable thing to leave broken for a decade.

Proxmox with Ceph: What the Same Hardware Delivers

Running Proxmox VE 9 with Ceph on the same all-NVMe hardware, with a tuned network configuration including jumbo frames, BBR congestion control, and a lossless fabric treatment using PFC on priority 3 with DSCP 26 marking via nftables:

Azure Local

Strong storage performance
Single control plane VM
Update failures unrecoverable
No drive performance monitoring
WAC perpetually in preview
Arc dependency for everything

Proxmox + Ceph

Strong storage performance
Fully distributed management
Rolling upgrades, zero downtime
Continuous OSD perf telemetry
Stable, visible management plane
No external dependencies

Azure Local’s storage performance was strong, as expected from SMB Direct over actual RDMA. Ceph’s RDMA messenger was experimental and has been effectively abandoned upstream. The performance gap between the two is real, and what can be done on the Ceph side is maximizing what TCP delivers, which is what the network tuning achieves.

The roughly 20% improvement from vanilla Ceph to the tuned configuration comes primarily from the lossless fabric treatment during saturation workloads. On all-NVMe OSDs, replication traffic during heavy rebalancing can genuinely approach link saturation on 100G LACP bonds. PFC preventing drops and ECN signalling early congestion translates into measurable throughput during those events, and into more consistent VM latency during recovery operations. In day-to-day operation the difference is not visible. The machinery sits idle waiting for congestion that does not occur. During rebalancing after an OSD failure, it earns its place.

On Zero-Downtime Upgrades

I have upgraded production clusters across two major versions of Proxmox VE, rolling, one node at a time, with live migration handling VM continuity during each node’s maintenance window. The observable impact on running VMs is a brief pause during live migration, typically under a second. This includes clusters where the VPN appliance providing remote access was itself a VM on the cluster being upgraded, with no secondary access path available if something went wrong.

The upgrade process across PVE 7 to PVE 9 with Ceph running throughout has never produced a failure requiring intervention beyond the documented upgrade steps. Hardware problems have occasionally introduced complications, but the software upgrade path itself has been reliable across every deployment that went through it.

This is not a claim that Proxmox is infallible. It is a statement about the upgrade reliability of a platform I have run for over 15 years across multiple client environments, compared directly with a platform where upgrades on certified hardware with Microsoft engineering support consistently ended in redeploy.

Conclusions

Azure Local is not a bad product because of hype or tribal preference against Microsoft. It is a product with real architectural problems: a single-VM control plane that becomes an unrecoverable failure point, an update pipeline that cannot reliably complete on its own validated hardware, a monitoring stack that cannot detect drive performance degradation, and a management surface that has been in preview for years.

S2D as a storage architecture is genuinely capable. The RDMA performance advantage is real and the gap between it and Ceph on TCP is not trivial. If you can tolerate the operational model and the update risk, and if the performance ceiling justifies it, there are environments where it makes sense. Mine was not one of them after two years of evidence.

The performance delta I accepted when moving to Ceph is the right trade for a platform where I can detect degraded drives in seconds, upgrade without downtime, recover from any failure state without a vendor call, and operate with full visibility into every layer of the system. Hardware failures are workdays. Management plane failures that require complete redeployment are not.

Proxmox with Ceph is not the performance ceiling. It is the operational ceiling, the point where the platform stops being the variable and hardware becomes the only thing you need to think about. For most infrastructure operators, that is the right place to be.

Written from over 15 years of Proxmox production experience and 2 years of Azure Local production evaluation. Hardware: all-NVMe nodes, Mellanox ConnectX-6 DX NICs, Mellanox switches in CLAG. Network configuration documented separately at GitHub.