VMware snapshot of vSZ causing AP disconnects

  • 1
  • Question
  • Updated 5 months ago
We have two Ruckus Virtual Smartzone controllers, both running 3.5.1
We use Veeam for backup and replication. As part of the backup and replication process, each VM gets a VMware snapshot, the snapshot stays open for about 3 minutes, and then the snapshot is deleted

Not every time, but quite often the VMware snapshot process causes AP disconnects. APs disconnect for 20-30 seconds then reconnect. APs do not restart, they just lose the connection to the controller for a short period of time

Has anyone else seen this?
Photo of David Henderson

David Henderson

  • 92 Posts
  • 13 Reply Likes

Posted 5 months ago

  • 1
Photo of Dave Watkins

Dave Watkins

  • 64 Posts
  • 13 Reply Likes
At a guess yo're seeing VM stun either when creating the snapshot, or more likely when it's being consolidated after deletion. What version of VMWare are you running? ESXi 6 had significant improvements in VM stun around snapshots. 

The other factor affecting VM stun is the speed of your storage. The faster the storage the lesser the affect
Photo of David Henderson

David Henderson

  • 92 Posts
  • 13 Reply Likes
We are running ESXi 6, update 3a
We are using a Nimble all flash array in production which has very high IOPS and very low latency
I thought about stun as well but it only takes a second to take a snapshot and even when deleting the snapshot and consolidation occurs it only takes a second

In the Ruckus controller under events I am seeing lots of "AP lost heartbeat" which does make sense. My guess is the AP does lose the heartbeat to the controller for just a second or two. I would not think this is long enough for AP disconnects. We have been running this setup for about 9 months and it is only recently we are seeing this behavior. We were running Ruckus firmware 3.4.x for much of that before upgrading to 3.5.0 and finally to 3.5.1 which is the latest.
Photo of David Henderson

David Henderson

  • 92 Posts
  • 13 Reply Likes
Here are the exact times from yesterdays snapshot that results in large number of AP disconnects

Create virtual machine snapshot
Requested Start Time - 4:17:23
Start Time - 4:17:23
Completed Time - 4:17:24

Remove snapshot
Requested Start Time - 4:20:24
Start Time - 4:20:24
Completed Time - 4:20:26

When a VM gets a snapshot or when a snapshot is removed the VM is stunned for a period of time and no I/O happens. The snapshot took 1 second to take and 2 seconds to remove. Seeing "AP lost heartbeat" during this time did not surprise me. One or 2 second stun should not be long enough for an AP disconnect