How to Troubleshoot High CPU utliization on the Ruckus vSZ-D virtual machine

  • 3
  • Question
  • Updated 5 days ago
  • Answered
  • (Edited)
We are currently running our environment in two virtual machines: Ruckus vSZ-E and vSZ-D
According to the webUI we're running the following:
Controller Version 3.5.1.0.296
Control Plane Software Version 3.5.1.0.205
AP Firmware Version 3.5.1.0.419

We have a total of 10 x R720 and R610 APs in 1 zone with 3 AP Groups. Not a large deployment. Our subject matter expert left the company recently and I'm trying to walk through the training/documentation to become familiar with Ruckus.

Recently we've started experiencing our vSZ-D instance running at 98% CPU for long periods of time. I'm at a loss on where to begin troubleshooting this. I am able to SSH to the device and login/enable.

I've tried a few things on various posts and the tried and true graceful shutdown/restart. Could someone point me to a KB article or troubleshooting steps to start digging deeper?
Photo of Scott Crace

Scott Crace

  • 6 Posts
  • 1 Reply Like
  • frustrated

Posted 4 weeks ago

  • 3
Photo of Andrew Giancola

Andrew Giancola

  • 28 Posts
  • 10 Reply Likes
Have you tried re-provisioning to Tier two specs? 4proc 16gb ram? that's where I had to go to get my Proc off redline.
Photo of Andrew Giancola

Andrew Giancola

  • 28 Posts
  • 10 Reply Likes
This may not be helpful if you simply never noticed your VZ-D was redlined(like me). I would personally start by configuring the syslog and having support review your logs from VZ-D
(Edited)
Photo of Scott Crace

Scott Crace

  • 4 Posts
  • 1 Reply Like
I did look through the documentation for the recommended specs to compare against what's in our environment. I couldn't find something specific to the vSZ-D but did use the specifications found for the overall release.

Per Ruckus documents for our release at the base Essentials install (1-2 nodes, 1-100 APs), they recommend:
100 GB HD, 2 vCPU, 13 GB RAM
Ours was
10 GB HD, 8 vCPU, 13 GB RAM

I did increase the HD size on the vSZ-D instance to 100 GB from the 10 GB while it was gracefully shutdown. Another forum article indicated the device should detect this additional space and start using it.

Appreciate the response.
Photo of Scott Crace

Scott Crace

  • 4 Posts
  • 1 Reply Like
This is a somewhat recent development though as our VM environment sends notifications upon CPU thresholding.

That being said, it doesn't appear to have been configured to forward syslogs. I will work on getting something setup to be able to review them.
Photo of Scott Crace

Scott Crace

  • 6 Posts
  • 1 Reply Like
As a followup, from the CLI on the vSZ-D virtual machine, running show stats while VMWare is showing 98% only shows around 75%. The syslogs aren't showing anything out of the ordinary either.

I have opened a case and provided some initial information. I'll post more once I've worked with support a little more. However, I suspect the old 'upgrade the version' approach will be the recommended steps. That may indeed solve the issue but I wanted to find out why it suddenly started especially since the solution hasn't changed much since the initial deployment.
Photo of JSo

JSo

  • 7 Posts
  • 2 Reply Likes
Interested hearing if support has some solution to this issue. I've been told it is normal that CPU is ~100% , it is how Intel DPDK based VSZ-D is supposed to work. I've been wondering what are the risks if you try to limit CPU resources on the vmware, especially in small low user density networks it feels a bit waste of CPU resources.
Photo of Scott Crace

Scott Crace

  • 6 Posts
  • 1 Reply Like
I got the same sort of response so you're not alone. Posting an overall reply as well.
Photo of Scott Crace

Scott Crace

  • 6 Posts
  • 1 Reply Like
The final answer from support is that the VM is behaving as anticipated and the CPU is expected to be near 100% based on how the Intel DPDK poll mode driver performs.

It doesn't necessarily explain why vmware wasn't complaining about this previously or what might suddenly cause it to start alarming if it was running this way previously. We did observe that the VM stopped alarming for several hours after we migrated it to a different host due to maintenance activities on the host. A few hours later on the new host it started alarming again.

I plan on dropping the priority on this for my work load but will keep fiddling and possibly escalate to VMware support. I expect their answer will be that it must be the lack of vmware tools or consult Ruckus.