Smartzone 100 and R730 Reboots/ Offline Issue

  • 2
  • Question
  • Updated 4 days ago
  • Acknowledged
Is anyone having issues with R730's randomly going Offline and requiring the AP to be rebooted in order to get it back online?  We have been seeing this issue for about 30 days now and support has been unable to tell us why the R730's are doing this.  The AP's are inaccessible most of the time when the issue happens, but sometimes we are able to still ping the AP and actually SSH to the login prompt, but the admin account/password will fail.  Once we power cycle the AP it works fine again.   When the issue is happening the AP is still accepting clients but it has no network access so those clients are broken.  It is very frustrating and Ruckus support has been no help.  The R730's are connected to Ruckus ICX 7650's via 5Gb multigig ports.  The switches report no problems and there are other AP's on the same switch at the time that have no issue, so the problem is just random Access Point specific.  The issue is completely random, no pattern can be found, other than support telling us they are seeing AP kernel panics and that they can't tell us why or how to make stop.   

Smartzone 100 version is 5.1.2.0.302 - which support had us upgrade to as they said that would fix the kernel panics - It has not

R730 version -  5.1.2.0.373

A few of the R730's have not been able to recover from this issue after a reboot and have had to be RMA'd.  Some of them will automatically reboot after 15-30 mins, but if we manually reboot them they typically come back online and work.  Was curious if any else is experiencing this issue with R730's, Smartzone 100's, and ICX 7650's?  

Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
  • frustrated with support and Ruckus hardware

Posted 1 month ago

  • 2
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

Hi Kevin, I have a similar situation, but the reference of AP's and controller is different, however, the behavior is the same, and like you I could not have a diagnosis by Ruckus support.
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Thanks Mario.  Would you mind telling me the model AP's and version of AP and controller you are running.  It may help me argue with support more as they have been very unhelpful and just keep asking for more and more logs.  Which the logs are all the same and show their AP's doing kernel panics and then the behavior is either a complete lockup of the AP or a reboot.  3 times the units have bricked themselves and even though we ask for the reason on the RMA we get back that they destroyed the units and can't provide a reason as we have to ask for a reason when we open the RMA.  We do that and still don't get a reason.  Horrible support.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Kevin, in the network you have AP of different references, R720, T310, among others, the difficulty arises in all references with the same behavior that you mention, we have tested with vSZ in the cloud and on-premise, and it continues With the same problem, the version currently in the group is 5.1.2.0.302, but we have experienced the same with several firmware from the first version 5.1.1.
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Do you still have your case open with support and have they provided you any guidance to fix the issue?  We are getting no where with them
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

I had one open for approximately 40 days, I was requested by the controller and AP's log, but the conclusion was that no fault was found, which is why we chose to migrate to an on-premise controller, but the fail, today I had to open a new case.
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Please update if they provide anything that helps as having random AP's reboot, Go Offline, or die is extremely painful.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Of course, if Kevin, also if you find any input please update me, I have been with this problem for a long time and I have not been able to solve it. thank you
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Definitely.  I am 99.9% sure the problem is a memory leak in their 5.x code which Ruckus needs to fix.  I think it has to do with Wifi6 (AC) connections and something they are doing which I have not been able to pinpoint so I can intentionally cause the issue, which causes the AP's to leak memory until they crash.  Why support can't figure this out and make it stop is beyond me.  All they do is request more and more logs and make us run packet captures.  Happy to know someone else is experiencing the issue but not happy to see how long you have been dealing without a fix.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

That's right Kevin, I hope we have positive news soon to solve the issue, best regards.

Photo of Sven Kessler

Sven Kessler

  • 5 Posts
  • 1 Reply Like
We have a similar issue with R730 APs. Everything is working fine for some days but suddenly, clients do no have network access when connected to these APs. Only an AP reboot fixes the issue for some time until it happens again. We have APs on firmware 5.1.1.0.624 and vSZ-H on 5.1.1.0.598.
Right now, we replaced the APs with R510 and everything i working without issues,
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Sven, do you have a case open with support?  If you don't could you please open one so they understand they are impacting multiple customers.  They tried telling us we are the only customer with this issue.  I am seeing now it appears to be anyone running Smartzone 5.x and AP's of 610 and higher.   I don't have the luxury of replacing all our brand new R730's with anything so I need Ruckus to fix the issue.  I believe the issue is related to Wifi6 (11ax) since there are more and more 11ax devices coming online in the past 30 days, especially with iPhone 11.  I think Ruckus has a memory leak that they don't know how to fix and I have little faith they will fix this soon as Mario posted of the problem 4 months ago and they still have not resolved it.  Their customer support for issues like this is horrible.
Photo of Sven Kessler

Sven Kessler

  • 5 Posts
  • 1 Reply Like
We do not have an open case regarding this right now, because I'm afraid the "standard solution steps" will take so much time compared to the outcome, so that we currently wait for a fix hopefully within the next firmware update.
But I get your point: Is nobody opens a case, there will be no solution. 
I will be on vacation next week and open a call after that.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Hi Sven,

Thank you very much for the contribution, I think like Kevin, we hope you help us by opening a case to support from Ruckus since that will show that it is a general problem and will help to give a quick solution.


Thank you,
Photo of Michael Brado

Michael Brado, Official Rep

  • 3049 Posts
  • 438 Reply Likes
Mario, Kevin, Sven, please tell me your case numbers.  We ought to be able to collect logs and AP support info to identify the cause of any problem(s), especially if you see it happening frequently.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3049 Posts
  • 438 Reply Likes
Mario, I saw your ticket 985751, and it says you performed an SZ 5.1.2.0.302 upgrade.  Please let us know if you see another situation, and then try to grab AP support info and SZ logs for tech support, thanks!
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Ticket 965817 - as of late yesterday our issue has finally been escalated and we are scheduled to work with an escalation engineer this afternoon who is planning on enabling some additional debugging/logging on the R730's and Smartzone.  Hoping they are able to figure this out soon as we still have R730's randomly rebooting.  I will provide the post updates as we make progress so others hopefully don't have to deal with this issue.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Best regards to all,

I have not yet shared information since the WiFi network is in a university, in Colombia there was a holiday bridge and during the weekend there is no work there, for this reason it is not possible to have conclusions about the provision.


Thank you
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
R730, reboot, kernel panic, apHearbeatLost, smartzone - Just adding some of the keywords that I was using to search for others with this problem so that hopefully anyone else with this random access point issue might add a comment so Ruckus is aware of all the customers impacted by this issue.
Photo of ian johnson

ian johnson

  • 2 Posts
  • 0 Reply Likes
This is very interesting, I have 3 * R730s, two display uptime of 38days, one restarts daily. Ive been meaning to troubleshoot and this thread prompts me to do that sooner now. 
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Thank Ian.  Please open a case with them and tell them to reference my case 965817 for details.  They are now trying to capture memory and cpu state with a custom script of the AP's prior to the reboots.  Hopefully the more customers they see opening cases for the R730's rebooting they will figure out the issue and fix it.  We ran R610's and R710's for 3 years without ever experiencing a random reboot.  We were also running ZoneDirector instead of Smartzone so I am not sure which is the actual culprit.
Photo of RF0V1K

RF0V1K

  • 6 Posts
  • 1 Reply Like
Hey Kevin, the R730 I was having trouble at was powered by a secondary switch which doesnt appear to have been able to provide enough power to keep the R730 happy. Moving it to be powered by my primary Juniper ex4300-48p has resolved my reboot issues. I know this doesnt help with your issue but wanted to follow up anyhow. 
Photo of Leonardo Ferreira

Leonardo Ferreira

  • 1 Post
  • 0 Reply Likes
I am also having some problems with the AP R730.
I have two distinct localities with the following problems:
In the first one with the same vSZ version and same firmware, in Access Point, in the Trafic tab, the graph is constantly showing me that the clients are disconnecting and connecting to the AP. Giving me the feeling of false positive.
In the second case, already with vSZ-H with firmware 5.1.1 and APs 5.1.1, all clients in 2.4 are disconnected and connected, this drop lasts a maximum of 3 seconds.
In neither case did I have a support solution, just log collection, and more collections and no solution.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Hello Leonardo,

It is indeed a situation similar to that of everyone in this forum, if it works with the update we do I share to see if it can be solved, if you have an open case with Ruckus please share it since Michael Brado is collecting the cases to analyze from support.

Thank you
Photo of Malcolm Chai

Malcolm Chai

  • 1 Post
  • 0 Reply Likes
We are running vSZ-H 5.1.1.0.589 and R730 5.1.1.0.3028

We had a known memory leak causing APs to reboot randomly.  They just had a temp patch fix about a week ago.

Now we are dealing with random disconnecting.  Student and staff machines will have either a true disconnect from the AP, or will be connected to the AP but not be able to access the internet.  Still working with Ruckus on this.



We are still working on this as we are not sure if this is a compatibility issue with new hardware or another problem in the AP.  


Please continue to update as I am interested in seeing how everyone's problems get resolved.


Malcolm
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

Best regards to all,

 I tell you that with version 5.1.2.0.373 for AP and 5.1.2.0.302 for VSZ the service has been stable, if you want you can try to update your services to these versions and tell us how they are doing.

 Happy day.
Photo of Kevin

Kevin

  • 14 Posts
  • 0 Reply Likes
They provided us 5.1.2.0.1013 for the AP's on Wednesday and hoping that fixes our AP Kernel Panic issues.  I believe there are still 2 other issues we are aware of that Ruckus support is trying to figure out a fix for (apHeartbeatLost alerts that are not accurrate and AP's going Offline but they still have connectivity to the Smartzone per the debug logs they were able to capture on Wednesday).   We did not have any reboots yesterday while running Smartzone 5.1.2.0.302 and AP 5.1.2.0.1013.  We were definitely having problems with AP version 5.1.2.0.373.
Photo of Mark Channer

Mark Channer

  • 1 Post
  • 0 Reply Likes
Any updates with the latest firmware version on this issue? I'm delaying my purchases until I know the 730 is working as expected. Thanks and sorry for the issues all of you have experienced with these APs.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3008 Posts
  • 424 Reply Likes
Have you tried SZ 5.1.2.0.302 (MR2), like Mario above?
Photo of Kevin

Kevin

  • 10 Posts
  • 0 Reply Likes
Yes and thats when the issues seemed to get even worse.  They provided us an AP patch firmware version 5.1.2.0.1013 Wednesday which we applied to all our AP's and yesterday we had NO AP's reboot but still had several alert from "apHeartbeatLost".  They were also able to capture additonal debug logs from an AP on Wednesday that went Offline and confirmed that it still had an SSH tunnel connected to the Smartzone so it obviously had network connectivity.  We have not been provided a fix for that issue but were told that the 5.1.2.0.1013 does fix one of the R730 kernel panic issues.  We are just waiting for the Offline issues to happen or some other R730 problem and hopefully will provide them more logs and hope they come up with a fix.  We currently have Smartzone 5.1.2.0.302 and AP's all on 5.1.2.0.1013.
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Adding an update that we are still having problems.  The AP Firmware 5.1.2.0.1013 they provided us seems to have fixed one of the AP kernel panic issues.  We still have at least 2 other AP problems/bugs which are only happening a few times  per day, which is helpful, but still horrible that we randomly having R730's go offline.  We discovered another issue Friday which Ruckus support/engineering believes is a problem with their current Smartzone code that is causing their Cassandra database to error and have problem with the control plane responding properly to the AP heartbeats.  Still providing logs and packet captures daily to support with no expectations for when they will fix all these bugs and provide us with a stable solution.  Extremely frustrating and disappointing that Ruckus is not providing a stable WiFi solution and we are having to spend countless hours troubleshooting their product.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3047 Posts
  • 435 Reply Likes
Hi Kevin, what's your current open case number please?I want to have a look (from the inside), to see what I can see for you.
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Noted above but it is 965817
Photo of Michael Brado

Michael Brado, Official Rep

  • 3047 Posts
  • 435 Reply Likes
Thanks, bug ID = ER-7736, and I see Eng has a test bed and is reviewing logs collected.
Still active... did they ask about running a Debug image (that collects extra stuff) on at least a couple APs?
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
They did and that was refused by my management as you all have no way of being able to predict which AP will have the issue. Support could also not explain the impact of this debug image on our AP's. All they could say was that they don't recommend putting it on all the AP's, and only recommend a few, which makes no sense as it is completely random as to which AP's have the issue and without us knowing the impact as well as being told the image has not been tested by your engineering team made my management refuse to be your testers.  This image could possibly make our wifi even worse then it currently is performing.  Every AP, at some point, over the past 6 weeks has had one or more of the 4 issues we have discovered, with no current pattern detected, so putting an unknown image onto a few AP's seemed pointless.  The odds of guessing the correct AP was not worth the risk of running an unknown image.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3047 Posts
  • 435 Reply Likes
ACK your concerns. 


Typically an Engineering test build will turn on additional debugs, which might slow an AP processing a little.
Random / occasional problems are quite difficult to pinpoint sometimes.
Photo of Kevin

Kevin

  • 15 Posts
  • 0 Reply Likes
Wanted to provide an update since this have been going on for over a month and Ruckus has only been able to fix 1 of what is currently now 5 issues/problems we are experiencing with their product.  I hope people evaluating Ruckus see this post and are warned of what to expect with Ruckus support/engineering.  They should not have products that kernel panic and have no way of being able to understand why, or take weeks to fix.
 
This is the only item they have fixed 
1) R730 - Kernel Panic - DNS issue >>>>[Resolved] - They fixed this with AP firmware 5.1.2.0.1013

These still have no fix and randomly happen daily
2) R730 - Kernel Panic - Soft Lockup - generates apHeartbeatLost alert and reboots after 1 minute

3) R730 - Kernel Panic - Watchdog Timeout - generates apHeartbeatLost alert and reboots after 1 minute

4) R730 - AP Hangs and no longer works. generates apHeartbeatLost and goes offline. It reboots itself after 30 mins

5) Smartzone Cassandra Database issue - APs show offline in SZ - They believe this is false positive and bug but no fix provided yet.

Ruckus is daily building custom scripts, which they never seem to get correct until multiple attempts, to try and figure out the root of their problems. We feel like we are definitely acting as their QA department and finding bug after bug with the SmartZone 100 and R730's.  Another issue is it seems all their escalation people are on the West Coast so we hear nothing until 1-2PM Eastern time and then it is just to have a remote session to collect more and more logs with no resolution in sight.

(Edited)