Smartzone 100 and R730 Reboots/ Offline Issue

  • 3
  • Question
  • Updated 1 month ago
  • Answered
Is anyone having issues with R730's randomly going Offline and requiring the AP to be rebooted in order to get it back online?  We have been seeing this issue for about 30 days now and support has been unable to tell us why the R730's are doing this.  The AP's are inaccessible most of the time when the issue happens, but sometimes we are able to still ping the AP and actually SSH to the login prompt, but the admin account/password will fail.  Once we power cycle the AP it works fine again.   When the issue is happening the AP is still accepting clients but it has no network access so those clients are broken.  It is very frustrating and Ruckus support has been no help.  The R730's are connected to Ruckus ICX 7650's via 5Gb multigig ports.  The switches report no problems and there are other AP's on the same switch at the time that have no issue, so the problem is just random Access Point specific.  The issue is completely random, no pattern can be found, other than support telling us they are seeing AP kernel panics and that they can't tell us why or how to make stop.   

Smartzone 100 version is 5.1.2.0.302 - which support had us upgrade to as they said that would fix the kernel panics - It has not

R730 version -  5.1.2.0.373

A few of the R730's have not been able to recover from this issue after a reboot and have had to be RMA'd.  Some of them will automatically reboot after 15-30 mins, but if we manually reboot them they typically come back online and work.  Was curious if any else is experiencing this issue with R730's, Smartzone 100's, and ICX 7650's?  

Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
  • frustrated with support and Ruckus hardware

Posted 10 months ago

  • 3
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

Hi Kevin, I have a similar situation, but the reference of AP's and controller is different, however, the behavior is the same, and like you I could not have a diagnosis by Ruckus support.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Thanks Mario.  Would you mind telling me the model AP's and version of AP and controller you are running.  It may help me argue with support more as they have been very unhelpful and just keep asking for more and more logs.  Which the logs are all the same and show their AP's doing kernel panics and then the behavior is either a complete lockup of the AP or a reboot.  3 times the units have bricked themselves and even though we ask for the reason on the RMA we get back that they destroyed the units and can't provide a reason as we have to ask for a reason when we open the RMA.  We do that and still don't get a reason.  Horrible support.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Kevin, in the network you have AP of different references, R720, T310, among others, the difficulty arises in all references with the same behavior that you mention, we have tested with vSZ in the cloud and on-premise, and it continues With the same problem, the version currently in the group is 5.1.2.0.302, but we have experienced the same with several firmware from the first version 5.1.1.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Do you still have your case open with support and have they provided you any guidance to fix the issue?  We are getting no where with them
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

I had one open for approximately 40 days, I was requested by the controller and AP's log, but the conclusion was that no fault was found, which is why we chose to migrate to an on-premise controller, but the fail, today I had to open a new case.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Please update if they provide anything that helps as having random AP's reboot, Go Offline, or die is extremely painful.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Of course, if Kevin, also if you find any input please update me, I have been with this problem for a long time and I have not been able to solve it. thank you
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Definitely.  I am 99.9% sure the problem is a memory leak in their 5.x code which Ruckus needs to fix.  I think it has to do with Wifi6 (AC) connections and something they are doing which I have not been able to pinpoint so I can intentionally cause the issue, which causes the AP's to leak memory until they crash.  Why support can't figure this out and make it stop is beyond me.  All they do is request more and more logs and make us run packet captures.  Happy to know someone else is experiencing the issue but not happy to see how long you have been dealing without a fix.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

That's right Kevin, I hope we have positive news soon to solve the issue, best regards.

Photo of Sven Kessler

Sven Kessler

  • 6 Posts
  • 3 Reply Likes
We have a similar issue with R730 APs. Everything is working fine for some days but suddenly, clients do no have network access when connected to these APs. Only an AP reboot fixes the issue for some time until it happens again. We have APs on firmware 5.1.1.0.624 and vSZ-H on 5.1.1.0.598.
Right now, we replaced the APs with R510 and everything i working without issues,
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Sven, do you have a case open with support?  If you don't could you please open one so they understand they are impacting multiple customers.  They tried telling us we are the only customer with this issue.  I am seeing now it appears to be anyone running Smartzone 5.x and AP's of 610 and higher.   I don't have the luxury of replacing all our brand new R730's with anything so I need Ruckus to fix the issue.  I believe the issue is related to Wifi6 (11ax) since there are more and more 11ax devices coming online in the past 30 days, especially with iPhone 11.  I think Ruckus has a memory leak that they don't know how to fix and I have little faith they will fix this soon as Mario posted of the problem 4 months ago and they still have not resolved it.  Their customer support for issues like this is horrible.
Photo of Sven Kessler

Sven Kessler

  • 6 Posts
  • 3 Reply Likes
We do not have an open case regarding this right now, because I'm afraid the "standard solution steps" will take so much time compared to the outcome, so that we currently wait for a fix hopefully within the next firmware update.
But I get your point: Is nobody opens a case, there will be no solution. 
I will be on vacation next week and open a call after that.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Hi Sven,

Thank you very much for the contribution, I think like Kevin, we hope you help us by opening a case to support from Ruckus since that will show that it is a general problem and will help to give a quick solution.


Thank you,
Photo of Michael Brado

Michael Brado, Official Rep

  • 3298 Posts
  • 523 Reply Likes
Mario, Kevin, Sven, please tell me your case numbers.  We ought to be able to collect logs and AP support info to identify the cause of any problem(s), especially if you see it happening frequently.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3298 Posts
  • 523 Reply Likes
Mario, I saw your ticket 985751, and it says you performed an SZ 5.1.2.0.302 upgrade.  Please let us know if you see another situation, and then try to grab AP support info and SZ logs for tech support, thanks!
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Ticket 965817 - as of late yesterday our issue has finally been escalated and we are scheduled to work with an escalation engineer this afternoon who is planning on enabling some additional debugging/logging on the R730's and Smartzone.  Hoping they are able to figure this out soon as we still have R730's randomly rebooting.  I will provide the post updates as we make progress so others hopefully don't have to deal with this issue.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Best regards to all,

I have not yet shared information since the WiFi network is in a university, in Colombia there was a holiday bridge and during the weekend there is no work there, for this reason it is not possible to have conclusions about the provision.


Thank you
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
R730, reboot, kernel panic, apHearbeatLost, smartzone - Just adding some of the keywords that I was using to search for others with this problem so that hopefully anyone else with this random access point issue might add a comment so Ruckus is aware of all the customers impacted by this issue.
Photo of ian johnson

ian johnson

  • 2 Posts
  • 0 Reply Likes
This is very interesting, I have 3 * R730s, two display uptime of 38days, one restarts daily. Ive been meaning to troubleshoot and this thread prompts me to do that sooner now. 
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Thank Ian.  Please open a case with them and tell them to reference my case 965817 for details.  They are now trying to capture memory and cpu state with a custom script of the AP's prior to the reboots.  Hopefully the more customers they see opening cases for the R730's rebooting they will figure out the issue and fix it.  We ran R610's and R710's for 3 years without ever experiencing a random reboot.  We were also running ZoneDirector instead of Smartzone so I am not sure which is the actual culprit.
Photo of RF0V1K

RF0V1K

  • 10 Posts
  • 3 Reply Likes
Hey Kevin, the R730 I was having trouble at was powered by a secondary switch which doesnt appear to have been able to provide enough power to keep the R730 happy. Moving it to be powered by my primary Juniper ex4300-48p has resolved my reboot issues. I know this doesnt help with your issue but wanted to follow up anyhow. 
Photo of Leonardo Ferreira

Leonardo Ferreira

  • 1 Post
  • 0 Reply Likes
I am also having some problems with the AP R730.
I have two distinct localities with the following problems:
In the first one with the same vSZ version and same firmware, in Access Point, in the Trafic tab, the graph is constantly showing me that the clients are disconnecting and connecting to the AP. Giving me the feeling of false positive.
In the second case, already with vSZ-H with firmware 5.1.1 and APs 5.1.1, all clients in 2.4 are disconnected and connected, this drop lasts a maximum of 3 seconds.
In neither case did I have a support solution, just log collection, and more collections and no solution.
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like
Hello Leonardo,

It is indeed a situation similar to that of everyone in this forum, if it works with the update we do I share to see if it can be solved, if you have an open case with Ruckus please share it since Michael Brado is collecting the cases to analyze from support.

Thank you
Photo of Malcolm Chai

Malcolm Chai

  • 1 Post
  • 0 Reply Likes
We are running vSZ-H 5.1.1.0.589 and R730 5.1.1.0.3028

We had a known memory leak causing APs to reboot randomly.  They just had a temp patch fix about a week ago.

Now we are dealing with random disconnecting.  Student and staff machines will have either a true disconnect from the AP, or will be connected to the AP but not be able to access the internet.  Still working with Ruckus on this.



We are still working on this as we are not sure if this is a compatibility issue with new hardware or another problem in the AP.  


Please continue to update as I am interested in seeing how everyone's problems get resolved.


Malcolm
Photo of Mario

Mario

  • 11 Posts
  • 1 Reply Like

Best regards to all,

 I tell you that with version 5.1.2.0.373 for AP and 5.1.2.0.302 for VSZ the service has been stable, if you want you can try to update your services to these versions and tell us how they are doing.

 Happy day.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
They provided us 5.1.2.0.1013 for the AP's on Wednesday and hoping that fixes our AP Kernel Panic issues.  I believe there are still 2 other issues we are aware of that Ruckus support is trying to figure out a fix for (apHeartbeatLost alerts that are not accurrate and AP's going Offline but they still have connectivity to the Smartzone per the debug logs they were able to capture on Wednesday).   We did not have any reboots yesterday while running Smartzone 5.1.2.0.302 and AP 5.1.2.0.1013.  We were definitely having problems with AP version 5.1.2.0.373.
Photo of Mark Channer

Mark Channer

  • 1 Post
  • 0 Reply Likes
Any updates with the latest firmware version on this issue? I'm delaying my purchases until I know the 730 is working as expected. Thanks and sorry for the issues all of you have experienced with these APs.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3298 Posts
  • 523 Reply Likes
Have you tried SZ 5.1.2.0.302 (MR2), like Mario above?
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Yes and thats when the issues seemed to get even worse.  They provided us an AP patch firmware version 5.1.2.0.1013 Wednesday which we applied to all our AP's and yesterday we had NO AP's reboot but still had several alert from "apHeartbeatLost".  They were also able to capture additonal debug logs from an AP on Wednesday that went Offline and confirmed that it still had an SSH tunnel connected to the Smartzone so it obviously had network connectivity.  We have not been provided a fix for that issue but were told that the 5.1.2.0.1013 does fix one of the R730 kernel panic issues.  We are just waiting for the Offline issues to happen or some other R730 problem and hopefully will provide them more logs and hope they come up with a fix.  We currently have Smartzone 5.1.2.0.302 and AP's all on 5.1.2.0.1013.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Adding an update that we are still having problems.  The AP Firmware 5.1.2.0.1013 they provided us seems to have fixed one of the AP kernel panic issues.  We still have at least 2 other AP problems/bugs which are only happening a few times  per day, which is helpful, but still horrible that we randomly having R730's go offline.  We discovered another issue Friday which Ruckus support/engineering believes is a problem with their current Smartzone code that is causing their Cassandra database to error and have problem with the control plane responding properly to the AP heartbeats.  Still providing logs and packet captures daily to support with no expectations for when they will fix all these bugs and provide us with a stable solution.  Extremely frustrating and disappointing that Ruckus is not providing a stable WiFi solution and we are having to spend countless hours troubleshooting their product.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3298 Posts
  • 523 Reply Likes
Hi Kevin, what's your current open case number please?I want to have a look (from the inside), to see what I can see for you.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Noted above but it is 965817
Photo of Michael Brado

Michael Brado, Official Rep

  • 3298 Posts
  • 523 Reply Likes
Thanks, bug ID = ER-7736, and I see Eng has a test bed and is reviewing logs collected.
Still active... did they ask about running a Debug image (that collects extra stuff) on at least a couple APs?
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
They did and that was refused by my management as you all have no way of being able to predict which AP will have the issue. Support could also not explain the impact of this debug image on our AP's. All they could say was that they don't recommend putting it on all the AP's, and only recommend a few, which makes no sense as it is completely random as to which AP's have the issue and without us knowing the impact as well as being told the image has not been tested by your engineering team made my management refuse to be your testers.  This image could possibly make our wifi even worse then it currently is performing.  Every AP, at some point, over the past 6 weeks has had one or more of the 4 issues we have discovered, with no current pattern detected, so putting an unknown image onto a few AP's seemed pointless.  The odds of guessing the correct AP was not worth the risk of running an unknown image.
Photo of Michael Brado

Michael Brado, Official Rep

  • 3298 Posts
  • 523 Reply Likes
ACK your concerns. 


Typically an Engineering test build will turn on additional debugs, which might slow an AP processing a little.
Random / occasional problems are quite difficult to pinpoint sometimes.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Wanted to provide an update since this have been going on for over a month and Ruckus has only been able to fix 1 of what is currently now 5 issues/problems we are experiencing with their product.  I hope people evaluating Ruckus see this post and are warned of what to expect with Ruckus support/engineering.  They should not have products that kernel panic and have no way of being able to understand why, or take weeks to fix.
 
This is the only item they have fixed 
1) R730 - Kernel Panic - DNS issue >>>>[Resolved] - They fixed this with AP firmware 5.1.2.0.1013

These still have no fix and randomly happen daily
2) R730 - Kernel Panic - Soft Lockup - generates apHeartbeatLost alert and reboots after 1 minute

3) R730 - Kernel Panic - Watchdog Timeout - generates apHeartbeatLost alert and reboots after 1 minute

4) R730 - AP Hangs and no longer works. generates apHeartbeatLost and goes offline. It reboots itself after 30 mins

5) Smartzone Cassandra Database issue - APs show offline in SZ - They believe this is false positive and bug but no fix provided yet.

Ruckus is daily building custom scripts, which they never seem to get correct until multiple attempts, to try and figure out the root of their problems. We feel like we are definitely acting as their QA department and finding bug after bug with the SmartZone 100 and R730's.  Another issue is it seems all their escalation people are on the West Coast so we hear nothing until 1-2PM Eastern time and then it is just to have a remote session to collect more and more logs with no resolution in sight.

(Edited)
Photo of Mark Rock

Mark Rock

  • 1 Post
  • 0 Reply Likes
We have 3 R730 in our area along with R710, R700 and R500.  2 of the R730's in our main meeting area have started to reboot and run for about 5 to 10 minutes and then loose heartbeat and reboot.  We are running 5.1.2.0.203 on our VM Controller and 5.1.2.0.373 on AP Firmware.  Seemed to start after we did the upgrade to 5.1.2.0.203.  Not opened case yet, but will tomorrow.

Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Over 4 months and Ruckus has still not resolved their problems with the R730's and SmartZone 100.  We have helped them discover at least 7 bugs/issues between the SmartZone 100's and Ruckus R730's but they have only resolved 2 of those issues.   All seem to point to memory issues with Ruckus code and yet their escalated "tiger" engineering team has still yet to be able to fix the issues so we continually have AP's reboot daily.   Ruckus needs a new tag line as "Simply Better" is no longer accurate.
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
We are facing the same issue with two R730, with kernel panics and sudden reboots within a few minutes or hours. We opened a case already in August 2019.

Is there any update from any side? Kevin?
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
We still have the issues.  They have tried several patch releases but now think the issue has something to do with the Qualcomm chipset used in the R730's and R750's.  We provided them new debug level outputs of the issues and hoping they find something from it and are able to come up with another patch for us to try.  It is absolutely ridiculous that this issue has been going on for almost 6 months with no end in sight.  After having almost zero problems with the R610's and R710's we have had nothing but issue after issues with the R730's and we told the R750's would have the same issues.  I would stay very far away from Ruckus R730's and R750's for anyone trying to decide between Ruckus and any other vendor.
(Edited)
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
Hi everyone,
I just heard some rumours that Ruckus is about to release SmartZone version 5.2 in these days and it should include firmware updates, especially accessing the WIFI 6 APs (even the not-certified R730 AP).

Anyone else into that?
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
We have been told the same about 5.2 being released very soon, however we do not believe it will have the fixes for the kernel panic hardware issues we are still experiencing on the R730's.  They also told us the same issue has been reported on the R750's.   Last info we have been provided indicated it is a problem with the Qualcomm chipset and we provided additonal custom debug script data to Ruckus in the hopes they can figure out the actual root of their problem.  If 5.2 somehow fixes all our issues I will definitely update this post accordingly.
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
@Kevin, have you triple checked the power issues? https://forums.ruckuswireless.com/ruckuswireless/topics/high-drops-and-retries-on-r730

We using Cisco SG350X PoE+ switches - and of course all APs and the ports are set to provide at+ power. But its explicitly NOT UPoE or similar 60W standard on that ports.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Yes and it would be really sad if it was a power issue considering we have all our R730's connected to Ruckus ICX 7650-48ZP's.  The engineering team at Ruckus believes 5.2 will resolve some of our R730 issues but the kernel panics that appear to be a Qualcomm chipset issue it will not.   We provided new custom debug logs this week and hope they help them find the root cause of the issue.   We still have R730's randomly reboot and or hang and then reboot multiple times per day.  There is no pattern that can be located currently.
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
@Kevin: Do you have any update from our side regarding the R730?
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
The original 5.2 release was pulled as it was bricking some model AP's (supposedly not the R730) so we waited.  Ruckus provided us a custom 5.2.x build last week and do to scheduling we are looking at implementing on March 16, 2020.   I will update this post later next week but with everyone working remotely I am not sure we will experience all the same issues until we are back under full load.
Photo of Skye Moroney

Skye Moroney

  • 1 Post
  • 0 Reply Likes
Just wanted to add that we are experiencing the "R730 - AP Hangs and no longer works. It reboots itself after 30 mins" issue too. But we use Ruckus Cloud and
Ruckus ICX7150-48Z switches instead of Smartzone.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
Update - We have to cancel our upgrade for tonight because Ruckus discovered problems with the custom 5.2.x build they provided us via another customer that tried it before us.  We are back in holding pattern waiting for them to provide a fix for the ton of bugs we have discovered with them.  They also informed us there are new bugs discovered on their ICX-7650-48ZP's which also may be contributing to our issues.  Horrible how long this has be dragged out and they can't make a stable product.  Extremely frustrating.
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
Thanks for the update - our R730 are in vacation mode as well disconntect from the SZ. Keep fingers crossed. I am happy to help and support for any kind of testing as well.
Photo of Sebastian Jansen

Sebastian Jansen

  • 2 Posts
  • 0 Reply Likes
@Kevin: Any news from your side? I am planning to give our R730 a chance for resurrection with latest 5.2.0.0.5030 patch available for download.
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
We upgraded to version 5.2.0.0.699 and then applied AP patch 5.2.0.0.5030 last week.  Problem is that 99% of my company is working from home so we have no actual way to confirm if these have resolved our issues.   We had about 16 discovered bugs/problems and we have not seen any random AP hangs/reboots since applying these versions.  No way to know for sure until we get everyone back in the office and have a couple weeks of actual user load.  I will update once that happens.
Photo of EightOhTwoEleven

EightOhTwoEleven

  • 167 Posts
  • 39 Reply Likes
Same for us, we are rocking 5030 at most of our sites, and have no idea if the changes made any difference. I do see in our RDD that some APs have a ever growing associations, and it only goes back to normal once you reboot the AP. Not sure if that's a bug in RDD or if the APs are actually accumulating associations and are DoS itself essentially. 
Photo of Sanjay Kumar

Sanjay Kumar, Employee

  • 198 Posts
  • 74 Reply Likes
Hi,

5.2 GA release has been released and available on the support site.
Photo of RF0V1K

RF0V1K

  • 10 Posts
  • 3 Reply Likes
Upgraded VSZ-e andR730's to 5.2 Fingers crossed for some improvements. Upgrade was painless.
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
I checked the release notes - it quietly says on page 36 "ER-7674 - Resolved an AP reboot issue on 11ax AP models". I dont want to be negative, but guess its referred to only one of the mentioned kernel panic issues.

If "ER-7665 - Resolved an issue where clients intermittently failed to pass any traffic when connected to 802.11ax (R750/R730)AP's" is helping, I really cannot say.

Furthermore the list of unsupported features on the R730 model is still very long! No Beamforming, no MU-MIMO in Downlink and Uplink, no OFDMA, no cell sizing, no 160 MHz channels, and many more others (which are not available for all other WiFi 6 AX AP, so I wont complain for them).
Photo of Jeronimo

Jeronimo

  • 387 Posts
  • 49 Reply Likes
I have upgraded ver 5.2 and it seems quite well.
Photo of Daniel

Daniel

  • 7 Posts
  • 0 Reply Likes
There are rumours that much more WiFi 6 functionality incl. OFDMA and so on are coming with next minor SZ software release published after 5.2, so I guess something like 5.2.1 or so.

Keep fingers crossed.
Photo of Paul Ainslie

Paul Ainslie

  • 8 Posts
  • 1 Reply Like
Hey folks, did GA 5.2 resolved the kernel panic issue?
Photo of Kevin

Kevin

  • 29 Posts
  • 7 Reply Likes
GA 5.2 has appeared to initially have fixed our kernel panic issues.  I say initially because we are still 90% WFH still because of COVID-19 so we do not have the normal load on our Ruckus gear.   We should be getting more load over the next 2 months and once things are back to normal load I will post if everything still seems good or not.
Photo of Milan Gvozdenovic

Milan Gvozdenovic

  • 16 Posts
  • 1 Reply Like
Dear Kevin, I see that you are dealing with this problem for 8 months now. Please, kindly update us when you can. I am very interested in seeing how is this issue going to be solved since we are implementing R510, R610, R720, T310 and T710 APs. on SZ124. I hope you will find a solution. Thank you!