"heartbeats lost" after upgrade to Unleashed 200.7.10.2.339

  • 1
  • Question
  • Updated 2 days ago
  • Answered
  • (Edited)
I have a tiny network with 2 R600s in Unleashed mode.  Last night, I upgraded to the latest (200.7.10.2.339) that apparently came out about a week ago. Multiple times after that, I'm seeing "heartbeats lost" messages.

Looking at the switch ports, I'm not losing link - the last up/down of the links was the reboot after the upgrade.

Also weird is that this is a 2 AP setup, and there are reports for both APs. How does an AP lose a heartbeat from itself?




Photo of Bway NOC

Bway NOC

  • 46 Posts
  • 11 Reply Likes

Posted 9 months ago

  • 1
Photo of Albert Pierson

Albert Pierson, Employee

  • 132 Posts
  • 114 Reply Likes
Hello Bway NOC,

The heartbeat starts from the non-master AP and is sent to the Master AP that is acting as a Controller which then responds back.  So you will see heartbeat loss messages on either side if they do not see the messages.  Usually a single heartbeat is a momentary loss of connectivity between the two devices, it may be either the AP or (more likely) the Master AP is too busy on it's Ethernet or CPU to respond.  This could also be caused by flooding multicast messages.

If this is a continuous outage you will see an AP disconnect message on the Master AP.

You could pull the support info diagnostic files from both AP's which reports the CPU, memory and network statistics for the moment. 

Hope this clarifies your question if it does not solve your issue

Thanks for choosing Ruckus Networks Products.


Photo of hitesh patel

hitesh patel

  • 12 Posts
  • 0 Reply Likes
I've been experiencing this exact thing with my R600 APs right after upgrading the firmware to 200.7.

Prior to this update, no issues with this heartbeat "loss". Seems to happen at random intervals, with no seeming correspondence to network load. The logs show the disconnect happening both at times when there is nearly zero traffic on the network, and also, when traffic is at a peak. 




Photo of Bway NOC

Bway NOC

  • 46 Posts
  • 11 Reply Likes
Albert - I think you're missing part of my question here.  Besides the fact that no cabling has changed and we had ZERO messages about heartbeat loss prior to the upgrade, if you look at the logs, the master claims it's missing heartbeats from itself.  That makes no sense...
Photo of Albert Pierson

Albert Pierson, Employee

  • 132 Posts
  • 114 Reply Likes
Hi Bway,

The Master AP is running two pieces of code, AP code and controller code (based on Zone Director).

It may be that the AP code is sending heartbeats to the controller section (probably via loopback) but I have never personally verified this.

But if that is the case I think this points to an issue in the "controller" section of code in receiving, processing or replying to the AP's heartbeats. 

Please get the AP support files and the controller diagnostic file when the event happens (within a few minutes if possible) and open a support case to have them analyzed.

Thanks
Photo of Bway NOC

Bway NOC

  • 46 Posts
  • 11 Reply Likes
I'm pretty sure this customer doesn't have a support contract, so I'm just noting there's probably a bug and offering the info up if you want to investigate (and leaving other customers at 200.6).

I have a debug file (Admin -> Diagnostics -> Debug Info -> Save Debug Info).  Not sure what the AP support files are.
Photo of Bway NOC

Bway NOC

  • 46 Posts
  • 11 Reply Likes
We have multiple sites seeing this, even with the latest firmware. One is ready to just swap out Ruckus for another vendor as this event seems to interrupt access.

All sites we upgraded from 200.6 to 200.7 show some variation of this "heartbeat loss" error when they did not previously, so I am not suspecting cabling problems (although we have tested cables at these locations).
Photo of hitesh patel

hitesh patel

  • 12 Posts
  • 0 Reply Likes
I don't think Ruckus is interested in addressing this problem especially for older AP models. I believe the documentation for the 200.7 firmware states that this will be the last firmware update for some of the older models.

However, on the bright side, it does seem that at least in my case, over time the heartbeat loss error is happening less and less frequently. I checked the logs on my APs and the last heartbeat loss error was on September 1st, and prior to that on July 22nd. This is in contrast to heartbeat loss errors every few hours after I first updated to 200.7.
(Edited)
Photo of Bway NOC

Bway NOC

  • 46 Posts
  • 11 Reply Likes
Dang, the R-600 is now outdated? I mean is there a huge difference between this and an R-610?

Anyhow, as I suspected this is absolutely looking a bug in the firmware.  I downgraded a site from the latest 200.7 release to the latest 200.6 release and the following things have happened since then:

  • The slave unit has stopped rebooting itself nightly (we tested the cable multiple times, swapped POE injectors prior to this).
  • The "hearbeat lost" both from the slave and from the master to itself have not happened yet (17 hours so far, would normally see this 4-5 times a day at least)
  • The client had issues with occasional drops, which I think happened concurrent with the "heartbeat lost" messages
I think we bought a contract for this location today, any tips for getting an actual resolution on this when we open a ticket? I can't keep these on 200.6 forever...

Photo of michael

michael

  • 5 Posts
  • 1 Reply Like
I’ve got the same issue with a pair of R500s.  The heartbeat issue just started about a month or so ago.  It sounds like the solution is to back rev them to 200.6.  I did confirm that I’m on 200.7.  I’m not sure how to back-rev them.  Any hints?

Thanks.
Photo of hitesh patel

hitesh patel

  • 12 Posts
  • 0 Reply Likes
The only problem with downgrading the firmware to 200.6 is that the nicer built in captive portal and customization options are gone too.

In my case at least, the heartbeat errors have become much less frequent over time, as I stated above. So I’m staying with 200.7.
Photo of michael

michael

  • 5 Posts
  • 1 Reply Like
Problem solved.  For me, it was a router problem.  Normally, my infrastructure has reserved IP addresses and specific host name configurations.  As the result of a few router issues, I’ve been swapping some routers around, got lazy and my infrastructure’s been pulling IP addresses from the DHCP pool with no special configuration.  I reserved IP addresses for my two problematic APs and configured their host names as I usually do.  I haven’t had a heartbeat problem since.

I don’t set a preference on which AP serves as the master, primarily for failover purposes.  What I notice in this configuration is that they mask themselves as each other.  For example, if I try to connect to the IP address of the non-master, it will connect me to the master.  Looking at my router, I noticed two things.
  1. About the time the APs would lose their heartbeat, they would also disappear out of the DHCP table.
  2. When the APs were present in the DHCP table (in the pool), they had the same host name.
My theory is that as the two APs try to mask themselves as each other, at least from the router’s perspective, the router got confused.  When the APs would attempt a heartbeat, the packets would occasionally go to the wrong AP and the heartbeat would fail.

So far so good.  I’m still running 200.7.
Photo of Bway NOC

Bway NOC

  • 46 Posts
  • 11 Reply Likes
It would still be really nice if Ruckus would fix this. The current solution of running really outdated firmware is really suboptimal.

It's really easy to reproduce - every site we have with Unleashed was doing this until we rolled back.