Radius server unreachable events

  • 2
  • Question
  • Updated 4 weeks ago
  • Acknowledged
vSZ version 5.2.0.0.699 with 411 R710s APs on campus. Every few days we are getting a bunch of radius server unreachable events. What is odd is the details of the event do not even point to our radius server. All events are reported like this

AP [[email protected]:E7:1E:2A:A4:40] is unable to reach radius server [127.0.0.1].

Of course 127.0.0.1 is not our radius server. We are using Cloudpath as our radius server. Our wifi is rock solid so there are no other symptoms other than a rash of these events every few days.

Any ideas what causes this?
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes

Posted 5 months ago

  • 2
Photo of EightOhTwoEleven

EightOhTwoEleven

  • 167 Posts
  • 39 Reply Likes
We get the same thing, except for it does show the correct RADIUS server (Windows NPS) in the message as well as we are on vSZ 5.1.2.

We have noticed that VM stun occurs sometimes when backups are being performed, but it will at other random times show APs that are unable to reach the server.
Photo of Dave Watkins

Dave Watkins

  • 74 Posts
  • 14 Reply Likes
We're seeing the same thing. All our Radius requests are proxied through the vSZ so there shouldn't even be radius processing on our AP's. It _appears_ to be causing radius server failovers for us but it's not anything the end user notices and we're busy enough at the moment I haven't had time to dig any further into it.

It started at the time we upgraded to 5.2.0.0.699 so is a bug in that code (or the relevent AP code) I'd say.
Photo of EightOhTwoEleven

EightOhTwoEleven

  • 167 Posts
  • 39 Reply Likes
After upgrading to 5.2.0.0.699, we are seeing the same thing as well. It's not causing a server fail-over because we don't see attempted RADIUS connections on our secondary server (we log and graph this).
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes
This past Sunday, April 12th starting at 2:20am and continuing to 12:45pm (about 10 hours) I received well over 8,000 emails from the controller about radius server unreachable events. These are scattered across nearly all of our APs (411 total) which are located in 6 separate buildings, attached so 22 different switch stacks. These email stopped Sunday afternoon. I then opened a ticket with Ruckus on this. Working the ticket now, no resolution yet
Photo of EightOhTwoEleven

EightOhTwoEleven

  • 166 Posts
  • 38 Reply Likes
Would be curious to know the outcome of said ticket.
Photo of David Black

David Black

  • 99 Posts
  • 52 Reply Likes
The problems with version 5 seem to surface on larger networks. We have several clients with very large production networks, each with several hundred sites, thousands of APs, and multiple multi-node clusters worldwide.  We are responsible for managing and maintaining these production networks so we are very cautious with upgrades.  We do our own testing and we've not found a single version 5 release that we like.  We've kept all client production networks on 3.6.2 with one exception - a single 4-node cluster managing APs and switches that's running 5.1.0.0.496 (the version we dislike the least).

We very much look forward to having a stable and tolerable v5 release one of these days, but because our neck is on the line, we intend to keep the our clients on 3.6.2 until there a there is a release that passes our testing.  If you check the Ruckus support site, you'll also find that TAC's recommended version for SZ or vSZ is 3.6.2.0.222.  
Photo of Dave Watkins

Dave Watkins

  • 74 Posts
  • 14 Reply Likes
Sadly we started our vSZ journey on 5. And we need support for new AP's so need to keep up to date. We're at a couple of hundred AP's and I've found issues on every release but need the new AP support so need to upgrade. I should go an look if they have fixed the email address fields to accept gTLD's longer than 6 chars or if they have fixed the broken multiple realm support of radius based admin logins that they broke with the previous release.
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes
I was the one who opened the ticket with support about this issue and they still have not resolved it. We have 411 R710 APs and are seeing two things. Occasionally, maybe once or twice each week, we get a dozen or so radius server cannot be reached events. Twice now though this has been a cascade of these events. Just yesterday I received literally thousands of emails with this same radius servers cannot be reached event. These continued overnight with thousands of more events. The last time this happened the only way I could stop them was to reboot both of my vSZ controllers which I am in the process of doing right now.

I think this is bug in the code but have not heard this from support
Photo of Dave Watkins

Dave Watkins

  • 74 Posts
  • 14 Reply Likes
I've done a lot of digging on this one and I _think_ our primary issue is that NPS ignores/discards packets with attributes it doesn't support/understand. There isn't any way you can tell it to send reject messages to these requests and so, if you deal with a lot of BYOD devices that aren't configured correctly you're at their mercy. Enough of these request cause failovers

Longer description here
https://community.jisc.ac.uk/groups/eduroam/article/improving-reliability-microsoft-nps-authentication-provider-eduroam

Sadly we're also seeing GPO configured windows clients not being assinged the right VLAN when they roam between AP's which might be related to Radius failovers. They drop down to the default VLAN assigned on the SSID not the radius assigned one. Which if course changes their IP and causes havoc.
Photo of Anders Grandt

Anders Grandt

  • 3 Posts
  • 0 Reply Likes
I'm seeing the exact same problem with our vSZ with version 5.2.0.0.699. 
Unable to reach radius server (127.0.0.1).

After vSZ-reboot it will work again but after a while it will be the same again.
This really need to be solved asap!
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes
I have had a case open for weeks about this. Ruckus support told me two things just a few days ago
1. This is a bug in vSZ firmware 5.2.0.0.699 which is listed as GA (General Available). This bug manifests itself in the AP trying to reach our radius server which happens to be Cloudpath. The AP should only be trying to reach our vSZ and not directly trying to reach Cloudpath
2. I should be using an MR (Maintenance release) of vSZ which as the name implies a release that has had more bug worked out

On our call a few days ago the engineer disable the emailing of radius unreachable events. I have since reached back out to them (should have asked on the call but did not) if I should back rev our controllers to an MR release. I should hear an answer this week
Photo of Anders Grandt

Anders Grandt

  • 3 Posts
  • 0 Reply Likes
Yeah I know about the GA and MR releases but Radius server support is pretty essential and "should" work correctly even in a GA release in my oppinion.

Anyway - if you get some more info in this case about solutions or updates/patches coming soon, please put it up here
Photo of Dave Watkins

Dave Watkins

  • 74 Posts
  • 14 Reply Likes
I find is somewhat surprising you've been directed to use an MR release. How exactly are you supposed to use new AP's on an MR? As far as I'm aware the R650 is only supported on 5.2.

Also, GA is GA, it's released to the public. Sure, an MR is going to have issues fixed, but any GA release shoudl have been based off the previous MR. It's like they are building each GA from the fround up with new code.

Just disappointing really. I keep upgrading as I keep hoping more bugs are fixed than are introduced. So far, not a lot of luck in that regard
Photo of Diego Garcia del Rio

Diego Garcia del Rio

  • 121 Posts
  • 43 Reply Likes
On my side, APs not using proxy are also showing the alarm (radius server unreachable - with the radius IP). From a user experience, it _seems_ to be ok but the alarms are quite annoying to say the least.

It _seems_ as the newer code is very agressive with the response latency for radius.  Also, they don't perform any liveliness checks (at least not in non-proxy mode) which other devices usually do.

Plus, in direct mode at least, the timeouts / retries are not configurable at all.
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes
I was told by Ruckus support that this is a big in the newest vSZ code and will be rectified when an MR release to the 5.2 code comes out. He was not sure of the timeframe
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes
Still waiting for the next release of firmware for vSZ which I was told was going to fix this issue
Photo of David Henderson

David Henderson

  • 114 Posts
  • 31 Reply Likes
Still waiting for the next release of firmware for vSZ that I was told would fix this issue
Photo of Jeronimo

Jeronimo

  • 384 Posts
  • 49 Reply Likes
Isn't it still released MR version fixed?

We also met those log at several site.

It's very serious.