ICX 6610 high cpu usage

  • 2
  • Question
  • Updated 5 months ago
  • Answered
  • (Edited)
Hi guys,

We have a pair of ICX 6610 stacked and both of them are running version 08.0.30tT7f3 (latest).

We are facing a very strange behavior with these boxes. The latency for all IPs configured in it, when we ping, returns with high latency even locally.

Below you can find two examples of the problem that we are facing.

##########################
[email protected][19:55][~]: ping 191.252.191.1
PING 191.252.191.1 (191.252.191.1) 56(84) bytes of data.
64 bytes from 191.252.191.1: icmp_seq=1 ttl=60 time=90.6 ms
64 bytes from 191.252.191.1: icmp_seq=2 ttl=60 time=230 ms
64 bytes from 191.252.191.1: icmp_seq=3 ttl=60 time=1.69 ms
64 bytes from 191.252.191.1: icmp_seq=4 ttl=60 time=3.59 ms
64 bytes from 191.252.191.1: icmp_seq=5 ttl=60 time=0.753 ms

PING 191.252.203.1 (191.252.203.1) 56(84) bytes of data.
64 bytes from 191.252.203.1: icmp_seq=1 ttl=60 time=1.00 ms
64 bytes from 191.252.203.1: icmp_seq=2 ttl=60 time=1.49 ms
64 bytes from 191.252.203.1: icmp_seq=3 ttl=60 time=121 ms
64 bytes from 191.252.203.1: icmp_seq=4 ttl=60 time=106 ms

Another strange thing is the CPU usage. The 1 second statistic show us spikes, may be this is the reason of latency.

###########################
65 percent busy, from 1 sec ago
1   sec avg: 65 percent busy
5   sec avg:  1 percent busy
60  sec avg:  1 percent busy
300 sec avg:  1 percent busy

spcrdc2ita001#sh cpu-utilization 
Less than a second from the last call, abort
1   sec avg:  1 percent busy
5   sec avg:  1 percent busy
60  sec avg:  1 percent busy
300 sec avg:  1 percent busy

spcrdc2ita001#sh cpu-utilization 
1 percent busy, from 1 sec ago
1   sec avg:  1 percent busy
5   sec avg:  1 percent busy
60  sec avg:  1 percent busy
300 sec avg:  1 percent busy

Sspcrdc2ita001#sh cpu-utilization 
1 percent busy, from 39 sec ago
1   sec avg: 73 percent busy
5   sec avg:  1 percent busy
60  sec avg:  1 percent busy
300 sec avg:  1 percent busy

spcrdc2ita001#sh cpu-utilization 
Less than a second from the last call, abort
1   sec avg:  7 percent busy
5   sec avg:  3 percent busy
60  sec avg:  1 percent busy
300 sec avg:  1 percent busy

When we run a "show cpu tasks" the number seems to be good, as below.

spcrdc2ita001#show cpu tasks       
... Usage average for all tasks in the last 1 second ...
==========================================================
Name %

idle                9               
con                  0               
mon                  0               
flash                0               
dbg                  0               
boot                0               
main                0               
stkKeepAlive        0               
keygen              0               
itc                  0               
poeFwdfsm            0               
tmr                  0               
scp                  0               
appl                91              
snms                0               
rtm                  0               
rtm6                0               
rip                  0               
bgp                  0               
bgp_io                                  0                         
ospf                0               
ospf_r_calc          0               
openflow_ofm        0               
openflow_opm        0               
mcast_fwd            0               
mcast                0               
msdp                0               
ripng                0               
ospf6                0               
ospf6_rt            0               
mcast6              0               
ipsec                0               
dhcp6                0               
snmp                0               
rmon                0               
web                  0               
acl                  0               
flexauth            0               
ntp                  0               
rconsole            0               
console              0               
ospf_msg_task        0               
ssh_0                                   0           

All the interfaces are fine (bandwidth consumption is low) and we have no problems with PPS (Packets per second) rate.

You can find an image from our monitoring tool with CPU,Memory usage and response time.

Did anynone experience this? Do you have any suggestions for tourbleshooting/fix that?




Thanks,
Marcelo Tadeu



Photo of Marcelo Araujo

Marcelo Araujo

  • 3 Posts
  • 1 Reply Like

Posted 8 months ago

  • 2
Photo of Ben

Ben, Employee

  • 91 Posts
  • 32 Reply Likes
I wouldn't test latency by pinging the ICX itself as we do not prioritize replying to ping. Do you still see latency spikes pinging through the ICX? With that said, if you are seeing CPU spikes, that could certainly lead to issues if the ICX is overwhelmed momentarily. I would search for any kind of control traffic that could be hitting the CPU or any logs that might show some kind of trigger at moment of CPU spike. Your best best may be to open a support case and have our TAC team try to help track it down. 
Photo of Marcelo Araujo

Marcelo Araujo

  • 3 Posts
  • 1 Reply Like
Yes, when we pass through the box our latency gets increased. It's not bad as when we ping ICX but it affects our machines behind ICX.

This is a MRT from the machine behind ICX going to our customer IP (inside -> outside).

percasJPG

From outside to inside:

We'll open a TAC to check it out and will post the fix here.

Thanks for attention.

Regards,
Marcelo Tadeu



Photo of Al Mat

Al Mat

  • 6 Posts
  • 0 Reply Likes
Same problem with a ICX 6610.  Console is very slow in SSH.
Photo of Sam Abbott

Sam Abbott

  • 2 Posts
  • 0 Reply Likes
Same issue with a stack of two ICX6610s.  Happened after crypto key generate for SSH use...  Showing CPU tasks before reloading appl and tmr were the only ones causing the high CPU.  Basically all network traffic was dropping... Uptime was less than 14 days on the stack.

Since the only change I made was generating the crypto keys for SSH, I zeroize'd that out...  Similar network load testing and it's not happening now.  Not sure what the deal is.  Anyone else have this issue or have a resolution?

Copyright (c) 1996-2016 Brocade Communications Systems, Inc. All rights reserved.

    UNIT 1: compiled on Feb 13 2019 at 18:30:50 labeled as FCXR08030t

(10545807 bytes) from Primary FCXR08030t.bin

        SW: Version 08.0.30tT7f3 

    UNIT 2: compiled on Feb 13 2019 at 18:30:50 labeled as FCXR08030t

  (10545807 bytes) from Primary FCXR08030t.bin

        SW: Version 08.0.30tT7f3 

  Boot-Monitor Image size = 370695, Version:10.1.00T7f5 (grz10100)


Photo of Al Mat

Al Mat

  • 6 Posts
  • 0 Reply Likes
I will open a new case after christmas vacancys.  I saw that we lost communcation with many interface when the cpu reach 100%.
Photo of Sam Abbott

Sam Abbott

  • 2 Posts
  • 0 Reply Likes
Likewise, if it happens again I'll open a ticket as well.. I ended up getting a maintenance window since we jumped from a v7.3 code to the 8.0.30 code, to pull the power and do a cold-boot.  Still no crypto key or SSH...  So maybe I won't experience it again.  SSH isn't that important on our LAN at this time anyway, telnet will work.
Photo of Al Mat

Al Mat

  • 6 Posts
  • 0 Reply Likes
Hello!
I replaced the ICX 6610 and it not solve the problem.  We have always a high CPU 100% on my stack.  I have SSH and crypto key on all my others 10 routeurs and there is no problem with that.  I found a documentation that it could help, I will take a look at that.  Here's the link:
https://support.ruckuswireless.com/articles/000007306

If you find a solution please share.
Thank you!

Photo of Marcelo Araujo

Marcelo Araujo

  • 3 Posts
  • 1 Reply Like
Hi guys,

Just an update: In our case the problem was due to a mismatch config at the servers side. We opened a TAC and we found that if we have a NIC teaming in the servers side, but doesnt use LACP 802.3ad and it causes the issue. Our servers are running XCP (Citrix Xenserver Open) and if they are not running vSwtich (802.3ad is only possible if running vSwitch) we got a mess.

The solution that we`ve found is turning the NIC teaming in a passive-active mode. In that way, the switches only learns the mac address from server in one port at time. 

If the same mac came from two differents ports at the same time, CPU goes high and cause all the problems.

Hope i can help you.

Regards,