I broke a 7450 stack

  • 1
  • Question
  • Updated 11 months ago
  • Answered
We had a stack set up with a 7650 and 5 7450s.  There's an issue with the primary and secondary flash on one of the 7450s and, in trying to fix it, I rebooted the stack.  Now, the last two switches have been dropped from the stack and I would like some help getting them back.

[email protected]#show spx
T=4h39m10.5: alone: standalone, D: dynamic cfg, S: static
ID   Type          Role    Mac Address    Pri State   Comment                   
1  S ICX7650-48ZP  alone   d4c1.9e14.78c3   0 local   Ready
21 S ICX7450-48P   spx-pe  d4c1.9e08.2440 N/A remote  Ready
22 S ICX7450-48P   spx-pe  d4c1.9e07.f680 N/A remote  Ready
23 S ICX7450-48P   spx-pe  d4c1.9e08.1f00 N/A remote  Ready
24 S ICX7450-48P   spx-pe  0000.0000.0000 N/A reserve 
25 S ICX7450-48P   spx-pe  0000.0000.0000 N/A reserve 

                                                                               
     +---+                                                                     
  3/1| 1 |3/3                                                                  
     +---+                                                                     
            +----+        +----+        +----+                                 
  1/3/2--3/1| 21 |4/1--3/1| 22 |4/1--3/1| 23 |4/1-                             
            +----+        +----+        +----+                                 
                                                                               
This is from before the reboot:

[email protected]#show flash
Stack unit 1:
  Compressed Pri Code size = 56904124, Version:08.0.80bT233 (TNR08080b.bin)
  Compressed Sec Code size = 56904124, Version:08.0.80bT233 (TNR08080b.bin)
  Compressed Boot-Monitor Image size = 1573376, Version:10.1.13T235
  Code Flash Free Space = 2802184192
SPX unit 21: 
  Compressed Pri Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Sec Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Boot-Monitor Image size = 786432, Version:10.1.13T215
  Code Flash Free Space = 1697820672
SPX unit 22: 
  Compressed Pri Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Sec Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Boot-Monitor Image size = 786432, Version:10.1.13T215
  Code Flash Free Space = 1697820672
SPX unit 23: 
  Compressed Pri Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Sec Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Boot-Monitor Image size = 786432, Version:10.1.13T215
  Code Flash Free Space = 1697816576
SPX unit 24: 
  Compressed Pri Code size = -1523736068, Version:08.0.80T213 (SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0�-��08.0.80T213)
  Compressed Sec Code size = -1523736068, Version:08.0.80T213 (SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0SPR0�-��10.1.12T215)
  Compressed Boot-Monitor Image size = 786944, Version:10.1.12T215
  Code Flash Free Space = 1760612352
SPX unit 25: 
  Compressed Pri Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Sec Code size = 29815996, Version:08.0.80bT213 (SPR08080b.bin)
  Compressed Boot-Monitor Image size = 786432, Version:10.1.13T215
  Code Flash Free Space = 1697816576



Suggestions?
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes

Posted 11 months ago

  • 1
Photo of Hashim Bharoocha

Hashim Bharoocha, Employee

  • 59 Posts
  • 35 Reply Likes

Hey Clayton,

-You need to check if the images on the missing units are intact by  consoling to it.  Make sure it is same image version.  Also try to recopy the files that was corrupted on the unit 24.
-Also you can check if the SPX  Stack ports are up. (show spx connections)
-Check log files for any information as to why it fell off the stack.

I think it would be a good idea to open up a ticket as it might be a bit more complex.



Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
Thanks.  When I console to the #24 switch, I get a "ICX7450-Boot>" prompt and I couldn't see which commands to use to access the flash memory.

Also, I thought one of the selling points of a stack configuration was that you could add a new switch, cable it up and issue one command to add it to the stack.  Don't I just need that command to re-add these two switches to the stack?

[email protected]#show spx connections
Probing the topology. Please wait ...
Discovered following spx connections...
Link 1: # of ports in lag = 1
       1: 1/3/2 <--> 21/3/1
Link 2: # of ports in lag = 1
       1: 21/4/1 <--> 22/3/1
Link 3: # of ports in lag = 1
       1: 22/4/1 <--> 23/3/1
Discovery complete


Photo of Hashim Bharoocha

Hashim Bharoocha, Employee

  • 59 Posts
  • 35 Reply Likes
Hey Clayton,

We need to debug this further, it is stuck in boot monitor.

Best thing is to open a case and work with TAC.
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
Thanks for your help.
Photo of NETWizz

NETWizz

  • 182 Posts
  • 58 Reply Likes
Were you able to get it resolved?  If not the next step is to fix each of the switches stuck in boot monitor.  I would probably remove the stacking cables from them provided you don't have the stack ring broken in two places.

Then I would recover the fimware until the unit is bootable.  At that point I would do a stack unconfigure me.

Finally, I would place back into the stack, ensure proper speed settings for the Stacking Cables... then enable stacking and hitless failover.  Most likely the unit would reboot and re-join.

At any rate have the order and all the mac addresses.

You can re-run the stack secure-setup if necessary and re-number as well.

Ensure you have a stack backup of the configuration.

Regardless, if worse comes to worst, you should be able to rebuild it within an hour or two from scratch.
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
Thanks.  The problem switch is stuck in boot mode.  We directly connected a TFTP server and were able to upload the proper firmware but when we rebooted, we got this error. 

"Wrong Image Format for bootm command"
"ERROR: can't get kernel image!"

I'm still waiting for Ruckus Support to get back to me about this one.


Photo of Hashim Bharoocha

Hashim Bharoocha, Employee

  • 59 Posts
  • 35 Reply Likes
Hi Clayton,
Happy New Year.

This will require an RMA.

I  Pinged the case owner for 00881158.

They will RMA it with Failure Analysis, so we can figure out what went wrong.

Thanks
Hashim



 


Photo of NETWizz

NETWizz

  • 182 Posts
  • 58 Reply Likes
Looking through my console logs, I did a recovery on or about August 2nd, 2018 on a 7150.  It should be nearly identical... only you need to send the appropriate firmware for a 7450 and not the firmware I sent for a 64XX...



NOTE:  I used Tftpd64 by Ph. Jounin.  This doesn't mean you need to use the same TFTP server, but I can verify this one works flawlessly.  Make sure it lists the Server Interface for the Network adapter you test ping later in this exercise, so we know it is reachable.  Also make certain the "Current Directory" is appropriate for your TFTP-Root with your image in the root directory.


Remove the switch to your desk and try the following (but with a compatible image for your model) AND the current boot monitor:


ICX64XX-boot>> printenv

You will get something like this only different:

baudrate=9600
ipaddr=10.115.142.82
serverip=10.115.1.80
netmask=255.255.255.0
gatewayip=10.115.128.1
uboot=kxz10105
image_name=/foundry/FGS/os/ICX64S08010h.bin
ver=10.1.05T310 (Mar 19 2015 - 16:39:59)

Now try this.  In my example, uboot is my TFTP server IP.  I do not think the serverip is relevant; I don't know what it does.  It seems the switch uses ipaddr!  Set these to work with your environment....


ICX64XX-boot>> setenv ipaddr 10.1.2.24
ICX64XX-boot>> setenv netmask 255.255.248.0
ICX64XX-boot>> setenv gatewayip 10.1.2.1
ICX64XX-boot>> set uboot 10.1.2.4
ICX64XX-boot>> set image_name ICX64S08030s.bin

ICX64XX-boot>> printenv
baudrate=9600
ipaddr=10.1.2.24
serverip=10.115.1.80
netmask=255.255.248.0
gatewayip=10.1.2.1
uboot=10.1.2.4
image_name=ICX64S08030s.bin





Save your environment

ICX64XX-boot>> saveenv


Connect the management interface to somewhere it can reach your TFTP server... and verify it can reach by pinging the server.  NOTE:  This is the ONLY interface on the switch that will work for a recovery!  On many switches this is near the console port!

ICX64XX-boot>> ping 10.1.2.4
Using egiga0 device
host 10.1.2.4 is alive


Get It! Pull the Image from the TFTP server:


ICX64XX-boot>> update_secondary
Using egiga0 device
TFTP from server 10.1.2.4; our IP address is 10.1.2.24
Download Filename 'ICX64S08030s.bin'.
Load address: 0x3000000
Download to address: 0x3000000
Loading: *%#################################################################
#################################################################
#################################################################
#################################################################
#################################################################
#################################################################
#################################################################
#################################################################
################################################################
done
Bytes transferred = 8558924 (82994c hex)
prot off f9080000 f9ffffff
........................................................................................................................................................................................................................................................
Un-Protected 248 sectors
erase f9080000 f9ffffff

.............................................................
.................................................................
.................................................................
.........................................................
Erased 248 sectors
copying image to flash, it will take sometime...
sflash write 3000000 1080000 f80000
TFTP to Flash Done.
ICX64XX-boot>> <
ICX64XX-boot>>
ICX64XX-boot>> 
ICX64XX-boot>> 


BOOT the FastIron Image:


ICX64XX-boot>> boot_secondary
Booting image from Secondary
## Booting image at 00007fc0 ...
   Created:      2018-05-30  14:30:33 UTC
   Data Size:    8558348 Bytes =  8.2 MB
   Load Address: 00008000
   Entry Point:  00008000
   Verifying Checksum ... OK
OK

Starting kernel in BE mode ...
Uncompressing Image...................................................................................................................................................................................................................................................................................................................................................................... done, booting the kernel.

Config partition mounted.
Creating TUN device
Starting the FastIron.
FIPS Disabled:PORT NOT DISABLED
platform type 49
OS>Unable to set the kernel wall time 
Starting Main Task .INFO: startup config data is not available, try to read from backup
INFO: startup config data in the backup area is not available
CPSS DxCh Version: cpss3.4p1 release
Pre Parsing Config Data ...
INFO: empty config data in the primary area, try to read from backup
INFO: empty config data in the backup area also

Parsing Config Data ...
INFO: empty config data in the primary area, try to read from backup
INFO: empty config data in the backup area also

System initialization completed...console going online.

  Copyright (c) 1996-2016 Brocade Communications Systems, Inc. All rights reserved.
    UNIT 1: compiled on May 30 2018 at 07:29:29 labeled as ICX64S08030s
(8558924 bytes) from Secondary ICX64S08030s.bin
        SW: Version 08.0.30sT311 
  Boot-Monitor Image size = 786944, Version:10.1.05T310 (10)
  HW: Stackable ICX6450-C12-PD
==========================================================================
UNIT 1: SL 1: ICX6450C 12-port-PD Management Module
  Serial  #: REDACTED
  License: BASE_SOFT_PACKAGE   (LID: eviHKJMnFNi)
  P-ENGINE  0: type DEF0, rev 01
==========================================================================
UNIT 1: SL 2: ICX6450C-Copper 2port 2G Module
==========================================================================
UNIT 1: SL 3: ICX6450C-Fiber 2port 2G Module
==========================================================================
  800 MHz ARM processor ARMv5TE, 400 MHz bus
65536 KB flash memory
  512 MB DRAM
STACKID 1  system uptime is 6 second(s) 
The system : started=warm start reloaded=by "reload"

ICX6450-C12PD Switch>
Stack unit 1 PS 1, Internal Power supply detected and up.

Stack unit 1 PS 1, Internal Power supply detected and up.
PoE: Stack unit 1 PS 1, Internal Power supply  with 68000 mwatts capacity is up
PoE Info: Adding new 54V capacity of 68000 mW, total capacity is 68000, total free capacity is 68000
PoE Info: PoE module 1 of Unit 1 on ports 1/1/1 to 1/1/4 detected. Initializing....
PoE Event Trace Log Buffer for 2000 log entries allocated
PoE Event Trace Logging enabled...
PoE Info: PoE module 1 of Unit 1 initialization is done. 
^C

ICX6450-C12PD Switch>

Photo of NETWizz

NETWizz

  • 182 Posts
  • 58 Reply Likes
Or you can RMA it, lol.  They posted that when I was writing the above.   Honestly, I don't know what happened.  I never had a situation where both the primary and secondary were corrupted at the same time, but I did a test recovery on an old device just as an exercise once.

If one or the other firmware is good, you might merely try...  It just seems unlikely both flash slots would be corrupt at the same time.  Regardless, it is my opinion if the boot-monitor is undamaged, I would think the switch is recoverable.

ICX64XX-boot>> boot_primary

or

ICX64XX-boot>> boot_secondary


That said, if they offer an RMA, that may be the easier path.

Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
Thanks NETWizz.  I used the tftpd32 but I assume it's just as reliable.

So you think I need to do it again?  I'm pretty sure I followed all your steps and I think I did both the boot_primary and boot_secondary but I'm happy to give it another try.  I'll make sure I log the putty session this time.

Do you know if there's a way to look at the flash slots while in boot mode to verify the firmware was uploaded correctly?


Photo of NETWizz

NETWizz

  • 182 Posts
  • 58 Reply Likes
I don't know of a way to check it from within the boot monitor.  That said, if it sends successfully, I see no reason it wouldn't boot provided the device doesn't have a hardware problem.

TFTPD32 should be fine; it's the same program compiled 32-bit as far as I know.

What software version are you sending?

What is your boot monitor version?
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
SPR08080b.bin

I think it's the 113 bootrom.
Photo of NETWizz

NETWizz

  • 182 Posts
  • 58 Reply Likes
Okay, so you are running:

10.1.13?

If yes, that is the same Boot-Monitor I am running.  Did you verify with printenv???  Or does it say it on boot?

If yes, that should be perfectly fine along with the above flash image.  I am currently running 08.0.80ca on all our Ruckus 7000 series stuff.  Not saying this is the best only it is what we deployed before I went on vacation for the holidays.

I am not saying there is any problem with SPR80080b thought in that it should certainly at least boot.  We used to run that for a while, and it worked fine on the 7450's.


Photo of Michael Brado

Michael Brado, Official Rep

  • 3069 Posts
  • 442 Reply Likes
Hi Clayton/NETWiz,

     We have seen cases like this with your error message, and if you cannot copy, then we suggest to RMA the switch.  Thanks!
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
NETWizz, when we reboot the switch, it shows a message that the 10.1.13 (spz10113) is installed but it's expecting spz10112.  This confuses me.

All the other switches in the stack have spz10113 installed.
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
Thanks Michael but the switch is booting okay now with the proper firmware installed and no configuration.  Now I just need to get it back into the stack.
Photo of NETWizz

NETWizz

  • 182 Posts
  • 58 Reply Likes
Awesome; I am glad it helped.

TAKE A BACKUP OF YOUR STACK CONFIG FIRST!!!



First I would do a "stack unconfigure me" or perhaps an "stack unconfigure all"... and a reload.  Then I would do an "erase startup-config" followed by a reload.


If it says it is expecting spz10112, I would probably send that over and reload it to see if that message goes away. Then you can put back the bootrom of your choice, and it should go back through the upgrade procedure to upgrade the bootmonitor.

At any rate, it is a bit different being you are running the router firmware SPR08080B I think you said.

On a blank switch, we need to put an IP on an interface that is within the same subnet as your TFTP server or if that TFTP server is on a different subnet, you need to also put a next hop to point to your LAN's default-gateway.

Basically choose an interface like Ethernet 1/1/48 or whatever and put an IP on it within your network on a Layer-3 Interface.

switch>
switch# conf t
switch(config)# int e 1/1/48
switch(config-if-e1000-1/1/48)# ip add 10.1.2.3/24
switch(config-if-e1000-1/1/48)# enable

Not needed it the TFTP server is on the SAME LAN... /24 above is for the subnet mask 255.255.255.0 (adjust accordingly to your LAN subnet)

switch(config-if-e1000-1/1/48)# exit
switch(config)# ip route 0.0.0.0/0 10.1.2.1


To Downgrade to 10.1.12 boot monitor that it is expecting:

switch# copy tftp flash 10.1.2.4 spz10112.bin bootrom



Please do a "show flash" to verify it is present before reloading.  Sometimes it takes a minute after the copy completes before it is really ready!



****************

Once everything is looking great, you can re-add the device to your stack.

Basically, you would put it back in and reconnect the cables.  It is best if it has a matching bootrom and firmware!

Once you connect the stacking cables, if you do not get messages on the terminal that they changed state to UP, you would need to adjust the speed.  I do not think this was an issue on the 7450's I dealt with though.


At this point

The entire stack should probably run


stack (config)# hitless-failover enable

Switch you add:
switch(config)# stack enable


Please do this in a maintenance window.  If you have hitless-failover, only the new device should reboot.  Otherwise the entire stack will likely reboot.


Next, you probably want the MAC addresses of each switch.... top to bottom; since, you want them numbered in order...

Do a "sh stack"

If all is well you will see a ring topology shown.
You are also looking for:  "Standby u2 - protocols ready, can failover or manually switch over"

You really do not want to make any changes until everything comes into convergence within the stack.  At that point, I would do a "wr mem"


Next to renumber, rerun your "stack secure-setup"

If you have your list of MAC addresses, it can walk you through renumbering all the switches.


Once complete, you want to validate communications


[email protected]#sh stack connection
Probing the topology. Please wait ...
[email protected]#
    active       standby
     +---+        +---+
 -4/1| 1 |3/1--4/1| 2 |3/1-
 |   +---+        +---+   |
 |                        |
 |------------------------|

trunk probe results: 2 links
Link 1: u1 -- u2, num=1
  1: 1/3/1 (P0) <---> 2/4/1 (P1)
Link 2: u1 -- u2, num=1
  1: 1/4/1 (P1) <---> 2/3/1 (P0)
CPU to CPU packets are fine between 2 units.



Basically, you are looking that the CPU to CPU packets are okay all around the stack.


If all is great, "wr mem"


At any rate, after a renumber, you may have to change some VLAN memberships etc.  Ultimately, though this should be readily repairable.
Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
Sorry, I forgot to post this update yesterday:

I have no idea how it happened but when I went to console into the problem switch earlier today, I didn't get the boot-mode prompt, I got the new switch prompt (empty running config).  I ran the "show version" and both the primary and secondary flashes had the correct firmware.  I guess whatever I did on Friday with the firmware and the bootrom took a while to process.

I had backed up the running-config last week and I managed to download it into the starting-config (initially there was no starting-config.txt file to download into until I made some small change to the empty running-config and wrote it to memory) but after a reload, I still had an empty running-config.  Maybe the starting-config for the stack wasn't going to work as the starting-config for the switch.




Photo of Clayton Tavernier

Clayton Tavernier

  • 77 Posts
  • 5 Reply Likes
I think we're going to RMA the problem switch.  We went through the procedures for having the switch "discovered" by the stack and it never worked.  It seems to be that whatever the zero-input process or the spx configuration save in the switch's flash memory that configures it as a Port Extension is lost every time it reboots.

Thanks everyone for all your input.