The case came into the routing protocols queue, even though it was simply a line card crash.  The RP queue in HTTS was the dumping ground for anything that did not fit into one of the few other specialized queues we had.  A large US service provider had a Packet over SONET (PoS) line card on a GSR 12000-series router crashing over and over again.

A previous engineer had the case, and he did what a lot of TAC engineers do when faced with an inexplicable problem:  he RMA’d the line card.  As I have said before, RMA is the default option for many TAC engineers, and it’s not a bad one.  Hardware errors are frequent and replacing hardware often is a quick route to solving the problem.  Unfortunately the RMA did not fix the problem, the case got requeued to another engineer, and he…RMA’d the line card.  Again.  When that didn’t work, he had them try the card in a different slot, but it continued to generate errors and crash.

The case bounced through two other engineers before getting to me.  Too bad the RMA option was out.  But the simple line card crash and error got even weirder.  The customer had two GSR routers in two different cities that were crashing with the same error.  Even stranger:  the crash was happening at precisely the same time in both cities, down to the second.  It couldn’t be a coincidence, because each crash on the first router was mirrored by a crash at exactly the same time on the second.

The conversation with my fellow engineers ranged from plausible to ludicrous.  There was a legend in TAC, true or not, that solar flares cause parity errors in memory and hence crashes.  Could a solar flare be triggering the same error on both line cards at the same time?  Some of my colleagues thought it was likely, but I thought it was silly.

Meanwhile, internal emails were going back and forth with the business unit to figure out what the errors meant.  Even for experienced network engineers, Cisco internal emails can read like a foreign language.  “The ALPHA errors are side-effects the GULF errors,” one development engineer commented, not so helpfully.  “Engine is feeding invalid packets to GULF and that causes the bad header error being detected on GULF,” another replied, only slightly more helpfully.

The customer, meanwhile, had identified a faulty fabric card on a Juniper router in their core.  Apparently the router was sending malformed packets to multiple provider edge (PE) routers all at once, which explained the simultaneous crashing.  Because all the PEs were in the US, forwarding was a matter of milliseconds, and thus there was very little variation in the timing.  How did the packets manage to traverse the several hops of the provider network without crashing any GSRs in between?  Well, the customer was using MPLS, and the corruption was in the IP header of the packets.  The intermediate hops forwarded the packets, without ever looking at the IP header, to the edge of the network, where the MPLS labels get stripped, and IP forwarding kicks in.  It was at that point that the line card crashed due to the faulty IP headers.  That said, when a line card receives a bad packet, it should drop it, not crash.  We had a bug.

The development engineers could not determine why the line card was crashing based on log info.  By this time, the customer had already replaced the faulty Juniper module and the network was stable.  The DEs wanted us to re-introduce the faulty line card into the core, and load up an engineering special debug image on the GSRs to capture the faulty packet.  This is often where we have a gulf, pun intended, between engineering and TAC.  No major service provider or customer wants to let Cisco engineering experiment on their network.  The customer decided to let it go.  If it came back, at least we could try to blame the issue on sunspots.