Author: ccie14023

TAC Tales #12: SACK of trouble

When I first started at Cisco TAC, I was assigned to a team that handled only enterprise customers.  One of the first things my boss said to me when I started there was “At Cisco, if you don’t like your boss or your cubicle, wait three months.”  Three months later, they broke the team up and I had a new boss and a new cubicle.  My new team handled routing protocols for both enterprise and service provider customers, and I had a steep learning curve having just barely settled down in the first job. A P1 case came into my queue for a huge cable provider.  Often P1’s are easy, requiring just an RMA, but this one was a mess.  It was a coast-to-coast BGP meltdown for one of the largest service provider networks in the country.  Ugh.  I was on the queue at the wrong time and took the wrong case. The cable company was seeing BGP adjacencies reset across their entire network.  The errors looked like this: Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGP Notification sent Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.17 3/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF 12345 Jun 16 13:48:00.313 EST: %BGP-5-ADJCHANGE: neighbor 172.17.249.17 Down BGPNotification sent Jun 16 13:48:00.313 EST: %BGP-3-NOTIFICATION: sent to neighbor 172.17.249.173/1 (update malformed) 8 bytes 41A41FFF FFFFFFFF The cause seemed to be malformed BGP packets,...

Read More

In Praise of Vendor Lock-In

There is one really nice thing about having a blog whose readership consists mainly of car insurance spambots:  I don’t have to feel guilty when I don’t post anything for a while.  I had started a series on programmability, but I managed to get sidetracked by the inevitable runup to Cisco Live that consumes Cisco TME’s, and so that thread got a bit neglected. Meanwhile, an old article by the great Ivan Pepelnjak got me out of post-CL recuperation and back onto the blog.  Ivan’s article talks about how vendor lock-in is inevitable.  Thank you, Ivan.  Allow me to go further, and write a paean in praise of vendor lock-in.  Now this might seem predicable given that I work at Cisco, and previously worked at Juniper.  Of course, lock-in is very good for the vendor who gets the lock.  However, I also spent many years in IT, and also worked at a partner, and I can say from experience that I prefer to manage single vendor networks.  At least, as single vendor as is possible in a modern network.  Two stories will help to illustrate this. In my first full-fledged network engineer job, I managed the network for a large metropolitan newspaper (back when such a thing existed.)  The previous network team had installed a bunch of Foundry gear.  They also had a fair amount of Cisco.  It was...

Read More

TAC Tales #11: Full up

No customer is happy if they have to reboot one of their Internet-facing routers periodically, and this was one of our biggest customers.  (At HTTS, they were all big customers.)  This customer had a GSR connecting to the Internet, with partial BGP routes, and he kept getting this error: %RP-3-ENCAP: Failure to allocate encap table entry, exceeded max number of entries, slot 2 1 %RP-3-ENCAP: Failure to allocate encap table entry, exceeded max number of entries, slot 2 Eventually the router would stop passing traffic and when this happened, he had to reload it.  Needless to say, he wasn’t happy. The error came with a traceback, which shows what functions the code was executing when the error was generated.  The last function was this: arp_background(0x5053d290)+0x140 1 arp_background(0x5053d290)+0x140 Well, this was obviously some sort of ARP issue.  But why was ARP causing the router to stop forwarding traffic? Looking up the error, I found that it meant the route processor was unable to allocate a rewrite entry for the slot 2 line card.  As a packet leaves the fabric of a large router like the GSR, the headers are re-written with the destination layer 2 info.  The rewrite table used for this was full.  I had the customer run a hidden command a few times, and we could see the table entries incrementing quickly: Adjacency Table has 3167 adjacencies Adjacency Table...

Read More

Programmability for Network Engineers

Since I finished my series of articles on the CCIE, I thought I would kick off a new series on my current area of focus:  network programmability.  The past year at Cisco, programmability and automation have been my focus, first on Nexus and now on Catalyst switches.  I did do a two-part post on DCNM, a product which I am no longer covering, but it’s well worth a read if you are interested in learning the value of automation. One thing I’ve noticed about this topic is that many of the people working on and explaining programmability have a background in software engineering.  I, on the other hand, approach the subject from the perspective of a network engineer.  I did do some programming when I was younger, in Pascal (showing my age here) and C.  I also did a tiny bit of C++ but not enough to really get comfortable with object-oriented programming.  Regardless, I left programming (now known as “coding”) behind for a long time, and the field has advanced in the meantime.  Because of this, when I explain these concepts I don’t bring the assumptions of a professional software engineer, but assume you, the reader, know nothing either. Thus, it seems logical that in starting out this series, I need to explain what exactly programmability means in the context of network engineering, and what it means to do something programmatically....

Read More

TAC Tales #10: Out to Lunch

When you work at TAC, you are required to be “on-shift” for 4 hours each day.  This doesn’t mean that you work four hours a day, just that you are actively taking cases only four hours per day.  The other four (or more) hours you work on your existing backlog, calling customers, chasing down engineering for bug fixes, doing recreates, and, if you’re lucky, doing some training on the side.  While you were on shift, you would still work on the other stuff, but you were responsible for monitoring your “queue” and taking cases as they came in.  On our queue we generally liked to have four customer support engineers (CSE’s) on shift at any time.  Occasionally we had more or less, but never less than two.  We didn’t like to run with two engineers for very long;  if a P1 comes in, a CSE can be tied up for hours unable to deal with the other cases that come in, and the odds are not low that more than one P1 come in.  With all CSE’s on-shift tied up, it was up to the duty manager to start paging off-shift engineers as cases came in, never a good thing.  If ever you were on hold for a long time with a P1, there is a good chance the call center agent was simply unable to find a CSE because...

Read More