Archives

All posts by ccie14023

The case came in P1, and I knew it would be a bad one. One thing you learn as a TAC engineer is that P1 cases are often the easiest. A router is down, send an RMA. But I knew this P1 would be tough because it had been requeued three times. The last engineer who had it was good, very good. And it wasn’t solved. Our hotline gave me a bridge number and I dialed in.

The customer explained to me that he had a 7513 and a 7206, and they had a multilink PPP bundle between them with 8 T1 lines. The MLPPP interface had mysteriously gone down/down and they couldn’t get it back. The member links were all up/down. Why they were connecting them this way was not a question an HTTS engineer was allowed to ask. We were just there to troubleshoot. As I was on the bridge, they were systematically taking each T1 out of the bundle and putting HDLC encapsulation on it, pinging across, and then putting it back into the MLPPP bundle. This bought me time to look over the case notes.

There were multiple RMA’s in the notes. They had RMA’d the line cards and the entire chassis. The 7513 they were shipped had problems and so they RMA’d it a second time. RMA’ing an entire 7513 chassis is a real pain. I perused the configs to see if authentication was configured on the PPP interface, but it wasn’t. It looked like a PPP problem (up/down state) but the interface config was plain MLPPP vanilla.

They finished testing all of the T1’s individually. One of the engineers said “I think we need another RMA.” I told them to hang on. “Take all of the links out of the bundle and give me an MLPPP bundle with one T1,” I said. “But we tested them all individually!” they replied. “Yes, but you tested them with HDLC. I want to test one link with multilink PPP on it.” They agreed. And with a single link it was still down/down. Now we were getting somewhere. I had them switch which link was the active one. Same problem. Now disable multilink and just run straight PPP on a single link. Same thing.

“Can you turn on debug ppp with all options?” I asked. They were worried about doing it on the 7513, but I convinced them to do it on the 7206. They sent me the logs, and this stood out:

AAA/AUTHOR/LCP: Denied

Authorization failed. But why? Nothing was configured under the interface, but I looked at the top of the config, where the AAA commands are, and saw this:

aaa authorization network default

And there it was. “Guys, could you remove this one line from the config?” I asked. They did. The single PPP link came up. “Let’s do this slowly. Add the single link back into multilink mode.” Up/up. “Now add all the links back.” It was working.

It turns out they had a project to standardize their configs across all their routers and accidentally added that line. They had RMA’d an entire 7513 chassis–twice!–for a single line of config. Replacing a 7513 is a lot of work. I still can’t believe it got that far.

Some lessons from this story: first, RMAs don’t always fix the problem. Second, even good engineers make stupid mistakes. Third, when troubleshooting, always limit the scope of the problem. Troubleshoot as little as you can. And finally, even hard P1’s can turn out easy.

This article continues to be the most popular one on this blog.  However, I published it back in 2014 while I was working on my JNCIE-SP, and that was a long time ago.  I now work at Cisco and do not have access to Junos, and my memory of Junos is getting spotty.  I am happy if the article helps you, and feel free to leave a comment, but unfortunately I will not be able to help you with specific questions on this or other Juniper topics.

 

Continuing on the subject of confusing Junos features, I’d like to talk about RIB groups. When I started here at Juniper, I remember being utterly baffled by this feature and its use. RIB groups are confusing both because the official documentation is confusing, and because many people, trying to be helpful, say things that are entirely wrong. I do think there would have been an easier way to design this feature, but RIB groups are what we have, so that’s what I’ll talk about. Continue Reading

Before I worked at TAC, I was pretty careless about how I filled in a TAC case online. For example, when I had to select the technology I was dealing with in the drop-down menu, if I didn’t see exactly what I had then I would go ahead and pick something at random and figure TAC would sort it out. And then I would get frustrated when I didn’t get an answer on my case for hours. Working in TAC showed me why.

When you open a TAC case, and you pick a particular technology, your choice determines into which queue the case is routed. For example, if you pick Catalyst 6500, the case ends up in a queue which is being monitored by engineers who are experts on that platform. Under TAC rules (assuming it is a priority 3 case) the engineers have 20 minutes to pick up the case. If they don’t, it turns blue in their display and their duty manager starts asking questions. (In high touch TAC where I worked, we didn’t have too many blue cases, but in backbone TAC it wasn’t uncommon to see a ton of blue and even black (> 1hr) cases sitting in a busy queue.)

If the customer categorized his case wrong, this meant it was sitting in the wrong queue. Now an engineer had to notice his case, review it, determine where it should go, and “punt” it to the appropriate queue, at which point the counters are reset and the case is sitting again.

Imagine for a moment that you are an overworked TAC engineer with 30 minutes left to go on your shift. You are supposed to clear out your queue and take any cases before the next crew comes on (at least we were in HTTS). You don’t want to take any more cases, however. There is a case sitting in your queue which has turned blue and your colleagues may not be happy to see it sitting there when they come on shift. Well, you’re an experienced TAC engineer and you know what to do: punt the case to another queue, even if it’s the wrong one. If you pick a busy queue, it will take at least 30 minutes for the engineers on that queue to see the “mis-queue” and punt the case back to your queue, at which point you are off shift and it becomes the problem of your colleagues on the next shift.

My recommendation is to be very careful to select the right menu options when you open a case online with any tech support organization. Make sure you route the case to the right place the first time so you don’t have to wait for engineers and managers to look at it and re-categorize it.

When I first started configuring MPLS on Juniper routers, I came across the strange and mysterious inet.3 table.  What could it possibly be?  When I worked in Cisco TAC I handled hundreds of MPLS VPN cases, but I never had encountered anything quite like inet.3 in IOS land.  As I researched inet.3 I found the documentation was sparse and confusing, so when I finally came to understand its purpose I decided to create a clear explanation for those who are searching in vain.  I will focus on the basics of how inet.3 works, leaving details of its use for later posts. Continue Reading

In this post, we’ll be looking at IS-IS inter-area concepts, and hopefully clearing up some of the confusion ISIS areas create in the minds of engineers who are used to OSPF.  ISIS handles areas quite differently from OSPF, and if you think about ISIS areas in OSPF terms you are likely to be confused by some of this behavior.  The good news is that if you configure area numbers and enable ISIS, it should just work, but if you want to do anything more complex you will need a deeper understanding of how ISIS areas work.  I’ll assume you know the basics of ISIS, for example that it is not an IP native protocol, and just focus on the areas for now.  My intention here is not to go into all of the details of ISIS inter-area operation, but to help you sort out the basics in your mind so you can dive deeper in your studies. All output will be from Juniper routers, but should be self-explanatory enough for those of you using a different platform. Continue Reading

This is my first post on this blog which I created some time ago and have left dormant.  Give that there are about twice as many blogs as people, it would seem best to start out with a statement of my purpose and intent.

Before that, a little background:  I currently work for Juniper Networks, although I don’t claim to speak for them.  I am responsible for network architecture within IT, which gives a unique perspective since I am both a customer and a vendor.  I’ve been working in this industry for over 15 years, although my history with computers goes back farther than that.  I hold dual CCIE certifications in Routing/Switching and Security, and an M.S. in Telecommunications Management.  Non-technical credentials:  I hold an FAA Private Pilot’s certificate, I have studied and taught Ancient Greek and Latin.

My hope is to provide several types of articles here.   I specialize in communicating technical concepts in simple and direct language, so I will be breaking down difficult technical subjects for my readers, focusing particularly on subjects that frustrate me.  (Don’t get me started on MSTP).  I will also provide frank commentary on the industry, its trends, and on training and certification.  I will also augment this with stories from my years as a network engineer that hopefully will keep things entertaining.  Finally I hope my language and humanities experience will lend some interesting color to this site so that it is not just another tech blog.