Archives

All posts by ccie14023

I’ve wanted to kick off a series for a while now on technical interviewing. Let me begin with a story.

My first job interview for a full network engineering role was at the San Francisco Chronicle in 2000. I had been working for five years in IT, mostly doing desktop and end-user support. I then decided to get a master’s degree in telecommunications management, which didn’t help at all, followed by a CCNA certification, which got me the interview.

My first interview was with the man who would be my boss. Henry was a manager who had almost no technical knowledge about networking, but I didn’t know that at the time. “Do you know Foundry switches at all?” Henry asked.

“No.” I was already worried.

“I doubted you would. That’s ok because we want to replace them all with Cisco and you know Cisco.” He pulled out a network diagram and handed it to me. “If you look at this, do you see a problem?” he asked.

I had never worked on a network larger than a couple switches, and now I was staring at a convoluted diagram depicting the network of the largest newspaper in Northern California. I was looking at subnet masks, link speeds, and hostnames, trying to find something wrong.

“I’m not sure,” I had to reply meekly.

He pointed at the main core switch for the network. There was only one, with no redundancy.  “There’s a huge single point of failure,” he said. I felt stupid missing the forest for the trees.

Henry brought me upstairs to interview with Tom, who was an on-site project management contractor from Lucent. I was extremely nervous–Lucent (later Avaya) was a big name in the industry and this guy worked for them! Henry left me with Tom. Tom pulled out a copy of the same diagram Henry was showing me earlier.

“Do you notice anything wrong with this?” he asked.

“Wow, that’s a huge single point of failure,” I replied.

He nodded his head in approval. “That’s right–very good.” He asked me a technical question about supernetting. I answered nervously, although it quickly became clear I knew more than he did.

The door flew open and another guy named Vincent walked in. He was the desktop support contractor, but again I didn’t know that. “Ask Jeff a few technical questions,” Tom said.

“Question number one,” said Vincent. “If you were running a network this size, would you subnet it?”

Now the answer seemed obviously to be “yes”, but I was trying to figure out if this is a trick. “Yes,” I answered, deciding to play it safe.

“Good! Next question: Can you route NetBios?” My desktop years were almost exclusively dedicated to Macs and I didn’t even know what NetBios was. I figured it was a 50/50 shot, and the way he asked it seemed to suggest the answer.

“No,” I said, trying to sound confident.

“He’s good,” said Vincent.

Next, the door flew open again, and in walked Bing. Bing was carrying some sort of network device with her. She handed it to me. “Is this a switch or a hub?” she asked. There was no obvious labeling on it, and as I turned the device over and over again in my hands, I had a sinking feeling.

“I don’t know,” I replied.

“Look at this,” Bing said. She pointed to a collision light. “Since there is one on each port, you can tell this is a switch.”

We don’t have collision lights on switches any more, but at the time we did and she had a valid point. Realizing this, I explained to her that a since a hub has a single collision domain, it would only have one collision light. I explained to her the concept of a collision domain, and how a switch worked versus a hub. It turns out she was a project manager for desktop support and she didn’t know any of that. Someone had just shown her the collision light thing and she thought it would be a good question.

“He’s good,” said Bing.

Tom had told me my next interview would be with a CCIE from Lucent. Now that was definitely intimidating. I knew of the reputation of CCIEs, and I didn’t expect to do well. The CCIE guy never showed up. As Tom was walking me to the elevator, however, we ran into him in the hallway. It turns out that Mike, who is still a friend of mine, and who later got three CCIE’s, had not passed the exam yet. We ended up talking about his home lab for a few minutes.

“He’s good,” said Mike. And a good thing too, as I’ve been in a couple of interviews with Mike and I’ve seen him grill people mercilessly.

I got a call with a job offer a few days later, and ended up working there five years.

For a while now I’ve had several posts in my drafts folder on the subject of technical interviewing. As you can see from the above story, interviews are often chaotic, disorganized, and conducted by unqualified people who have no plan. In the case of the San Francisco Chronicle, they made the right decision on me, and I don’t think anybody there would dispute that. I was thankful to begin my career in network engineering.

That said, I’ve had other interviews that didn’t go so well. Over the next few articles, I’d like to cover technical interviewing. Why do we interview people? How can we select good people from bad people? How worthwhile are the typical technical questions? Are gotchas worth throwing out “just to see how the candidate reacts”? Are interviews purely subjective or can we make them data-driven and objective?

I’ll throw out a few more anecdotes from my own experience to illustrate my points–feel free to comment with some of your own!

I worked for two years at a Cisco Gold Partner.  The first year was great.  We were trying to start up a Cisco practice in San Francisco (they were primarily a Citrix partner before), so my buddy and I wined and dined Cisco channel account managers trying to impress them with our CCIE’s and get them to steer business our way.  Eventually, the 2009 financial crisis hit and business started to dry up.  The jobs became fewer and less interesting.  I had two CCIE’s and at one point, I drove out to Mare Island near San Francisco to install a single switch for a customer whose entire network consisted of–a single switch.  I always recommend people not to stay in jobs like this too long, as it hurts your prospects for future employment.

Potential Employer:  “So what kind of jobs have you done lately?”

You:  “Uh, I installed one switch at a customer.”

Anyhow, we had one other customer that managed to keep me surprisingly busy, considering their network was quite small as well.  They were a local builder, and with three small offices connected together with ASAs and VPN tunnels.  The owner was filthy rich and also paranoid about security, which meant I was out there a lot changing passwords, tightening up ACLs, and cleaning up the mess the last network engineer had left.

The owner had a ranch near Wilits, CA which was reputed to be the size of the city of Concord, CA.  He also had two jets to take him to his private landing strip at his ranch.  Being a pilot myself, the prospect of a trip in a small jet to his ranch made me wish for some sort of network problems up there.  However, there wasn’t much up there for me to work on.  He had a single ASA 5505 connected to satellite uplink which he primarily used to connect to the cameras (which he had everywhere) at the ranch.

One day, my contact at the builder told me the cameras weren’t reachable.  Yes!  Finally a trip in the jet.  We set a date and I spent my time wondering whether I’d get the Lear or the Citation.

Unfortunately, when the day rolled around, the weather was hideous.  A Lear jet can handle most any weather, but the little airstrip had no instrument approaches.  Instead, my contact gave me an alternative:  I was to drive up there with her in-house cabling contractor (I’ll call him “Tim”) to do the job.  (I never understood why a business this small had an in-house cabling contractor.  As far as I knew he didn’t work on the actual construction projects associated with the company.)  Now from San Francisco, the drive to Willits is about 2.5 hours.  However, the ranch was near Willits.  After driving 2.5 hours to Willits, we had another hour drive over dirt roads to the middle of nowhere.

The cabling contractor was exactly the sort of person with whom I have nothing in common, and spending 3.5 hours in a car with him, in the era before smartphones are a handy distraction, was painful.  Tim loved fishtailing his truck as we drove on dirt roads on the side of a mountain.  I think he also liked just scaring the white collar guy.  It worked.

We arrived at the ranch and Tim opened up the back of his pickup.  “Can you give me a hand here?” he asked.  In the bed of his truck were several large carpet rolls and piles of dry cleaning.  I grabbed one end of a carpet roll and began the backbreaking work.  My company was billing me out at $250/hour to haul some lady’s dry-cleaning into her ranch.

The ASA itself was located in a pole in the middle of the property, which had a satellite dish on top.  I was amazed the ASA 5505 even functioned out there, given that the external temperature could reach over 100 degrees Fahrenheit.  The metal box housing the ASA was like an oven.  I consoled into it and immediately saw a problem.  Latency on the link was over one second round-trip.  There was no way he was going to get real-time video streaming with this slow satellite uplink.  I reported my findings to Tim and, after eating lunch with the ranch hands, we hopped back in the truck.  Tim put on a song called “You piss me off, f*cking jerk” while we drove.  I guess he didn’t like me.

When I mentor people, I often tell them you have to know the right time to quit a job.  There were several signs in this story that it was time for a change.  With two CCIEs, installing a single switch or working on a single ASA 5505 was not really a good use of my skills.  Neither was moving in carpet rolls and dresses for $250/hour.  Luckily I had enough big jobs at the partner that I managed to get through my interviews at Juniper without trouble.

Meanwhile, a few years later I read about the FBI raiding the builder who was my customer.  I guess he had good reasons for cameras.

 

I recently replied to a comment that I think warrants a full blog post.

I’ve been here at Cisco working on programmability for a few years.  Brian Turner wrote in to say, essentially:  Hang on!  I became a network engineer precisely because I don’t want to be a coder!  I tried programming and hated it!  Now you’re telling me to become a programmer!

As I said in my reply, I have a lot of sympathy for him.  It reminds me of a story.

Back when I was at Juniper, I met with the IT department’s head of automation to discuss using some of his tools for network automation.  Jeremy was an expert in all things Puppet and Ansible, and a rather enthusiastic promoter of these tools on the server/app side of the house.  He had also managed to get Puppet running on a Junos device.  I was meeting with him because, frankly, the wind seemed to be blowing in his direction.  That said, I did not share his enthusiasm.  He told me about a server guy he had worked with, Stephane.  When Jeremy proposed to Stephane that he should use automation tools to make his life easier, Stephane vehemently rejected the idea, and the meeting ended with Stephane banging his fists on the table and shouting “I am not a coder!”

Flash forward a couple years and Stephane ended up the head of automation for a major company.  Apparently he finally bought into the idea.

Frankly I had no desire to become a coder either.  When I interviewed at Cisco, most of my discussions were around the controllers I was working with at the time, data center fabrics, etc.  When I arrived, my new boss assigned me as his Principal TME for programmability.  I never claimed to be an expert in this area.  Two months later I was presenting to Tech Field Day, and experienced automation guys like Jason Edelman and Matt Oswalt on how to run Puppet on a Nexus switch.  Three years later and I’m known as a NETCONF/YANG guy.  I’d barely heard of them when I started.

As I replied to Brian, Cisco doesn’t want him or anyone to learn Python or YANG or whatever.  Think about it from my perspective in product management.  Implementing YANG models for all of IOS XE is a massive undertaking.  Engineering devoted a huge amount of effort to pull this off.  Huge.  Mandating YANG models for their ongoing development burns cycles.  Product marketing and engineering would never prioritize this unless we thought there was a high probability someone would use it.  In other words, we don’t want people to use it so much as customers want us to develop it.  We have demand for programmable interfaces for network devices, and hence we’ve delivered on it.   My job as a TME is not to push NETCONF/YANG on anyone, but to provide the enablement to make it easier for someone to use this technology if they themselves want to.

As I often say in my presentations, the why is important.  Why do some customers demand these interfaces?  Well, because they know Notepad is a horrible automation tool, and it’s what 90% of network engineers use.  If you want to configure 50 switches, you’re going to configure one, paste the config into Notepad, tweak a few values, and then paste it into the next switch.  Do this 48 more times and tell me if this is the best use of your time as a highly skilled network engineer.  You can write a script to do this and save yourself a lot of trouble.  Or use Ansible to do it.  Or Cisco DNAC.  Whatever you want.  But if you want any of these tools to work efficiently, you need a machine interface, which CLI is not.  If you don’t believe me, try writing a script to do regular expression-based parsing of CLI outputs.  It’s a lot easier with YANG.

The point is not for network engineers to become programmers.  The point is to add some tools to your toolbox to help you focus on what you do well.  One weekend spent with a Python course and one more weekend with a DevNet course on YANG will give you a tool you can use to make your life easier.  That’s it.  Some customers may take it a lot further, of course, and go way into CI/CD workflows and that’s fine.  If you want to do 95% of your work in CLI and write a few scripts to do the other 5%, that’s fine.  If you want to use Cisco DNAC to do almost everything, knock yourself out.  It’s about what works best for you, as a network engineer.

I often point out how lousy my code quality is.  I’m sometimes ashamed to show the code for some of the scripts I’ve written.  I’m not a coder!  That’s a point I often make.  I don’t want to be a full-time software developer.  I’m a network engineer.  So for Brian and all the other CCIE’s out there, keep doing what you do best, but don’t close yourself off to some additional tools that will make your life easier.

I’ve mentioned before that EIGRP SIA was my nightmare case at TAC, but there was one other type of case that I hated–QoS problems.  Routing protocol problems tend to be binary.  Either the route is there or it isn’t;  either the pings go through or they don’t.  Even when a route is flapping, that’s just an extreme version of the binary problem.  QoS is different.  QoS cases often involved traffic that was passing sometimes or in certain amounts, but would start having problems when different sizes of traffic went through, or possibly traffic was dropping at a certain rate.  Thus, the routes could be perfectly fine, pings could pass, and yet QoS was behaving incorrectly.

In TAC, we would regularly get cases where the customer claimed traffic was dropping on a QoS policy below the configured rate.  For example, if they configured a policing profile of 1000 Mbps, sometimes the customer would claim the policer was dropping traffic at, say, 800 Mbps.  The standard response for a TAC agent struggling to figure out a QoS policy issue like this was to say that the link was experiencing “microbursting.”  If a link is showing a 800 Mbps traffic rate, this is actually an average rate, meaning the link could be experiencing short bursts above this rate that exceed the policing rate, but are averaged out in the interface counters.  “Microbursting” was a standard response to this problem for two reasons:  first, it was most often the problem;  second, it was an easy way to close the case without an extensive investigation.  The second reason is not as lazy as it may sound, as microbursts are common and are usually the cause of these symptoms.

Thus, when one of our large service provider customers opened a case stating that their LLQ policy was dropping packets before the configured threshold, I was quick to suspect microbursts.  However, working in high-touch TAC, you learn that your customers aren’t pushovers and don’t always accept the easy answer.  In this case, the customer started pushing back, claiming that the call center which was connected to this circuit generated a constant stream of traffic and that he was not experiencing microbursts.  So much for that.

This being the 2000’s, the customer had four T1’s connected in a single multi-link PPP (MLPPP) bundle.  The LLQ policy was dropping traffic at one quarter of the threshold it was configured for.  Knowing I wouldn’t get much out of a live production network, I reluctantly opened a lab case for the recreate, asking for a back-to-back router with the same line cards, a four-link T1 interconnection, and a traffic generator.  As always, I made sure my lab had exactly the same IOS release as the customer.

Once the lab was set up I started the traffic flowing, and much to my surprise, I saw traffic dropping at one quarter of the configured LLQ policy.  Eureka!  Anyone who has worked in TAC will tell you that more often than not, lab recreates fail to recreate the customer problem.  I removed and re-applied the service policy, and the problem went away.  Uh oh.  The only thing worse than not recreating a problem is recreating it and then losing it again before developers get a chance to look at it.

I spent some time playing with the setup, trying to get the problem back.  Finally, I reloaded the router to start over and, sure enough, I got the traffic loss again.  So, the problem occurred at start-up, but when the policy was removed and re-applied, it corrected itself.  I filed a bug and sent it to engineering.

Because it was so easy to recreate, it didn’t take long to find the answer.  The customer was configuring their QoS policy using bandwidth percentages instead of absolute bandwidth numbers.  This means that the policy bandwidth would be determined dynamically by the router based on the links it was applied to.  It turns out that IOS was calculating the bandwidth numbers before the MLPPP bundle was fully up, and hence was using only a single T1 as the reference for the calculation instead of all four.  The fix was to change the priority of operations in IOS, so that the MLPPP bundle came up before the QoS policy was applied.

So much for microbursts.  The moral(s) of the story?  First, the most obvious cause is often not the cause at all.  Second, determined customers are often right.  And third:  even intimidating QoS cases can have an easy fix.

I was doing well on the blog for a few months but lately fell behind.  With (now) 12 people reporting to me, and three major areas of responsibility (SD-Access, Assurance, and Programmability), it’s not easy to find time to write up a blog post.   I have about five drafts needing work but I cannot seem to find the will to finish them.  Sometimes, however, it just takes a spark to get me going. That spark came in my inbox from Ivan Peplnjak.  I like Ivan’s blog posts, which, while often not favorable to Cisco, are nonetheless fair and balanced and raise some very important points.

“Why Is Every SDN Vendor Bashing Networking Engineers?” asks Ivan in the form email I received.  “[T]he vendors know they wouldn’t be able to sell their latest concoctions to people who actually understand how networking works and why some architectures have no chance of ever working in real life,” answers Ivan.  “The only way to sell the warez is to try to convince everyone else how to get rid of the pesky ossified CLI jockeys.”

Now I work for a vendor, and since I deal with the aforementioned products, I guess I am an SDN vendor.  That would seem to qualify me to speak on this subject.  (With, of course, the usual disclaimer that the opinions here are my own and do not represent Cisco officially.)

Selling Concoctions

I must admit, I do want to sell our products.  Everyone at Cisco should want our products to sell.  Just about all of us have a personal, financial stake in the matter, whether we have stock grants or ESPP.  We would be insane not to want people to buy our products.  I, and most of my co-workers, are driven by far more than finance, however.  We all want to know that our work means something, and that we are coming up with innovative solutions to problems.  Otherwise, why show up in the office every day?

We operate in a highly competitive environment, which means if we are not constantly innovating and coming up with better ways to do things, we will all suffer.  You can complain about the macroeconomic system, and believe me, I’m not a Randian, objectivist believer in unbridled capitalism.  But, at the end of the day, a public company needs to create the perception of future value in the eyes of the stock market, and that’s a motivating factor for all of us.

These things being said, I’ve been in product management for a few years now and I have never heard anyone, ever, talk about trying to put one over on our customers.  I’m not saying that’s what Ivan means here, but it’s an accusation I’ve heard before.  In the first place, our customers are network engineers who are quite smart.  If ever I’ve presented to my customer and was not crystal clear on what I was talking about and what advantage it would bring the customer, they will let me know it.  We’re constantly trying to find ways to do things better and make our customers’ lives easier.  As somebody who worked in IT for more years than product management, I’m very interested in this subject.  There were a lot of things that were frustrating and I want to fix things that used to annoy me.  You can argue about whether we’ve come up with the right ideas, but I hope nobody questions our motivations.

CLI Jockeys

Do I bash CLI jockeys in order to sell my products?  I should hope not, given that most of my customers are CLI jockeys, as I am myself!  I have two CCIEs and a JNCIE.  I spent a couple years in routing protocols TAC and many years in IT.  I spent a long time learning my trade and I have a lot of respect for those who have put the time and effort into learning it as well.  It’s not easy.

However, I don’t operate under the delusion that network engineers do a good job of configuring and managing CLI.  When I was at Juniper, I had designed a new NGMVPN system for our WAN.  I handed it off to the implementation team with some sample configs and asked them to come back to me with their plan.  I think we were touching about 20 devices the first go around.  The engineer came back with 20 Word documents.  He took my sample config and copied and pasted it into Word, and then modified the config in a separate Word doc for each CE/PE he was touching.  CLI itself isn’t a problem, but how we manage it.  This is where programmability and automation tools come in.  At the very least Ansible templating would have made this easier.  Software-Defined Networking (a very loose term, for what it’s worth), is not about replacing ossified CLI jockeys but getting them to focus on what they should be doing (network engineering) and avoiding what they should not (pasting stuff in Word docs.)

SD-Access takes this quite a bit further than Ansible, NETCONF, and other device-level tools.  Rather than saying “I want this device to be a LISP MS/MR” and so forth, you just say “I want this device to be a control plane node” and the system figures out what you need.  Theoretically we could change from LISP to some other protocol and the end-user shouldn’t even notice.  The idea here is somewhat like a fly-by-wire system.  When a pilot operated the controls of an airplane, they used to be directly coupled to the control surfaces via hydraulics.  Now, the pilot is operating what is essentially a joystick, providing control inputs to a computer, which then computes the best way to move the control surfaces given the conditions.  This is then relayed to servo motors in the wings, tail, etc.  The complexity of a fly-by-wire system is much higher than an old hydraulic system, but the complexity is hidden from the pilot in order to provide a better experience.  Likewise, with SD-Access, we’ve made the details more complex in order to deliver a better experience (TrustSec, layer 3 routed backbone, etc.) while hiding the complexity from the user.  It’s a different approach, for sure, but the idea is to allow engineers to focus on the right problems, like how to design their network, and not worry so much about configuration.

A New Era?

I’ve written extensively (see, for example, here, and here) about the role for CLI-jockey network engineers in the future.  When airplanes switched from the old dials and gauges to sleek, modern computerized (glass) cockpits, I’m sure some old timers threw up their hands, retired, and got their old Piper Super Cubs out of the hanger to do some “real” flying.  But most adapted, and in the end, saw how the new automation systems helped them do their jobs better.  That’s an era I’m looking forward to.  And as I always, always say, the pilots who fly the new cockpits still need to understand weather systems, engines, navigation, etc.  We still need network engineers who know how networks operate.

Meanwhile, I won’t bash any CLI jockeys and I hope nobody else here does either.

My first full-time networking job was at the San Francisco Chronicle.  Now there isn’t much to the Chronicle anymore, but in the early 2000’s the newspaper was still going strong.  It was the beginning of the decline, but most people still took their local newspaper as their primary source of news.  Being a network engineer at a major metropolitan newspaper was fascinating.  It is a massive operation to print and distribute a newspaper every single day, and you can never, ever, miss.  There is no slippage of production deadlines.  It has to be out every day, and every day you start all over, with a blank page.

As the lead network engineer, I touched everything from editorial (the news and photography content of the paper) to advertising, pre-press, production systems, and circulation.  Every one of these was critical.  If editorial content didn’t make it through, there was nothing to go into the paper.  If advertising didn’t make it in, we didn’t earn revenue.  If pre-press or production had problems, the paper wasn’t printed.  If circulation wasn’t working, nobody could get their paper.

The Chronicle owned and operated three printing plants in the Bay Area.  One was on Army Street in San Francisco, while the other two were in Union City and Richmond in the East Bay.  The main office was on Fifth and Mission in downtown SF, so the paper was prepared in San Francisco and then sent to the plants via microwave.  That’s where I came in.

Our microwave system used a dish on the clock tower of our building.  From 5th and Mission we sent a signal up to Roundtop Mountain in the East Bay hills. At Roundtop we leased space in a little concrete bunker that was used for various kinds of radio communication including cellular.  From Roundtop we bounced the signal back to the three printing plants.

Chronicle building with the microwave visible on the clock tower

The microwave presented itself to us as T1 lines.  I had the T1 lines connected to dual routers at the main site and each of the plants.  In addition to the microwave, we had two additional backup T1’s to each plant which were landlines from different carriers with diverse paths into the buildings.  We kept the microwave and the first T1 plugged into the routers, with the third one on manual standby in case we needed it.  You don’t take chances with production in a newspaper, and we had triple redundancy on everything.  I used OSPF for redundancy between the microwave and #1 backup circuit on the routers, and HSRP for gateway redundancy.  With only four sites it was a simple enough topology and it never gave me much trouble.

Until, that is, the day when I got a call from our operations center that the primary circuits were all down.  We were running on backups.  I immediately called up the production systems engineer who managed the microwave and told him his circuits were down.  “Impossible!” he said, “that microwave is five-nines reliable.  Check your router!”  I tried a few of the usual:  shut/no shut the interface, changing the line encoding, etc.  No go.  He wanted me to start swapping hardware, which was a big deal in a live newspaper environment, and seemed pointless.  If it was hardware, why would all of the circuits be down?

We bickered a bit before I moved to have the tertiary backup circuits swapped in so we had automatic failover while we worked on the microwave.  I got out our old T-berd tester to see if I could find any indication of the problem.  Then the systems engineer called:  “We need to meet at the clock tower, I’ve found the problem,” he said.  It’s always a relief to hear that when finger pointing is going around.

T-berd T1 Tester

I showed up at the entrance to the tower and followed the systems guy up a rusty ladder mounted to the wall.  Up in the tower there were bird droppings and as I climbed higher I fought the urge to look down.  I’ve never much liked heights and being out of shape and relying on my own strength to keep from falling several stories onto concrete was not promising.  Once I got to the top there was a large separation between the ladder and the floor, and I fought the urge to panic as I flung my leg way over to climb onto the concrete flooring.  From there we went outside and I saw the problem right away.

If you’ve ever been to a convention in San Francisco, chances are it took place in the Moscone Center.  In the early 2000’s, the city decided to expand Moscone by building a new Moscone Center West on 4th and Howard streets.  And from up on the clock tower it was plain as day:  they had built a cooling tower on the roof right in the path of our microwave beam.  I looked at the systems guy and said, “Well, I guess you could make popcorn in that cooling tower.  Anyways, there goes your five nines.”

We hastily called meetings together to decide what to do.  Sue the city?  Call the FCC?  Find another building to bounce the microwave off of?  Those were long term solutions but we had an immediate problem.  Two circuits might seem like enough, but they were telco circuits and not as reliable as the microwave was, at least when its path wasn’t blocked.

Getting the city to cut the cooling tower off Moscone West was a non-starter, especially when it was the newspaper asking, a newspaper that made its money being critical of city officials.  So, we decided to lease roof space from another building and add an additional repeater.  However, this was a long process.  We needed to negotiate with the landlord, replan the radio deployment, license it and obtain permits, add the new repeater, and re-point the old dish to the new building.  That last item was not as simple as it sounded, since this wasn’t a DirecTV dish.  It was welded to the tower, so we needed to hire ironworkers to cut it off and re-position it.

Meantime, we ordered T1’s from downtown SF up to Roundtop to bypass the segment that wasn’t working.  We’d go hard wire to Roundtop, the microwave the rest of the way.  This was not, by any means, an ideal solution, nor was it an overnight solution, but we could at least get some redundancy faster than it would take to add the repeater.  I’m glad we did because shortly after the microwave went down we started having terrible problems with the landlines and needed the triple redundancy.

If you drive by Fifth and Mission now, the microwave dish is gone from the clock tower.  The Chronicle, a shadow of its former self, no longer operates its own printing plants, and has a circulation far smaller than it did in 2004, when I left.  As I said in my last post, it’s great to have a sense of purpose when you work in IT.  It wasn’t about fixing a microwave but about getting that paper in the hands of our readers.  I’m thankful I got to be a part of that for a few years, even if it cost me some vertigo and sleepless nights.

I’ve been in this industry a while now, and I’ve done a lot of jobs.  Certainly not every job, but a lot.  My first full time network engineering job came in 2000, but I was doing some networking for a few years before that.

I often see younger network engineers posting in public forums asking about the pros and cons of different job roles.  I’ve learned over the years that you have to take people’s advice with a grain of salt.  Jobs in one category may have some common characteristics, but a huge amount is dependent on the company, your manager, and the people you work with.  However, we all have a natural tendency to try to figure out the situation at a potential job in advance, and the experience of others can be quite helpful in sizing up a role.  Therefore, I’ve decided to post a summary of the jobs I’ve done, and the advantages/disadvantages of each.

IT Network Engineer

Summary:
This is an in-house engineer at a corporation, government agency, educational institution, or really any entity that runs a network.  The typical job tasks vary depending on level, but they usually involve a large amount of day-to-day network management.  This can be responding to complaints about network performance, patching in network connectivity (less so these days because of wireless), upgrading and maintaining devices, working with carriers, etc.  Larger scale projects could be turning up new buildings and sites, planning for adding new functionality (e.g. multicast), etc.

Pros:

  • Stable and predictable work environment.  You show up at the same place and know the people, unlike consulting.
  • You know the network.  You’re not showing up a new place trying to figure out what’s going on.
  • It can be a great chance to learn if the company is growing and funds new projects

Cons:

  • You only get to see one network and one way of doing things.
  • IT is a cost center, so there is a constant desire to cut personnel/expenses.
  • Automation is reducing the type of on-site work that was once a staple for these engineers.
  • Your fellow employees often hate IT and blame you for everything.
  • Occasionally uncomfortable hours due to maintenance windows.

Key Takeaway:
I often tell people that if you want to do an in-house IT job, try to find an interesting place to work.  Being an IT guy at a law firm can be kind of boring.  Being an IT guy at the Pentagon could be quite interesting.  I worked for a major metropolitan newspaper for five years (when there was such a thing) and it was fascinating to see how newspapers work.  Smaller companies can be better in that you often get to touch more technologies, but the work can be less interesting.  Larger companies can pigeonhole you into a particular area.  You might work only on the WAN and never touch the campus or data center side of things, for example.

Technical Support Engineer

Summary:
Work at a vendor like Cisco or Juniper taking cases when things go wrong.  Troubleshoot problems, recreate them in the lab, file bugs, find solutions for customers.  Help resolve outages.  See my TAC Tales for the gory details.

Pros:

  • Great way to get a vast amount of experience by taking lots of tough cases
  • Huge support organization to help you through trouble
  • Short-term work for the most part–when you close a case you’re done with it and move on to something new
  • Usually works on a shift schedule, with predictable hours.  Maintenance windows can often be handed off.

Cons:

  • Nearly every call involves someone who is unhappy.
  • Complex and annoying technical problems.  Your job is 100% troubleshooting and it gets old.
  • Usually a high case volume which means a mountain of work.

Key Takeway:
Technical Support is a tough and demanding environment, but a great way to get exposure to a constant stream of major technical issues.  Some people actually like tech support and make a career out of it, but most I’ve known can burn out after a while.  I wouldn’t trade my TAC years for anything despite the difficulties, as it was an incredible learning experience for me.

Sales Engineer

Summary:
I’ve only filled this role at a partner, so I cannot speak directly to the experience inside a company like Cisco (although I constantly work with Cisco SE’s).  This is a pre-sales technical role, generally partnered with a less-technical account manager.  SE’s ultimately are responsible for generating sales, but act as a consultant or adviser to the customer to ensure they are selling something that fits.  SE’s do initial architecture of a given solution, work on developing the Bill of Materials (BoM), and in the case of partners, help to write the Statement of Work (SoW) for deployment.  SE’s are often involved in deployment of the solutions they sell but it is not their primary job.

Pros:

  • Architectural work is often very rewarding;  great chance to partner with customer and build networks.
  • Often allows working on a broad range of technologies and customers.
  • Because it involves sales, usually good training on the latest technologies.
  • Unlike pure sales (account managers in Cisco lingo), a large amount of compensation is salary so better financial stability.
  • Often very lucrative.

Cons:

  • Like any account-based job, success/enjoyability is highly dependent on the account(s) you are assigned to.
  • Compensation tied to sales, so while there are good opportunities to make money, there is also a chance to lose a lot of discretionary income.
  • Often take the hit for poor product quality from the company whose products you are selling.
  • Because it is a pre-sales role, often don’t get as much hands-on as post-sales engineers.
  • For some products, building BoM’s can be tedious.
  • Sales pressure.  Your superiors have numbers to make and if you’re not seen to be helping, you could be in trouble.

Key Takeaway:
Pre-sales at a partner or vendor can be a well-paying and enjoyable job.  Working on architecture is rewarding and interesting, and a great chance to stay current on the latest technologies.  However, like any sales/account-based job, the financial and career success of SE’s is highly dependent on the customers they are assigned to and the quality of the sales team they are working with.  Generally SE’s don’t do technical support, but often can get pulled into late-night calls if a solution they sell doesn’t work.  SEs are often the face of the company and can take a lot of hits for things that they sell which don’t work as expected.  Overall I enjoyed being a partner SE for the most part, although the partner I worked for had some problems.

Post-Sales/Advanced Services

Summary:
I’m including both partner post-sales, which I have done, and advanced services at a vendor like Cisco, which are similar.  A post-sales engineer is responsible for deploying a solution once the customer has purchased it, and oftentimes the AS/deployment piece is a part of the sale.  Occasionally these engineers are used for non-project-based work, more so at partners.  In this case, the engineer might be called to be on site to do some regular maintenance, fill in for a vacationing engineer, etc.

Pros:

  • Hands-on network engineering.  This is what we all signed up for, right?  Getting into real networks, setting stuff up, and making things happen.
  • Unlike IT network engineers, this job is more deployment focused so you don’t have to spend as much time on day-to-day administrative tasks.
  • Unlike sales, the designs you work on are lower-level and more detailed, so again, this is a great nuts-and-bolts engineering role.

Cons:

  • As with sales, the quality and enjoyability is highly dependent on the customers you end up with.
  • You can get into some nasty deployment scenarios with very unhappy customers.
  • Often these engagements are short-term, so less of a chance to learn a customer/network.  Often it is get in, do the deployment, and move on to the next one.
  • Can involve a lot of travel.
  • Frequently end up assisting technical support with deployments you have done.
  • Can have odd hours.
  • Often left scrambling when sales messed up the BoM and didn’t order the right gear or parts.

Key Takeway:
I definitely enjoyed many of my post-sales deployments at the VAR.  Being on-site, and doing a live deployment with a customer is great.  I remember one time when I did a total network refresh and  VoIP deployment up at St. Helena Unified School District in Napa, CA.  It was a small school district, but over a week in the summer we went building-by-building replacing the switches and routers and setting up the new system.  The customer was totally easygoing, gave us 100% free reign to do it how we wanted, was understanding of complications, and was satisfied with the result.  Plus, I enjoyed spending a week up in Napa, eating well and loving the peace. However, I also had some nightmare customers who micromanaged me or where things just went south.  It’s definitely a great job to gain experience on a variety of live customer networks.

Technical Marketing Engineer

Summary:
I’m currently a Principal TME and a manager of TMEs.  This is definitely my favorite job in the industry.  I give more details in my post on what a TME does, but generally we work in a business unit of a vendor, on a specific product or product family, both guiding the requirements for the product and doing outbound work to explain the product to others, via white papers, videos, presentations, etc.

Pros:

  • Working on the product side allows a network engineer to actually guide products and see the results.  It’s exciting to see a new feature, CLI, etc., added to a product because you drove it.
  • Get to attend at least several trade shows a year.  Everyone likes getting a free pass to a conference like Cisco Live, but to be a part of making it happen is exhilarating.
  • Great career visibility.  Because the nature of the job requires producing content related to your product, you have an excellent body of work to showcase when you decide to move on.
  • Revenue side.  I didn’t mention this in the sales write-up, but it’s true there too.  Being close to revenue is generally more fun than being in a cost center like IT, because you can usually spend more money.  This means getting new stuff for your lab, etc.
  • Working with products before they are ever released to the public is a lot of fun too.
  • Mostly you don’t work on production networks so not as many maintenance windows and late nights as IT or AS.

Cons:

  • Relentless pace of work.  New software releases are constantly coming;  as soon as one trade show wraps up it’s time to prepare for the next one.  I often say TME work is as relentless as TAC.
  • Can be heavy on the travel.  That could, of course, be a good thing but it gets old.
  • Difficulty of influencing engineering without them reporting to you.  Often it’s a fight to get your ideas implemented when people don’t agree.
  • If you don’t like getting up in front of an audience, or writing documents, this job may not be for you.
  • For new products, often the TMEs are the only resources who are customer facing with a knowledge of the product, so you can end up working IT/AS-type hours anyways.  Less an issue with established/stable products.

Key Takeaway:
As I said, I love this job but it is a frenetic pace.  Most of the posts I manage to squeeze in on the blog are done in five minute intervals over a course of weeks.  But I have to say, I like being a TME more than anything else I’ve done.  Being on the product side is fascinating, especially if you have been on the consumer side.  Going to shows is a lot of fun.  If you like to teach and explain, and mess around with new things in your lab, this is for you.

It’s not a comprehensive list of the jobs you can do as a network engineer, but it covers some of the main ones.  I’m certainly interested in your thoughts on other jobs you’ve done, or if you’ve done one of the above, whether you agree with my assessment.  Please drop a comment–I don’t require registration but do require an email address just to keep spam down.

I was hoping to do a few technical posts but my lab is currently being moved, so I decided to kick off another series of posts I call “NetStalgia”.  The TAC tales continue to be popular, but I only spent two years in TAC and most cases are pretty mundane and not worthy of a blog post.  What about all those other stories I have from various times and places working on networks?  I think there is some value in those stories, not the least because they show where we’ve come from, but also I think there are some universal themes.  So, allow me to take you back to 1995, to a now-defunct company where I first ventured to work on a computer network.

I graduated college with a liberal arts degree, and like most liberal arts majors, I ended up working as an administrative assistant.  I was hired on at company that both designed and built museum exhibits.  It was a small company, with around 60 people, half of whom worked as fabricators, building the exhibits, while the other half worked as designers and office personnel.  The fabricators consisted of carpenters, muralists, large and small model builders, and a number of support staff.  The designers were architects, graphic designers, and museum design specialists.  Only the office workers/designers had their own computers, so it was a quite small network of 30 machines, all Macs.

When the lead designer was spending too much time on maintaining the computer network, the VP of ops called me in and asked me to take over, since seemed to be pretty good with computers and technical stuff, like fixing the fax machine.

Back then, believe it or not, PCs did not come with networking capabilities built in.  You had to install a NIC if you wanted to connect to a network.  Macs actually did come with an Apple-proprietary interface called LocalTalk.  The LocalTalk interface consisted of a round serial port, and with the appropriate connectors and cables, you could connect your Macs in a daisy-chain topology.  Using thick serial cables with short lengths to network office computers was a big limitation, so an enterprising company named Farallon came up with a better solution, called PhoneNet.  PhoneNet plugged into the rear LocalTalk port, but instead of using serial cables it converted the LocalTalk signal so that it ran on a single twisted pair of wires.  The brilliance of this was that most offices had phone jacks at every desk, and PhoneNet could use the spare wires in the jacks to carry its signal.  In our case, we had a digital phone system that consumed two pairs of our four-pair Cat 3 cables, so we could dedicate one to PhoneNet/LocalTalk and call it good.

PhoneNet connector with resistor

We used an internal email system called SnapMail from Cassidy and Greene.  SnapMail was great for small companies because it could run in a peer-to-peer mode, without the need for an expensive server.  In this mode, an email you sent to a colleague went directly to their machine.  The obvious problem with this is that if I work the day shift, and you work the night shift, our computers will never be on at the same time and you won’t get my email.  Thankfully, C&G also offered a server option for store-and-forward messaging, but even with the server enabled it would still attempt a peer-to-peer delivery if both sender and receiver were online.

One day I started getting complaints about the reliability of the email system.  Messages were being sent but not getting delivered.  Looking at some of the trouble devices, I could see that they were only partially communicating with each other and the failed messages were not being queued in the server.  This was because the peers seemed to think each other was online, when in fact there was some communication breakdown.

Determining a cause for the problem was tough.  Our network used the AppleTalk protocol suite and not IP.  There was no ping to test connectivity.  I had little idea what to do.

As I mentioned, PhoneNet used a single pair of phone wiring, and as we expanded, the way I added new users was as follows:  when a new hire came on board, I would connect a new phone jack for him, and then go to the 66 punch-down block in a closet in the cafeteria and tie the wires into another operative jack. Then I would plug a little RJ11 with a resistor on it in the empty port of the LocalTalk dongle, because the dongle had a second port for daisy-chaining and this is what we were supposed to do if it was not in use.  This was a supported configuration known in PhoneNet terminology as a “passive star”.  Passive, because there was nothing in between the stations.  This being before Google, I didn’t know that Farallon only supported 4 branches on a passive star.  I had 30.  Not only did we have too many stations and too much cable length, but the combined resistance on this giant circuit was huge because of all the resistors.

I had a walkthrough with our incredulous “systems integrator”, who refused to believe we had connected so many devices without a hub, which was called a “Star Controller” in Farallon terminology.  When he figured out what I had done, we came up with a plan to remove some of the resistors and migrate the designers off of the LocalTalk network.

Some differences between now and then:

  • Networking capability wasn’t built in on PCs, but it was on Macs.
  • I was directly wiring together computers on a punch-down block.
  • There was no Google to figure out why things weren’t working.
  • We used peer-to-peer email systems.

Some lessons that stay the same:

  • Understand thoroughly the limitations of your system.
  • Call an expert when you need help.
  • And of course:  don’t put resistors on your network unless you really need to!

 

In a previous post I had mentioned I co-authored a book on IOS XE Programmability with some colleagues of mine.  For those who are interested, the book is available here.

The book is not a comprehensive how-to, but a summary of the IOS XE features along with a few samples.  It should provide a good overview of the capabilities of IOS XE.  For those who were on my CCIE webinar, it should be more than adequate to get you up to speed on CCIE written programmability topics.

As with any technical book, there could be some errata, so please feel free to pass them along and I can get them corrected in the next edition.

I’ve mentioned in previous TAC Tales that I started on a TAC team dedicated to enterprise, which made sense given my background.  Shortly after I came to Cisco the enterprise team was broken up and its staff distributed among the routing protocols team and LAN switch team.  The RP team at that time consisted of service provider experts with little understanding of LAN switching issues, but deep understanding of technologies like BGP and MPLS.  This was back before the Ethernet-everywhere era, and SP experts had never really spent a lot of time with LAN switches.

This created a big problem with case routing.  Anyone who has worked more than 5 minutes in TAC knows that when you have a routing protocol problem, usually it’s not the protocol itself but some underlying layer 2 issue.  This is particularly the case when adjacencies are resetting.  The call center would see “OSPF adjacencies resetting” and immediately send the case to the protocols team, when in fact the issue was with STP or perhaps a faulty link.  With all enterprise RP issues suddenly coming into the same queue as SP cases, our SP-centric staff were constantly getting into stuff they didn’t understand.

One such case came in to us, priority 1, from a service provider that ran “cell sites”, which are concrete bunkers with radio equipment for cellular transmissions.  “Now wait,” you’re saying, “I thought you just said enterprise RP cases were a problem, but this was a service provider!”  Well, it was a service provider but they ran LAN switches at the cell site, so naturally when OSPF started going haywire it came in to the RP team despite obviously being a switching problem!

A quick look at the logs confirmed this:

Jun 13 01:52:36 LSW38-0 3858130: Jun 13 01:52:32.347 CDT:
%C4K_EBM-4-HOSTFLAPPING: Host 00:AB:DA:EE:0A:FF in vlan 74 is flapping
between port Fa2/37 and port Po1

Here we could see a host MAC address moving between a front-panel port on the switch and a core-facing port channel.  Something’s not right there.  There were tons of messages like these in the logs.

Digging a little further I determined that Spanning Tree was disabled.  Ugh.

Spanning Tree Protocol (STP) is not  popular, and it’s definitely flawed.  With all due respect to the (truly) great Radia Perlman, the inventor of STP, choosing the lowest bridge identifier (usually the MAC address of the switch) as the root, when priorities are set to the default, is a bad idea.  It means that if customers deploy STP with default values, the oldest switch in the network becomes root.  Bad idea, as I said.  However, STP also gets a bad reputation undeservedly.  I cannot tell you how many times there was a layer 2 loop in a customer network, where STP was disabled, and the customer referred to it as a “Spanning Tree loop”.  STP stops layer 2 loops, it does not create them.  And a layer 2 loop out of control is much worse than a 50 second spanning tree outage, which is what you got with the original protocol spec.  When there is no loop in the network, STP doesn’t do anything at all except for send out BPDUs.

As I suspected, the customer had disabled spanning tree due to concerns about the speed of failover.  They had also managed to patch a layer 2 loop into their network during a minor change, causing an unchecked loop to circulate frames out of control, bringing down their entire cell site.

I explained to them the value of STP, and why any outage caused by it would be better than the out of control loop they had.  I was told to mind my own business.  They didn’t want to enable spanning tree because it was slow.  Yes, I said, but only when there is a loop!  And in that case, a short outage is better than a meltdown.  Then I realized the customer and I were in a loop, which I could break by closing the case.

Newer technologies (such as SD-Access) obviate the need for STP, but if you’re doing classic Layer 2, please, use it.