Monday, January 12, 2015

Jabbered to Death: Counterproductivity Software, or TPS Reports in the Age of the Interwebs

" I'm going to need those TPS reports... ASAP..."  - Bill Lumbergh in Office Space.

* Side note: occasionally I apply a literary device called sarcasm, as well as hyperbole and stating the truth in a humorous or reframed way.  A good example of that is found in the comment about SLAs.  Aside from blogs (especially technology blogs - hopefully this one doesn't induce as much sleep as some others) being more readable if they consist of some entertainment value, there is often helpful truth hidden in humor.  The degree of that depends on the reader, and the topic.  Any sore toes with what is written here are unintentional, and we all know as professionals that even the best tools and methods get stretched and misused from time to time.  We also know that there are no absolutes, and that any resemblance to companies/departments living, or soon to be scooped up by the Google-Facebook Complex are sheerly coincidental, except for sanitized exemplars.  So please enjoy, and if you receive a laugh, so much the better.  If you receive good inspiration from this, then you owe me a penny.  To understand that last comment,  you may want to consider furthering your education in the Center for Information and Comunication Sciences at Ball State University.  Knowledge is a "Force" multiplier.

First let's clear the air, lest I offend any fellow Jedis - I'm not picking on Jabber specifically, it just happens to have the snappiest name, though given that jabbering is sometimes used to describe the sound a band of monkeys makes, it could be apropos.  It's all about the user of the tool, not just the tool.  So due respect to the Cisco Kids...

Now to the point.  How do you handle production incidents in your company?  I'm not asking about your SLAs (Service Level Agreements - basically agreements ahead of time which determine the period of time a service provider can say "I'm working on it, be patient" versus saying "I'm sorry, reality and promises made to get the business have unfortunately collided"), or your Help desk staffing, I'm referring to the response/communication mechanisms that are used to try to arrive at a solution.

We have worked very hard to provide a good way to attempt to improve the flow of information for all business processes, incident responses especially.  There was a day when we had the telephone,  and that was it.  The passage of information was very much serial, rather than parallel. 

I'm referring to the basic two ways we have of wiring up an electrical circuit - serial, which means the power flows to one place, then to the next, along a series of connections, and parallel, which means the power flows to all the places at the same time.  For a frame of reference, if you've ever had a light in a strand of Christmas lights go out, all of them go out, because the series is interrupted.  The strands where that doesn't happen are wired in parallel.  They are also more costly [implementation of foreshadowing complete].

The two people connected up could pass information in real time, but anyone else wanting in on the feed of information would receive a busy signal.  So the solution to that problem was the speakerphone/conference call.  That is properly understood as the "5 minute delay to find the number and figure the equipment out/sound like someone speaking backwards through a box fan" solution.  Aside from some impracticality involving gathering and finding equipment during a response, it is an okay solution.

However, distance and time are increasingly becoming inexcusable barriers to doing what we do, because the inverse of Moore's law also applies to user tolerance for delays, namely that the tolerable time for waiting is cut in half every 18 months.  So we saw the use of email in response to production issues.  It works efficiently and offers the parallel passage of information so that a person being away doesn't halt anything.  There are distribution lists and other ways of getting things out to everyone at the same time.

The issue with that was that such a solution was asynchronous,  and not only could you not know when someone acted on something when they did it, you had no idea if they even saw it, and there was the risk of a fix being overwritten by something happening out of order.  In terms of command and control structures, email is the UDP of human interaction (aw, shoot, sometime you just have to Google some things - suffice it to say TCP/IP is better, UDP is faster).  Information is sent out and you don't know if it was received unless someone sends you something in reply.

Enter the instant messaging client.  It allows the direct sharing of real time information like the phone call, the use of groups in different places like the conference call, and the broadcasting ability of email, adding in the ability to tell if someone is online, or offline when something comes up.  It seems to be the perfect vehicle for efficiently addressing problems, connecting the managerial side to the technical side with the real time, blow-by-blow progress toward a solution. The technology is nailed down.

The TECHNOLOGY is nailed down.  The USE of the technology and carbon-based error resolution procedures are not.  One problem is that, unless there is a well-followed process, there is a tendency to use all of them, simultaneously.  The end result of that is an inefficient resolution.  The most efficient tools we can come up with, when used in conjunction, counteract their own efficiencies?  Here's how.  The following is a somewhat sanitized version of a real situation.  See if you can spot the efficiency losses.

☆Tech_0 receives an email from Programmer_0: There's a problem with some data loaded a couple days ago, I have the correct data ready to go in to replace it.  We need to verify it will work in test before adding it into production.

☆ T_0 sends an email to P_0: Will request the appropriate group do the delete, then when it is complete I'll load...

☆ [popping up in mid-email] IM from Manager_0: hearing that there is some bad data out there, can you check into that to see what is happening?

☆ T_0 IM to M_0: just received msg. from P_0, starting the process.

☆ [T_0 returning to email]: ...the new data into production, validate it, then load...

☆ IM from M_0: Okay let me know when it is done.

☆ [T_0 finishing email]: ...it into prod.  I'll let you know when all is done.

☆ When message is sent, open waiting email from Helpdesk Manager_0: I understand there is a data issue.  Can you please look into it?

☆ T_0 initiates process before answering by using in/out indicator to locate proper systems tech to delete bad data, locates and sends IM to Systems_0: Production issue.  Need to...

☆ IM from HM_0: I sent you an email.  There's a production data problem that needs to be taken care of.

☆ IM from T_0 to HM_0: I received the email and am initiating the process.  I will send an update when it is done.

☆ T_0 finishing IM to S_0: ...delete load 1 and load 2.  Are you available to do that?

☆ IM from Helpdesk Tech_0: I hear there is a production data issue.  What is the change order for it?

☆ IM from S_0: I can take care of that.  Give me a couple minutes and I'll let you know when it is done.

☆ IM from Manager_1: Just got out of a meeting and heard there may be a data problem in production.  I asked the Helpdesk to open a change order for it.

☆IM to M_1: That is a change to the way that we normally handle the process, since those are normally for configurations and such.  This is replacing bad data...

☆ IM from Helpdesk Tech_1: I opened up a change order for the data issue that's happening.  What should I put in for the task steps?

☆ T_0 checks email to get the change order number for further use.   While there sees an email from M_0 asking for an update.

☆ T_0 IM to HT_1: X, Y and Z are the steps we would do.  We've not ever done a change order for this before, so there may be additional components.  The major pieces are what listed though.

☆ T_0 email to M_0: Starting process, will let you know when done.

☆IM to HT_0: The change order is 123.

☆ T_0 opens email with correct data to extract data.

☆ IM from HT_0:  So you opened the change order already?

☆ IM to HT_0: No, HT_1 opened one up and sent me the number.

☆ Email from HS_0: where is the fix for the data?  M_1 is asking for an update.

☆ Email from M_1: Where are we on the data problem?

☆ IM from HS_0: Is there an ETA on the fix?

☆ T_0 attempts to pull threads into a group chat for updating purposes.  More than half are shown as in meetings/do not disturb, but are already in existing chat sessions.  Half the requested personnel keep going in the existing individual chats rather than joining the chat.

For brevity's sake I'll summarize the rest.  There are several more email exchanges from various individuals, and the flow of questions and requests for updates kept flying in.  The whole thing was taken care of in 40 minutes.  That includes IMs (Instant Messages) asking for me to mark various tasks done, some of which couldn't be done yet.  There were a total of 9 different IM chats, and email exchanges with 4 of those individuals, who also had chat windows open, plus three unique recipients. I work remotely, otherwise I imagine the phone would have been busy as well.

The actual working time involved to resolve the issue was less than 10 minutes total.  The communication and other portions consumed three times as long.  It took the efforts of two people to directly fix it, and the other dozen-plus were just icing on the cake.

Examining this, it appears to be a conflict between the need to provide a solution and the need to drive a solution.  The basis of that can be found in the fundamental shift that has happened in the professional world.  

Imagine a brown out happened 50 years ago: "The coil winder got knocked out when the brown out happened.  How long to rebuild the motor?"

"Assuming we have the parts in the crib," (and they always did, unlike a lot of the bare bones equipment stores we have to deal with today), "it'll be two hours."

Imagine the same scenario within our context today: "The brown out took down the network.  When will it be back up?"

"If we just need to do a reboot, 15 minutes should be it.  If we lost a switch then it will be an hour or two, depending on which one and..."  The difference is that the structures and mechanisms we use today aren't so simple. 

The pacing of equipment and processes is so much different now than it was.  What's worse, asking for an estimate to bring the network back up is like asking your dentist how long it takes to repair a piston ring - a very vague guess is the best that can be done, given that the issue could reside in one of so many places.  Worse, the issue could be caused by two components simultaneously, and NOBODY gives an estimate for that scenario.  Actually there's one person who starts to, but they get stifled and transferred to another office far, far away.

While I'm a firm believer in never bringing a problem without a solution, this is a little different.  This is a problem generated by multiple, individually-good solutions, which step on each other.  So, with that in mind, I submit to the ether the following three suggested maxims for responding to production issues:

1) Let your one be a one, and your zero be a zero. 

As highlighted above, an estimate to fix an unknown problem is worthless.  What's worse, once an estimate is given, even if there is something totally new and frightening found, the estimate comes across as law, and turns into a bludgeon if it is exceeded.  So, since we trust that our IT staff is dedicated to quick, efficient resolution of issues, the statement that they are actively working on it needs to suffice.  Demanding a 1 where there is a 0, or a 4 where there is a 1, is pointless.

Let me illustrate with an example from a few years ago.  A computer room operator performed a scheduled maintenance task every Friday night at midnight.  There was a certain user who inevitably (I mean every Friday night - there's passive-aggressive, then there was this individual) would try to log in, find it was down, and call the operator.  Hearing it would be 90 minutes (this WAS a few years ago) they would call about every 10 minutes asking for an updated estimate.  This even though in this case all was known - a maintenance process is a maintenance process.  

One night he had enough and said it was laying in pieces and he had just finished hosing it out, but would get it back together as soon as it was dried out, lest someone get electrocuted.  The calls stopped, but the true question was why they started to begin with.
 
Then again, in a network outage, in which recovery requires dedicated efforts of all hands on board, why does there seem to be such an emphasis on getting updated estimates, rather than a functional network?  If IT knew exactly what was wrong they could give a highly accurate estimate, in which case updates would be unnecessary, and the outage could probably have been foreseen and even avoided, and if they had no idea, then the original estimate and subsequent updates would be meaningless.  And, lest we forget, an estimate never fixed anything.   If it's wrong it's not like you can obtain a solution by turning it in like a warranty, "You said 5 hours, and it's been 6 hours, so I demand my time back and a working interwebafacebookagoogle now.

2) Circumventing the system when an issue arises is like sticking a penny into a fuse holder: a bad idea, waiting to destroy everything.
 
In the example detailed above, did you catch that there were a couple instances where someone went around the regular pathway because they "knew a guy" in IT?  Did you also see where that added additional circuits to the communications chain? 

In a response to an issue, there may indeed be a need for some parallel communication to take place.  However, that need should be initiated by IT in an effort to get some highly-specialized, or at least scoped information to directly resolve the issue.  It should never be initiated by someone trying to get an update. 

If this sounds like an avoidance tactic, it isn't.   It is an efficiency tool.  Each person processes a maximum amount of communications.  I would prefer to maximize problem resolution energy, rather than assess where we are every half an hour and report that officially and via the many chains opened up during the issue.  Remember that, unless there is a formal structure, each interaction could be "the official" one, and failing to communicate back to the correct person can cause serious trouble.   So we communicate to all who have opened one up so we don't miss the one we need.

3) 186,000 m.p.h., minus resistance is all we have, and even that is faster than a hard drive.

That's all we have to work with.   In the above example of a coil winder, one electrician could be winding the wire around the commutator while the other one pop rivets in the assembly that engages the brushes, effectively multiplying the efforts in the same amount of time. Though sometimes that is possible with computing technology, oftentimes there are tasks which are tied to physical limitations of equipment, and some things must be done solo, and at the speed of the process/machine, and no quicker.

For a restoration of data from a tape, for example, there is a hard limit on the speed the information comes back with.  You can use different algorithms and techniques to make the new versions of backups written to tape more efficient, but generally when restoring, your speed is part mechanical and part logical, and you are stuck restoring with the best logic that existed when the backup was created. 

So while no offense was meant here, if you find that the scenario above, or one like it, has played out in your shop, you're doing nothing wrong.  You're using the tools as needed.   The problem is that by doing nothing wrong, you could be failing to do it right as well.  Such is the paradox of the bit twiddler. 

No comments: