[Copyright 1981,1984,1987,1996,1998-2012 Frank Durda IV, All Rights Reserved. Mirroring of any material on this site in any form is expressly prohibited. The official web site for this material is: http://nemesis.lonestar.org Contact this address for use clearances: clearance at nemesis.lonestar.org Comments and queries to this address: web_stories at nemesis.lonestar.org]
If you have always wondered about the various stages of support action your computer maker provides when your system self-destructs, here is a handy guide that will tell you everything you ever wanted to know about the Six Stages of Field Service Support and how to identify the symptoms of being at each stage.
Much of this knowledge is based on twenty-five years of careful observations at sites with DECsystem-20s, DEC Alphas, IBM 370/155, VAX 11/780, IBM 4341, and other systems with many different service organizations. Some machines needed more help than others. For example, the DEC-20 once managed to get over 300 hours of down-time in a single month, so it needed lots of support. And that wasn't one of the months when the computer room filled with sewage.
For those of you that think I didn't like the DEC-20, don't get me wrong. It was one of the best computer architectures I have ever dealt with, and TWENEX and TOPS-20 were two of the best operating systems around at the time, certainly better than OS/MVT, VM, HASP, MS-DOS and the other junk that the competition were pushing. And the TOPS-20 clock won't go boom in a few years like most other systems I know. DEC learned their lesson in 1978 when all the PDP-11s fell over.
Regardless of who made your computer or how large or small it is, these events may seem hauntingly familiar.
"May the road rise to meet you, and may you never go beyond Stage Two."
- Ancient greeting, date unknown.
Stage Zero is where your journey begins when all is well and then suddenly all the terminals around the university or office go dead and the French-fryer" beepers on the DECwriter terminals all simultaneously go off. This is the way that the TOPS-20 operating system told everybody that it had crashed and that it had also lost what everybody was doing. Alerted by the beeping sounds or the cries of anguish from the users (or both), the keepers of the system rush to the machine room. On arriving in the machine room, you may smell the problem or just see the flames coming out of the processor.
This type of event invariably happens four minutes after the daily service contract period ends, which means it will cost big bucks to get the Field Service Engineers (FSEs) to come out to fix the problem right now. Your management has also disappeared for the day, leaving you with no authority to spend money to get help.
Meanwhile, the users are already starting to press their noses and cheeks against the computer room windows, as if they think that their concerned stares will somehow make things better. Believe me, it does not. To avoid gazing in that direction, everybody in the computer room will avoid looking towards the windows, even to the extreme of walking backwards and feeling behind themselves for the manuals in the bookcase that was placed too close to the visitor viewing windows. More often than not, this results in more accidents, such as knocking over the one gallon jar of jalapeno peppers that formed the complete daily evening meal for the console operator. Seriously.
Left with all the options that don't involve spending new money, you go and call DDC, the Digital Diagnostic Center (saying this is always accompanied by a jarring chord of music, such as that heard in the film "Monty Python and the Holy Grail" when they say "A SHRUBBERY!").
For those of you who don't deal with DEC, DDC is this neat service that you call when your computer system starts doing strange things. DDC can run diagnostics on your computer from where they live (used to be Colorado Springs, but its latest incarnation seems to be in a facility in Atlanta which seems to be named after a space alien who used to appear in badly-written sitcoms - ALF) or study core dumps your machine may have emitted, and these tests help isolate the problem before the local service office even knows that there is a problem. Well, that was the idea anyway.
If the support structure for your computer doesn't have something equivalent to DDC, proceed to Stage One.
Assuming you do have a DDC-a-like, you give them a call, and they will take your name, phone number and serial number. (The serial number of the computer, not yours.) They may even ask what the trouble is. Now they will tell you that they will call you back as soon as they have a service representative available. Actually, this delay is deliberate and gives DDC time to check their records to make sure that the serial number you gave them really resides at the phone number you gave them when you called. They learned this precaution from the pizza delivery industry, either that or else DEC had a lot of problems with the wild guys over at the Delta-I-Q fraternity house at MIT calling in prank trouble tickets on the campus computers (or perhaps on computers belonging to other schools). I hear that "wild" stuff like that still happens all the time up there.
Just substitute the name of your remote support service company where it says "DDC".
Anyway, sometime later, someone from DDC will call you over some trans-Atlantic phone line. Going to a quiet room to talk to the service representative won't help as they always have you go back into the machine room and load the field-service pack and set switches on the front-end processor and "boot from SW" or perhaps they have you just stand in front of the computer and see if the floppy drive light comes on. For machines with more than one switch, the DDC reps always seem to insult your intelligence by giving the switch settings like: "Set the two right-most switches down and then skip two switches and push the next one down." People who have called DDC more than once learn to ignore these instructions since DDC always asks for the same switch settings and so you just set the proper octal value on the front-end processor control panel and say "uh huh" a lot to the rep.
DDC can now take over your system via the front-end processor and run diagnostics which test the various parts of the machine as well as testing the amount of paper you have left in the console DECwriter terminal.
You must not leave the area while these tests are being run, because the DDC person will probably contact you next, not by calling you on the phone, but by typing messages to you on the console. If you aren't sitting right there so you can respond, the DDC rep may go away and you get to start the problem reporting process all over. You need to hang around anyway because someone needs to be standing by to un-jam the console printer.
If the tests eventually find something wrong, DDC will contact the local field service office who will come out to your location with "all the parts necessary" to work on the diagnosed problem whenever the on-site service period resumes. By having a FSE arrive at your site, you proceed to Stage One.
If DDC is unable to run any diagnostics because the front-end processor is dead or the smoke that is pouring out of the system is too thick for you to see if the floppy drive light is coming on, proceed to Stage One.
If DDC doesn't find anything wrong, count the number of times that the system has crashed in the last week from unexplained causes or problems that cannot be diagnosed that you have reported to DDC. If the number of crashes is greater than a secret quantity which you do not know and will not be told, proceed to Stage One. If you haven't reached the magic number, reboot the system and remain in Stage Zero, although after each crash, DDC might give you a slight change to make to the system configuration that will help cloud the issue later when the troops do arrive.
If the system crashes again, call DDC again and repeat Stage Zero.
An added complexity is that these days for certain systems, they seem to bring one and only one card out at a time, meaning you have to wait if any associated cable, screw or other part is discovered to be the real culprit.
If the FSE was on site when the problem initially occurred, or if you have fallen back to Stage Zero from a higher level, the phrase "I'm going to get some other kits out of the car" or "we [Royal We] are going out for some lunch" indicates that you have not had sufficient reproducible failures to warrant a gutting of the system. The hope is that the machine will either repair itself before they get back, or it will completely melt-down, allowing them to skip to Support Stage Two, where your service call escalates and the problem becomes Someone Else's Problem (SEP).
You arrive at Stage One in one of five ways:
You are not in Stage One if the FSE was performing PM on your system and when you returned from lunch, you found your entire VAX 11/780 tilting at a 45-degree angle, with the FSE desperately trying to get the system back upright or at least trying to keep it from tilting any further. This actually happened once in my presence - something about not extending those stabilizer legs before opening all of the cabinets. Now, if the system does tip completely over, then you get to go to Stage One, right after the FSE goes to the hospital.
If the Field Service Engineer wasn't already on the scene in Stage Zero, most FSEs must go through a period of disbelief about the severity or existence of the problem that you are reporting before serious work on the problem begins. This hesitant behavior is usually characterized by the FSE walking into the machine room, observing the flames coming out of the system cabinets and saying: "AH, HA! This looks like a software problem."
Even if the FSE was on-site and the machine worked perfectly before they started doing routine maintenance on it, the FSE still may accuse you of running an operating system that has been "patched" or "customized". Anything beyond setting the local time-zone may be considered to be "customized", even if you replaced broken application executables with ones from earlier vendor-provided versions that do work. You are generally doomed if you are running NIDEC ("Not Invented by DEC") software. In the 1980s, you would expect questions such as, "Are you trying to run UNIX or something?" In the late 1990s, it is "Are you trying to run FreeBSD/NetBSD or something?"
If you are talking about problems with a photocopier, the question would be something like "You aren't trying to make double-sided copies, are you?"
Although the FSE is now at your site, he/she may leave at any moment, causing you to return to Stage Zero. You will advance firmly into Stage One if any of the following occur:
Stage One typically directs the FSE to change any boards that the diagnostics indicate are causing the problem. If the diagnostics won't even run, this step is either skipped or the FSE swaps whatever boards he happens to have in the processor (aka "KL") spare case.
If the diagnostics then run without incident and the operating system will get as far as asking if it is okay to run CHECKD (an incredibly slow fsck) the FSE may consider the problem solved and may leave. Depending on the number of times they have been called to look at the same problem in recent days, the FSE may hang around until the system gets as far as asking for the current date or starts the network interfaces before leaving. The goal here is apparently to be out of the area prior to the "login" prompt appearing, or more likely, not appearing.
If the diagnostics now find a problem, the indicated board is replaced with one from "the spares kit". Hopefully the FSE brought the right spares kit with him. If not, you will experience a delay in getting a replacement component, which he may have to get, or it will be delivered by the tag-team FSE. (Some creative FSEs will, in this case, replace some other unrelated cards that they do have, just in case the diagnostic is mistaken. This kills time - and possibly your machine - nicely.)
After inserting the replacement part, make sure that the FSE re-runs the diagnostics before he leaves to make sure that:
- The replacement card cured the problem or made the problem move elsewhere (symptom changes).
- The replacement card isn't worse than the one that was in the machine in the first place.
A spooky thing that happens here is that most FS organizations seem to have a policy that a board will be tagged as "bad" only if it has a solid failure which "follows" the board. If a board can be moved to a different machine or different slot in the machine and the problem goes away, or the problem is intermittent, the board will be replaced, but may not be tagged as being defective. This board now ends up in the spares kit. Remember this board; it may come back, or worse, you might get someone else's headache-board.
Once the problem appears to go away, you fall back to Stage Zero, but the failure counter is incremented.
If, when you arrived at Stage One, the FSE ran diagnostics and they ran successfully and do not complain about anything, the FSE usually pulls and re-seats all the cards in the system. Some FSE's then haul out the pencil eraser and clean the connectors when practical, which seems to always be extremely fatal to the cards that get erased. This type of activity seems to always help advance you to Stage Two.
The problem for you at this point in Stage One is figuring out what is going on. The FSE's usually won't tell you what they are up to, so you have to watch for tell-tale signs.
You are quickly advancing toward Stage Two if:
WARNING: ONLY DIAGNOSTIC TESTS 1 THROUGH 64 CAN BE RUN WITH MAIN MEMORY DISCONNECTED!This is always a dead giveaway, particularly if you only called the FSE in to fix a tape drive.
TEST #1, KL LOOP TEST #KL LOOP FAILURz @@@@@@@@@@@@@@@@@@@...and the '@' characters continue to print for several pages of paper until someone stops it.
Experienced Data Processing and Information Systems facility personnel know that the real reason for all of this security is to keep the FSE's from managing to leave the site before the systems can be completely restarted, at which point you might notice that only 512K of your multi-Megabytes of main memory are still visible, and only one CPU is still responding.
The longer it takes to bring the computer systems up to the point where work can be done on them, the more security measures the facility that houses the computers will have. Think about it.
Run CHECKD? No %XYZZY Warning - Replying 'No' is equivalent to slitting your wrists with a tape leader trimming tool. Run CHECKD? YesSo everybody always ran CHECKD.
Stage Two occurs when the first FSE has been unable to fix the problem after a given amount of time, usually about 6 hours, or (more likely) the problem has grown considerably in scope.
Here additional forces arrive from the local office. Sometimes it is simply one other FSE bringing more spare boards, or he is sent to relieve the first FSE. More often than not, this arrival allows one of the FSEs to keep you and your staff distracted while the other FSE tries to retrieve his wrist-watch from the system backplane without being spotted by you.
The second FSE reviews the situation as briefed by the first FSE, looking for anything obvious that would correct the problem, such as turning *all* of the power circuit breakers back on, putting the right cards in the right slots, using Beta instead of VHS, adding toner, etc. This process frequently catches really stupid errors but you will always be given an extremely complex (and bogus) explanation of what the problem was that has now been fixed. Note that there is no guarantee that the second FSE is more senior or knowledgeable than the first, but sometimes that doesn't matter.
If both FSEs are unable to make headway, the local office may also send the FSE who has the most experience with this particular system or this type of problem, assuming he isn't at your site already. You can usually tell when this FSE appears, as he has a complete set of spare parts in his car, and possibly his own microfiche viewer or laptop computer. (Unlike all others, this FSE will replace burnt-out indicator light bulbs without you having to open a ticket.)
This "senior" FSE will usually get the other FSEs on site to go do something else (like buy food) and while they are gone, he will then try to assess the situation, both by looking at the diagnostics, and by talking to you. This allows him to determine how many of the current problems were there when work began. (Note that in some organizations, the arrival of the Senior FSE is the start of Stage Three.)
Depending on the maintenance agreement you have with the service organization, in Stage Two the FSEs may hang around until whatever time your shop normally closes for the night, or stay on until you pass out, at which point they will sneak-out anyway and may even come back in the morning.
Most problems are resolved in the latter phases of Stage Two, so there isn't a lot of other interesting things to say here. Getting to Stage Three is mainly a function of time, although a really spectacular event, such as any of these headlines in the campus newspaper will get you to Stage Three faster:
"FSE taken over by stranger-than-usual aliens! Biology department impressed!",
"Walls bleed in campus computer center - OS Upgrade identified as cause", or
"Trouble ticket open for 28 months causes vendors bug tracking system to form black hole! Reports say damage worse than what was predicted for Y2K problem!"
If the system starts working while at Stage Two, you return to Stage Zero, although the secret failure counter doesn't return to zero. It does goes down a little for every day that the system keeps working, and reaches zero after about a week.
If after two more days you still have FSEs (possibly four of them now lurk around the room by now), you go to Stage Three.
Well, like most universities, someone in the hierarchy decided that we just had to have signs explaining what the various cabinets in the computer room were ("AIR CONDITIONER", "DISK DRIVE", "BOX THAT BUZZES LOUDLY AND DOESN'T LIKE WATER", etc), so that when a tour occurred, the signs would be there and the visitors could try to read the signs as they flapped wildly around in the forced-air of the computer room. They also helped the director of the facility correctly identify the object he was pointing at while giving tours, since he doesn't go in the computer room very often. So on a typical tour, you might hear an exchange like this:
Director: "And this is the IBM 370/155, which cost the school over two million dollars, money that could have been spent on the football program. Yes, a question in back?"
Guest: "Yes, are you sure that is the '370? It really looks more like a swivel chair with a green colored stain."
Second Guest: "Perhaps the '370 is the large blue box with all the flashing lights behind the chair?"
Director: "No, these signs are very accurate and thousands of dollars were spent to make and install them, money that could have been spent on the new stadium."
In our case, some signs would flap around so wildly that they would come loose from the ceiling and fall to the floor. Eventually, someone would pick these lost signs up and simply lay them on top of the appropriate box. Months later, probably just minutes before the next start-of-semester tours, some knowledgeable soul would get a ladder and reinstall the signs in the appropriate locations, more or less. In the meantime, the signs quietly rested on top of the computers, waiting for their chance.
Which brings us to our FSE, the victim of this story. Called to correct a minor problem with eight of the 64 serial ports on a DECsystem-20, he has the PDP-11 cabinet on the DEC-20 extended out of the machine to measure some voltages and reaches on top of the computer for a screwdriver he placed there earlier. Instead of picking the screwdriver up, he drags it across the top, dragging the chain and the "LARGE ORANGE BOX WITH NO FLASHING LIGHTS WHICH MAKES A BORING TOUR STOP" sign along with it, and the chain and sign fall neatly into the card cage, in full view of our staff.
The FSE, thinking it would be better to get that metal chain out of there, particularly since the computer was on, pulls the chain out and puts it back on top and continues with the measurements. After a few minutes he returns to the console area, puzzled that he can't find the signal he was looking for on ANY board. And for some reason, now the console doesn't work any more. This person had clearly lost the gift of cause and effect analysis.
We nearly went to Stage Three to resolve this one, despite witnesses repeatedly telling the additional FSEs exactly what happened. Eventually, our service vendor did get a nice letter from our management saying that this particular FSE would not be let in the building ever again.
Congratulations, you have reached Stage Three. You did this by keeping FSEs at your site over three or more days (two days if they were there more than 14 hours a day), or whining about the situation in public forums on the Internet with samples and embarrassing photos. Your management can speed or slow the arrival of Stage Three support depending on how many calls are made to the computer vendor and how threatening he/she/it can sound:
Educational Non-Threatening: "We have 25,000 students that are unable to complete their projects and will get failing grades and then beat on our cars or use the skills they learned in chemistry classes against us if you don't fix this computer."
Educational Threatening: "I own a rocket launcher and am coming to your house NOW if you don't fix this computer."
Business Non-Threatening: "We have over 50,000 customers who can't download cooking recipes off the Internet because of this problem."
Business Threatening: "We are going to tell 50,000 of our customers - all of whom own rocket launchers - where you are if you don't fix this computer, as you are preventing them from downloading 'porn off the Internet fast enough."
To be at Stage Three, things are really messed up now and parts of your system that never bothered you before are probably malfunctioning. You might even be getting error messages regarding peripherals your system doesn't even have.
The failing system is also starting to look more like it did prior to being originally assembled, as more and more loose parts litter the area.
The Regional FSEs, usually out of a major city like Chicago, Creede, Houston, or Twin Peaks arrive to at least stabilize the situation, and hopefully, get the system back to the level of functionality you had back at Stage Zero. If they fix it completely, that is a plus, but no longer the main goal.
The Stage Three patrol does a lot of rediscovering. This means that they ignore most or all of the information obtained in the earlier stages and must experience it for themselves. If your system will run, you have to bring it up and let the users on, knowing that at any moment it will go down again, erasing the unsaved work of hundreds of students, co-workers or customers, who know where you are and possibly which car you drive.
You are told to not warn the users about what is going on because they would not use the system in the same way they normally do if they knew it might crash at any second, and this might cause the problem to not occur.
Note it is almost impossible to descend from Stage Three back to Stage Two. Even if it takes a day or more to fail, these guys usually hang around, along with your growing collection of FSEs from the earlier stages that have taken-up residence.
Finally, the system crashes. Hurrah! Now the suit jackets come off and the neckties get caught in the printer drum. No, wait, that only happened once.
Stage Three personnel bring their own 'scopes and other strange test equipment, most of which appears to have been obtained from the set of the film "Frankenstein", which they may have wired into your machine before the demonstration crash. If they didn't do this earlier, they will wire it all up now and ask you to make the system crash again in the same way, as they really will be watching this time.
You may go to Stage Four if the system will not run with the test instruments attached, but this isn't a sure thing. You will definitely go to Stage Four if they disconnect the test equipment and now the system won't do anything at all. "They've taken Spocks brain!"
After a crash occurs in front of the instruments, the FSEs will take action. Usually its a phone call to someone at the corporate headquarters, where you may overhear them say stuff like "If it was human, it's dead. No wait, the 'scope probe came loose." (Now speaking to you) "Uh, can you make it crash again?"
The people from Stage Three usually have special diagnostics that the local office never get, or that they didn't know about, or they left them at another local site, or are at this moment are stuffed in the dollar-changer in the Coke machine back at their office with a "Thieving @*%!! Machine" note written on it. Anyway, these newly-used diagnostics provide a new wealth of information to examine, but invariably result in more phone calls to back to Mr. Peabody, who is the only person in the world who knows what the diagnostics are trying to report.
The chant of Stage Three becomes "Okay, we have a recorded failure. We can rebuild the system. We have the technology. We have the spares kits. We've got lots of YOUR spare time. Let's do it!"
Stage Three also starts a more methodical replacement of components on a scale not attempted at earlier stages. Let us say that we have what appears to be a hard disk problem. Here is a typical replacement checklist: (You will be asked to reboot and let users use the system between each change to see if the problem is really fixed or see if it will crash again)
No matter what the problem is, there are just over a dozen steps worth of work in Stage Three before the Stage Three timer expires. If the problem isn't getting any worse, you get about a week at Stage Three. If the system is degenerating and less of it works by the hour, Stage Four can arrive in as few as four days.
You really can't accelerate movement to Stage Four on your own. Reporting the problems and misdeeds of the FSEs for all to see on the Internet only works once and doesn't get you beyond Stage three no matter when you use that ploy.
Okay, so Stage Three didn't work out so well. Don't worry, it's probably something obvious, like the planetary alignment of your computer room.
Stage Four personnel usually come from the corporate headquarters and review the things that Stage Three replaced, and will probably replace a few of them again, but in a different order.
Stage Four specializes in replacing things that appear to have (and usually have) nothing to do with the problem whatsoever and don't seem to have the same goal of any of the previous stages. Changing things for the sake of change seems to be part of the art of problem resolution in Stage Four. Science and logical process were killed and swapped-out in Stage Three. So for your hard disk problem, expect the ribbon on your printers to be replaced, tape drive heads to be cleaned and calibrated, and to have most of the false floor ripped up for days. Part by part, the system will be replaced. You need plenty of room for all the replaced parts and cables that will start to accumulate around the area.
Tip: You might want to write down what your systems configuration was when you started all of this so that you might be able to get back to that arrangement, or at least so that you can get all your parts back. Systems have been known to have been "fixed" by disconnecting the offending hardware or turning off the alarms from the hardware that are trying to warn you of data corruption because that hardware really is broken.
Stage Four also has a special squad that deals in "blame assignment", looking for anything at all that might be external to the equipment listed in the service agreement that might be the cause of the problem, at least in some envisioned parallel dimension where our physical laws of nature do not apply. They will look for current leakage from the false floor to the computer cabinet, ignoring comments that the rubber wheels on the computer probably take care of insulating the system against any millivolt of differential they happen to locate across a 100 foot computer room floor.
Then comes the temperature and humidity monitors, and the stern recommendations to change both settings in ways that end up making it rain in your computer room. Of course once that happens, the Stage Four blame squad can point out that it may have rained in there previously and that this might have been the cause of the previous failures. This is your opportunity to respond by pointing out that, until now, your computer room has never been visited by beings from the planet Cretin, so prior rain-making activity was unlikely.  Despite your assurances of no prior moronic or paranormal activity, they may hang doggedly onto this discovery even if the rain causes a completely different problem, like shocking the stuffing out of one of the lower-tier FSEs, still hiding behind the system.
Finally, you will see them bring up the Power Disturbance Monitor(TM), and gradually you will be unplugging things all over the building to prove that they are not causing the problem. Oh, you can leave the coffee maker plugged-in, as that is a priority piece of equipment for the FSEs.
Another thing that may happen at Stage Four is the recovery of old parts. You remember all the boards swapped-out by all the previous FSEs? Stage Four has been known to conclude that this system only works well with a particular vintage of cards, and will try to retrieve them and put them all back in your system, even if some have since made it into machines across town. Due to the lack of tagging boards with solid failures, this ends up being an opportunity for more random cards from the ten or twelve spares kits lining the room to find their way into your machine. As you might guess, the chances of things not getting even worse after doing this is very small.
Stage Four will last as long as the service company can possibly stand it. Only when the FSEs are down to two flat-head screws and the Emergency-Off switch as the only things that haven't been replaced or fiddled-with, will you possibly move on to Stage Five on the FSE timetable. I say possibly, because they may decide the problem is "hard" and toss the problem back to the software group for a couple of months at this juncture.
The threat of legal action, or long-range nuclear weapons directed at high levels of the company can cause Stage Five to appear before the end of time, but you might have to start walking towards the court house with the lawsuit papers as well as the launch codes in hand before anything happens.
At the end of the service visits, any survivors in the Stage Zero or Stage One personnel will be given the task to cart six carloads of stuff back to the local depot. All the other FSEs will escape back into whatever dimension that they came from rather than participate in this task.
Based on that information, what do you as the computer vendor do? Why, you recall Patch Kit C and have the customer remove it, replace the customers revision F hardware with revision F (yes, same version) hardware, and guess what? The system still crashes. Now the vendor starts talking about replacing memory cards, CPU cards and the system backplane and other random hardware, and eventually replaces all of these things (some repeatedly) with no improvement. Eventually, they replaced the backplane and entire metal frame, thinking that they had swapped swapped everything else. This didn't fix the problem.
Now, you might think that maybe, just maybe, the problem might have something to do with the software that was changed at the point when the system started crashing (installing upgrade B), but no, that was one of the the last things tried. Meanwhile, the original goal of trying to apply the fix to problem A was on hold for months. (A FDDI network card completely unrelated to what the error messages were reporting was eventually found to be the true culprit.) This is an unusual case, since the hardware people are usually keen to lob this type of problem back over the net into the OS groups hands.
Stage Five. Wow. It almost never happens. Until recently, I thought it wasn't possible at DEC now that Ken has retired (who has the key to the box?). Being at Stage Five, you have arrived at the pinnacle of Field Support. There are only two choices left to the equipment maker: One, to roll in a complete replacement system and let the existing system be "accidentally" exported to a forbidden country, or Two, let one of the gurus out of the box, who might be able to identify the real problem and get it fixed.
Neither choice is popular with the vendor. The first costs a lot of money, and sometimes those pesky border patrol or customs people catch the faulty system being exported to North Korea or wherever. The second option might allow the guru to be exposed to real-life, something that could be far worse. He might find out that Nixon resigned, that nobody makes slide rules any more, or that IBM finally built a computer with a stack pointer. Plus, the guru might still recommend replacing the system and the vendor will still have to try to smuggle your broken system out of the country in order to dispose of it.
In one of the cases where I witnessed a Stage Five escalation, the system had been putting out a BUGHLT: SWPUPT message for six weeks and crashing each time. The message claims that the hard disk driver or paging code (or both), which should never be swapped to disk, had been swapped. This is bad. However, at least there was code in the operating system that said "I did a bad thing" when it happened.
In response to this crash that was increasing in frequency daily, the FSEs had gradually replaced nearly every component of the mainframe, all the cables to the hard disk drives, most internal cabling, all boards in the drives, realigned the heads on the drives several times, reformatted the packs and had us reinstall the operating system more than a dozen times. They examined the raised flooring for electrical current loops, looked for uneven cooling, and at various times blamed the failures on the telephone lines the modems were connected to, the console DECwriter, the printers and tape drives, even the proximity of an IBM 370-155 (hmm, maybe...), but one by one these were eliminated, except for the IBM. I think we erected a curtain or something as a joke so the DEC-20 could not "see" the IBM. It didn't help.
After six weeks, there were only a few sheets of metal casing and the wheels that hadn't been replaced on the entire system and the problem persisted. We had also gone through periods of having a completely useless system as working parts were replaced with broken parts in such large quantities that no one could recall what had been changed last.
So, just when they were about to break down and replace the entire system, they decided to let a guru out of the box in Maynard.
For the arrival, nearly all of the non-Stage 4 FSEs were asked to leave the area. The guru walks in and looks at the OS logs of the crashes, which had about as much debug information as your typical MS-DOS "Null-pointer assignment" error message. He completely ignores the reams of output from the diagnostics run over the previous six weeks. After ten minutes or so, he proclaims that there is some dirt in a glide track in the head assembly of a particular hard disk drive. Then he leaves for the airport and goes back to Maynard. Maybe he asked the taxi to wait.
The entire FSE garrison echos "WHAT?" This was impossible. Sure, they hadn't taken this particular assembly apart (it wasn't easy), but why would it cause that error message instead of a disk read or write error message? And how could a U-shaped piece of metal cause all this? And why didn't the diagnostics ever fail or report a problem in this area?
They were almost ready to ignore this advice and roll the replacement system in when someone decided to spend the hour cleaning this track. It worked. The system, now with several hundred stripped bolts and worn connectors, ran fine, even when we started four copies of HAUNT at the same time, the ultimate system stress test.
So we didn't get a new computer, but the problem was solved, and a considerable percentage of the computer science college got a grade of Incomplete for the semester. The system behaved itself just fine until it rained in the computer room (real rain, not the sprinkler or air conditioning system), but that's another story.
So there you have it, the Six Stages of Field Service Support.
My company immediately responded to the letter by announcing to this vendor that we would never buy anything from them ever again and would dispose of all our existing equipment at the first opportunity, possibly utilizing our 30th floor perch in some spectacle that would be televised: "Tonight on Jerry Springer! When computers go bad!" or "The Late Show Computer Toss!"
The computer vendor (who had previously been saying that they were working on the problem on and off for the previous 18 months), had come up with a patch a year earlier that wouldn't even boot, but mostly used a series of conference calls to keep us informed on the lack of progress), now suddenly decides that we might be serious about having this problem fixed and to listening to why we would not abandon the software configuration we preferred for one that they supported "better". (I point out here that most positive values are greater than zero.) We finally forced this issue by asking fellow users of this type of system on the Internet for their opinion of the two filesystem methods in question, the one with the bug and the vendors favorite. With one exception, everybody who responded was using the filesystem with the bug that the computer vendor was effectively abandoning. Some sites had tried the vendors sweetheart system for a few days and then switched-back, bugs and all.
Well, apparently the computer vendor saw this discussion on the Internet and ran some of their own tests and found out that their customers might actually have a point about the performance difference of the two systems and that their little darling was in-fact a performance pig. Suddenly, the orphaned filesystem software was being looked-at and bugs and stupid inefficiencies were being found all over the place in the operating system, at least according to the now-resumed weekly conference calls.
At one point, to show positive activity, the vendor let a software guru out of the box to visit us, along with someone who can be best described as part diplomat and part damage-control officer. Of course, to make sure we wouldn't actually solve the problem while they were at the customers site or honor our previous requests for source code to let us fix it ourselves, they brought paper listings of the source code in question, which we never got to see.
Of course, having been in a similar position in a former job, the reality is that the software guru can do little apart from observe during the visit, since doing code development/correction while surrounded by a bunch of pacing and chattering people is tough, and you probably don't have any of your normal development tools (plus nothing but a paper listing, but that was self-inflicted). Therefore, the secret goal is to make sure that you know how to make the failure occur predictably and get back to the lab where you can actually work on the problem and not have to work in some corridor that the customer sticks you in that the customer calls "a work space".
As you might expect, letting the software guru out of the box in a software-related Stage Five doesn't mean the problem gets fixed right away, or even near-term. That is only the first phase of the Software Stage Five, which can drag on for months. In my case, after the open problem report passed its 55th month mark and 26 months since the most recent "get serious about fixing it" event, we just stopped using that computer and moved to an operating system and hardware that didn't have that issue.
"Hi there!! I just did something stupid and committed suicide! I will crash in a few moments and lose everybody's work. In the meantime, here is some music to listen to!" BEEP BEEP BEEP ... (DEC-20s would send out nine BEEPs in a pattern that sounded almost exactly like the morse code S-0-S.)Also, why do you document this type of event which will cause your customers to ask these types of pointed questions?
Diagnostics only find known problems.
As always, the FSE prime-directive is that:
If you can't fix it right away, make sure whatever it rapidly becomes Someone Else's Problem (SEP).Note that Douglas Adams, the late science fiction writer, based an entire novel on the subject of what he called "SEP fields". He suggested that such fields could be used to "cloak" star-ships, planets and other objects from prying eyes, simply by making the object completely uninteresting and unimportant to any viewer, and thus essentially invisible. Such a scheme would accomplish "practical" invisibility at a fraction of the cost of all those systems that try to bend light and such. What Mr. Adams failed to discover in his research is that for years before he came along, FSEs have been employing small SEP fields to distract computer system owners from the real problems, both to save their respective firms money, and to make sure the problems that can't be concealed escalate rapidly into someone else's jurisdiction, effectively making the problem invisible, at least to the original FSE.
[Copyright 1981,1984,1987,1996,1998-2012 Frank Durda IV, All Rights Reserved. Mirroring of any material on this site in any form is expressly prohibited. The official web site for this material is: http://nemesis.lonestar.org Contact this address for use clearances: clearance at nemesis.lonestar.org Comments and queries to this address: web_stories at nemesis.lonestar.org]
Visit the nemesis.lonestar.org home page and index at http://nemesis.lonestar.org