Root Cause Analysis - finding out what happens when a failure occurs, and why it happened in the first place. There are many reasons for root cause analysis. The “system” design can be improved to prevent some of the failures from happening again and those who troubleshoot can do so more quickly if this should reoccur. Both improvements can have a significant impact on down time as well as safety.
My home has a fiber optic digital service for internet, television, and telephone. The internet and television both failed on Friday. As there is no effective diagnostic and annunciation system, we noticed the failure just before our guests were trying to watch a hockey game on the television. This was clearly a “failure on demand” prompting me to begin an emergency troubleshooting process. The telephone still worked so I called the remote troubleshooting service.
- Check power on the display and set top box, press 1, or say “next step”
- Check cables on the display and set top box, press 1, or say “next step”
- Reboot (unplug the box for five seconds and reinsert the power plug) the set top box, press 1, or say “next step”
- Reboot the router, press 1, or say “next step”
Frustrating, but I finally spoke to someone after about 20 minutes.
The person asked “any construction going on?” I said no but the outside of the house was power washed that day. But that should have no impact, the fiber optic cable is fine, and all the electronics are indoors. After rebooting different set top boxes and the router about six times, I ran out of time and gave up for the evening. Those who needed to see the game used a “diverse redundant system.” They used their cell phone to program the DVR at their house to record the game.
The following morning I had more time to think about the failure. It had to be a problem in the demarcation interface box. The “system status” light was solid green; seemingly fine. The battery status light was blinking green; also seemingly fine. Then I plugged a test box into the power outlet with the demarcation box: No 110 VAC! It was a ground fault outlet that had tripped. There was an outside outlet on the same circuit that must have gotten wet. I then reset the ground fault outlet and now the battery light was solid green. OK, now I know flashing green must mean running on battery power and battery power will only cover the phone. Now I understand how the system works.
When this all began, it was hard to imagine that outside water would have been the cause of the system failure, but it was. A system upgrade project is now planned where a separate outlet with generator backup will be installed. After all, TV service is classified as an essential service.
It is good to have done the root cause analysis. Not only is system troubleshooting going to be much easier but this particular failure will be prevented. Perhaps I should get out my copy of SILStat and enter the event information. But the Quality Manager at this site (me) still does not require full documentation. I know the system upgrade will happen since those who watch sporting events on TV will track progress to completion. One cannot ignore these essential life services.
Tagged as: SILStat Root Cause Analysis failure Dr. William Goble