Flav wrote on Jun 30
th, 2014 at 2:01pm:
You really have no clue...
Just to give you an idea...
$TELCO calls at 3 AM, Card 4*10 Gigabit went down on a backbone router... at 1AM. It needs to be replaced. ( notice that some customers have already been down for two hours ).
I ask for all the relevant information ( site location, how to get in the site, and so on ) and then call $SUB-CONTRACTOR ( dubbed as $SUB after ) for an on site tech to go there to replace the card. $SUB calls 30 minutes later asking me for more details ( site acces and location.. .dummy, I sent them to you when I opened the ticket ), I repeat the relevant information... 15 minutes laters $SUB gives me the name and mobile number of the tech that will change the card. ( yeah another 45 minutes lost )
Site is in the backyard of nowhere in Froggy Land, The tech from $SUB will take one hour to reach his office and grab cards ( notice the plural ) and then he will have to dr'ive for 2 hours... ( So basically when he reach the site [note it doesn't means when he replace the card, just parking his car near the site ], the card has been down for 5H45... )
Once there site access was good ( he got lucky, it's not uncommon that they can't get inside ), so he replaces the card ( add 15 to 30 more minutes ), new cards is DOA ( Dead on Arrival, about 10% are like that... in some case it's up to 20% ), he needs to go back to car to fetch the other card ( he got lucky he could bring two cards ) [ 15 more minutes ]... Second Card is good...
So in the end for the people connected to that router ( Bad Luck, the card was carrying the backbone ports, so lots of people was isolated ) the downtime was... Almost 6 hours...
And that's a lucky case...
I can give you one recent example where the time between being called ( 1hour after failure ) and the time for the new cards to work is more than 12 hours.
So since usually the Monday Maintenance is datacenter stuff, it can means so many things ( moving servers to another rack cabinet, replacing all the switches, upgrading the SAN, new firmwares on stuff... and Windows Patches ) that Massive Maintenance Windows are not that massive.
I could also tell you about a system update for a French $TELCO that spanned from Friday 6PM to Monday 7 AM... and at Sunday 5PM we noticed that we had forgotten half the data in the migration... It was a happy night after that ( Happy Coding, Happy data extracting and Happy keep the customer busy so that he won't notice that we have a Problem.
Yes we managed to avoid having to extend the maintenance window, but it had a cost : nobody with any knowledge on that system was at work on monday, to the despair of the Project Mangler.
That is unexpected downtime. Planned downtime not completing in the window is piss poor planning.