Microsoft made an announcement through their Outlook.com blog about a recent issue specific to users of Outlook.com and Hotmail. After a seemingly exhaustive attempt to migrate customers from Hotmail to the new Outlook suite, Microsoft experienced a minor hiccup as they updated their firmware. There are still many questions left unanswered, even with the frank admission on the Outlook.com blog.
Here's Microsoft's "recap" of the entire event, quoting directly from their blog.
"At 13:35 PM PDT on March 12th, 2013 there was a service interruption that affected some people's access to a small part of the SkyDrive service, but primarily Hotmail.com and Outlook.com. Availability was restored over the course of the afternoon and evening, and fully restored by 5:43 AM PDT on March 13th, 2013."
One point of interest must be, why did this outage occur in the afternoon, and was only fully restored by next day? Why was the timeframe stretched so far? The Outlook.com blog goes further into the issue, marking the root cause as a substantial rise in temperature due to the updated firmware. The resulting "waves" of updates/reboots took many hours to complete as they brought the datacenter to full strength.
Microsoft continues with a detailed explanation:
"This failure resulted in a rapid and substantial temperature spike in the datacenter. This spike was significant enough before it was mitigated that it caused our safeguards to come in to place for a large number of servers in this part of the datacenter...Once the safeguards kicked in on these systems, the team was instantly alerted and they immediately began to get to work to restore access."
It sounds like an excellent strategy; lock out user access in response to rising temperatures to prevent a melted server or data loss. However, it seems that these safeguards were directly connected to other systems and had very strict responses to the temperature change, and thereby prevented a standard 'failover' to a redundant system.
Particularly in datacenters and IT, the goal of temperature monitoring and alerting is to provide a direct line of communication between operators and data center temperatures. Temperature monitoring devices are most effective when utilized as an unbiased indicator of temperature change, but for integration purposes, the devices must be formatted to send instantaneous alerts without compromising other systems. Holistic integration, or automated systems that have a series of moving parts and streamline processes, is an admirable solution for datacenters of scale, but the fact remains that specific monitors and devices must have a closed loop and limited "next-step" automation. Microsoft and Outlook.com may find that a alternative solution is the separation of their datacenter temperature monitoring devices from their automated disaster planning, and using the devices as a primary indicator of trouble (and enacting safeguards thereafter based on the situation). By this method, engineers or system administrators could investigate the temperature rise instantly and investigate the problem. After investigating, active decisions can be made towards automation and safeguarding based on the findings. The instantaneous alerts were clearly helpful to Microsoft, but it seems that the safeguard logic override inflated the problem.
It seems like an issue of redundancy as well, and many bloggers and comments have expressed disbelief at the simplicity of Microsoft's datacenters. Some may argue that the Outlook.com servers should have been designed for maximum redundancy, especially as it's being touted as Outlook's big step into SaaS. This is a hot topic among SaaS veterans and other cloud enthusiasts; redundancy is a complex and vital resource for disaster planning. Continued access to services, business operations, and other assets is the main benefit of a redundant system, apart from the aversion of data loss. Still, even as the redundancy was a relevant problem by Microsoft's own admission, the safeguards were the true issue. Outlook.com was fractured not by a faulty temperature monitoring device, missed monitoring report, or an unopened email or text alert, but by the very logic of the safeguard's response to the rising temperature. Their own temperature monitoring device was obviously effective in marking the temperature change (though we can't truly confirm that the team was "instantly alerted"), but sadly, it was the next step in the monitoring logic that locked out countless users. We do recommend integration of such devices into management systems and for scaled automation, but check the sensitivity of your safeguard and logic systems to prevent an overreaction (and costly outage). Don't have an "Outlook.com" moment!