[M5Hosting] Post Event Report - Network Issues Following Power Event
Michael J McCafferty
mike at m5computersecurity.com
Tue Mar 24 03:50:07 PDT 2009
Dear M5Hosting Customer,
On Wednesday, March 18th there was a period of high packet loss and
inconsistent network connectivity. These network issues followed the
power outage, but did not begin until about 2 hours after the power
event. This is a post event report regarding the network issues. A post
event report regarding the power issues has already been released.
Approximately 90% of our hosted dedicated servers came back immediately
up after the power event. The remaining 10% required manual intervention
from our tech staff. This was a significant effort. The number of
support requests following the event was very high. About 1hr after the
power event, the connectivity between M5Hosting and American Internet
Services (AIS) became intermittent. The number of support incidents and
monitoring alerts as a result of the power event did slow our discovery
of the network specific issue. However, since our entire technical staff
was already on site, the resolution came more swiftly once discovery was
made. The root cause was that systems designed to be redundant and to
provide continued service in the event of a hardware failure, performed
in an unexpected manner. The AIS routers directly up stream from us are
Cisco 6509s using HSRP (Hot-Swap Routing Protocol) to determine which of
the two redundant routers is the active router. Both of AIS's routers
determined they were the master, intermittently. This caused a repeated
flip-flopping of the redundant pair of routers. Each time the master
flip-flops, it causes a few seconds of packet loss. Once it was
determined what was happening, the fix was to revert to a non-redundant
single connection to AIS. This stopped the flip-flopping.
We also have our own pair of redundant routers facing AIS. Our
redundant routers also use a redundancy protocol. We use VRRP (Virtual
Router Redundancy Protocol). The proper functioning of VRRP on our
routers depends on the same pair of layer 2 switches as HSRP depends for
the AIS routers. Our systems did not flip-flop during this event. This
indicates that the layer 2 switches were functional and reduces the
likelihood that they were the cause of the AIS routers flip-flopping.
During an announced maintenance window between 02:00 and 04:00 Friday,
we tested all relevant systems in our network and attempted to recreate
the situation that caused the intermittent network connectivity. All
systems performed exactly as designed and expected. There was no
flip-flopping. We performed these tests with AIS on hand and while they
were logged in to their equipment. Given the successful tests, we
returned the connections between M5Hosting and AIS to the normal fully
redundant configuration.
We maintain a strong relationship with AIS. We requested that AIS
carefully review their logs and configurations to determine what
possible causes there may have been indicated in their logs. We also
requested a copy of the relevant log entries. Given the response and
partial log data provided by AIS, we can not conclusively determine what
caused the flip-flopping on their equipment. However, with the
successful test of all M5Hosting systems and our connectivity to AIS and
the function of HSRP on the AIS routers, we can only conclude that the
issues which caused this network anomaly on Wednesday were unique, and
likely related to the power event or the flurry of activity surrounding
it.
As with any challenging event, we are taking this opportunity to
improve our systems, both technical and procedural. After reviewing our
response to the events on Wednesday and our technical configurations, we
will be making some improvements to our internal monitoring systems,
procedures and configurations. These improvements will speed diagnosis
of complicated issues involving the interaction between system, improve
remote access to more network devices in the event that the network is
down. Additionally, we will improve our off-site monitoring of our
infrastructure and create alerts which are unique and alarm differently
than other systems so as to stand out from other alerts if there is a
large volume of simultaneous alerts.
Part of improving our systems includes taking your feedback. If you
have thoughts about this email, or the details it describes, please take
a moment to share them with us and help us improve.
Sincerely,
Mike
--
************************************************************
Michael J. McCafferty
Principal, Engineer
M5 Hosting
http://www.m5hosting.com
You can have your own custom Dedicated Server up and running today !
RedHat Enterprise, CentOS, Ubuntu, Debian, OpenBSD, FreeBSD, and more
************************************************************
More information about the M5Hosting
mailing list