LabVIEW Web UI Builder Outage April 21 2011 Postmortem

ernestm · ‎04-26-2011

Here is the postmortem from the WLV server outage we suffered last week. We did post ongoing updates on the NI Hosted Services status page located at http://status.niwsc.com - if you think there's something wrong with the NI cloud that's your first place to check, if there's a known problem we update it as soon as we can.

What Happened And Why

From 10:53 AM to 4:05 PM CDT on Thursday, April 21, The LabVIEW Web UI Builder cloud systems were offline. This prevented downloads of the editor, loading or saving to the cloud, and build/deploy. If you were installed locally and using local project storage, you may not have been affected. We apologize for this unplanned outage, we got caught up in the massive Amazon AWS Eastern region downtime, which is where we host the service.

Detailed Timeline

2:51am – LabVIEW FPGA Compile Cloud Beta (FCC) and LabVIEW Web UI Builder (WLV) experience problems with the load balanced web servers on the front end, LV Cloud Ops team is notified by monitoring. We have 24x5 operations staff on shift so there was no oncall delay.

3:30am – LV Cloud Ops team is able to stabilize WLV, but FCC is offline (both redundant front end servers are down).

4:00am – Ticket opened with Amazon support.

4:00am – Attempted to create new servers to fail over to. Not able to create new servers in Amazon US East Region.

7:17am – Posted on status.niwsc.com that FCC was no longer functioning.

10:16am – One of the two WLV web servers crashes, AWS still not able to spin up new instances in US East. It becomes clear from Twitter that the problem is widespread and affecting many Amazon customers hosted in US East.

10:20am – We alert customers that we might be losing WLV on status.niwsc.com.

10:30am – We begin planning moving web servers to US West, but also hoping second webserver stays up. Our tests show AWS is generally jacked up.

10:53am – Second web server for WLV goes down, WLV offline.

11:00am – Team works on moving assets to the US West Amazon region and continue to try and get new instances in US East.

12:35 pm – Amazon provides their first explanation:
A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes.Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. Weare working as quickly as possible to add capacity to that one AvailabilityZone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet.We will continue to provide updates when we have them.

1:00pm – Reset DNS for uibuilder.niwsc.com to point to “Page not found” on our core services.

2:30pm – Able to provision new instances for the first time. We grab 4 real quick.

2:40pm – Deploy software, apps, and config on all 4 new servers.

3:30pm – Encounter errors with the Elastic Load Balancers (ELBs)

3:50pm – Swap in new ELBs and move DNS

4:05pm – Services restored to customers, give it burn in time before communication on status page and forums

4:55pm – Communicate issue resolution

What We Can Do Better

We were not already spread across multiple Amazon geolocations, instead relying on multiple availability zones for uptime. Furthermore, processes to migrate systems over to new regions require the old region to be working. We plan to set up a process to move assets over to US-West on a regular basis so that we can bring up systems there as needed. We have had consistently good uptime with Amazon up to this point and do not plan on moving providers, but of course if the issue recurs we would address it at our supplier level. We do intend to address the lack of/slow communication to us and other customers surrounding the issue.
It took us a while to get the downtime notification page updated. We will do better; it is hard to prioritize communication when people are frantically working issues but we train to expect that as a top priority. We have an automation solution nearly done that will update the notification page when our monitoring goes bad, though of course explanations would still have to be entered by operations staff.
We were not able to quickly email affected customers, as operations staff does not have a direct channel to send email to customers. We are looking at other communication options such as Twitter and opt-in email from the notification page.

We'd Like To Hear From You

Were you affected by the outage? What could we have done better in your eyes?

How would you like to be notified of outages? We are reluctant to email all hosted services customers about outages in case you'd consider that annoying or spammy. Should we always push outage info to you or should that be via an opt in mechanism? Do you like email, twitter, forum posts, all of the above?

Again, we apologize for the service disruption and will continue to work to make the NI Cloud up and rock solid 100% of the time.

LabVIEW Web UI Builder and Data Dashboard

LabVIEW Web UI Builder Outage April 21 2011 Postmortem

LabVIEW Web UI Builder Outage April 21 2011 Postmortem