Here is the postmortem from the WLV server outage we suffered last week. We did post ongoing updates on the NI Hosted Services status page located at http://status.niwsc.com - if you think there's something wrong with the NI cloud that's your first place to check, if there's a known problem we update it as soon as we can.
What Happened And Why
From 10:53 AM to 4:05 PM CDT on Thursday, April 21, The LabVIEW Web UI Builder cloud systems were offline. This prevented downloads of the editor, loading or saving to the cloud, and build/deploy. If you were installed locally and using local project storage, you may not have been affected. We apologize for this unplanned outage, we got caught up in the massive Amazon AWS Eastern region downtime, which is where we host the service.
2:51am – LabVIEW FPGA Compile Cloud Beta (FCC) and LabVIEW Web UI Builder (WLV) experience problems with the load balanced web servers on the front end, LV Cloud Ops team is notified by monitoring. We have 24x5 operations staff on shift so there was no oncall delay.
3:30am – LV Cloud Ops team is able to stabilize WLV, but FCC is offline (both redundant front end servers are down).
4:00am – Ticket opened with Amazon support.
4:00am – Attempted to create new servers to fail over to. Not able to create new servers in Amazon US East Region.
7:17am – Posted on status.niwsc.com that FCC was no longer functioning.
10:16am – One of the two WLV web servers crashes, AWS still not able to spin up new instances in US East. It becomes clear from Twitter that the problem is widespread and affecting many Amazon customers hosted in US East.
10:20am – We alert customers that we might be losing WLV on status.niwsc.com.
10:30am – We begin planning moving web servers to US West, but also hoping second webserver stays up. Our tests show AWS is generally jacked up.
10:53am – Second web server for WLV goes down, WLV offline.
11:00am – Team works on moving assets to the US West Amazon region and continue to try and get new instances in US East.
12:35 pm – Amazon provides their first explanation:
A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes.Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. Weare working as quickly as possible to add capacity to that one AvailabilityZone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet.We will continue to provide updates when we have them.
1:00pm – Reset DNS for uibuilder.niwsc.com to point to “Page not found” on our core services.
2:30pm – Able to provision new instances for the first time. We grab 4 real quick.
2:40pm – Deploy software, apps, and config on all 4 new servers.
3:30pm – Encounter errors with the Elastic Load Balancers (ELBs)
3:50pm – Swap in new ELBs and move DNS
4:05pm – Services restored to customers, give it burn in time before communication on status page and forums
4:55pm – Communicate issue resolution
What We Can Do Better
We'd Like To Hear From You
Were you affected by the outage? What could we have done better in your eyes?
How would you like to be notified of outages? We are reluctant to email all hosted services customers about outages in case you'd consider that annoying or spammy. Should we always push outage info to you or should that be via an opt in mechanism? Do you like email, twitter, forum posts, all of the above?
Again, we apologize for the service disruption and will continue to work to make the NI Cloud up and rock solid 100% of the time.