Scheduling and End-User Transactional Pages

Intermittent service unavailability

Postmortem

Summary: AWS suffered major service outage in us-east-1, causing the loss of zingfit platform reliability. public-boot in particular was inoperable for an extended period. Some remediation efforts successfully recovered the service but occasional drops in service still occured during the incident’s lifetime.

On 2020-11-25, AWS suffered a cascading failure of services stemming from disruption of AWS Kinesis in the us-east-1 data center. At 15:39 UTC zingfit’s public front-end experienced mysterious service timeouts (AWS had not yet updated their status page to reflect their errors so it took some time to realize that AWS was experiencing service outages). Eventually the entire back-end for zingfit’s public scheduling application was rendered inoperable at around 16:38 UTC. This was due to AWS ECS losing Kinesis service (ECS depends on Kinesis for integrated service discovery). The ECS disruption only affected one of zingfit’s many ECS services: All other ECS services remained operable. zingfit application logs were inaccessible due to CloudWatch logs being rendered inoperative via both the AWS console and SDK (500 errors were returned). As the zingfit public scheduling back-end also utilizes Kinesis to send large amounts of API request logs to storage API request logs were also impacted during the incident.

These issues originated from Amazon’s data center; zingfit could do little but weather the service outage.

A zingfit engineer was able to work around the AWS ECS failures by manually creating a new ECS service identical to the previous. API request logging was also disabled as a precaution. Deploying these alterations sufficed enough to restore service to clients at 17:38. AWS claims full restoration occurred at 18:23.

For more technical information on the AWS outage, visit https://aws.amazon.com/message/11201/

Resolved
Assessed

Summary: AWS suffered major service outage in us-east-1, causing the loss of zingfit platform reliability. public-boot in particular was inoperable for an extended period. Some remediation efforts successfully recovered the service but occasional drops in service still occured during the incident’s lifetime.