On Thursday, February 26, we had an outage for downloading cookbooks from Supermarket via Berkshelf. The next day, February 27, we held a public post mortem.
If you’d like to see the video of the post mortem, you can view it on Youtube here.
Description
A deploy to production supermarket that was intended to allow http access for downloads switched the download links to http for berkshelf/chefdk, but did not fix broken http downloads. The result was failures for Berkshelf and ChefDK cookbook downloads.
Timeline
A deploy to production supermarket that was intended to allow http access for downloads switched the download links to http for berkshelf/chefdk, but did not fix broken http downloads. The result was failures for Berkshelf and ChefDK cookbook downloads.
Time to Detect – 47 minutes
Time to Resolution – 103 minutes
All times are in UTC on February 25, 2016
- 20:55: Deploy of supermarket 2.4.0 causing the issue is preformed by Robb Kidd (robb), at this time https is still functional
- 21:31: First user report of issue comes in via #chef on Freenode (irc)
- 21:35: Issue is reported in Hangops Slack
- 21:37: Noah Kantrowitz (coderanger) notifies Paul Mooring (pwm) via Chef Sucess Slack
- 21:42: Nell Shamrell-Harrigton (nell), pwm and robb begin investigating the issue in Chef’s internal Slack
- 21:46: Incorrect protocol in universe endpoint is discovered by robb
- 21:53: Config option to disable ssl is pointed out by robb
- 21:55: Config option to set ssl to true is set by nell
- 22:03: All nodes have ssl set to true
- 22:03: Due to self signed cert, all download URLs are unreachable
- 22:04: All instances get removed from service by ELB (due to cert issues)
- 22:05: Eric Alwais (eric) updates Chef status page (status.chef.io)
- 22:10: pwm, robb and nell meet to discuss problem
- 22:22: robb begins reverting and pinning package version to 2.3.3
- 22:28: nell directs robb to reverting config changes
- 22:34: Changes complete, nell verifies problem is clear
- 22:37: Josh Glass posts all clear to status page
- 22:37: pwm calls incident resolved
Impact
Users were unable to download cookbooks using Berkshelf or ChefDK for approximately 2 hours.
- Direct downloads (via web interface, curl, etc.) were functional using https
- Automated systems (berkshelf, chefdk, etc.) were returning http links based on universe endpoint
- After setting ssl was enabled, a total outage occured (30 minutes)
Contributing Factor(s)
- Insufficient monitoring on supermarket (api including /universe and web app)
- Lack of comprehensive testing on deploys
- Overly complicated code in omnibus package
- Lack of production system understanding
Stabilization Step
Changes made to the intial deploy were reverted:
- Production supermarket was dropped back to version 2.3.3
- Supermarket version 2.3.3 was locked on frontends
- Config changes were reverted to the pre-deploy stated and supermarket-ctl reconfigure was run
- Unsecured (http over port 80) access to cookbook downloads was turned back off (backed out code change)
Corrective Actions
Long Term
- Document various ssl deployments for supermarket
- Get Supermarket deployed through automatic provisioning with tests
Immediate
- Package a 2.4.1 without code changes for http downloads – robb
- Add an attribute for supermarket version to deploy cookbook – nell
- Monitor /universe including protocol version returned – nell and pwm
- Update deployment checklist for explicit test steps – robb