Bunny.net is a CDN provider based in Slovenia with customers and PoPs all over the globe. On June 22nd, an automated code update led to a cascade of failures that crashed the company’s DNS servers, wiped out most of its CDN capacity, and affected 750,000 Web sites.
Despite multiple redundancies built into the system, hidden dependencies thwarted attempts to roll back, reboot, and redeploy. Over two frantic hours, the team struggled to identify the problem, get fixes in place, and get the service up and running. In an effort to be transparent with customers, Bunny.net wrote a detailed postmortem that it shared on the company blog.
On today’s Day Two Cloud, we talk with Bunny.net to understand what happened, what the company learned, and what other infrastructure operators can take away from this experience. Our guest is Dejan Grofelnik Pelze, founder of Bunny.net.
* Automation and dependencies
* Leveraging testing to reveal problems that hadn’t been considered
* How to recognize single points of failure
* Considerations around fate sharing
Zesty provides an autonomous cloud experience by leveraging advanced AI technology to manage the cloud for you.
Our AI reacts in real-time to capacity changes and enables companies to maximize cloud efficiency, reduce AWS bill by more than 50%, completely hands-free. It’s cloud on auto-pilot. Find out how to spend less and do more at zesty.co.
The stack overflow of death. How we lost DNS and what we’re doing to prevent this in the future. – Bunny.net blog
@dejanpelzel – Dejan Pelze on Twitter
Dejan Pelze on LinkedIn
[00:00:00.980] – Ned
Zesty provides an autonomous cloud experience by leveraging advanced A.I. technology to manage the cloud for you. Their A.I. reacts in real time to capacity changes and enables companies to maximize cloud efficiency and reduce their AWS bill by more than 50 percent completely hands free. Cloud on autopilot with zesty companies can spend less and do more. Check them out at Zesty.co.
[00:00:33.720] – Ethan
Welcome to Day Two Cloud. And today we have the story of an outage. Yeah, the story of an outage. We love these stories because are lessons learned for the rest of us. And the story is brought to us by the folks at Bunny.net, not that this is a sponsored show. It isn’t. I just spotted this very transparent recounting of a two hour outage that Bunny dot net had, and it was up on Hacker News. There was some discussion about it.
[00:00:57.870] – Ethan
And I reached out to them and said, hey, you want to come on the podcast and tell the story? I just thought it was fascinating, this set of cascading failures that were tied in with automation and DNS. And it was one of those failures that everybody saw. And what did you what did you get out of this conversation Ned?
[00:01:15.420] – Ned
You know, I got a couple of things out of it. One was that the ultimate test is production, because no matter how much testing you do ahead of time, once that code rolls out into production,