Databases go offline. Services fail to scale up. Deployment errors can cause an application backend to get DDoS’d.
When an event happens that prevents your company from operating as expected, it is known as an incident. Software teams respond to an incident by issuing a fix. Sometimes that fix returns the software to its ideal state. Other times the software remains in a degraded state, and it takes more fixing to return the software to the place it should be.
One way that a software team can learn from an incident is through incident reproduction. When an incident is turned into a reproducible system, it becomes a predictable training exercise rather than a surprising and painful outage.
Tammy Butow is an engineer with Gremlin, a company that makes chaos engineering software. Chaos engineering is the process of creating controlled experiments that simulate outages. Tammy joins the show to discuss common incident types, and how those can be made reproducible for training exercises.
Sponsorship inquiries: sponsor@softwareengineeringdaily.com
Check out our active projects:The post Incident Reproduction with Tammy Butow appeared first on Software Engineering Daily.