Software Engineering Daily

Technical interviews about software topics.

https://softwareengineeringdaily.com/

subscribe
share






Incident Reproduction with Tammy Butow


Databases go offline. Services fail to scale up. Deployment errors can cause an application backend to get DDoS’d.

When an event happens that prevents your company from operating as expected, it is known as an incident. Software teams respond to an incident by issuing a fix. Sometimes that fix returns the software to its ideal state. Other times the software remains in a degraded state, and it takes more fixing to return the software to the place it should be.

One way that a software team can learn from an incident is through incident reproduction. When an incident is turned into a reproducible system, it becomes a predictable training exercise rather than a surprising and painful outage.

Tammy Butow is an engineer with Gremlin, a company that makes chaos engineering software. Chaos engineering is the process of creating controlled experiments that simulate outages. Tammy joins the show to discuss common incident types, and how those can be made reproducible for training exercises.

Sponsorship inquiries: sponsor@softwareengineeringdaily.com

Check out our active projects:
  • We are hiring a head of growth. If you like Software Engineering Daily and consider yourself competent in sales, marketing, and strategy, send me an email: jeff@softwareengineeringdaily.com
  • FindCollabs is a place to build open source software.
  • The SEDaily app for iOS and Android includes all 1000 of our old episodes, as well as related links, greatest hits, and topics. Subscribe for ad-free episodes.

The post Incident Reproduction with Tammy Butow appeared first on Software Engineering Daily.


fyyd: Podcast Search Engine
share








 October 16, 2019  1h1m