Failing up: Former Amazon engineers land $18M to help companies find weak spots in their apps

(L to R): Gremlin co-founders Kolton Andrus, CEO, and Matt Fornaciari, CTO. (Gremlin Photo)

“Chaos engineering” is not the planning process for a Senate committee hearing, but a growing discipline within software engineering that by deliberately making your systems and applications fail, you’ll learn the best ways to protect those apps from real-world failures. Gremlin just raised an $18 million Series B round to prove that companies will pay to have someone crash their apps.

Founded by former Amazon and Netflix engineers Kolton Andrus, CEO, and Matthew Fornaciari, CTO, Gremlin has now raised $23.75 million in funding for its concept of “failure-as-a-service.” The round was led by Tomasz Tunguz of Redpoint Ventures, with participation from existing investors Amplify Ventures and Index Ventures.

Andrus and Fornaciari worked together at Amazon in Seattle for four years (“and one week,” Andrus cracked in an interview, a probable reference to vesting schedules) where the principles of chaos engineering came together before they were later refined at Netflix, Andrus’ next stop after Amazon. The idea is to throw predictable problems — slow connections, storage outages, an issue in code your application depends on — at different parts of your infrastructure to understand how your application fails in response to those obstacles, limiting the “chaos” to a small group of users or even an individual engineer’s account.

This approach worked really, really well, Andrus said, and helped improve uptime at both companies to a substantial extent.

“After seeing those two patterns apply (at Amazon and Netflix), we felt like there was an opportunity to build this commercially for everyone in the industry. But anyone that really wanted to do this successfully would need good tooling, and some good help and guidance,” he said.

In its early days, Gremlin has focused on helping customers diagnose weak spots in their infrastructure, but it now wants to move up the stack and help developers understand application-level failures. Alongside the funding announcement, Gremlin plans to unveil a new service called Application Level Fault Injection at the Chaos Conf in San Francisco tomorrow.

This new service will allow application developers to subscribe to Gremlin’s service, which will require them to add a bit of code to their applications, Andrus said. He thinks developers writing serverless computing applications will be particularly interested in this new service, because while the whole point of serverless computing is that you’ve passed management of the servers involved to somebody else, serverless applications still break in weird ways like anything else.

On Monday Gremlin’s 40th employee will start working for the San Francisco-headquartered startup. But the company has embraced a “remote by default” approach toward building its business, and given its Seattle heritage has several employees in the region, Andrus said. Gremlin has a co-working space in Seattle but will likely look for dedicated office space in the city over the next year or so, he said.