On Thursday, February 3rd, 2022, some of our customers encountered a service disruption in the Kisi system. The outage, affecting mostly our European customers, lasted for about 1 hour and 40 minutes. Still, any interruption to our service is unacceptable, and we wholeheartedly apologize for that.
It is our responsibility to keep our customers informed about what happened, as well as share our strategy to prevent similar events from occurring in the future. In this blog we will explain exactly what went wrong, and let you know what we’re doing to help prevent it from happening again.
We will use this event as an opportunity to learn and make Kisi even more resilient.
Here’s what happened
On Thursday, February 3rd, 2022, we experienced an outage starting at 11:15 CET. Our distributed engineering team immediately began to investigate. Everyone involved—from Sweden to the United States to Argentina—was on a call within minutes to identify the root cause and solve the problem. Shortly before 13:00 CET our API became available again.
The root cause was a deadlock in a proxy used to forward a portion of our traffic to a new firewall for more efficient DDoS and similar attack prevention and detection - part of our continuous efforts to increase the resilience of the Kisi API. Even though the proxy had been used in production for a few weeks, an unlikely contention scenario occurred and created a deadlocking feedback loop. Once we had identified what was happening, we were able to swiftly resolve the issue.
Here’s how we’re using this lesson to make Kisi more resilient
We started to analyze the root cause of the outage by asking “five whys,” and investigating every layer to understand how the incident unfolded. From there, we brainstormed a list of ideas we could implement to reduce or eliminate outages in the long term. To prevent such incidents in the future and make Kisi more resilient, we will:
- Update our incident response protocol and revise it regularly
- Organize frequent workshops for incident response with outage drills
- Strengthen our monitoring and alerting processes
Expedite our efforts towards deploying our edge cache feature. This means that recently used mobile and card credentials will be stored on the readers, which will allow access attempts to work regardless of API status.
Closing thoughts
People and businesses around the world rely on Kisi every day to simplify access control. In other words, they expect Kisi to work. Anything that stops Kisi from working is a top-priority for the entire Kisi team.
We must live up to the expectations of our customers to deliver a system with an extreme degree of reliability. On February 3rd, those efforts (ironically) caused an outage, but in the future we expect an even more reliable experience with less downtime than ever.
We’d like to extend our thanks to all of our affected customers for your patience and again apologize for any inconvenience caused. Should you encounter any issues or have any questions, please reach out to us at support@getkisi.com.
Related articles