At 09:58 AEST on 22 August, we deployed an update to add API scopes to the Kinde Management API. Following this, a few customers reported errors with token exchange with the API, but it only seemed to generate an error about 50% of the time. We started to investigate from the first issue report.
The team initially investigated this as an isolated issue impacting a few customers and only for about 50% of API calls. The team realized that it was a much wider problem hours later, as more customers started getting similar errors.
The root cause turned out to be an aggressive timeout setting in the production infrastructure, which had been set for security reasons. The issue was hard to detect because the API was functioning as expected before the timeout, but when a request exceeded the timeout, an incorrect token was returned that contained no scopes. So when customers tried to use the token, the API returned a 403 error. Customers using a cached token did not receive any errors due to backwards compatibility of the API scope change.
We deployed a fix to resolve the issue at 17:21 AEST, total incident time was 7 hours and 23 minutes.
We communicated with impacted customers throughout the incident.