Summary
The root cause of the incident was due to a faulty configuration update during a deployment.
This lead to a period of time whereby the deployment partially served traffic prior to being classified as unhealthy.
The impacted API resources were as follows:
v1/payment-sessionsTimeline
The erroneous deployment went live at 5:21pm UTC. Live traffic was switched to the new instances at 5:23pm.
On-site developers noticed elevated errors originating from the new nodes at 5:25pm and initiated a rollback at 5:30pm.
The rollback was completed at 5:49pm and saw an instant reduction of the errors introduced by the previous deployment.
The total impact time was approx 25 minutes.
What are we doing about it?