Incident Started: Oct 04 at 03:14 am UTC Resolution started: Oct 04 at 03:14 am UTC Incident mitigated: Oct 04 at 03:24 am UTC Customer impacted time: varying between 0 to 5 minutes
Summary: Volterra data-plane service on all POPs were restarted on Oct-04 leading to service disruption.
Root cause: Our control plane service evicted dataplane microservice on all our POPs because it was requesting much more resources than allocated and this led to restart of the dataplane service. The dataplane service came back up within 5 minutes and services came up online immediately.
Incident flow: At around 03:14 am UTC on Oct 04 2021, we received alerts on our monitoring system that our customer services have been degraded. We root caused the issue with the dataplane getting restarted by our control plane. Since dataplane service got restarted, customer applications serviced by Volterra POPs in all regions were impacted during the incident time. The customer impact was that some of the users would have seen 503 or no response when requesting the origin.
Corrective measures: Our engineering team is still investigating the root-cause of this issue