Incident Started: Oct 06 at 01:09 am UTC Resolution started: Oct 06 at 01:11 am UTC Incident mitigated: Oct 06 at 01:25 am UTC Incident fully resolved: Oct 06 at 02:49 am UTC Customer impacted time: varying between 19 to 27 minutes
Customer impacted time:
Summary: Volterra data-plane service on all POPs were restarted on Oct-04 and Oct-06 leading to service disruption.
Root cause: Our control plane service evicted dataplane microservice on all our POPs because it was requesting much more resources than allocated and this led to restart of the dataplane service. The dataplane service came back up in about 15m and services came up online immediately after.
Incident flow: At around 1:09 am UTC on Oct 06 2021, we received alerts on our monitoring system that our customer services have been degraded. We root caused the issue with the dataplane getting restarted by our control plane. Since dataplane service got restarted, customer applications serviced by Volterra POPs in all regions were impacted during the incident time. The customer impact was that some of the users would have seen 503 or no response when requesting the origin.
Corrective measures: Engineering has root caused the issue to WAF engine requesting more memory than it was allowed. This has been caused by the increase in number of customers on the platform and unfortunately, this component was not being monitored by our internal SRE (site reliability engineering) system.
Our engineering team has fixed this issue by increasing the memory allocation of WAF and hotfixed this in production across our global network.