Intermittent outages of Customer Metrics API
Incident Report for Volterra
Resolved
Summary:
Volterra Metrics APIs were not responding to intermittent requests for 1-5 minutes intervals due increased load and cluster rebalancing. Customer services/applications were not impacted but it affected availability of metrics on the customer portal.

Root cause:
Our Metrics system experienced significant load because of rebalancing when one of the members crashed. Our system detected the issue and automatically removed members to mitigate the issue. Total impact was between 1-5 minutes in Monitoring dashboards on customer portal during the mitigation window.

Incident flow:
At around 1:06 pm UTC on Oct 12 2021, we received alerts on our monitoring system that our Metrics ingester had been degraded. At 0:09 the system started facing random crashes due cluster rebalancing due to crashed members. We root caused the issue with the overloaded members and initiated scaling of underlying infrastructure to mitigate the issue quickly. This resulted in red or error views in monitoring pages inside of VoltConsole. Service was fully restored at 1:49 pm.

Corrective measures:
Engineering has root caused the issue and increased the scale with more spare members for the metric cluster.
Posted Oct 12, 2021 - 14:09 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 12, 2021 - 13:28 UTC
This incident affected: Volterra Public APIs (Customer Dashboard, Metrics APIs).