Management of Customer Edge devices experiencing issues
Incident Report for F5 Distributed Cloud
Postmortem

F5® Distributed Cloud Services – Customer Edges

Root Cause Analysis for Customer Edge management experiencing 503 errors

Report Date: 2024-05-01

Incident Date(s): 2024-04-18

EVENT SUMMARY

On 2024-04-18 at 18:30 UTC, the F5 Distributed Cloud support team used internal monitoring to detect 503 service unavailable error on Customer Edges (CE). The ensuing investigation revealed that this issue was confined to registration of new customer edges or software upgradation on existing ones. Traffic processing was not impacted.

Further examination pinpointed the onset of the 503 error to the point when the F5 Distributed Cloud team identified an incorrect CE configuration which resulted in the service responsible for CE configuration management to malfunction.

To address the issue, the F5 Distributed Cloud team manually removed the incorrect configuration and restarted the CE configuration management service. Post-fix validation confirmed the CE management issue to have been resolved on 2024-04-18 by 21:20 UTC. End of the service event.

The service event lasted 2 hours and 50 minutes.

WHAT HAPPENED?

INCIDENT DETAILS
Start time of Service Event 2024-04-18 18:30 UTC
Conclusion of Service Event 2024-04-18 21:20 UTC
Event duration 2 hours 50 minutes
Impact Distributed Cloud customers may have experienced issues upgrading software on the existing Customer Edges or registration of new ones.
Root cause A new Customer Edge site was created with an incorrect configuration, resulting in management of the Customer Edge to exhibit failures.
TIMELINE OF EVENTS
DATE TIME (UTC) ACTION
2024-04-18 18:30 F5 Distributed Cloud support team identified 503 errors in Customer Edge (CE) management via proactive monitoring.
2024-04-18 18:30 – 20:24 F5 Distributed Cloud team investigated and determined that registration of new CE sites and software upgrade on existing ones was affected. No impact to traffic processing.
2024-04-18 20:30 – 21:00 An incorrect configuration on a newly created CE was identified to be the cause of the 503 errors.
2024-04-18 21:20 The F5 Distributed Cloud team manually rectified the incorrect configuration and restarted the CE configuration management service, post which, no further errors were observed with CE management This is the end of service event.

IS THE SERVICE EVENT FULLY RESOLVED?

Yes, the service event is resolved, and the Customer Edges services are fully operational.

ROOT CAUSE

The functionality of a particular service is to oversee the configuration management when a new CE is registered or when existing ones undergo a software upgrade. However, due to an under-optimized code base, the service was unable to adequately handle the impact of an incorrect CE configuration, leading to its failure. Consequently, users experienced 503 errors when attempting to create new CE sites or upgrade existing ones.

RESOLUTION AND NEXT STEPS

RESOLUTION

The F5 Distributed Cloud team rectified the incorrect configuration manually and subsequently restarted the affected service.

NEXT STEPS: FUTURE EVENT PREVENTION

We will take the below measures to prevent this service event from reoccurring and to ensure that we are better prepared to react to and recover from similar scenarios more quickly.

  • The F5 Distributed Cloud team will implement a hotfix during the upcoming Distributed Cloud platform release to ensure that the CE configuration management service operates seamlessly, even in the event of misconfigurations during the registration of new CE sites.

CLOSING

F5® understands how important reliability of the Distributed Cloud Platform is for customers. F5 will ensure the recommended changes in this document are canonized into our operational Methods of Procedure (MoP) moving forward. We are grateful you have chosen to partner with F5® for critical service delivery and are committed to evolving our platform and tooling to better anticipate and mitigate disruptions to Distributed Cloud Platform services.

APPENDICES

F5 Glossary

https://www.f5.com/services/resources/glossary

Posted May 01, 2024 - 23:29 UTC

Resolved
The F5 Distributed Cloud team validated and confirmed the identified issue with Customer Edge has been restored and no more issues are being observed. All other services remain fully operational. This incident has been resolved.
Posted Apr 18, 2024 - 21:52 UTC
Monitoring
The issue with management of Customer Edges has been identified and resolved. We are continuously monitoring the system.
Posted Apr 18, 2024 - 21:27 UTC
Investigating
F5 Distributed Cloud team noticed an issue with our platform where any upgradation of the software to the Customer Edges or registration of the new Customer Edges would experience issues, however, traffic processing is not impacted. Team is working on identifying the issue.
Posted Apr 18, 2024 - 21:02 UTC
This incident affected: Customer Support, Docs and WebSite (Software Distribution).