Rack outage

Incident Report for Corpex

Resolved

Incident Report: Rack Hardware & Switch Failure
Start: February 21, 2026, ~04:30 CET
End: February 21, 2026, ~07:50 CET

Executive Summary
The rack failure on February 21 was caused by a defective switch. Because the switch's status continuously fluctuated between online and offline, the automatic failover functionality of the switch stack failed to trigger. As a result, several hardware servers within the affected rack temporarily went offline.

Timeline of Events
- ~04:30: Our monitoring systems alerted us to the failure of a switch and multiple connected hardware servers.
- Initial Triage: We immediately began remote troubleshooting. We quickly ruled out high power load and broader network issues as root causes. However, the switch stack of the affected rack remained unresponsive.
- On-Site Investigation: A physical inspection revealed that the switch stack was caught in a boot loop, preventing any network traffic from passing through.
- Mitigation: A review of our network capacity confirmed that the switches had sufficient buffer to remain operational during this period. Nevertheless, to restore stability quickly, we proactively migrated the connected servers to neighboring racks.
- Resolution: We performed a targeted reboot of the stack, which successfully cleared the boot loop.
- ~07:50: Once the switch stack was fully accessible again, we started and restarted the final services on the servers, bringing all systems back to normal operation.

Next Steps
- Hardware Replacement We will replace the affected switch stack to prevent this issue from recurring.
- Communication: A schedule and announcement for the hardware replacement will follow.

Posted Feb 23, 2026 - 14:14 CET

Monitoring

We have identified a failed switch stack within the affected rack as the root cause of this incident. The faulty hardware has since been replaced with a spare unit and connectivity has been restored. We are continuing to monitor the situation closely to ensure full stability.

Posted Feb 21, 2026 - 07:49 CET

Investigating

We are currently investigating a complete loss of connectivity to one of our racks. Our engineering team has been engaged and is working to restore full service as quickly as possible. We will provide updates as the situation develops

Posted Feb 21, 2026 - 04:49 CET

This incident affected: Core Network (Internal Routing), Virtual Machines (Computing), Mailservers (POP3, Webmail), and Services (Icinga, Helpdesk).