During the week of June 24, 2025, two separate incidents of intermittent disruption of service were observed due to issues related to database replication and synchronization. These incidents occurred:
The service disruption was caused by problems arising from unexpected growth of database logs causing interference in the normal operation of replication and synchronization processes. Immediate remediation efforts were required to restore system functionality.
During the disruption, users experienced intermittent failures across most functions of the payment gateway, including transaction processing, balance checks, and merchant API responses.
The root cause of the incident was traced to abnormal growth in database logs, which overwhelmed the replication mechanisms and resulted in synchronization failures. The underlying log management processes failed to flag the anomaly early, allowing the issue to escalate into a full-service disruption.
To resolve the issue:
To prevent recurrence:
· Conduct a post-mortem analysis to assess incident response and escalation effectiveness, and to identify process improvements.
· Implement additional monitors, checks, and notifications to track log sizes and database free space in order to mitigate similar occurrences in the future.