Analyzing Systemic Failures in IT Incident Management: Insights from Post-Mortem Analysis
DOI:
https://doi.org/10.59188/eduvest.v5i5.51192Keywords:
IT Incident Management, Incident Prevention, Incident Detection, Fintech, Root Cause Analysis, System ReliabilityAbstract
The reliability of IT systems is crucial for technology-driven businesses, as service disruptions can lead to financial losses, operational inefficiencies, and customer dissatisfaction. Despite having an incident management framework, organizations still experience recurring IT incidents, indicating systemic weaknesses in incident prevention, detection, and response. To identify the systemic root causes of major IT incidents and assess challenges in incident detection and resolution. By identifying recurring failure patterns, the research seeks to provide insights into improving IT incident management processes. The study uses a qualitative approach, utilizing thematic analysis on post-mortem reports of 26 major IT incidents that occurred at PT INUSA, a fintech company in Indonesia, between August 2023 and August 2024. Tags were assigned to categorize systemic failure points, and patterns were extracted to highlight deficiencies in software operations and incident management processes. Findings show that 80% of incidents were triggered by internal changes, with recurring issues such as insufficient testing, ineffective deployment and change control processes, and missing or misconfigured production settings. Additionally, 69% of incidents lacked proactive alerts, particularly on transaction success rates, CPU utilization, and system health metrics, leading to delayed detection. Incident response inefficiencies, including delayed incident reporting and slow debugging processes, further prolonged recovery times. The study highlights critical weaknesses in IT incident management and recommends improvements such as enhanced automated testing, stricter deployment validation, and standardized monitoring mechanisms. These insights provide guidance for fintech and technology companies to reduce incident frequency, improve detection capabilities, and optimize response efficiency.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Faris Arifiansyah

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.