Online behavior identification in distributed systems
Files
(Restricted Access)
Date
2015
Authors
Álvarez Cid-Fuentes, J.
Szabo, C.
Falkner, K.
Editors
Advisors
Journal Title
Journal ISSN
Volume Title
Type:
Conference paper
Citation
Proceedings of the Symposium on Reliable Distributed Systems, 2015, vol.2016-January, pp.202-211
Statement of Responsibility
Javier Álvarez Cid-Fuentes, Claudia Szabo, and Katrina Falkner
Conference Name
34th Symposium on Reliable Distributed Systems (SRDS) (28 Sep 2015 - 1 Oct 2015 : Montreal, Canada)
Abstract
The diagnosis, prediction, and understanding of unexpected behavior is crucial for long running, large scale distributed systems. However, existing works focus on the identification of faults in specific time moments preceded by significantly abnormal metric readings, or require a previous analysis of historical failure data. In this work, we propose an online behavior classification system to identify a wide range of undesired behaviors, which may appear even in healthy systems, and their evolution over time. We employ a two-step process involving two online classifiers on periodically collected system metrics to identify at runtime normal and anomalous behaviors such as deadlock, starvation and livelock, without any previous analysis of historical failure data. Our approach achieves over 80% accuracy in detecting unexpected behaviors and over 90% accuracy in identifying their type with a short delay after the anomalies appear, and with minimal expert intervention. Our experimental analysis uses system execution traces obtained from a Google cluster and from our in-house distributed system with varied behaviors, and shows the benefits of our approach as well as future research challenges.
School/Discipline
Dissertation Note
Provenance
Description
Access Status
Rights
© 2015 IEEE