Online behavior identification in distributed systems

Files

RA_hdl_107821.pdf (379.34 KB)
  (Restricted Access)

Date

2015

Authors

Álvarez Cid-Fuentes, J.
Szabo, C.
Falkner, K.

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Conference paper

Citation

Proceedings of the Symposium on Reliable Distributed Systems, 2015, vol.2016-January, pp.202-211

Statement of Responsibility

Javier Álvarez Cid-Fuentes, Claudia Szabo, and Katrina Falkner

Conference Name

34th Symposium on Reliable Distributed Systems (SRDS) (28 Sep 2015 - 1 Oct 2015 : Montreal, Canada)

Abstract

The diagnosis, prediction, and understanding of unexpected behavior is crucial for long running, large scale distributed systems. However, existing works focus on the identification of faults in specific time moments preceded by significantly abnormal metric readings, or require a previous analysis of historical failure data. In this work, we propose an online behavior classification system to identify a wide range of undesired behaviors, which may appear even in healthy systems, and their evolution over time. We employ a two-step process involving two online classifiers on periodically collected system metrics to identify at runtime normal and anomalous behaviors such as deadlock, starvation and livelock, without any previous analysis of historical failure data. Our approach achieves over 80% accuracy in detecting unexpected behaviors and over 90% accuracy in identifying their type with a short delay after the anomalies appear, and with minimal expert intervention. Our experimental analysis uses system execution traces obtained from a Google cluster and from our in-house distributed system with varied behaviors, and shows the benefits of our approach as well as future research challenges.

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

© 2015 IEEE

License

Grant ID

Call number

Persistent link to this record