Online behavior identification in distributed systems

dc.contributor.authorÁlvarez Cid-Fuentes, J.
dc.contributor.authorSzabo, C.
dc.contributor.authorFalkner, K.
dc.contributor.conference34th Symposium on Reliable Distributed Systems (SRDS) (28 Sep 2015 - 1 Oct 2015 : Montreal, Canada)
dc.date.issued2015
dc.description.abstractThe diagnosis, prediction, and understanding of unexpected behavior is crucial for long running, large scale distributed systems. However, existing works focus on the identification of faults in specific time moments preceded by significantly abnormal metric readings, or require a previous analysis of historical failure data. In this work, we propose an online behavior classification system to identify a wide range of undesired behaviors, which may appear even in healthy systems, and their evolution over time. We employ a two-step process involving two online classifiers on periodically collected system metrics to identify at runtime normal and anomalous behaviors such as deadlock, starvation and livelock, without any previous analysis of historical failure data. Our approach achieves over 80% accuracy in detecting unexpected behaviors and over 90% accuracy in identifying their type with a short delay after the anomalies appear, and with minimal expert intervention. Our experimental analysis uses system execution traces obtained from a Google cluster and from our in-house distributed system with varied behaviors, and shows the benefits of our approach as well as future research challenges.
dc.description.statementofresponsibilityJavier Álvarez Cid-Fuentes, Claudia Szabo, and Katrina Falkner
dc.identifier.citationProceedings of the Symposium on Reliable Distributed Systems, 2015, vol.2016-January, pp.202-211
dc.identifier.doi10.1109/SRDS.2015.16
dc.identifier.isbn9781467393027
dc.identifier.issn1060-9857
dc.identifier.issn1060-9857
dc.identifier.orcidSzabo, C. [0000-0003-2501-1155]
dc.identifier.orcidFalkner, K. [0000-0003-0309-4332]
dc.identifier.urihttp://hdl.handle.net/2440/107821
dc.language.isoen
dc.publisherIEEE
dc.relation.ispartofseriesSymposium on Reliable Distributed Systems Proceedings
dc.rights© 2015 IEEE
dc.source.urihttps://doi.org/10.1109/srds.2015.16
dc.titleOnline behavior identification in distributed systems
dc.typeConference paper
pubs.publication-statusPublished

Files

Original bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
RA_hdl_107821.pdf
Size:
379.34 KB
Format:
Adobe Portable Document Format
Description:
Restricted Access