Change search
ReferencesLink to record
Permanent link

Direct link
DieHard: reliable scheduling to survive correlated failures in cloud data centers
Umeå University, Faculty of Science and Technology, Department of Computing Science.
Umeå University, Faculty of Science and Technology, Department of Computing Science.
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
Show others and affiliations
2016 (English)In: 2016 16TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2016, 52-59 p.Conference paper (Refereed)
Abstract [en]

In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job. This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job's reliability in the presence of correlated failures. In addition, we address the problem of scheduling a job with reliability constraints. We formulate the scheduling problem as an optimization problem, with the aim being to achieve the desired reliability with the minimum number of extra tasks. We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability. We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job's reliability.

Place, publisher, year, edition, pages
2016. 52-59 p.
IEEE-ACM International Symposium on Cluster Cloud and Grid Computing, ISSN 2376-4414
National Category
Computer Science
URN: urn:nbn:se:umu:diva-126537DOI: 10.1109/CCGrid.2016.11ISI: 000382529800006ISBN: 978-1-5090-2453-7 (print)OAI: diva2:1040751
16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), MAY 16-19, 2016, Cartagena, COLOMBIA
Available from: 2016-10-28 Created: 2016-10-10 Last updated: 2016-10-28Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Sedaghat, MinaWadbro, EddieDe Luna, SaraSeleznjev, OlegElmroth, Erik
By organisation
Department of Computing ScienceDepartment of Mathematics and Mathematical Statistics
Computer Science

Search outside of DiVA

GoogleGoogle Scholar

Altmetric score

Total: 17 hits
ReferencesLink to record
Permanent link

Direct link