Umeå universitets logga

umu.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Application-aware resource management for datacenters
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Distributed Systems Group)
2018 (Engelska)Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)Alternativ titel
Applikationsmedveten resurshantering för datacenter (Svenska)
Abstract [en]

High Performance Computing (HPC) and Cloud Computing datacenters are extensively used to steer and solve complex problems in science, engineering, and business, such as calculating correlations and making predictions. Already in a single datacenter server, there are thousands of hardware and software metrics – Key Performance Indicators (KPIs) – that individually and aggregated can give insight in the performance, robustness, and efficiency of the datacenter and the provisioned applications. At the datacenter level, the number of KPIs is even higher. The fast growing interest on datacenter management from both public and industry together with the rapid expansion in scale and complexity of datacenter resources and the services being provided on them have made monitoring, profiling, controlling, and provisioning compute resources dynamically at runtime into a challenging and complex task. Commonly, correlations of application KPIs, like response time and throughput, with resource capacities show that runtime systems (e.g., containers or virtual machines) that are used to provision these applications do not utilize available resources efficiently. This reduces datacenter efficiency, which in term results in higher operational costs and longer waiting times for results.

The goal of this thesis is to develop tools and autonomic techniques for improving datacenter operations, management and utilization, while improving and/or minimizing impacts on applications performance. To this end, we make use of application resource descriptors to create a library that dynamically adjusts the amount of resources used, enabling elasticity for scientific workflows in HPC datacenters. For mission critical applications, high availability is of great concern since these services must be kept running even in the event of system failures. By modeling and correlating specific resource counters, like CPU, memory and network utilization, with the number of runtime synchronizations, we present adaptive mechanisms to dynamically select which fault tolerant mechanism to use. Likewise, for scientific applications we propose a hybrid extensible architecture for dual-level scheduling of data intensive jobs in HPC infrastructures, allowing operational simplification, on-boarding of new types of applications and achieving greater job throughput with higher overall datacenter efficiency.

Ort, förlag, år, upplaga, sidor
Umeå: Department of computing science, Umeå university , 2018. , s. 28
Serie
Report / UMINF, ISSN 0348-0542 ; 18.14
Nyckelord [en]
Resource Management, High Performance Computing, Cloud Computing
Nationell ämneskategori
Datorsystem
Forskningsämne
datalogi
Identifikatorer
URN: urn:nbn:se:umu:diva-155620ISBN: 978-91-7601-971-9 (tryckt)OAI: oai:DiVA.org:umu-155620DiVA, id: diva2:1282472
Presentation
2018-12-12, MA121, MIT-Huset, Umeå, 20:31 (Engelska)
Opponent
Handledare
Tillgänglig från: 2019-01-25 Skapad: 2019-01-24 Senast uppdaterad: 2020-09-14Bibliografiskt granskad
Delarbeten
1. E-HPC: A Library for Elastic Resource Management in HPC Environments
Öppna denna publikation i ny flik eller fönster >>E-HPC: A Library for Elastic Resource Management in HPC Environments
Visa övriga...
2017 (Engelska)Ingår i: 12th Workshop on Workflows in Support of Large-Scale Science (WORKS), New York, NY, USA: Association for Computing Machinery (ACM), 2017, artikel-id 1Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Next-generation data-intensive scientific workflows need to support streaming and real-time applications with dynamic resource needs on high performance computing (HPC) platforms. The static resource allocation model on current HPC systems that was designed for monolithic MPI applications is insufficient to support the elastic resource needs of current and future workflows. In this paper, we discuss the design, implementation and evaluation of Elastic-HPC (E-HPC), an elastic framework for managing resources for scientific workflows on current HPC systems. E-HPC considers a resource slot for a workflow as an elastic window that might map to different physical resources over the duration of a workflow. Our framework uses checkpoint-restart as the underlying mechanism to migrate workflow execution across the dynamic window of resources. E-HPC provides the foundation necessary to enable dynamic resource allocation of HPC resources that are needed for streaming and real-time workflows. E-HPC has negligible overhead beyond the cost of checkpointing. Additionally, E-HPC results in decreased turnaround time of workflows compared to traditional model of resource allocation for workflows, where resources are allocated per stage of the workflow. Our evaluation shows that E-HPC improves core hour utilization for common workflow resource use patterns and provides an effective framework for elastic expansion of resources for applications with dynamic resource needs.

Ort, förlag, år, upplaga, sidor
New York, NY, USA: Association for Computing Machinery (ACM), 2017
Nyckelord
high performance computing, scientific workflows, resource management
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
datalogi
Identifikatorer
urn:nbn:se:umu:diva-142624 (URN)10.1145/3150994.3150996 (DOI)2-s2.0-85054761132 (Scopus ID)978-1-4503-5129-4 (ISBN)
Konferens
The International Conference for High Performance Computing, Networking, Storage and Analysis
Tillgänglig från: 2017-12-06 Skapad: 2017-12-06 Senast uppdaterad: 2023-03-24Bibliografiskt granskad
2. Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance
Öppna denna publikation i ny flik eller fönster >>Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance
Visa övriga...
2018 (Engelska)Ingår i: Proceedings - 2018 IEEE International Conference on Cloud Engineering, IC2E 2018 / [ed] Li J., Chandra A., Guo T., Cai Y., Institute of Electrical and Electronics Engineers (IEEE), 2018, s. 12-22Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Active Virtual Machine (VM) replication is an application independent and cost-efficient mechanism for high availability and fault tolerance, with several recently proposed implementations based on checkpointing. However, these methods may suffer from large impacts on application latency, excessive resource usage overheads, and/or unpredictable behavior for varying workloads. To address these problems, we propose a hybrid approach through a Proportional-Integral (PI) controller to dynamically switch between periodic and on-demand check-pointing. Our mechanism automatically selects the method that minimizes application downtime by adapting itself to changes in workload characteristics. The implementation is based on modifications to QEMU, LibVirt, and OpenStack, to seamlessly provide fault tolerant VM provisioning and to enable the controller to dynamically select the best checkpointing mode. Our evaluation is based on experiments with a video streaming application, an e-commerce benchmark, and a software development tool. The experiments demonstrate that our adaptive hybrid approach improves both application availability and resource usage compared to static selection of a checkpointing method, with application performance gains and neglectable overheads.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2018
Nyckelord
Fault Tolerance, Resource Management, Checkpoint, COLO, Control Theory
Nationell ämneskategori
Datorsystem
Forskningsämne
datalogi
Identifikatorer
urn:nbn:se:umu:diva-152033 (URN)10.1109/IC2E.2018.00023 (DOI)000759774400002 ()2-s2.0-85048315473 (Scopus ID)978-1-5386-5009-7 (ISBN)978-1-5386-5008-0 (ISBN)
Konferens
2018 IEEE International Conference on Cloud Engineering (IC2E 2018), 17–20 April 2018, Orlando, Florida, USA
Tillgänglig från: 2018-09-24 Skapad: 2018-09-24 Senast uppdaterad: 2023-09-05Bibliografiskt granskad
3. Hybrid Resource Management for HPC and Data Intensive Workloads
Öppna denna publikation i ny flik eller fönster >>Hybrid Resource Management for HPC and Data Intensive Workloads
2019 (Engelska)Ingår i: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Los Alamitos: IEEE Computer Society, 2019, s. 399-409Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Traditionally, High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate hardware using different tools for resource and application management. With increasing convergence of these paradigms, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common computation platform for both application areas increases. Executing both application classes on the same hardware not only enables hybrid workflows, but can also increase the usage efficiency of the system, as often not all available hardware is fully utilized by an application. While HPC systems are typically managed in a coarse grained fashion, allocating a fixed set of resources exclusively to an application, DI systems employ a finer grained regime, enabling dynamic resource allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system that allows the execution of DI applications on top of standard HPC scheduling systems.In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource utilization monitoring to efficiently co-schedule HPC and DI applications. The architecture is easily adaptable and extensible to current and new types of distributed workloads, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The architecture is implemented based on the Slurm and Mesos resource managers for HPC and DI jobs. Our experimental evaluation in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.

Ort, förlag, år, upplaga, sidor
Los Alamitos: IEEE Computer Society, 2019
Nyckelord
Resource Management, High Performance Computing, Data Intensive Computing, Mesos, Slurm, Boostrapping
Nationell ämneskategori
Datorsystem
Forskningsämne
datalogi
Identifikatorer
urn:nbn:se:umu:diva-155619 (URN)10.1109/CCGRID.2019.00054 (DOI)000483058700045 ()2-s2.0-85069469164 (Scopus ID)978-1-7281-0913-8 (ISBN)978-1-7281-0912-1 (ISBN)
Konferens
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019, May 14-17, 2019, Larnaca, Cyprus
Anmärkning

Originally included in thesis in manuscript form

Tillgänglig från: 2019-01-24 Skapad: 2019-01-24 Senast uppdaterad: 2020-06-26Bibliografiskt granskad

Open Access i DiVA

fulltext(760 kB)257 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 760 kBChecksumma SHA-512
480515e6bcec87b3a64dfb0d03ca5ac4b379015526dce6693a7e24643b01ad6e00551e297a0b459f4e17397c83d8dbd1b99e158e871497ec6e81de6349d77e45
Typ fulltextMimetyp application/pdf

Person

Souza, Abel Pinto Coelho de

Sök vidare i DiVA

Av författaren/redaktören
Souza, Abel Pinto Coelho de
Av organisationen
Institutionen för datavetenskap
Datorsystem

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 257 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 758 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf