Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Hybrid Resource Management for HPC and Data Intensive Workloads
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Distributed Systems Group)
PDC Center for High Performance Computing, KTH Royal Institute of Technology, Sweden.
PDC Center for High Performance Computing, KTH Royal Institute of Technology, Sweden.
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Distributed Systems Group)
2019 (English)In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Los Alamitos: IEEE Computer Society, 2019, p. 399-409Conference paper, Published paper (Refereed)
Abstract [en]

Traditionally, High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate hardware using different tools for resource and application management. With increasing convergence of these paradigms, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common computation platform for both application areas increases. Executing both application classes on the same hardware not only enables hybrid workflows, but can also increase the usage efficiency of the system, as often not all available hardware is fully utilized by an application. While HPC systems are typically managed in a coarse grained fashion, allocating a fixed set of resources exclusively to an application, DI systems employ a finer grained regime, enabling dynamic resource allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system that allows the execution of DI applications on top of standard HPC scheduling systems.In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource utilization monitoring to efficiently co-schedule HPC and DI applications. The architecture is easily adaptable and extensible to current and new types of distributed workloads, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The architecture is implemented based on the Slurm and Mesos resource managers for HPC and DI jobs. Our experimental evaluation in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.

Place, publisher, year, edition, pages
Los Alamitos: IEEE Computer Society, 2019. p. 399-409
Keywords [en]
Resource Management, High Performance Computing, Data Intensive Computing, Mesos, Slurm, Boostrapping
National Category
Computer Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-155619DOI: 10.1109/CCGRID.2019.00054ISI: 000483058700045Scopus ID: 2-s2.0-85069469164ISBN: 978-1-7281-0913-8 (print)ISBN: 978-1-7281-0912-1 (electronic)OAI: oai:DiVA.org:umu-155619DiVA, id: diva2:1282469
Conference
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019, May 14-17, 2019, Larnaca, Cyprus
Note

Originally included in thesis in manuscript form

Available from: 2019-01-24 Created: 2019-01-24 Last updated: 2020-06-26Bibliographically approved
In thesis
1. Application-aware resource management for datacenters
Open this publication in new window or tab >>Application-aware resource management for datacenters
2018 (English)Licentiate thesis, comprehensive summary (Other academic)
Alternative title[sv]
Applikationsmedveten resurshantering för datacenter
Abstract [en]

High Performance Computing (HPC) and Cloud Computing datacenters are extensively used to steer and solve complex problems in science, engineering, and business, such as calculating correlations and making predictions. Already in a single datacenter server, there are thousands of hardware and software metrics – Key Performance Indicators (KPIs) – that individually and aggregated can give insight in the performance, robustness, and efficiency of the datacenter and the provisioned applications. At the datacenter level, the number of KPIs is even higher. The fast growing interest on datacenter management from both public and industry together with the rapid expansion in scale and complexity of datacenter resources and the services being provided on them have made monitoring, profiling, controlling, and provisioning compute resources dynamically at runtime into a challenging and complex task. Commonly, correlations of application KPIs, like response time and throughput, with resource capacities show that runtime systems (e.g., containers or virtual machines) that are used to provision these applications do not utilize available resources efficiently. This reduces datacenter efficiency, which in term results in higher operational costs and longer waiting times for results.

The goal of this thesis is to develop tools and autonomic techniques for improving datacenter operations, management and utilization, while improving and/or minimizing impacts on applications performance. To this end, we make use of application resource descriptors to create a library that dynamically adjusts the amount of resources used, enabling elasticity for scientific workflows in HPC datacenters. For mission critical applications, high availability is of great concern since these services must be kept running even in the event of system failures. By modeling and correlating specific resource counters, like CPU, memory and network utilization, with the number of runtime synchronizations, we present adaptive mechanisms to dynamically select which fault tolerant mechanism to use. Likewise, for scientific applications we propose a hybrid extensible architecture for dual-level scheduling of data intensive jobs in HPC infrastructures, allowing operational simplification, on-boarding of new types of applications and achieving greater job throughput with higher overall datacenter efficiency.

Place, publisher, year, edition, pages
Umeå: Department of computing science, Umeå university, 2018. p. 28
Series
Report / UMINF, ISSN 0348-0542 ; 18.14
Keywords
Resource Management, High Performance Computing, Cloud Computing
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-155620 (URN)978-91-7601-971-9 (ISBN)
Presentation
2018-12-12, MA121, MIT-Huset, Umeå, 20:31 (English)
Opponent
Supervisors
Available from: 2019-01-25 Created: 2019-01-24 Last updated: 2020-09-14Bibliographically approved
2. Autonomous resource management for high performance datacenters
Open this publication in new window or tab >>Autonomous resource management for high performance datacenters
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Autonom resurshantering för högpresterande datacenter
Abstract [en]

Over the last decade, new applications such as data intensive workflows have hit an inflection point in wide spread use and influenced the compute paradigm of most scientific and industrial endeavours. Data intensive workflows are highly dynamic and adaptable to resource changes, system faults, and by also allowing approximate solutions into their models. On the one hand, these dynamic characteristics require processing power and capabilities originated in cloud computing environments, and are not well supported by large High Performance Computing (HPC) infrastructures. On the other hand, cloud computing datacenters favor low latency over throughput, deeply contrasting with HPC, which enforces a centralized environment and prioritizes total computation accomplished over-time, ignoring latency entirely. Although data handling needs are predicted to increase by as much as a thousand times over the next decade, future datacenters processing power will not increase as much.

To tackle these long-term developments, this thesis proposes autonomic methods combined with novel scheduling strategies to optimize datacenter utilization while guaranteeing user defined constraints and seamlessly supporting a wide range of applications under various real operational scenarios. Leveraging upon data intensive characteristics, a library is developed to dynamically adjust the amount of resources used throughout the lifespan of a workflow, enabling elasticity for such applications in HPC datacenters. For mission critical environments where services must run even in the event of system failures, we define an adaptive controller to dynamically select the best method to perform runtime state synchronizations. We develop different hybrid extensible architectures and reinforcement learning scheduling algorithms that smoothly enable dynamic applications into HPC environments. An overall theme in this thesis is extensive experimentation in real datacenters environments. Our results show improvements in datacenter utilization and performance, achieving higher overall efficiency. Our methods also simplify operations and allow the onboarding of novel types of applications previously not supported.

Abstract [sv]

Dataintensiva workflows är en ny klass av applikationer som blivit alltmer vanliga under senaste årtiondet och har stor påverkan på hur beräkningar utförs inom flertalet forskningsområden och i industrin. Dessa dataintensiva workflows kan dynamiskt anpassa sig till ändringar i resursallokering, systemfel och kan ibland även approximera lösningar vid resursbrist. De kräver hög beräkningskraft och därtill funktionalitet som endast återfinns i datormoln och de passar därmed dåligt i dagens högpresterande datorsystem (HPC-system). Datacenter i molnet prioriterar att snabbt starta nyinkomna applikationer, vilket drastiskt skiljer sig från HPC-miljöer där hög genomströmning över tid är det främsta målet. Trots att behovet av datahantering uppskattas öka mer än tusenfallt under kommande årtioende kommer framtidens datacenter inte att ha motsvarande utveckling av beräkningskapacitet.

Denna avhandling möter dessa utmaningar genom en kombination av autonoma system och nya strategier för schedulering för att optimera utnyttjandegraden i datacenter. Detta sker utan att göra avkall på användares prestandakrav och därtill med målet att stödja ett brett spektrum av applikationer och scenarios. Ett bibliotek utvecklas för att dynamiskt anpassa resursallokering för workflows under körning, vilket innebär att även HPC-system kan stödja elastiska applikationer som tidigare bara kunde exekveras i datormoln. För miljöer med höga krav på tillgänglighet defineras en regulator för att dynamiskt anpassa hur applikationer synkroniserar tillstånd, för mer resurseffektiv aktiv replikering. Avhandlingen utvecklar även flera resurshanteringssystem baserat på schedulering med förstärkningsinlärning i syftet att förbättra stödet för dynamiska applikationer i HPC-system. Ett övergripande tema i avhandlingen är omfattande utvärderingar av de framtagna metoderna och systemen genom storskaliga experiment i verkliga datacenter. Resultaten visar förbättringar överlag av resursutnyttjande och prestanda i datacenter. De utvecklade systemen förenklar även drift och möjliggör nya typer av applikationer som tidigare ej kunnat exekveras i HPC-miljöer.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2020. p. 44
Series
Report / UMINF, ISSN 0348-0542 ; 20.03
Keywords
Datacenters, high performance computing, scheduling, hybrid
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-169633 (URN)978-91-7855-286-3 (ISBN)978-91-7855-287-0 (ISBN)
Public defence
2020-05-08, MIT Place Seminarierummet, MIT-byggnaden (Plan 2), Umeå, 10:00 (English)
Opponent
Supervisors
Note

New place for the public defence (wrong place in the posting sheet). 

Available from: 2020-04-17 Created: 2020-04-13 Last updated: 2020-05-25Bibliographically approved

Open Access in DiVA

fulltext(988 kB)594 downloads
File information
File name FULLTEXT02.pdfFile size 988 kBChecksum SHA-512
22f3f10644282f5f4b5ba506d39beed75e2e319f6f820895270ede499123ebf1725e298469d413f75f4ce8fa6846c56d1b8856c53eca7555ceef7a6179a7067c
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Souza, AbelTordsson, Johan

Search in DiVA

By author/editor
Souza, AbelTordsson, Johan
By organisation
Department of Computing Science
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 972 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1669 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf