Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 11) Show all publications
Villarroel, B., Pelckmans, K., Solano, E., Laaksoharju, M., Souza, A., Dom, O. N., . . . Ward, M. J. (2022). Launching the VASCO Citizen Science Project. Universe, 8(11), Article ID 561.
Open this publication in new window or tab >>Launching the VASCO Citizen Science Project
Show others...
2022 (English)In: Universe, E-ISSN 2218-1997, Vol. 8, no 11, article id 561Article in journal (Refereed) Published
Abstract [en]

The Vanishing & Appearing Sources during a Century of Observations (VASCO) project investigates astronomical surveys spanning a time interval of 70 years, searching for unusual and exotic transients. We present herein the VASCO Citizen Science Project, which can identify unusual candidates driven by three different approaches: hypothesis, exploratory, and machine learning, which is particularly useful for SETI searches. To address the big data challenge, VASCO combines three methods: the Virtual Observatory, user-aided machine learning, and visual inspection through citizen science. Here we demonstrate the citizen science project and its improved candidate selection process, and we give a progress report. We also present the VASCO citizen science network led by amateur astronomy associations mainly located in Algeria, Cameroon, and Nigeria. At the moment of writing, the citizen science project has carefully examined 15,593 candidate image pairs in the data (ca. 10% of the candidates), and has so far identified 798 objects classified as "vanished". The most interesting candidates will be followed up with optical and infrared imaging, together with the observations by the most potent radio telescopes.

Place, publisher, year, edition, pages
MDPI, 2022
Keywords
citizen science, SETI, surveys, transients
National Category
Astronomy, Astrophysics and Cosmology
Identifiers
urn:nbn:se:umu:diva-201224 (URN)10.3390/universe8110561 (DOI)000881392200001 ()2-s2.0-85141787535 (Scopus ID)
Funder
Swedish Research Council, 2017-06372NordForsk
Available from: 2022-12-12 Created: 2022-12-12 Last updated: 2022-12-12Bibliographically approved
Souza, A., Pelckmans, K. & Tordsson, J. (2021). A HPC Co-scheduler with Reinforcement Learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): . Paper presented at JSSPP 2021, 24th International Workshop on Job Scheduling Strategies for Parallel Processing, Virtual, May 21, 2021 (pp. 126-148). Springer
Open this publication in new window or tab >>A HPC Co-scheduler with Reinforcement Learning
2021 (English)In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer, 2021, p. 126-148Conference paper, Published paper (Refereed)
Abstract [en]

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications’ actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.

Place, publisher, year, edition, pages
Springer, 2021
Series
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349 ; 12985
Keywords
Adaptive reinforcement learning, Co-scheduling, Datacenters, High performance computing
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-188961 (URN)10.1007/978-3-030-88224-2_7 (DOI)000869960400007 ()2-s2.0-85117482466 (Scopus ID)978-3-030-88223-5 (ISBN)978-3-030-88224-2 (ISBN)
Conference
JSSPP 2021, 24th International Workshop on Job Scheduling Strategies for Parallel Processing, Virtual, May 21, 2021
Note

Also part of the Theoretical Computer Science and General Issues book sub series (LNTCS, volume 12985)

Available from: 2021-10-28 Created: 2021-10-28 Last updated: 2023-09-05Bibliographically approved
Gutierrez, F., Beedkar, K., Souza, A. & Markl, V. (2021). AdCom: Adaptive combiner for streaming aggregations. In: Velegrakis Y.; Velegrakis Y.; Zeinalipour D.; Chrysanthis P.K.; Chrysanthis P.K.; Guerra F. (Ed.), Advances in Database Technology - EDBT: . Paper presented at Advances in Database Technology - 24th International Conference on Extending Database Technology, EDBT 2021, Virtual, March 23-26, 2021. (pp. 403-414). OpenProceedings, 2021-March
Open this publication in new window or tab >>AdCom: Adaptive combiner for streaming aggregations
2021 (English)In: Advances in Database Technology - EDBT / [ed] Velegrakis Y.; Velegrakis Y.; Zeinalipour D.; Chrysanthis P.K.; Chrysanthis P.K.; Guerra F., OpenProceedings, 2021, Vol. 2021-March, p. 403-414Conference paper, Published paper (Refereed)
Abstract [en]

Continuous applications such as device monitoring and anomaly detection often require real-time aggregated statistics over unbounded data streams. While existing stream processing systems such as Flink, Spark, and Storm support processing of streaming aggregations, their optimizations are limited with respect to the dynamic nature of the data, and therefore are suboptimal when the workload changes and/or when there is data skew. In this paper we present AdCom, which is an adaptive combiner for stream processing engines. The use of AdCom in aggregation queries enables pre-aggregating tuples upstream (i.e., before data shuffling) followed by global aggregation downstream. In contrast to existing approaches, AdCom can automatically adjust the number of tuples to pre-aggregate depending on the data rate and available network. Our experimental study using real-world streaming workloads shows that using AdCom leads to 2.5-9× higher sustainable throughput without compromising latency.

Place, publisher, year, edition, pages
OpenProceedings, 2021
Series
Advances in Database Technology, E-ISSN 2367-2005
National Category
Computer Sciences
Research subject
computer and systems sciences
Identifiers
urn:nbn:se:umu:diva-187179 (URN)10.5441/002/edbt.2021.43 (DOI)2-s2.0-85113710327 (Scopus ID)9783893180844 (ISBN)
Conference
Advances in Database Technology - 24th International Conference on Extending Database Technology, EDBT 2021, Virtual, March 23-26, 2021.
Funder
EU, Horizon 2020, 765452
Available from: 2021-09-06 Created: 2021-09-06 Last updated: 2021-09-06Bibliographically approved
Souza, A., Pelckmans, K., Ghoshal, D., Ramakrishnan, L. & Tordsson, J. (2020). ASA - The Adaptive Scheduling Architecture. In: HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing: . Paper presented at HPDC '20: The 29th International Symposium on High-Performance Parallel and Distributed Computing, Stockholm, Sweden, June 23-26, 2020 (pp. 161-165). ACM Digital Library
Open this publication in new window or tab >>ASA - The Adaptive Scheduling Architecture
Show others...
2020 (English)In: HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, ACM Digital Library, 2020, p. 161-165Conference paper, Published paper (Refereed)
Abstract [en]

In High Performance Computing (HPC), resources are controlled by batch systems and may not be available due to long queue waiting times, negatively impacting application deadlines. This is noticeable in low latency scientific workflows where resource planning and timely allocation are key for efficient processing. On the one hand, peak allocations guarantee the fastest possible workflows execution time, at the cost of extended queue waiting times and costly resource usage. On the other hand, dynamic allocations following specific workflow stage requirements optimizes resource usage, though it increases the total workflow makespan. To enable new scheduling strategies and features in workflows, we propose ASA: the Adaptive Scheduling Architecture, a novel scheduling method to reduce perceived queue waiting times as well as to optimize workflows resource usage. Reinforcement learning is used to estimate queue waiting times, and based on these estimates ASA pro-actively submit resource change requests, minimizing total workflow inter-stage waiting times, idle resources, and makespan. Experiments with three scientific workflows at two HPC centers show that ASA combines the best of the two aforementioned approaches, with average queue waiting time and makespan reductions of up to 10% and 2% respectively, with up to 100% prediction accuracy, while obtaining near optimal resource utilization.

Place, publisher, year, edition, pages
ACM Digital Library, 2020
Keywords
Scheduling, HPC, reinforcement learning
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-169631 (URN)10.1145/3369583.3392693 (DOI)2-s2.0-85088391721 (Scopus ID)978-1-4503-7052-3 (ISBN)
Conference
HPDC '20: The 29th International Symposium on High-Performance Parallel and Distributed Computing, Stockholm, Sweden, June 23-26, 2020
Note

Originally included in thesis in manuscript form. 

Available from: 2020-04-13 Created: 2020-04-13 Last updated: 2023-03-24Bibliographically approved
Souza, A. (2020). Autonomous resource management for high performance datacenters. (Doctoral dissertation). Umeå: Umeå University
Open this publication in new window or tab >>Autonomous resource management for high performance datacenters
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Autonom resurshantering för högpresterande datacenter
Abstract [en]

Over the last decade, new applications such as data intensive workflows have hit an inflection point in wide spread use and influenced the compute paradigm of most scientific and industrial endeavours. Data intensive workflows are highly dynamic and adaptable to resource changes, system faults, and by also allowing approximate solutions into their models. On the one hand, these dynamic characteristics require processing power and capabilities originated in cloud computing environments, and are not well supported by large High Performance Computing (HPC) infrastructures. On the other hand, cloud computing datacenters favor low latency over throughput, deeply contrasting with HPC, which enforces a centralized environment and prioritizes total computation accomplished over-time, ignoring latency entirely. Although data handling needs are predicted to increase by as much as a thousand times over the next decade, future datacenters processing power will not increase as much.

To tackle these long-term developments, this thesis proposes autonomic methods combined with novel scheduling strategies to optimize datacenter utilization while guaranteeing user defined constraints and seamlessly supporting a wide range of applications under various real operational scenarios. Leveraging upon data intensive characteristics, a library is developed to dynamically adjust the amount of resources used throughout the lifespan of a workflow, enabling elasticity for such applications in HPC datacenters. For mission critical environments where services must run even in the event of system failures, we define an adaptive controller to dynamically select the best method to perform runtime state synchronizations. We develop different hybrid extensible architectures and reinforcement learning scheduling algorithms that smoothly enable dynamic applications into HPC environments. An overall theme in this thesis is extensive experimentation in real datacenters environments. Our results show improvements in datacenter utilization and performance, achieving higher overall efficiency. Our methods also simplify operations and allow the onboarding of novel types of applications previously not supported.

Abstract [sv]

Dataintensiva workflows är en ny klass av applikationer som blivit alltmer vanliga under senaste årtiondet och har stor påverkan på hur beräkningar utförs inom flertalet forskningsområden och i industrin. Dessa dataintensiva workflows kan dynamiskt anpassa sig till ändringar i resursallokering, systemfel och kan ibland även approximera lösningar vid resursbrist. De kräver hög beräkningskraft och därtill funktionalitet som endast återfinns i datormoln och de passar därmed dåligt i dagens högpresterande datorsystem (HPC-system). Datacenter i molnet prioriterar att snabbt starta nyinkomna applikationer, vilket drastiskt skiljer sig från HPC-miljöer där hög genomströmning över tid är det främsta målet. Trots att behovet av datahantering uppskattas öka mer än tusenfallt under kommande årtioende kommer framtidens datacenter inte att ha motsvarande utveckling av beräkningskapacitet.

Denna avhandling möter dessa utmaningar genom en kombination av autonoma system och nya strategier för schedulering för att optimera utnyttjandegraden i datacenter. Detta sker utan att göra avkall på användares prestandakrav och därtill med målet att stödja ett brett spektrum av applikationer och scenarios. Ett bibliotek utvecklas för att dynamiskt anpassa resursallokering för workflows under körning, vilket innebär att även HPC-system kan stödja elastiska applikationer som tidigare bara kunde exekveras i datormoln. För miljöer med höga krav på tillgänglighet defineras en regulator för att dynamiskt anpassa hur applikationer synkroniserar tillstånd, för mer resurseffektiv aktiv replikering. Avhandlingen utvecklar även flera resurshanteringssystem baserat på schedulering med förstärkningsinlärning i syftet att förbättra stödet för dynamiska applikationer i HPC-system. Ett övergripande tema i avhandlingen är omfattande utvärderingar av de framtagna metoderna och systemen genom storskaliga experiment i verkliga datacenter. Resultaten visar förbättringar överlag av resursutnyttjande och prestanda i datacenter. De utvecklade systemen förenklar även drift och möjliggör nya typer av applikationer som tidigare ej kunnat exekveras i HPC-miljöer.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2020. p. 44
Series
Report / UMINF, ISSN 0348-0542 ; 20.03
Keywords
Datacenters, high performance computing, scheduling, hybrid
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-169633 (URN)978-91-7855-286-3 (ISBN)978-91-7855-287-0 (ISBN)
Public defence
2020-05-08, MIT Place Seminarierummet, MIT-byggnaden (Plan 2), Umeå, 10:00 (English)
Opponent
Supervisors
Note

New place for the public defence (wrong place in the posting sheet). 

Available from: 2020-04-17 Created: 2020-04-13 Last updated: 2020-05-25Bibliographically approved
Villarroel, B., Pelckmans, K., Solano, E., Laaksoharju, M., Souza, A., Dom, O. N., . . . Ward, M. J. (2020). The VASCO project: 100 red transients and their follow up. In: Proceedings of the International Astronautical Congress, IAC: 71st International Astronautical Congress, IAC 2020. Paper presented at 71st International Astronautical Congress, IAC 2020; Virtual, October 12-14, 2020. International Astronautical Federation, IAF, Article ID 166680.
Open this publication in new window or tab >>The VASCO project: 100 red transients and their follow up
Show others...
2020 (English)In: Proceedings of the International Astronautical Congress, IAC: 71st International Astronautical Congress, IAC 2020, International Astronautical Federation, IAF , 2020, article id 166680Conference paper, Published paper (Refereed)
Abstract [en]

The Vanishing & Appearing Sources during a Century of Observations (VASCO) project investigates astronomical surveys spanning a 70 years time interval, searching for unusual and exotic transients. We present herein the VASCO Citizen Science Project, that uses three different approaches to the identification of unusual transients in a given set of candidates: hypothesis-driven, exploratory-driven and machine learning-driven (which is of particular benefit for SETI searches). To address the big data challenge, VASCO combines methods from the Virtual Observatory, a user-aided machine learning and visual inspection through citizen science. In this article, we demonstrate the citizen science project, the new and improved candidate selection process and give a progress report. We also present the VASCO citizen science network led by amateur astronomy associations mainly located in Algeria, Cameroon and Nigeria. At the moment of writing, the citizen science project has carefully examined 12,000 candidate image pairs in the data, and has so far identified 713 objects classified as “vanished”. The most interesting candidates will be followed up with optical and infrared imaging, together with the observations by the most potent radio telescopes.

Place, publisher, year, edition, pages
International Astronautical Federation, IAF, 2020
Series
Proceedings of the International Astronautical Congress, IAC, ISSN 0074-1795
Keywords
Catalogs, Extraterrestrial intelligence, Miscellaneous, Surveys, Transient
National Category
Astronomy, Astrophysics and Cosmology
Identifiers
urn:nbn:se:umu:diva-181050 (URN)2-s2.0-85100949904 (Scopus ID)
Conference
71st International Astronautical Congress, IAC 2020; Virtual, October 12-14, 2020
Available from: 2021-03-05 Created: 2021-03-05 Last updated: 2021-04-16Bibliographically approved
Souza, A., Rezaei, M., Laure, E. & Tordsson, J. (2019). Hybrid Resource Management for HPC and Data Intensive Workloads. In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID): . Paper presented at 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019, May 14-17, 2019, Larnaca, Cyprus (pp. 399-409). Los Alamitos: IEEE Computer Society
Open this publication in new window or tab >>Hybrid Resource Management for HPC and Data Intensive Workloads
2019 (English)In: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Los Alamitos: IEEE Computer Society, 2019, p. 399-409Conference paper, Published paper (Refereed)
Abstract [en]

Traditionally, High Performance Computing (HPC) and Data Intensive (DI) workloads have been executed on separate hardware using different tools for resource and application management. With increasing convergence of these paradigms, where modern applications are composed of both types of jobs in complex workflows, this separation becomes a growing overhead and the need for a common computation platform for both application areas increases. Executing both application classes on the same hardware not only enables hybrid workflows, but can also increase the usage efficiency of the system, as often not all available hardware is fully utilized by an application. While HPC systems are typically managed in a coarse grained fashion, allocating a fixed set of resources exclusively to an application, DI systems employ a finer grained regime, enabling dynamic resource allocation and control based on application needs. On the path to full convergence, a useful and less intrusive step is a hybrid resource management system that allows the execution of DI applications on top of standard HPC scheduling systems.In this paper we present the architecture of a hybrid system enabling dual-level scheduling for DI jobs in HPC infrastructures. Our system takes advantage of real-time resource utilization monitoring to efficiently co-schedule HPC and DI applications. The architecture is easily adaptable and extensible to current and new types of distributed workloads, allowing efficient combination of hybrid workloads on HPC resources with increased job throughput and higher overall resource utilization. The architecture is implemented based on the Slurm and Mesos resource managers for HPC and DI jobs. Our experimental evaluation in a real cluster based on a set of representative HPC and DI applications demonstrate that our hybrid architecture improves resource utilization by 20%, with 12% decrease on queue makespan while still meeting all deadlines for HPC jobs.

Place, publisher, year, edition, pages
Los Alamitos: IEEE Computer Society, 2019
Keywords
Resource Management, High Performance Computing, Data Intensive Computing, Mesos, Slurm, Boostrapping
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-155619 (URN)10.1109/CCGRID.2019.00054 (DOI)000483058700045 ()2-s2.0-85069469164 (Scopus ID)978-1-7281-0913-8 (ISBN)978-1-7281-0912-1 (ISBN)
Conference
19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019, May 14-17, 2019, Larnaca, Cyprus
Note

Originally included in thesis in manuscript form

Available from: 2019-01-24 Created: 2019-01-24 Last updated: 2020-06-26Bibliographically approved
Souza, A. P. (2018). Application-aware resource management for datacenters. (Licentiate dissertation). Umeå: Department of computing science, Umeå university
Open this publication in new window or tab >>Application-aware resource management for datacenters
2018 (English)Licentiate thesis, comprehensive summary (Other academic)
Alternative title[sv]
Applikationsmedveten resurshantering för datacenter
Abstract [en]

High Performance Computing (HPC) and Cloud Computing datacenters are extensively used to steer and solve complex problems in science, engineering, and business, such as calculating correlations and making predictions. Already in a single datacenter server, there are thousands of hardware and software metrics – Key Performance Indicators (KPIs) – that individually and aggregated can give insight in the performance, robustness, and efficiency of the datacenter and the provisioned applications. At the datacenter level, the number of KPIs is even higher. The fast growing interest on datacenter management from both public and industry together with the rapid expansion in scale and complexity of datacenter resources and the services being provided on them have made monitoring, profiling, controlling, and provisioning compute resources dynamically at runtime into a challenging and complex task. Commonly, correlations of application KPIs, like response time and throughput, with resource capacities show that runtime systems (e.g., containers or virtual machines) that are used to provision these applications do not utilize available resources efficiently. This reduces datacenter efficiency, which in term results in higher operational costs and longer waiting times for results.

The goal of this thesis is to develop tools and autonomic techniques for improving datacenter operations, management and utilization, while improving and/or minimizing impacts on applications performance. To this end, we make use of application resource descriptors to create a library that dynamically adjusts the amount of resources used, enabling elasticity for scientific workflows in HPC datacenters. For mission critical applications, high availability is of great concern since these services must be kept running even in the event of system failures. By modeling and correlating specific resource counters, like CPU, memory and network utilization, with the number of runtime synchronizations, we present adaptive mechanisms to dynamically select which fault tolerant mechanism to use. Likewise, for scientific applications we propose a hybrid extensible architecture for dual-level scheduling of data intensive jobs in HPC infrastructures, allowing operational simplification, on-boarding of new types of applications and achieving greater job throughput with higher overall datacenter efficiency.

Place, publisher, year, edition, pages
Umeå: Department of computing science, Umeå university, 2018. p. 28
Series
Report / UMINF, ISSN 0348-0542 ; 18.14
Keywords
Resource Management, High Performance Computing, Cloud Computing
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-155620 (URN)978-91-7601-971-9 (ISBN)
Presentation
2018-12-12, MA121, MIT-Huset, Umeå, 20:31 (English)
Opponent
Supervisors
Available from: 2019-01-25 Created: 2019-01-24 Last updated: 2020-09-14Bibliographically approved
Souza, A., Papadopoulos, A. V., Tomás Bolivar, L., Gilbert, D. & Tordsson, J. (2018). Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance. In: Li J., Chandra A., Guo T., Cai Y. (Ed.), Proceedings - 2018 IEEE International Conference on Cloud Engineering, IC2E 2018: . Paper presented at 2018 IEEE International Conference on Cloud Engineering (IC2E 2018), 17–20 April 2018, Orlando, Florida, USA (pp. 12-22). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Hybrid Adaptive Checkpointing for Virtual Machine Fault Tolerance
Show others...
2018 (English)In: Proceedings - 2018 IEEE International Conference on Cloud Engineering, IC2E 2018 / [ed] Li J., Chandra A., Guo T., Cai Y., Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 12-22Conference paper, Published paper (Refereed)
Abstract [en]

Active Virtual Machine (VM) replication is an application independent and cost-efficient mechanism for high availability and fault tolerance, with several recently proposed implementations based on checkpointing. However, these methods may suffer from large impacts on application latency, excessive resource usage overheads, and/or unpredictable behavior for varying workloads. To address these problems, we propose a hybrid approach through a Proportional-Integral (PI) controller to dynamically switch between periodic and on-demand check-pointing. Our mechanism automatically selects the method that minimizes application downtime by adapting itself to changes in workload characteristics. The implementation is based on modifications to QEMU, LibVirt, and OpenStack, to seamlessly provide fault tolerant VM provisioning and to enable the controller to dynamically select the best checkpointing mode. Our evaluation is based on experiments with a video streaming application, an e-commerce benchmark, and a software development tool. The experiments demonstrate that our adaptive hybrid approach improves both application availability and resource usage compared to static selection of a checkpointing method, with application performance gains and neglectable overheads.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Fault Tolerance, Resource Management, Checkpoint, COLO, Control Theory
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-152033 (URN)10.1109/IC2E.2018.00023 (DOI)000759774400002 ()2-s2.0-85048315473 (Scopus ID)978-1-5386-5009-7 (ISBN)978-1-5386-5008-0 (ISBN)
Conference
2018 IEEE International Conference on Cloud Engineering (IC2E 2018), 17–20 April 2018, Orlando, Florida, USA
Available from: 2018-09-24 Created: 2018-09-24 Last updated: 2023-09-05Bibliographically approved
Fox, W., Ghoshal, D., Souza, A., P. Rodrigo, G. & Ramakrishnan, L. (2017). E-HPC: A Library for Elastic Resource Management in HPC Environments. In: 12th Workshop on Workflows in Support of Large-Scale Science (WORKS): . Paper presented at The International Conference for High Performance Computing, Networking, Storage and Analysis. New York, NY, USA: Association for Computing Machinery (ACM), Article ID 1.
Open this publication in new window or tab >>E-HPC: A Library for Elastic Resource Management in HPC Environments
Show others...
2017 (English)In: 12th Workshop on Workflows in Support of Large-Scale Science (WORKS), New York, NY, USA: Association for Computing Machinery (ACM), 2017, article id 1Conference paper, Published paper (Refereed)
Abstract [en]

Next-generation data-intensive scientific workflows need to support streaming and real-time applications with dynamic resource needs on high performance computing (HPC) platforms. The static resource allocation model on current HPC systems that was designed for monolithic MPI applications is insufficient to support the elastic resource needs of current and future workflows. In this paper, we discuss the design, implementation and evaluation of Elastic-HPC (E-HPC), an elastic framework for managing resources for scientific workflows on current HPC systems. E-HPC considers a resource slot for a workflow as an elastic window that might map to different physical resources over the duration of a workflow. Our framework uses checkpoint-restart as the underlying mechanism to migrate workflow execution across the dynamic window of resources. E-HPC provides the foundation necessary to enable dynamic resource allocation of HPC resources that are needed for streaming and real-time workflows. E-HPC has negligible overhead beyond the cost of checkpointing. Additionally, E-HPC results in decreased turnaround time of workflows compared to traditional model of resource allocation for workflows, where resources are allocated per stage of the workflow. Our evaluation shows that E-HPC improves core hour utilization for common workflow resource use patterns and provides an effective framework for elastic expansion of resources for applications with dynamic resource needs.

Place, publisher, year, edition, pages
New York, NY, USA: Association for Computing Machinery (ACM), 2017
Keywords
high performance computing, scientific workflows, resource management
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-142624 (URN)10.1145/3150994.3150996 (DOI)2-s2.0-85054761132 (Scopus ID)978-1-4503-5129-4 (ISBN)
Conference
The International Conference for High Performance Computing, Networking, Storage and Analysis
Available from: 2017-12-06 Created: 2017-12-06 Last updated: 2023-03-24Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-6952-1195

Search in DiVA

Show all publications