Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Cluster Scheduling and Management for Large-Scale Compute Clouds
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Distributed systems group)ORCID iD: 0000-0001-6894-1438
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud computing has become a powerful enabler for many IT services and new technolo-gies. It provides access to an unprecedented amount of resources in a fine-grained andon-demand manner. To deliver such a service, cloud providers should be able to efficientlyand reliably manage their available resources. This becomes a challenge for the manage-ment system as it should handle a large number of heterogeneous resources under diverseworkloads with fluctuations. In addition, it should also satisfy multiple operational require-ments and management objectives in large scale data centers.Autonomic computing techniques can be used to tackle cloud resource managementproblems. An autonomic system comprises of a number of autonomic elements, which arecapable of automatically organizing and managing themselves rather than being managedby external controllers. Therefore, they are well suited for decentralized control, as theydo not rely on a centrally managed state. A decentralized autonomic system benefits fromparallelization of control, faster decisions and better scalability. They are also more reliableas a failure of one will not affect the operation of the others, while there is also a lower riskof having faulty behaviors on all the elements, all at once. All these features are essentialrequirements of an effective cloud resource management.This thesis investigates algorithms, models, and techniques to autonomously managejobs, services, and virtual resources in a cloud data center. We introduce a decentralizedresource management framework, that automates resource allocation optimization and ser-vice consolidation, reliably schedules jobs considering probabilistic failures, and dynam-icly scales and repacks services to achieve cost efficiency.As part of the framework, we introduce a decentralized scheduler that provides andmaintains durable allocations with low maintenance costs for data centers with dynamicworkloads. The scheduler assigns resources in response to virtual machine requests andmaintains the packing efficiency while taking into account migration costs, topologicalconstraints, and the risk of resource contention, as well as fluctuations of the backgroundload.We also introduce a scheduling algorithm that considers probabilistic failures as part ofthe planning for scheduling. The aim of the algorithm is to achieve an overall job reliabil-ity, in presence of correlated failures in a data center. To do so, we study the impacts ofstochastic and correlated failures on job reliability in a virtual data center. We specificallyfocus on correlated failures caused by power outages or failure of network components onjobs running large number of replicas of identical tasks.Additionally, we investigate the trade-offs between vertical and horizontal scaling. Theresult of the investigations is used to introduce a repacking technique to automatically man-age the capacity required by an elastic service. The repacking technique combines thebenefits of both scaling strategies to improve its cost-efficiency.

Abstract [sv]

Datormoln har kommit att bli kraftfulla möjliggörare för många nya IT-tjänster. De ger tillgång till mycket storskaliga datorresurser på ett finkornigt och omedelbart sätt. För att tillhandahålla sådana resurser krävs att de underliggande datorcentren kan hantera sina resurser på ett tillförlitligt och effektivt sätt. Frågan hur man ska designa deras resurshanteringssystem är en stor utmaning då de ska kunna hantera mycket stora mängder heterogena resurser som i sin tur ska klara av vitt skilda typer av belastning, ofta med väldigt stora variationer över tid. Därtill ska de typiskt kunna möta en mängd olika krav och målsättningar för hur resurserna ska nyttjas. Autonomiska system kan med fördel användas för att realisera sådana system. Ett autonomt system innehåller ett antal autonoma element som automatiskt kan organisera och hantera sig själva utan stöd av externa regulatorer. Förmågan att hantera sig själva gör dem mycket lämpliga som komponenter i distribuerade system, vilka i sin tur kan bidra till snabbare beslutsprocesser, bättre skalbarhet och högre feltolerans. Denna avhandling fokuserar på algoritmer, modeller och tekniker för autonom hantering av jobb och virtuella resurser i datacenter. Vi introducerar ett decentraliserat resurshanteringssystem som automatiserar resursallokering och konsolidering, schedulerar jobb tillförlitligt med hänsyn till korrelerade fel, samt skalar resurser dynamiskt för att uppnå kostnadseffektivitet. Som en del av detta ramverk introducerar vi en decentraliserad schedulerare som allokerar resurser med hänsyn till att tagna beslut ska vara bra för lång tid och ge låga resurshanteringskostnader för datacenter med dynamisk belastning. Scheduleraren allokerar virtuella maskiner utifrån aktuell belastning och upprätthåller ett effektivt nyttjande av underliggande servrar genom att ta hänsyn till migrationskostnader, topologiska bivillkor och risk för överutnyttjande. Vi introducerar också en resursallokeringsalgoritm som tar hänsyn till korrelerade fel som ett led i planeringen. Avsikten är att kunna uppnå specificerade tillgänglighetskrav för enskilda tjänster trots uppkomst av korrelerade fel. Vi fokuserar främst på korrelerade fel som härrör från problem med elförsörjning och från felande nätverkskomponenter samt deras påverkan på jobb bestående av många identiska del-jobb. Slutligen studerar vi även hur man bäst ska kombinera horisontell och vertikal skalning av resurser. Resultatet är en process som ökar kostnadseffektivitet genom att kombinera de två metoderna och därtill emellanåt förändra fördelning av storlekar på virtuella maskiner.

Place, publisher, year, edition, pages
Umeå University , 2015. , p. 24
Series
UMINF, ISSN 0348-0542 ; 15.19
Keywords [en]
Cloud computing, Scheduling, Resource Management
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-112467ISBN: 978-91-7601-389-2 (print)OAI: oai:DiVA.org:umu-112467DiVA, id: diva2:878117
Public defence
2015-01-21, Hörsal A, Samhällsvetarhuset, Umeå, 10:15 (English)
Opponent
Supervisors
Available from: 2015-12-16 Created: 2015-12-08 Last updated: 2021-03-18Bibliographically approved
List of papers
1. Unifying cloud management: towards overall governance of business level objectives
Open this publication in new window or tab >>Unifying cloud management: towards overall governance of business level objectives
2011 (English)In: Cluster, Cloud and Grid Computing (CCGrid), 2011 11th IEEE/ACM International Symposium on, IEEE Computer Society , 2011, p. -597Conference paper, Published paper (Refereed)
Abstract [en]

We address the challenge of providing unified cloud resource management towards an overall business level objective, given the multitude of managerial tasks to be performed and the complexity of any architecture to support them. Resource level management tasks include elasticity control, virtual machine and data placement, autonomous fault management, etc, which are intrinsically difficult problems since services normally have unknown lifetime and capacity demands that varies largely over time. To unify the management of these problems, (for optimization with respect to some higher level business level objective, like optimizing revenue while breaking no more than a certain percentage of service level agreements)becomes even more challenging as the resource level managerial challenges are far from independent. After providing the general problem formulation, we review recent approaches taken by the research community, including mainly general autonomic computing technology for large-scale environments and resource level management tools equipped with some business oriented or otherwise qualitative features. We propose and illustrate a policy-driven approach where a high-level management system monitors overall system and services behavior and adjusts lower level policies (e.g., thresholds for admission control, elasticity control, server consolidation level, etc) for optimization towards the measurable business level objectives.

Place, publisher, year, edition, pages
IEEE Computer Society, 2011
Keywords
Cloud governance, autonomic computing, policy-driven management
National Category
Computer Sciences
Research subject
business data processing
Identifiers
urn:nbn:se:umu:diva-40343 (URN)10.1109/CCGrid.2011.65 (DOI)2-s2.0-79961142850 (Scopus ID)978-1-4577-0129-0 (ISBN)
Conference
The 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Newport Beach, CA, USA, May 23 - 26, 2011
Available from: 2011-02-22 Created: 2011-02-22 Last updated: 2023-03-23Bibliographically approved
2. Autonomic resource allocation for cloud data centers: a peer to peer approach
Open this publication in new window or tab >>Autonomic resource allocation for cloud data centers: a peer to peer approach
2014 (English)In: 2014 International Conference on Cloud and Autonomic Computing (ICCAC 2014), IEEE Computer Society, 2014, , p. 10p. 131-140Conference paper, Published paper (Refereed)
Abstract [en]

We address the problem of resource management for large scale cloud data centers. We propose a Peer to Peer (P2P) resource management framework, comprised of a number of agents, overlayed as a scale-free network. The structural properties of the overlay, along with dividing the management responsibilities among the agents enables the management framework to be scalable in terms of both the number of physical servers and incoming Virtual Machine (VM) requests, while it is computationally feasible. While our framework is intended for use in different cloud management functionalities, e.g. admission control or fault tolerance, we focus on the problem of resource allocation in clouds. We evaluate our approach by simulating a data center with 2500 servers, striving to allocate resources to 20000 incoming VM placement requests. The simulation results indicate that by maintaining an efficient request propagation, we can achieve promising levels of performance and scalability when dealing with large number of servers and placement requests.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014. p. 10
Keywords
cloud computing, peer to peer, resource management
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-93273 (URN)10.1109/ICCAC.2014.16 (DOI)000370731000018 ()2-s2.0-84923951011 (Scopus ID)978-1-4799-5841-2 (ISBN)
Conference
International Conference on Cloud and Autonomic Computing (ICCAC), SEP 08-12, 2014, Imperial Coll, London, ENGLAND
Available from: 2014-09-15 Created: 2014-09-15 Last updated: 2023-03-24Bibliographically approved
3. Divide the task, multiply the outcome: cooperative VM consolidation
Open this publication in new window or tab >>Divide the task, multiply the outcome: cooperative VM consolidation
2014 (English)In: Proceedings of The 6th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2014), IEEE, 2014, , p. 6p. 300-305Conference paper, Published paper (Refereed)
Abstract [en]

Efficient resource utilization is one of the mainconcerns of cloud providers, as it has a direct impact onenergy costs and thus their revenue. Virtual machine (VM)consolidation is one the common techniques, used by infrastruc-ture providers to efficiently utilize their resources. However,when it comes to large-scale infrastructures, consolidationdecisions become computationally complex, since VMs aremulti-dimensional entities with changing demand and unknownlifetime, and users often overestimate their actual demand.These uncertainties urges the system to take consolidationdecisions continuously in a real time manner.In this work, we investigate a decentralized approach forVM consolidation using Peer to Peer (P2P) principles. Weinvestigate the opportunities offered by P2P systems, as scalableand robust management structures, to address VM consol-idation concerns. We present a P2P consolidation protocol,considering the dimensionality of resources and dynamicityof the environment. The protocol benefits from concurrencyand decentralization of control and it uses a dimension awaredecision function for efficient consolidation. We evaluate theprotocol through simulation of 100,000 physical machinesand 200,000 VM requests. Results demonstrate the potentialsand advantages of using a P2P structure to make resourcemanagement decisions in large scale data centers. They showthat the P2P approach is feasible and scalable and producesresource utilization of 75% when the consolidation aim is 90%.

Place, publisher, year, edition, pages
IEEE, 2014. p. 6
Series
International Conference on Cloud Computing Technology and Science, ISSN 2330-2194
Keywords
Cloud computing, Peer to Peer, Gossip protocols, VM consolidation, Resource management
National Category
Engineering and Technology
Research subject
business data processing
Identifiers
urn:nbn:se:umu:diva-98898 (URN)10.1109/CloudCom.2014.81 (DOI)000392947000041 ()2-s2.0-84937903421 (Scopus ID)978-1-4799-4093-6 (ISBN)
Conference
The 6th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2014), 15-18 Dec. 2014 Singapore
Available from: 2015-01-28 Created: 2015-01-28 Last updated: 2023-03-24Bibliographically approved
4. Decentralized cloud datacenter reconsolidation through emergent and topology-aware behavior
Open this publication in new window or tab >>Decentralized cloud datacenter reconsolidation through emergent and topology-aware behavior
2016 (English)In: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 56, p. 51-63Article in journal (Refereed) Published
Abstract [en]

Consolidation of multiple applications on a single Physical Machine (PM) within acloud data center can increase utilization, minimize energy consumption, and reduceoperational costs. However, these benefits comes at the cost of increasing the complex-ity of the scheduling problem.In this paper, we present a topology-aware resource management framework. Aspart of this framework, we introduce a Reconsolidating PlaceMent scheduler (RPM)that provides and maintains durable allocations with low maintenance costs for datacenters with dynamic workloads. We focus on workloads featuring both short-livedbatch jobs and latency-sensitive services such as interactive web applications. Thescheduler assigns resources to Virtual Machines (VMs) and maintains packing effi-ciency while taking into account migration costs, topological constraints, and the riskof resource contention, as well as the variability of the background load and its com-plementarity to the new VM.We evaluate the model by simulating a data center with over 65000 PMs, structuredas a three-level multi-rooted tree topology. We investigate trade-offs between factorsthat affect the durability and operational cost of maintaining a near-optimal packing.The results show that the proposed scheduler can scale to the number of PMs in thesimulation and maintain efficient utilization with low migration costs.

Place, publisher, year, edition, pages
Elsevier: , 2016
Keywords
Cloud computing, Peer to Peer, VM consolidation, Resource management
National Category
Computer and Information Sciences
Research subject
Computer Systems
Identifiers
urn:nbn:se:umu:diva-109131 (URN)10.1016/j.future.2015.09.023 (DOI)000368652500004 ()2-s2.0-84945533575 (Scopus ID)
External cooperation:
Available from: 2015-09-19 Created: 2015-09-19 Last updated: 2023-03-23Bibliographically approved
5. A virtual machine re-packing approach to the horizontal vs. vertical elasticity trade-off for cloud autoscaling
Open this publication in new window or tab >>A virtual machine re-packing approach to the horizontal vs. vertical elasticity trade-off for cloud autoscaling
2013 (English)In: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference, ACM Press, 2013, p. Article no. 6-Conference paper, Published paper (Refereed)
Abstract [en]

An automated solution to horizontal vs. vertical elasticity problem is central to make cloud autoscalers truly autonomous. Today's cloud autoscalers are typically varying the capacity allocated by increasing and decreasing the number of virtual machines (VMs) of a predefined size (horizontal elasticity), not taking into account that as load varies it may be advantageous not only to vary the number but also the size of VMs (vertical elasticity). We analyze the price/performance effects achieved by different strategies for selecting VM-sizes for handling increasing load and we propose a cost-benefit based approach to determine when to (partly) replace a current set of VMs with a different set. We evaluate our repacking approach in combination with different auto-scaling strategies. Our results show a range of 7% up to 60% cost saving in total resource utilization cost of our sample applications and workloads.

Place, publisher, year, edition, pages
ACM Press, 2013
Keywords
Cloud computing, Autoscaling, Autonomous computing, Vertical elasticity, Horizontal elasticity
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-79781 (URN)10.1145/2494621.2494628 (DOI)2-s2.0-84883721654 (Scopus ID)978-1-4503-2172-3 (ISBN)
Conference
2013 ACM International Conference on Cloud and Autonomic Computing, CAC 2013, Miami, FL, United States, 5 August 2013 through 9 August 2013
Available from: 2013-09-02 Created: 2013-09-02 Last updated: 2023-03-23Bibliographically approved
6. DieHard: Reliable Scheduling to Survive Correlated failures in Cloud Data Centers
Open this publication in new window or tab >>DieHard: Reliable Scheduling to Survive Correlated failures in Cloud Data Centers
Show others...
2016 (English)In: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), IEEE, 2016Conference paper, Published paper (Refereed)
Abstract [en]

In large scale data centers, a single fault can lead to correlated failures of several physical machines and the tasks running on them, simultaneously. Such correlated failures can severely damage the reliability of a service or a job running on the failed hardware. This paper models the impact of stochastic and correlated failures on job reliability in a data center. We focus on correlated failures caused by power outages or failures of network components, on jobs running multiple replicas of identical tasks. We present a statistical reliability model and an approximation technique for computing a job’s reliability in the presence of correlated failures. In addition, we address the problem of scheduling a job with reliability constraints.We formulate the scheduling problem as an optimization problem, with the aim being to maintain the desired reliability with the minimum number of extra tasks to resist failures.We present a scheduling algorithm that approximates the minimum number of required tasks and a placement to achieve a desired job reliability. We study the efficiency of our algorithm using an analytical approach and by simulating a cluster with different failure sources and reliabilities. The results show that the algorithm can effectively approximate the minimum number of extra tasks required to achieve the job’s reliability.

Place, publisher, year, edition, pages
IEEE, 2016
Keywords
Cloud computing, Scheduling; Reliability, Fault tolerance, Correlated failures
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-116791 (URN)10.1109/CCGrid.2016.11 (DOI)2-s2.0-84979776469 (Scopus ID)
Conference
16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), Cartagena, Colombia, May 16-19, 2016
Note

Originally published in manuscript form.

Available from: 2016-02-11 Created: 2016-02-11 Last updated: 2023-03-23Bibliographically approved

Open Access in DiVA

fulltext(2104 kB)3048 downloads
File information
File name FULLTEXT01.pdfFile size 2104 kBChecksum SHA-512
8504a1ca6a3ce3bad40e54690f3a4fcd92ec75dd2edcae6edd184167f1b1ba48e66b66cbf8e989bf408cc7ee6d4d0114e227da0ef3a0aa6c0aa104336e55e8e0
Type fulltextMimetype application/pdf
spikblad(139 kB)83 downloads
File information
File name SPIKBLAD01.pdfFile size 139 kBChecksum SHA-512
fd2efbddb43ea18ab86b40ca24aa75d260f8c7b7c1db7c4b249dc1602fc6f8e8c564ee8605ee1df5444fe97ef3f1c58f1a56a51926b209915e86103cd0186439
Type spikbladMimetype application/pdf

Authority records

Sedaghat, Mina

Search in DiVA

By author/editor
Sedaghat, Mina
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 3053 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 5753 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf