Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Performance anomaly detection and resolution for autonomous clouds
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Distributed Systems)ORCID-id: 0000-0002-3308-834X
2017 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Fundamental properties of cloud computing such as resource sharing and on-demand self-servicing is driving a growing adoption of the cloud for hosting both legacy and new application services. A consequence of this growth is that the increasing scale and complexity of the underlying cloud infrastructure as well as the fluctuating service workloads is inducing performance incidents at a higher frequency than ever before with far-reaching impact on revenue, reliability, and reputation. Hence, effectively managing performance incidents with emphasis on timely detection, diagnosis and resolution has thus become a necessity rather than luxury. While other aspects of cloud management such as monitoring and resource management are experiencing greater automation, automated management of performance incidents remains a major concern.

Given the volume of operational data produced by cloud datacenters and services, this thesis focus on how data analytics techniques can be used in the aspect of cloud performance management. In particular, this work investigates techniques and models for automated performance anomaly detection and prevention in cloud environments. To familiarize with developments in the research area, we present the outcome of an extensive survey of existing research contributions addressing various aspects of performance problem management in diverse systems domains. We discuss the design and evaluation of analytics models and algorithms for detecting performance anomalies in real-time behaviour of cloud datacenter resources and hosted services at different resolutions. We also discuss the design of a semi-supervised machine learning approach for mitigating performance degradation by actively driving quality of service from undesirable states to a desired target state via incremental capacity optimization. The research methods used in this thesis include experiments on real virtualized testbeds to evaluate aspects of proposed techniques while other aspects are evaluated using performance traces from real-world datacenters.

Insights and outcomes from this thesis can be used by both cloud and service operators to enhance the automation of performance problem detection, diagnosis and resolution. They also have the potential to spur further research in the area while being applicable in related domains such as Internet of Things (IoT), industrial sensors as well as in edge and mobile clouds.

Abstract [sv]

Grundläggande egenskaper för datormoln såsom resursdelning och självbetjäning driver ett växande nyttjande av molnet för internettjänster. En följd av denna tillväxt är att den underliggande molninfrastrukturens ökande storlek och komplexitet samt fluktuerade arbetsbelastning orsakar prestandaincidenter med högre frekvens än någonsin tidigare. En konsekvens av detta blir omfattande inverkan på intäkter, tillförlitlighet och rykte för de som äger tjänsterna. Det har därför blivit viktigt att snabbt och effektivt hantera prestandaincidenter med avseende på upptäckt, diagnos och korrigering. Även om andra aspekter av resurshantering för datormoln, som övervakning och resursallokering, på senare tid automatiserats i allt högre grad så är automatiserad hantering av prestandaincidenter fortfarande ett stort problem.

Denna avhandling fokuserar på hur prestandahanteringen i molndatacenter kan förbättras genom användning av dataanalystekniker på de stora datamängder som produceras i de system som monitorerar prestanda hos datorresurser och tjänster. I synnerhet undersöks tekniker och modeller för automatisk upptäckt och förebyggande av prestandaanomalier i datormoln. För att kartlägga utvecklingen inom forskningsområdet presenterar vi resultatet av en omfattande undersökning av befintliga forskningsbidrag som behandlar olika aspekter av hantering av prestandaproblem inom i relevanta tillämpningsområden. Vi diskuterar design och utvärdering av analysmodeller och algoritmer för att upptäcka prestandaanomalier i realtid hos resurser och tjänster. Vi diskuterar också utformningen av ett maskininlärningsbaserat tillvägagångssätt för att mildra prestandaförluster genom att aktivt driva tjänsternas kvalitet från oönskade tillstånd till ett önskat målläge genom inkrementell kapacitetoptimering. Forskningsmetoderna som används i denna avhandling innefattar experiment på verkliga virtualiserade testmiljöer för att utvärdera aspekter av föreslagna tekniker medan andra aspekter utvärderas med hjälp av belastningsmönster från verkliga datacenter.

Insikter och resultat från denna avhandling kan användas av både moln- och tjänsteoperatörer för att bättre automatisera detekteringen av prestandaproblem, inklusive dess diagnos och korrigering. Resultaten har också potential att uppmuntra vidare forskning inom området samtidigt som de är användbara inom relaterade områden som internet-av-saker, industriella sensorer, och storskaligt distribuerade moln eller telekomnätverk.

sted, utgiver, år, opplag, sider
Umeå: Umeå University , 2017. , s. 60
Serie
Report / UMINF, ISSN 0348-0542 ; 17.18
Emneord [en]
Cloud Computing, Distributed Systems, Performance Management, Anomaly Detection, Quality of Service, Performance Analytics, Machine Learning
HSV kategori
Forskningsprogram
datorteknik; administrativ databehandling; datalogi
Identifikatorer
URN: urn:nbn:se:umu:diva-142033ISBN: 978-91-7601-800-2 (tryckt)OAI: oai:DiVA.org:umu-142033DiVA, id: diva2:1157924
Disputas
2017-12-14, MA121, MIT-huset, Umeå University, Umeå, 13:15 (engelsk)
Opponent
Veileder
Prosjekter
Cloud ControleSSENCE
Forskningsfinansiär
Swedish Research Council, C0590801Tilgjengelig fra: 2017-11-21 Laget: 2017-11-17 Sist oppdatert: 2021-03-18bibliografisk kontrollert
Delarbeid
1. Performance Anomaly Detection and Bottleneck Identification
Åpne denne publikasjonen i ny fane eller vindu >>Performance Anomaly Detection and Bottleneck Identification
2015 (engelsk)Inngår i: ACM Computing Surveys, ISSN 0360-0300, E-ISSN 1557-7341, Vol. 48, nr 1, artikkel-id 4Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in different system and application domains is well studied. In order to assess progress, research trends and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2015
Emneord
Systems performance, performance anomaly detection, bottleneck detection, performance problem identification
HSV kategori
Forskningsprogram
datorteknik
Identifikatorer
urn:nbn:se:umu:diva-105991 (URN)10.1145/2791120 (DOI)000363733200004 ()2-s2.0-84938363675 (Scopus ID)
Forskningsfinansiär
Swedish Research Council, C0590801
Tilgjengelig fra: 2015-07-03 Laget: 2015-07-03 Sist oppdatert: 2021-03-18bibliografisk kontrollert
2. Apex lake: a framework for enabling smart orchestration
Åpne denne publikasjonen i ny fane eller vindu >>Apex lake: a framework for enabling smart orchestration
Vise andre…
2015 (engelsk)Inngår i: Proceedings of the Industry Track of the 16th ACM/IFIP/USENIX Middleware Conference, New York, USA: Association for Computing Machinery (ACM), 2015, s. 1-7, artikkel-id 1Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The introduction of a Software-defined infrastructures brings additional challenges to the management of cloud infrastructure. With the impending convergence of telecommunications and cloud infrastructures, datacenters become an essential part of an overall integrated environment. The potential scale of such environments has significant implications as traditional orchestration approaches cannot scale appropriately. However, the combination of infrastructure topology, fine-grained operational data and advanced analytics, has the potential to deliver a scalable approach to facilitate orchestration and resource management. In this paper we introduce Apex Lake, a framework designed to address the question of "how to efficiently define and maintain a physical and logical resource and service landscape enriched by operational data, to support orchestration for optimized service delivery?" We also demonstrate with a use-case illustrating how functionalities provided by Apex Lake can be used dealing with performance anomalies.

sted, utgiver, år, opplag, sider
New York, USA: Association for Computing Machinery (ACM), 2015
Emneord
Cloud monitoring and orchestration, Resource Management, Datacenter Management, Software-defined Infrastructure
HSV kategori
Forskningsprogram
datorteknik
Identifikatorer
urn:nbn:se:umu:diva-114696 (URN)10.1145/2830013.2830016 (DOI)2-s2.0-84981340935 (Scopus ID)978-1-4503-3727-4 (ISBN)
Konferanse
16th ACM/IFIP/USENIX Middleware Conference, Middleware Industry 2015, Vancouver, Canada, 7 December 2015 through 11 December 2015
Tilgjengelig fra: 2016-01-26 Laget: 2016-01-26 Sist oppdatert: 2021-03-18bibliografisk kontrollert
3. Performance Anomaly Detection using Datacenter Landscape Graphs
Åpne denne publikasjonen i ny fane eller vindu >>Performance Anomaly Detection using Datacenter Landscape Graphs
2017 (engelsk)Konferansepaper, Oral presentation only (Annet vitenskapelig)
Abstract [en]

The migration of mission-critical workloads to the cloud and the automation of various aspects of datacenter management is contributing to the evolution of software-defined infrastructures. One implication of this evolution is that the composition (both physical and virtual) and logical topology of datacenters is becoming even more dynamic. Identification of performance problems (e.g.\ bottlenecks) in such environments needs to be done with awareness of this dynamic topology to understand the impact of dependencies among components. A technique is introduced that a) employs expert knowledge to identify bottleneck components using associated performance metrics, and b) utilizes dynamic dependencies to rank problem components in order to facilitate diagnosis efforts. The technique is demonstrated experimentally on an OpenStack testbed with realistic fault injection. Results of experiment case studies show that the technique is able to correctly detect and rank problem nodes. 

HSV kategori
Forskningsprogram
administrativ databehandling
Identifikatorer
urn:nbn:se:umu:diva-142023 (URN)
Konferanse
2nd IEEE International Conference on Big Data, Cloud Computing, and Data Science (BCD 2017), Hamamatsu, Japan, July 9-13, 2017
Prosjekter
Cloud Control (C0590801)
Forskningsfinansiär
Swedish Research Council, C0590801
Tilgjengelig fra: 2017-11-17 Laget: 2017-11-17 Sist oppdatert: 2021-03-18bibliografisk kontrollert
4. Adaptive Anomaly Detection in Performance Metric Streams
Åpne denne publikasjonen i ny fane eller vindu >>Adaptive Anomaly Detection in Performance Metric Streams
2018 (engelsk)Inngår i: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 15, nr 1, s. 217-231Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Continuous detection of performance anomalies such as service degradations has become critical in cloud and Internet services due to impact on quality of service and end-user experience. However, the volume and fast changing behaviour of metric streams have rendered it a challenging task. Many diagnosis frameworks often rely on thresholding with stationarity or normality assumption, or on complex models requiring extensive offline training. Such techniques are known to be prone to spurious false-alarms in online settings as metric streams undergo rapid contextual changes from known baselines. Hence, we propose two unsupervised incremental techniques following a two-step strategy. First, we estimate an underlying temporal property of the stream via adaptive learning and, then we apply statistically robust control charts to recognize deviations. We evaluated our techniques by replaying over 40 time-series streams from the Yahoo! Webscope S5 datasets as well as 4 other traces of real web service QoS and ISP traffic measurements. Our methods achieve high detection accuracy and few false-alarms, and better performance in general compared to an open-source package for time-series anomaly detection.

sted, utgiver, år, opplag, sider
IEEE, 2018
Emneord
Performance Monitoring and Measurement, Computer Network Management, Quality of Service, Time Series Analysis, Anomaly Detection, Unsupervised Learning
HSV kategori
Forskningsprogram
datalogi; administrativ databehandling; datorteknik
Identifikatorer
urn:nbn:se:umu:diva-142030 (URN)10.1109/TNSM.2017.2750906 (DOI)000427420100016 ()2-s2.0-85043695414 (Scopus ID)
Prosjekter
Cloud Control
Forskningsfinansiär
Swedish Research Council, C0590801
Tilgjengelig fra: 2017-11-17 Laget: 2017-11-17 Sist oppdatert: 2024-07-04bibliografisk kontrollert
5. A Black-box Approach for Detecting Systems Anomalies in Virtualized Environments
Åpne denne publikasjonen i ny fane eller vindu >>A Black-box Approach for Detecting Systems Anomalies in Virtualized Environments
2017 (engelsk)Inngår i: 2017 IEEE International Conference on Cloud and Autonomic Computing (ICCAC 2017), IEEE, 2017, s. 22-33Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Virtualization technologies allow cloud providers to optimize server utilization and cost by co-locating services in as few servers as possible. Studies have shown how applications in multi-tenant environments are susceptible to systems anomalies such as abnormal resource usage due to performance interference. Effective detection of such anomalies requires techniques that can adapt autonomously with dynamic service workloads, require limited instrumentation to cope with diverse applications services, and infer relationship between anomalies non-intrusively to avoid "alarm fatigue" due to scale. We propose a black-box framework that includes an unsupervised prediction-based mechanism for automated anomaly detection in multi-dimensional resource behaviour of datacenter nodes and a graph-theoretic technique for ranking anomalous nodes across the datacenter. The proposed framework is evaluated using resource traces of over 100 virtual machines obtained from a production cluster as well as traces obtained from an experimental testbed under realistic service composition. The technique achieve average normalized root mean squared forecast error and R^2 of (0.92, 0.07) across hosts servers and (0.70, 0.39) across virtual machines. Also, the average detection rate is 88% while explaining 62% of SLA violations with an average lead-time of 6 time-points when the testbed is actively perturbed under three contention scenarios. 

sted, utgiver, år, opplag, sider
IEEE, 2017
Emneord
Anomaly Detection, Performance Anomaly Detection, Performance Diagnosis, Cloud Computing, Virtualized Services, Unsupervised Learning, Time Series Analysis, Quality of Service
HSV kategori
Forskningsprogram
datorteknik; administrativ databehandling
Identifikatorer
urn:nbn:se:umu:diva-142031 (URN)10.1109/ICCAC.2017.10 (DOI)2-s2.0-85035314261 (Scopus ID)978-1-5386-1939-1 (ISBN)
Konferanse
2017 IEEE International Conference on Cloud and Autonomic Computing (ICCAC 2017), Tucson, Arizona, USA, 18–22 September 2017
Prosjekter
Cloud Control
Forskningsfinansiär
Swedish Research Council, C0590801
Tilgjengelig fra: 2017-11-17 Laget: 2017-11-17 Sist oppdatert: 2023-03-24bibliografisk kontrollert
6. Adaptive Service Performance Control using Cooperative Fuzzy Reinforcement Learning in Virtualized Environments
Åpne denne publikasjonen i ny fane eller vindu >>Adaptive Service Performance Control using Cooperative Fuzzy Reinforcement Learning in Virtualized Environments
2017 (engelsk)Inngår i: UCC '17 Proceedings of the10th International Conference on Utility and Cloud Computing, IEEE/ACM , 2017, s. 19-28Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Designing efficient control mechanisms to meet strict performance requirements with respect tochanging workload demands without sacrificing resource efficiency remains a challenge in cloudinfrastructures. A popular approach is fine-grained resource provisioning via auto-scaling mechanisms that rely on either threshold-based adaptation rules or sophisticated queuing/control-theoretic models. While it is difficult at design time to specify optimal threshold rules, it is even more challenging inferring precise performance models for the multitude of services. Recently, reinforcement learning have been applied to address this challenge. However, such approaches require many learning trials to stabilize at the beginning and when operational conditions vary thereby limiting their application under dynamic workloads. To this end, we extend the standard reinforcement learning approach in two ways: a) we formulate the system state as a fuzzy space and b) exploit a set of cooperative agents to explore multiple fuzzy states in parallel to speed up learning. Through multiple experiments on a real virtualized testbed, we demonstrate that our approach converges quickly, meets performance targets at high efficiency without explicit service models.

sted, utgiver, år, opplag, sider
IEEE/ACM, 2017
Emneord
Performance control, Resource allocation, Quality of service, Reinforcement learning, Autoscaling, Autonomic computing
HSV kategori
Forskningsprogram
datorteknik; administrativ databehandling
Identifikatorer
urn:nbn:se:umu:diva-142032 (URN)10.1145/3147213.3147225 (DOI)2-s2.0-85050972967 (Scopus ID)978-1-4503-5149-2 (ISBN)
Konferanse
10th IEEE/ACM International Conference on Utility and Cloud Computing, Austin, Texas, USA, December 5-8, 2017
Prosjekter
Cloud Control
Forskningsfinansiär
Swedish Research Council, C0590801
Tilgjengelig fra: 2017-11-17 Laget: 2017-11-17 Sist oppdatert: 2023-03-24bibliografisk kontrollert

Open Access i DiVA

fulltext(2123 kB)3636 nedlastinger
Filinformasjon
Fil FULLTEXT03.pdfFilstørrelse 2123 kBChecksum SHA-512
b178f80fdd6d80a06365f54a510c10e87738c8920f78961571b40eaf58e2933a9d028455c9014d71ac0fe8ae5c5186385f459d389e46bcf65db615e54f0d9d61
Type fulltextMimetype application/pdf
spikblad(123 kB)193 nedlastinger
Filinformasjon
Fil FULLTEXT02.pdfFilstørrelse 123 kBChecksum SHA-512
5b219307a3f086e8795ab9e0e1bf71857f377b917343dac61af9d483d28e1d2658db20265c0eebc795a5d2ceef8821304ab37c5b2b6f1ef11d2596cb117d156a
Type spikbladMimetype application/pdf
omslag(3032 kB)165 nedlastinger
Filinformasjon
Fil COVER01.pdfFilstørrelse 3032 kBChecksum SHA-512
cd5c5d911253f5c0a57f9a9d0c0c9e0cd9e659be72cdbb99f430c6f2a41832d402f10d0c221a932df17cd883b8c813c58d2ac736b92845d3d32177f73da34c21
Type coverMimetype application/pdf

Person

Ibidunmoye, Olumuyiwa

Søk i DiVA

Av forfatter/redaktør
Ibidunmoye, Olumuyiwa
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 3830 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

isbn
urn-nbn

Altmetric

isbn
urn-nbn
Totalt: 4858 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf