umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Performance Anomaly Detection and Bottleneck Identification
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Distributed Systems)
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Distributed Systems)
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Distributed Systems)
2015 (English)In: ACM Computing Surveys, ISSN 0360-0300, E-ISSN 1557-7341, Vol. 48, no 1, 4Article in journal (Refereed) Published
Abstract [en]

In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in different system and application domains is well studied. In order to assess progress, research trends and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2015. Vol. 48, no 1, 4
Keyword [en]
Systems performance, performance anomaly detection, bottleneck detection, performance problem identification
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
URN: urn:nbn:se:umu:diva-105991DOI: 10.1145/2791120ISI: 000363733200004Scopus ID: 2-s2.0-84938363675OAI: oai:DiVA.org:umu-105991DiVA: diva2:839537
Funder
Swedish Research Council, C0590801
Available from: 2015-07-03 Created: 2015-07-03 Last updated: 2017-12-04Bibliographically approved
In thesis
1. Performance problem diagnosis in cloud infrastructures
Open this publication in new window or tab >>Performance problem diagnosis in cloud infrastructures
2016 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud datacenters comprise hundreds or thousands of disparate application services, each having stringent performance and availability requirements, sharing a finite set of heterogeneous hardware and software resources. The implication of such complex environment is that the occurrence of performance problems, such as slow application response and unplanned downtimes, has become a norm rather than exception resulting in decreased revenue, damaged reputation, and huge human-effort in diagnosis. Though causes can be as varied as application issues (e.g. bugs), machine-level failures (e.g. faulty server), and operator errors (e.g. mis-configurations), recent studies have attributed capacity-related issues, such as resource shortage and contention, as the cause of most performance problems on the Internet today. As cloud datacenters become increasingly autonomous there is need for automated performance diagnosis systems that can adapt their operation to reflect the changing workload and topology in the infrastructure. In particular, such systems should be able to detect anomalous performance events, uncover manifestations of capacity bottlenecks, localize actual root-cause(s), and possibly suggest or actuate corrections.

This thesis investigates approaches for diagnosing performance problems in cloud infrastructures. We present the outcome of an extensive survey of existing research contributions addressing performance diagnosis in diverse systems domains. We also present models and algorithms for detecting anomalies in real-time application performance and identification of anomalous datacenter resources based on operational metrics and spatial dependency across datacenter components. Empirical evaluations of our approaches shows how they can be used to improve end-user experience, service assurance and support root-cause analysis. 

Place, publisher, year, edition, pages
Umeå: Department of Computing Science, Umeå University, 2016. 28 p.
Series
Report / UMINF, ISSN 0348-0542 ; 16.14
Keyword
Systems Performance, Performance anomalies, Performance bottlenecks, Cloud infrastructures, Cloud Computing, Cloud Services, Cloud Computing Performance, Performance problems, Performance anomaly detection, Performance bottleneck identification, Performance Root-cause Analysis
National Category
Computer Systems
Research subject
Computer Systems; Computer Science
Identifiers
urn:nbn:se:umu:diva-120287 (URN)978-91-7601-500-1 (ISBN)
Presentation
2016-05-24, N430, Naturvetarhuset, Umeå University, Umeå, 10:00 (English)
Opponent
Supervisors
Projects
Cloud Control (C0590801)
Funder
Swedish Research Council, C0590801
Available from: 2016-05-23 Created: 2016-05-13 Last updated: 2016-08-23Bibliographically approved
2. Performance anomaly detection and resolution for autonomous clouds
Open this publication in new window or tab >>Performance anomaly detection and resolution for autonomous clouds
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Fundamental properties of cloud computing such as resource sharing and on-demand self-servicing is driving a growing adoption of the cloud for hosting both legacy and new application services. A consequence of this growth is that the increasing scale and complexity of the underlying cloud infrastructure as well as the fluctuating service workloads is inducing performance incidents at a higher frequency than ever before with far-reaching impact on revenue, reliability, and reputation. Hence, effectively managing performance incidents with emphasis on timely detection, diagnosis and resolution has thus become a necessity rather than luxury. While other aspects of cloud management such as monitoring and resource management are experiencing greater automation, automated management of performance incidents remains a major concern.

Given the volume of operational data produced by cloud datacenters and services, this thesis focus on how data analytics techniques can be used in the aspect of cloud performance management. In particular, this work investigates techniques and models for automated performance anomaly detection and prevention in cloud environments. To familiarize with developments in the research area, we present the outcome of an extensive survey of existing research contributions addressing various aspects of performance problem management in diverse systems domains. We discuss the design and evaluation of analytics models and algorithms for detecting performance anomalies in real-time behaviour of cloud datacenter resources and hosted services at different resolutions. We also discuss the design of a semi-supervised machine learning approach for mitigating performance degradation by actively driving quality of service from undesirable states to a desired target state via incremental capacity optimization. The research methods used in this thesis include experiments on real virtualized testbeds to evaluate aspects of proposed techniques while other aspects are evaluated using performance traces from real-world datacenters.

Insights and outcomes from this thesis can be used by both cloud and service operators to enhance the automation of performance problem detection, diagnosis and resolution. They also have the potential to spur further research in the area while being applicable in related domains such as Internet of Things (IoT), industrial sensors as well as in edge and mobile clouds.

Abstract [sv]

Grundläggande egenskaper för datormoln såsom resursdelning och självbetjäning driver ett växande nyttjande av molnet för internettjänster. En följd av denna tillväxt är att den underliggande molninfrastrukturens ökande storlek och komplexitet samt fluktuerade arbetsbelastning orsakar prestandaincidenter med högre frekvens än någonsin tidigare. En konsekvens av detta blir omfattande inverkan på intäkter, tillförlitlighet och rykte för de som äger tjänsterna. Det har därför blivit viktigt att snabbt och effektivt hantera prestandaincidenter med avseende på upptäckt, diagnos och korrigering. Även om andra aspekter av resurshantering för datormoln, som övervakning och resursallokering, på senare tid automatiserats i allt högre grad så är automatiserad hantering av prestandaincidenter fortfarande ett stort problem.

Denna avhandling fokuserar på hur prestandahanteringen i molndatacenter kan förbättras genom användning av dataanalystekniker på de stora datamängder som produceras i de system som monitorerar prestanda hos datorresurser och tjänster. I synnerhet undersöks tekniker och modeller för automatisk upptäckt och förebyggande av prestandaanomalier i datormoln. För att kartlägga utvecklingen inom forskningsområdet presenterar vi resultatet av en omfattande undersökning av befintliga forskningsbidrag som behandlar olika aspekter av hantering av prestandaproblem inom i relevanta tillämpningsområden. Vi diskuterar design och utvärdering av analysmodeller och algoritmer för att upptäcka prestandaanomalier i realtid hos resurser och tjänster. Vi diskuterar också utformningen av ett maskininlärningsbaserat tillvägagångssätt för att mildra prestandaförluster genom att aktivt driva tjänsternas kvalitet från oönskade tillstånd till ett önskat målläge genom inkrementell kapacitetoptimering. Forskningsmetoderna som används i denna avhandling innefattar experiment på verkliga virtualiserade testmiljöer för att utvärdera aspekter av föreslagna tekniker medan andra aspekter utvärderas med hjälp av belastningsmönster från verkliga datacenter.

Insikter och resultat från denna avhandling kan användas av både moln- och tjänsteoperatörer för att bättre automatisera detekteringen av prestandaproblem, inklusive dess diagnos och korrigering. Resultaten har också potential att uppmuntra vidare forskning inom området samtidigt som de är användbara inom relaterade områden som internet-av-saker, industriella sensorer, och storskaligt distribuerade moln eller telekomnätverk.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2017. 60 p.
Series
Report / UMINF, ISSN 0348-0542 ; 17.18
Keyword
Cloud Computing, Distributed Systems, Performance Management, Anomaly Detection, Quality of Service, Performance Analytics, Machine Learning
National Category
Computer Systems
Research subject
Computer Systems; Computing Science; Computer Science
Identifiers
urn:nbn:se:umu:diva-142033 (URN)978-91-7601-800-2 (ISBN)
Public defence
2017-12-14, MA121, MIT-huset, Umeå University, Umeå, 13:15 (English)
Opponent
Supervisors
Projects
Cloud ControleSSENCE
Funder
Swedish Research Council, C0590801
Available from: 2017-11-21 Created: 2017-11-17 Last updated: 2017-11-29Bibliographically approved

Open Access in DiVA

fulltext(1814 kB)1004 downloads
File information
File name FULLTEXT01.pdfFile size 1814 kBChecksum SHA-512
782f2845b2f0af8eec85aa9eec85d6e8bc2034e103cee075064d89fce8a3d1e6273e4869a826794133a1e0b827ef4832e6e3b079151f9faf4c89b1c16a555c8e
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records BETA

Ibidunmoye, OlumuyiwaFrancisco, Hernandez-RodriguezElmroth, Erik

Search in DiVA

By author/editor
Ibidunmoye, OlumuyiwaFrancisco, Hernandez-RodriguezElmroth, Erik
By organisation
Department of Computing Science
In the same journal
ACM Computing Surveys
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 1004 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 556 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf