Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Tordsson, Johan
Alternative names
Publications (10 of 98) Show all publications
Saleh Sedghpour, M. R., Garlan, D., Schmerl, B., Klein, C. & Tordsson, J. (2023). Breaking the vicious circle: self-adaptive microservice circuit breaking and retry. In: Lisa O’Conner (Ed.), 2023 IEEE international conference on cloud engineering: proceedings. Paper presented at 2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, Massachusetts, 25–28 September 2023. (pp. 32-42). IEEE Computer Society, Article ID 24126172.
Open this publication in new window or tab >>Breaking the vicious circle: self-adaptive microservice circuit breaking and retry
Show others...
2023 (English)In: 2023 IEEE international conference on cloud engineering: proceedings / [ed] Lisa O’Conner, IEEE Computer Society, 2023, p. 32-42, article id 24126172Conference paper, Published paper (Refereed)
Abstract [en]

Microservice-based architectures consist of numerous, loosely coupled services with multiple instances. Service meshes aim to simplify traffic management and prevent microservice overload through circuit breaking and request retry mechanisms. Previous studies have demonstrated that the static configuration of these mechanisms is unfit for the dynamic environment of microservices. We conduct a sensitivity analysis to understand the impact of retrying across a wide range of scenarios. Based on the findings, we propose a retry controller that can also work with dynamically configured circuit breakers. We have empirically assessed our proposed controller in various scenarios, including transient overload and noisy neighbors while enforcing adaptive circuit breaking. The results show that our proposed controller does not deviate from a well-tuned configuration while maintaining carried response time and adapting to the changes. In comparison to the default static retry configuration that is mostly used in practice, our approach improves the carried throughput up to 12x and 32x respectively in the cases of transient overload and noisy neighbors.

Place, publisher, year, edition, pages
IEEE Computer Society, 2023
Series
Proceedings of the ... IEEE International Symposium on Requirements Engineering, E-ISSN 2332-6441
Keywords
reliability, retry mechanism, circuit breaker pattern, service mesh, microservices
National Category
Computer Engineering Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206989 (URN)10.1109/IC2E59103.2023.00012 (DOI)2-s2.0-85179510941 (Scopus ID)979-8-3503-4394-6 (ISBN)
Conference
2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, Massachusetts, 25–28 September 2023.
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

Originally included in thesis in manuscript form. 

Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2023-12-22Bibliographically approved
Saleh Sedghpour, M. R., Obeso Duque, A., Cai, X., Skubic, B., Elmroth, E., Klein, C. & Tordsson, J. (2023). Hydragen: a microservice benchmark generator. In: C. Ardagna; N. Atukorala; P. Beckman; C.K. Chang; R.N. Chang; C. Evangelinos; J. Fan; G.C. Fox; J. Fox; C. Hagleitner; Z. Jin; T. Kosar; M. Parashar (Ed.), 2023 IEEE 16th international conference on cloud computing (CLOUD): . Paper presented at 16th IEEE International Conference on Cloud Computing, CLOUD 2023, Hybrid/Chicago, July 2-8, 2023 (pp. 189-200). IEEE, 2023-July
Open this publication in new window or tab >>Hydragen: a microservice benchmark generator
Show others...
2023 (English)In: 2023 IEEE 16th international conference on cloud computing (CLOUD) / [ed] C. Ardagna; N. Atukorala; P. Beckman; C.K. Chang; R.N. Chang; C. Evangelinos; J. Fan; G.C. Fox; J. Fox; C. Hagleitner; Z. Jin; T. Kosar; M. Parashar, IEEE, 2023, Vol. 2023-July, p. 189-200Conference paper, Published paper (Refereed)
Abstract [en]

Microservice-based architectures have become ubiq-uitous in large-scale software systems. Experimental cloud re-searchers constantly propose enhanced resource management mechanisms for such systems. These mechanisms need to be eval-uated using both realistic and flexible microservice benchmarks to study in which ways diverse application characteristics can affect their performance and scalability. However, current mi-croservice benchmarks have limitations including static compu-tational complexity, limited architectural scale, and fixed topology (i.e., number of tiers, fan-in, and fan-out characteristics).

We therefore propose HydraGen, a tool that enables re-searchers to systematically generate benchmarks with different computational complexities and topologies, to tackle experimental evaluation of performance at scale for web-serving applications, with a focus on inter-service communication. To illustrate the potential of our open-source tool, we demonstrate how it can reproduce an existing microservice benchmark with preserved architectural properties. We also demonstrate how HydraGen can enrich the evaluation of cloud management systems based on a case study related to traffic engineering.

Place, publisher, year, edition, pages
IEEE, 2023
Series
IEEE International Conference on Cloud Computing, CLOUD, ISSN 2159-6182, E-ISSN 2159-6190
Keywords
microservices, benchmark generator, performance analysis, emulation, validation, cloud systems
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206987 (URN)10.1109/CLOUD60044.2023.00030 (DOI)2-s2.0-85174317366 (Scopus ID)9798350304817 (ISBN)9798350304824 (ISBN)
Conference
16th IEEE International Conference on Cloud Computing, CLOUD 2023, Hybrid/Chicago, July 2-8, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Google
Note

Originally included in thesis in manuscript form. 

Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2023-11-06Bibliographically approved
Saleh Sedghpour, M. R., Klein, C. & Tordsson, J. (2022). An Empirical Study of Service Mesh Traffic Management Policies for Microservices. In: ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering: . Paper presented at CPE '22: ACM/SPEC International Conference on Performance Engineering, Bejing, China, April 9 - 13, 2022 (pp. 17-27). New York: ACM Digital Library
Open this publication in new window or tab >>An Empirical Study of Service Mesh Traffic Management Policies for Microservices
2022 (English)In: ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering, New York: ACM Digital Library, 2022, p. 17-27Conference paper, Published paper (Refereed)
Abstract [en]

A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate management, observability, and communication between microservices. Service meshes provide various traffic management policies such as circuit breaking and retry mechanisms, which are claimed to protect microservices against overload and increase the robustness of communication between microservices. However, there have been no systematic studies on the effects of these mechanisms on microservice performance and robustness. Furthermore, the exact impact of various tuning parameters for circuit breaking and retries are poorly understood. This work presents a large set of experiments conducted to investigate these issues using a representative microservice benchmark in a Kubernetes testbed with the widely used Istio service mesh. Our experiments reveal effective configurations of circuit breakers and retries. The findings presented will be useful to engineers seeking to configure service meshes more systematically and also open up new areas of research for academics in the area of service meshes for (autonomic) microservice resource management.

Place, publisher, year, edition, pages
New York: ACM Digital Library, 2022
Keywords
microservices, service mesh, traffic management, circuit breaking, retry, microservice resiliency
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:umu:diva-193341 (URN)10.1145/3489525.3511686 (DOI)000883411400004 ()2-s2.0-85128651087 (Scopus ID)
Conference
CPE '22: ACM/SPEC International Conference on Performance Engineering, Bejing, China, April 9 - 13, 2022
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2022-03-28 Created: 2022-03-28 Last updated: 2023-09-05Bibliographically approved
Souza, A., Pelckmans, K. & Tordsson, J. (2021). A HPC Co-scheduler with Reinforcement Learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): . Paper presented at JSSPP 2021, 24th International Workshop on Job Scheduling Strategies for Parallel Processing, Virtual, May 21, 2021 (pp. 126-148). Springer
Open this publication in new window or tab >>A HPC Co-scheduler with Reinforcement Learning
2021 (English)In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer, 2021, p. 126-148Conference paper, Published paper (Refereed)
Abstract [en]

Although High Performance Computing (HPC) users understand basic resource requirements such as the number of CPUs and memory limits, internal infrastructural utilization data is exclusively leveraged by cluster operators, who use it to configure batch schedulers. This task is challenging and increasingly complex due to ever larger cluster scales and heterogeneity of modern scientific workflows. As a result, HPC systems achieve low utilization with long job completion times (makespans). To tackle these challenges, we propose a co-scheduling algorithm based on an adaptive reinforcement learning algorithm, where application profiling is combined with cluster monitoring. The resulting cluster scheduler matches resource utilization to application performance in a fine-grained manner (i.e., operating system level). As opposed to nominal allocations, we apply decision trees to model applications’ actual resource usage, which are used to estimate how much resource capacity from one allocation can be co-allocated to additional applications. Our algorithm learns from incorrect co-scheduling decisions and adapts from changing environment conditions, and evaluates when such changes cause resource contention that impacts quality of service metrics such as jobs slowdowns. We integrate our algorithm in an HPC resource manager that combines Slurm and Mesos for job scheduling and co-allocation, respectively. Our experimental evaluation performed in a dedicated cluster executing a mix of four real different scientific workflows demonstrates improvements on cluster utilization of up to 51% even in high load scenarios, with 55% average queue makespan reductions under low loads.

Place, publisher, year, edition, pages
Springer, 2021
Series
Lecture Notes in Computer Science (LNCS), ISSN 0302-9743, E-ISSN 1611-3349 ; 12985
Keywords
Adaptive reinforcement learning, Co-scheduling, Datacenters, High performance computing
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-188961 (URN)10.1007/978-3-030-88224-2_7 (DOI)000869960400007 ()2-s2.0-85117482466 (Scopus ID)978-3-030-88223-5 (ISBN)978-3-030-88224-2 (ISBN)
Conference
JSSPP 2021, 24th International Workshop on Job Scheduling Strategies for Parallel Processing, Virtual, May 21, 2021
Note

Also part of the Theoretical Computer Science and General Issues book sub series (LNTCS, volume 12985)

Available from: 2021-10-28 Created: 2021-10-28 Last updated: 2023-09-05Bibliographically approved
Wu, L., Tordsson, J., Elmroth, E. & Kao, O. (2021). Causal Inference Techniques for Microservice Performance Diagnosis: Evaluation and Guiding Recommendations. In: Esam El-Araby; Vana Kalogeraki; Danilo Pianini; Frédéric Lassabe; Barry Porter; Sona Ghahremani; Ingrid Nunes; Mohamed Bakhouya; Sven Tomforde (Ed.), 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems, ACSOS 2021: Proceedings. Paper presented at 2nd IEEE International Conference on Autonomic Computing and Self-Organizing Systems, ACSOS 2021, 27 September–1 October 2021, Virtual Conference. (pp. 21-30). IEEE
Open this publication in new window or tab >>Causal Inference Techniques for Microservice Performance Diagnosis: Evaluation and Guiding Recommendations
2021 (English)In: 2021 IEEE International Conference on Autonomic Computing and Self-Organizing Systems, ACSOS 2021: Proceedings / [ed] Esam El-Araby; Vana Kalogeraki; Danilo Pianini; Frédéric Lassabe; Barry Porter; Sona Ghahremani; Ingrid Nunes; Mohamed Bakhouya; Sven Tomforde, IEEE, 2021, p. 21-30Conference paper, Published paper (Refereed)
Abstract [en]

Causal inference (CI) is one of the popular performance diagnosis methods, which infers the anomaly propagation from the observed data for locating the root causes. Although some specific CI methods have been employed in the literature, the overall performance of this class of methods on microservice performance diagnosis is not well understood. To this end, we select six representative CI methods from three categories and evaluate their performance against the challenges of microservice operations, including the large-scale observable data, heterogeneous anomaly symptoms, and a wide range of root causes. Our experimental results show that 1) CI techniques must be integrated with anomaly detection or anomaly scores to differentiate the causality in normal and abnormal data; 2) CI techniques are more robust to false positives in anomaly detection than knowledge-based non-CI method; 3) To get the fine-grained root causes, an effective way with CI techniques is to identify the faulty service first and infer the detailed explanation of the service abnormality. Overall, this work broadens the understanding of how CI methods perform on microservice performance diagnosis and provides recommendations for an efficient application of CI methods.

Place, publisher, year, edition, pages
IEEE, 2021
Keywords
Causal inference, Experimental evaluation, Microservices, Performance diagnosis, Self-healing
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-192796 (URN)10.1109/ACSOS52086.2021.00029 (DOI)000802156400003 ()2-s2.0-85124790415 (Scopus ID)9781665429405 (ISBN)9781665412612 (ISBN)
Conference
2nd IEEE International Conference on Autonomic Computing and Self-Organizing Systems, ACSOS 2021, 27 September–1 October 2021, Virtual Conference.
Funder
EU, Horizon 2020
Available from: 2022-03-01 Created: 2022-03-01 Last updated: 2023-09-05Bibliographically approved
Wu, L., Tordsson, J., Bogatinovski, J., Elmroth, E. & Kao, O. (2021). MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems. In: 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence): . Paper presented at ICSE21 Workshop on Cloud Intelligence In conjunction with the 43rd International Conference on Software Engineering, Virtual (Originally Madrid, Spain), May 29, 2021 (pp. 31-36). IEEE
Open this publication in new window or tab >>MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems
Show others...
2021 (English)In: 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence), IEEE, 2021, p. 31-36Conference paper, Published paper (Refereed)
Abstract [en]

Microservice architecture has emerged as a popular pattern for developing large-scale applications for its benefits of flexibility, scalability, and agility. However, the large number of services and complex dependencies make it difficult and time-consuming to diagnose performance issues. We propose MicroDiag, an automated system to localize root causes of performance issues in microservice systems at a fine granularity, including not only locating the faulty component but also discovering detailed information for its abnormality. MicroDiag constructs a component dependency graph and performs causal inference on diverse anomaly symptoms to derive a metrics causality graph, which is used to infer root causes. Our experimental evaluation on a microservice benchmark running in a Kubernetes cluster shows that MicroDiag localizes root causes well, with 97% precision of the top 3 most likely root causes, outperforming state-of-the-art methods by at least 31.1%.

Place, publisher, year, edition, pages
IEEE, 2021
Keywords
Performance diagnosis, Microservice system, Causal inference, Data-driven, Fine-grained root cause
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-183328 (URN)10.1109/CLOUDINTELLIGENCE52565.2021.00015 (DOI)000706518900006 ()2-s2.0-85115886239 (Scopus ID)978-1-6654-4563-4 (ISBN)978-1-6654-4564-1 (ISBN)
Conference
ICSE21 Workshop on Cloud Intelligence In conjunction with the 43rd International Conference on Software Engineering, Virtual (Originally Madrid, Spain), May 29, 2021
Funder
EU, Horizon 2020, 765452
Available from: 2021-05-23 Created: 2021-05-23 Last updated: 2023-09-05Bibliographically approved
Arkian, H., Pierre, G., Tordsson, J. & Elmroth, E. (2021). Model-based Stream Processing Auto-scaling in Geo-Distributed Environments. In: 2021 International Conference on Computer Communications and Networks (ICCCN): . Paper presented at ICCCN 2021, 30th International Conference on Computer Communications and Networks, Athens, Greece, July 19-22, 2021. IEEE
Open this publication in new window or tab >>Model-based Stream Processing Auto-scaling in Geo-Distributed Environments
2021 (English)In: 2021 International Conference on Computer Communications and Networks (ICCCN), IEEE, 2021Conference paper, Published paper (Refereed)
Abstract [en]

Data stream processing is an attractive paradigm for analyzing IoT data at the edge of the Internet before transmitting processed results to a cloud. However, the relative scarcity of fog computing resources combined with the workloads' nonstationary properties make it impossible to allocate a static set of resources for each application. We propose Gesscale, a resource auto-scaler which guarantees that a stream processing application maintains a sufficient Maximum Sustainable Throughput to process its incoming data with no undue delay, while not using more resources than strictly necessary. Gesscale derives its decisions about when to rescale and which geo-distributed resource(s) to add or remove on a performance model that gives precise predictions about the future maximum sustainable throughput after reconfiguration. We show that this auto-scaler uses 17% less resources, generates 52% fewer reconfigurations, and processes more input data than baseline auto-scalers based on threshold triggers or a simpler performance model.

Place, publisher, year, edition, pages
IEEE, 2021
Series
Proceedings (International Conference on Computer Communications and Networks), E-ISSN 2637-9430
Keywords
Stream processing, autoscaling, fog computing
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-183326 (URN)10.1109/ICCCN52240.2021.9522236 (DOI)000701532600054 ()2-s2.0-85113709912 (Scopus ID)
Conference
ICCCN 2021, 30th International Conference on Computer Communications and Networks, Athens, Greece, July 19-22, 2021
Available from: 2021-05-23 Created: 2021-05-23 Last updated: 2022-01-21Bibliographically approved
Wu, L., Bogatinovski, J., Nedelkoski, S., Tordsson, J. & Kao, O. (2021). Performance Diagnosis in Cloud Microservices Using Deep Learning. In: Hakim Hacid; Fatma Outay; Hye-young Paik; Amira Alloum; Marinella Petrocchi; Mohamed Reda Bouadjenek; Amin Beheshti; Xumin Liu; Abderrahmane Maaradji (Ed.), Service-Oriented Computing – ICSOC 2020 Workshops: AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events Dubai, United Arab Emirates, December 14–17, 2020 Proceedings. Paper presented at AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events held in conjunction with 18th International Conference on Service-Oriented Computing, ICSOC 2020, Virtual, Online, December 14-17, 2020 (pp. 85-96). Cham: Springer
Open this publication in new window or tab >>Performance Diagnosis in Cloud Microservices Using Deep Learning
Show others...
2021 (English)In: Service-Oriented Computing – ICSOC 2020 Workshops: AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events Dubai, United Arab Emirates, December 14–17, 2020 Proceedings / [ed] Hakim Hacid; Fatma Outay; Hye-young Paik; Amira Alloum; Marinella Petrocchi; Mohamed Reda Bouadjenek; Amin Beheshti; Xumin Liu; Abderrahmane Maaradji, Cham: Springer, 2021, p. 85-96Conference paper, Published paper (Refereed)
Abstract [en]

Microservice architectures are increasingly adopted to design large-scale applications. However, the highly distributed nature and complex dependencies of microservices complicate automatic performance diagnosis and make it challenging to guarantee service level agreements (SLAs). In particular, identifying the culprits of a microservice performance issue is extremely difficult as the set of potential root causes is large and issues can manifest themselves in complex ways. This paper presents an application-agnostic system to locate the culprits for microservice performance degradation with fine granularity, including not only the anomalous service from which the performance issue originates but also the culprit metrics that correlate to the service abnormality. Our method first finds potential culprit services by constructing a service dependency graph and next applies an autoencoder to identify abnormal service metrics based on a ranked list of reconstruction errors. Our experimental evaluation based on injection of performance anomalies to a microservice benchmark deployed in the cloud shows that our system achieves a good diagnosis result, with 92% precision in locating culprit service and 85.5% precision in locating culprit metrics.

Place, publisher, year, edition, pages
Cham: Springer, 2021
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 12632
Keywords
Autoencoder, Cloud computing, Microservices, Performance diagnosis, Root cause analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-191348 (URN)10.1007/978-3-030-76352-7_13 (DOI)000719568200013 ()2-s2.0-85109866300 (Scopus ID)978-3-030-76351-0 (ISBN)978-3-030-76352-7 (ISBN)
Conference
AIOps, CFTIC, STRAPS, AI-PA, AI-IOTS, and Satellite Events held in conjunction with 18th International Conference on Service-Oriented Computing, ICSOC 2020, Virtual, Online, December 14-17, 2020
Funder
EU, Horizon 2020, 765452
Note

Also part of the Programming and Software Engineering book sub series (LNPSE, volume 12632).

Available from: 2022-01-14 Created: 2022-01-14 Last updated: 2022-01-14Bibliographically approved
Saleh Sedghpour, M. R., Klein, C. & Tordsson, J. (2021). Service mesh circuit breaker: From panic button to performance management tool. In: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems: . Paper presented at EuroSys '21: Sixteenth European Conference on Computer Systems, Online, UK, April, 2021 (pp. 4-10). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Service mesh circuit breaker: From panic button to performance management tool
2021 (English)In: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, Association for Computing Machinery (ACM), 2021, p. 4-10Conference paper, Published paper (Refereed)
Abstract [en]

Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments.

We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
micro-services, circuit breaker, performance management, control theory
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-182614 (URN)10.1145/3447851.3458740 (DOI)2-s2.0-85106002930 (Scopus ID)978-1-4503-8336-3 (ISBN)
Conference
EuroSys '21: Sixteenth European Conference on Computer Systems, Online, UK, April, 2021
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2021-04-27 Created: 2021-04-27 Last updated: 2023-04-24Bibliographically approved
Tamiru, M., Tordsson, J., Elmroth, E. & Pierre, G. (2020). An Experimental Evaluation of the Kubernetes Cluster Autoscaler in the Cloud. In: 2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom): . Paper presented at The 12th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2020), Publication only, 2020 (pp. 17-24). IEEE
Open this publication in new window or tab >>An Experimental Evaluation of the Kubernetes Cluster Autoscaler in the Cloud
2020 (English)In: 2020 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, 2020, p. 17-24Conference paper, Published paper (Refereed)
Abstract [en]

Despite the abundant research in cloud autoscaling, autoscaling in Kubernetes, arguably the most popular cloud platform today, is largely unexplored. Kubernetes' Cluster Autoscaler can be configured to select nodes either from a single node pool (CA) or from multiple node pools (CA-NAP). We evaluate and compare these configurations using two representative applications and workloads on Google Kubernetes Engine (GKE). We report our results using monetary cost and standard autoscaling performance metrics (under-and over-provisioning accuracy, under-and over-provisioning timeshare, instability of elasticity and deviation from the theoretical optimal autoscaler) endorsed by the SPEC Cloud Group. We show that, overall, CA-NAP outperforms CA and that autoscaling performance depends mainly on the composition of the workload. We compare our results with those of the related work and point out further configuration tuning opportunities to improve performance and cost-saving.

Place, publisher, year, edition, pages
IEEE, 2020
Keywords
Cloud computing, autoscaling, Kubernetes
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-183330 (URN)10.1109/CloudCom49646.2020.00002 (DOI)
Conference
The 12th IEEE International Conference on Cloud Computing Technology and Science (CloudCom 2020), Publication only, 2020
Available from: 2021-05-23 Created: 2021-05-23 Last updated: 2021-05-24Bibliographically approved
Organisations

Search in DiVA

Show all publications