Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards self-driving microservices
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Autonomous Distributed Systems Lab)ORCID iD: 0000-0002-0751-9695
2023 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Mot självkörande mikrotjänster (Swedish)
Abstract [en]

In recent years, microservice architecture has become a popular method for software system design and development. This involves creating applications with multiple small services, each with multiple instances, operating as independent processes. Due to the distributed nature of microservices, communication between services presents a challenging task that becomes increasingly complex as the number of services grows. This complexity can even lead to short-term failures that can degrade application performance. Therefore, the auto-tuning of inter-service communication is necessary to prevent such failures. Service meshes were introduced to offer the necessary technical capabilities that can be employed in such scenarios. In essence, a service mesh is an infrastructure layer that includes a set of configurable proxies integrated into microservices. This enables the provision of traffic management policies such as circuit breaking and retry mechanisms to enhance microservice resilience against transient failures. However, static configuration or misconfiguration of these mechanisms is unsuitable for the dynamic environment of microservices and can lead to serious issues and performance problems, such as retry storms.

The goal of this thesis is three-fold. First, it aims to investigate the impact and effectiveness of service traffic management on application reliability and availability in the presence of transient failures. Second, it focuses on auto-tuning of service traffic management to increase carried throughput and maintain carried response time. Third, this research aims to propose measures that can improve research reproducibility in the area of distributed systems ensuring that the findings can be independently verified by others. In this thesis, we aim to offer detailed guidelines on best practices for implementing research software.

To achieve these goals, this thesis delves into the current state-of-the-art in service meshes and eBPF-powered microservices, identifying current challenges and potential future directions. It analyzes the effects of circuit breaker and retry mechanisms on microservice performance and proposes adaptive controllers for both. The results show the need for such controllers that increase throughput while maintaining the tail response time of the application. Additionally, it proposes a microservice benchmark generator to enable systematic microservice benchmark generation and improve reproducibility. It also provides recommendations for improving artifact evaluation in distributed systems research by compiling all existing recommendations.

Abstract [sv]

Mikrotjänster har de senaste åren blivit en populär arkitekturmodell för programvara. Modellen innebär att man skapar applikationer med flera små tjänster, var och en med flera instanser som fungerar som oberoende processer. Mikrotjänsters distribuerade natur gör kommunikationen mellan tjänster mer utmanande. Denna komplexitet kan även ge upphov till tillfälliga fel på grund av lastobalans eller överbelastning och även försämra applikationers prestanda. Av denna anledning är dynamisk konfigurationav kommunikationen mellan mikrotjänster nödvändig. En service mesh är en teknikplattform för att hantera hur mikrotjänster kommunicerar, med funktioner för att enkelt kryptera kommunikation mellan tjänster, mäta prestanda och finkorningt styra kommunikationsflöden.En service mesh implementeras ofta som en uppsättning konfigurerbara proxies. Detta möjliggör trafikhanteringspolicies baserade på mekanismer som kretsbrytning och omsändningar. Statisk konfiguration av dessa mekanismer kan dock ge allvarliga prestandaproblem såsom  låg genomströmning och/eller omfattande omsändningar.

Denna avhandling har tre mål. För det första undersöker den hur trafikhantering för mikrotjänster påverkar tillförlitligheten och tillgängligheten för applikationer vid tillfälliga störningar. För det andra fokuserar den på adaptiv reglering av trafikhantering av mikrotjänster för att öka genomströmningen och samtidigt bibehålla acceptabla svarstider. För det tredje syftar den till att förbättra reproducerbarheten i forskning inom distribuerade system och se till att forskningsresultat enklare kan verifieras av oberoende. 

För att uppnå dessa mål undersöker avhandlingen den tekniska frontlinjen inom service mesh och mikrotjänster driva av eBPF-tekniken. Avhandlingen analyserar vidare hur användandet av kretsbrytare och omsändningsmekanismer påverkar mikrotjänsters prestanda. Adaptiva reglersystem för att hantera konfiguration av båda dessa mekanismer föreslås och utvärderas i omfattande experimentent. Resultaten visar att sådana regulatorer är nödvändiga för att öka genomströmningen och samtidigt bibehålla (högre percentiler av) applikationers svarstider. Adaption är särskilt viktigt då faktorer som totala trafikmängden, applikationers prestanda, tillfälliga fel, etc. kan förändras snabbt. 

Avhandlingen introducerar även ett verktyg för att generera godtyckliga testapplikationer för att kunna genomföra mer heltäckande utvärderingar av olika typer av forskningsprogramvara som hanterar mikrotjänster. Avhandlingen bidrar även till reproducerbarhet genom att studera hur programvaru-artefakter bäst bör utvärderas inom forskningsområdet distribuerade system. Detta sker genom att sammanställa, och utöka, befintliga rekommendationer inom området.

Place, publisher, year, edition, pages
Umeå: Umeå University , 2023. , p. 61
Series
Report / UMINF, ISSN 0348-0542 ; 23:04
Keywords [en]
Microservices, Autonomic Computing, Service Mesh, Reliability, Circuit Breaker, Retry, Microservice Resiliency, Microservice Benchmarking, Reproducibility, Repeatability.
National Category
Computer Sciences Software Engineering
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-206990ISBN: 978-91-8070-022-1 (print)ISBN: 978-91-8070-023-8 (electronic)OAI: oai:DiVA.org:umu-206990DiVA, id: diva2:1752772
Public defence
2023-05-26, Aula Anatomica (BIO.A.206), Biologihuset, Umeå, 13:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2023-05-03 Created: 2023-04-24 Last updated: 2023-04-24Bibliographically approved
List of papers
1. Service mesh and eBPF-powered microservices: a survey and future directions
Open this publication in new window or tab >>Service mesh and eBPF-powered microservices: a survey and future directions
2022 (English)In: 2022 IEEE International Conference on Service-Oriented System Engineering (SOSE), IEEE, 2022, p. 176-184Conference paper, Published paper (Refereed)
Abstract [en]

Modern software development practice has seen a profound shift in architectural design, moving from monolithic approaches to distributed, microservice-based architectures. This allows for much simpler and faster application orchestration and management, especially in cloud-based systems, with the result being that orchestration systems themselves are becoming a key focus of computing research.

Orchestration system research addresses many different subject areas, including scheduling, automation, and security. However, the key characteristic that is common throughout is the complex and dynamic nature of distributed, multi-tenant cloud-based microservice systems that must be orchestrated. This complexity has led to many challenges in areas such as inter-service communication, observability, reliability, single cluster to multi-cluster, hybrid environments, and multi-tenancy.

The concept of service meshes has been introduced to handle this complexity. In essence, a service mesh is an infrastructure layer built directly into the microservices - or the nodes of orchestrators - as a set of configurable proxies that are responsible for the management, observability, and security of microservices.

Service meshes aim to be a full networking solution for microservices; however, they also introduce overhead into a system - this can be significant for low-powered edge devices, as service mesh proxies work in user space and are responsible for processing the incoming and outgoing traffic of each service. To mitigate performance issues caused by these proxies, the industry is pushing the boundaries of monitoring and security to kernel space by employing eBPF for faster and more efficient responses. 

We propose that the movement towards the use of service meshes as a networking solution for most of the required features by industry - combined with their integration with eBPF - is the next key trend in the evolution of microservices. This paper highlights the challenges of this movement, explores its current state, and discusses future opportunities in the context of microservices. 

Place, publisher, year, edition, pages
IEEE, 2022
Series
Proceedings (IEEE International Symposium on Service-Oriented System Engineering), E-ISSN 2642-6587
Keywords
Service-Oriented Computing, Microservice, Service Mesh, eBPF
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:umu:diva-200303 (URN)10.1109/SOSE55356.2022.00027 (DOI)000942754700021 ()2-s2.0-85141439805 (Scopus ID)978-1-6654-7534-1 (ISBN)978-1-6654-7535-8 (ISBN)
Conference
IEEE SOSE 2022, 16th International Conference on Service-Oriented System Engineering (SOSE), San Fransisco, USA, August 15-18, 2022
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2022-10-14 Created: 2022-10-14 Last updated: 2023-09-05Bibliographically approved
2. An Empirical Study of Service Mesh Traffic Management Policies for Microservices
Open this publication in new window or tab >>An Empirical Study of Service Mesh Traffic Management Policies for Microservices
2022 (English)In: ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering, New York: ACM Digital Library, 2022, p. 17-27Conference paper, Published paper (Refereed)
Abstract [en]

A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate management, observability, and communication between microservices. Service meshes provide various traffic management policies such as circuit breaking and retry mechanisms, which are claimed to protect microservices against overload and increase the robustness of communication between microservices. However, there have been no systematic studies on the effects of these mechanisms on microservice performance and robustness. Furthermore, the exact impact of various tuning parameters for circuit breaking and retries are poorly understood. This work presents a large set of experiments conducted to investigate these issues using a representative microservice benchmark in a Kubernetes testbed with the widely used Istio service mesh. Our experiments reveal effective configurations of circuit breakers and retries. The findings presented will be useful to engineers seeking to configure service meshes more systematically and also open up new areas of research for academics in the area of service meshes for (autonomic) microservice resource management.

Place, publisher, year, edition, pages
New York: ACM Digital Library, 2022
Keywords
microservices, service mesh, traffic management, circuit breaking, retry, microservice resiliency
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:umu:diva-193341 (URN)10.1145/3489525.3511686 (DOI)000883411400004 ()2-s2.0-85128651087 (Scopus ID)
Conference
CPE '22: ACM/SPEC International Conference on Performance Engineering, Bejing, China, April 9 - 13, 2022
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2022-03-28 Created: 2022-03-28 Last updated: 2023-09-05Bibliographically approved
3. Service mesh circuit breaker: From panic button to performance management tool
Open this publication in new window or tab >>Service mesh circuit breaker: From panic button to performance management tool
2021 (English)In: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, Association for Computing Machinery (ACM), 2021, p. 4-10Conference paper, Published paper (Refereed)
Abstract [en]

Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments.

We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
micro-services, circuit breaker, performance management, control theory
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-182614 (URN)10.1145/3447851.3458740 (DOI)2-s2.0-85106002930 (Scopus ID)978-1-4503-8336-3 (ISBN)
Conference
EuroSys '21: Sixteenth European Conference on Computer Systems, Online, UK, April, 2021
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2021-04-27 Created: 2021-04-27 Last updated: 2023-04-24Bibliographically approved
4. Breaking the vicious circle: self-adaptive microservice circuit breaking and retry
Open this publication in new window or tab >>Breaking the vicious circle: self-adaptive microservice circuit breaking and retry
Show others...
2023 (English)In: 2023 IEEE international conference on cloud engineering: proceedings / [ed] Lisa O’Conner, IEEE Computer Society, 2023, p. 32-42, article id 24126172Conference paper, Published paper (Refereed)
Abstract [en]

Microservice-based architectures consist of numerous, loosely coupled services with multiple instances. Service meshes aim to simplify traffic management and prevent microservice overload through circuit breaking and request retry mechanisms. Previous studies have demonstrated that the static configuration of these mechanisms is unfit for the dynamic environment of microservices. We conduct a sensitivity analysis to understand the impact of retrying across a wide range of scenarios. Based on the findings, we propose a retry controller that can also work with dynamically configured circuit breakers. We have empirically assessed our proposed controller in various scenarios, including transient overload and noisy neighbors while enforcing adaptive circuit breaking. The results show that our proposed controller does not deviate from a well-tuned configuration while maintaining carried response time and adapting to the changes. In comparison to the default static retry configuration that is mostly used in practice, our approach improves the carried throughput up to 12x and 32x respectively in the cases of transient overload and noisy neighbors.

Place, publisher, year, edition, pages
IEEE Computer Society, 2023
Series
Proceedings of the ... IEEE International Symposium on Requirements Engineering, E-ISSN 2332-6441
Keywords
reliability, retry mechanism, circuit breaker pattern, service mesh, microservices
National Category
Computer Engineering Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206989 (URN)10.1109/IC2E59103.2023.00012 (DOI)001103216300004 ()2-s2.0-85179510941 (Scopus ID)979-8-3503-4394-6 (ISBN)
Conference
2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, Massachusetts, 25–28 September 2023.
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

Originally included in thesis in manuscript form. 

Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2025-04-24Bibliographically approved
5. Hydragen: a microservice benchmark generator
Open this publication in new window or tab >>Hydragen: a microservice benchmark generator
Show others...
2023 (English)In: 2023 IEEE 16th international conference on cloud computing (CLOUD) / [ed] C. Ardagna; N. Atukorala; P. Beckman; C.K. Chang; R.N. Chang; C. Evangelinos; J. Fan; G.C. Fox; J. Fox; C. Hagleitner; Z. Jin; T. Kosar; M. Parashar, IEEE, 2023, Vol. 2023-July, p. 189-200Conference paper, Published paper (Refereed)
Abstract [en]

Microservice-based architectures have become ubiq-uitous in large-scale software systems. Experimental cloud re-searchers constantly propose enhanced resource management mechanisms for such systems. These mechanisms need to be eval-uated using both realistic and flexible microservice benchmarks to study in which ways diverse application characteristics can affect their performance and scalability. However, current mi-croservice benchmarks have limitations including static compu-tational complexity, limited architectural scale, and fixed topology (i.e., number of tiers, fan-in, and fan-out characteristics).

We therefore propose HydraGen, a tool that enables re-searchers to systematically generate benchmarks with different computational complexities and topologies, to tackle experimental evaluation of performance at scale for web-serving applications, with a focus on inter-service communication. To illustrate the potential of our open-source tool, we demonstrate how it can reproduce an existing microservice benchmark with preserved architectural properties. We also demonstrate how HydraGen can enrich the evaluation of cloud management systems based on a case study related to traffic engineering.

Place, publisher, year, edition, pages
IEEE, 2023
Series
IEEE International Conference on Cloud Computing, CLOUD, ISSN 2159-6182, E-ISSN 2159-6190
Keywords
microservices, benchmark generator, performance analysis, emulation, validation, cloud systems
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206987 (URN)10.1109/CLOUD60044.2023.00030 (DOI)001085065100020 ()2-s2.0-85174317366 (Scopus ID)9798350304817 (ISBN)9798350304824 (ISBN)
Conference
16th IEEE International Conference on Cloud Computing, CLOUD 2023, Hybrid/Chicago, July 2-8, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Google
Note

Originally included in thesis in manuscript form. 

Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2025-04-24Bibliographically approved
6. Artifact evaluation for distributed systems: current practices and beyond
Open this publication in new window or tab >>Artifact evaluation for distributed systems: current practices and beyond
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifacts, with an increasing number of artifacts submitted.

This study compiles recent artifact evaluation procedures and guidelines to show how artifact evaluation in distributed systems research lags behind other computing disciplines, and/or is less unified and more complex. We further argue that current artifact assessment criteria are uncoordinated and insufficient for the unique challenges of distributed systems research. We examine the current state of the practice for artifacts and their evaluation to provide recommendations to assist artifact authors, reviewers, and track chairs. Although our recommendations alone will not resolve the repeatability and reproducibility crisis, we want to start a discussion in our community to increase both the number of submitted artifacts and their quality over time.

The ambition of this paper is to provide both artifact authors and reviewers with a one-stop shop for all required knowledge to make this successful.

Keywords
Artifact evaluation, reproducibility, repeatability, distributed systems
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206988 (URN)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Research Council, PSI
Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2023-04-24

Open Access in DiVA

fulltext(4088 kB)580 downloads
File information
File name FULLTEXT01.pdfFile size 4088 kBChecksum SHA-512
b2c226154840c325c9a579f5eb26966fe56a134eb6b9922c1dde07a0de011bc18784355c9593eebcce365b552b56401ed8bc783a5661f4ecefd8db8f936827d4
Type fulltextMimetype application/pdf
spikblad(127 kB)86 downloads
File information
File name SPIKBLAD01.pdfFile size 127 kBChecksum SHA-512
a163549d30d60010a30f7b09fc3fd140cd5621b987076843781576f21cbd5f20beaefbe5947d22c228ba6e20073c29611590ab1af3ac3dcd3fac6ea3b9652893
Type spikbladMimetype application/pdf

Authority records

Saleh Sedghpour, Mohammad Reza

Search in DiVA

By author/editor
Saleh Sedghpour, Mohammad Reza
By organisation
Department of Computing Science
Computer SciencesSoftware Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 580 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2557 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf