Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Service mesh circuit breaker: From panic button to performance management tool
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Autonomous Distributed Systems Lab)ORCID iD: 0000-0002-0751-9695
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Autonomous Distributed Systems Lab)ORCID iD: 0000-0003-0106-3049
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Autonomous Distributed Systems Lab)ORCID iD: 0000-0002-5970-7276
2021 (English)In: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, Association for Computing Machinery (ACM), 2021, p. 4-10Conference paper, Published paper (Refereed)
Abstract [en]

Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments.

We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021. p. 4-10
Keywords [en]
micro-services, circuit breaker, performance management, control theory
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:umu:diva-182614DOI: 10.1145/3447851.3458740Scopus ID: 2-s2.0-85106002930ISBN: 978-1-4503-8336-3 (electronic)OAI: oai:DiVA.org:umu-182614DiVA, id: diva2:1547603
Conference
EuroSys '21: Sixteenth European Conference on Computer Systems, Online, UK, April, 2021
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2021-04-27 Created: 2021-04-27 Last updated: 2023-04-24Bibliographically approved
In thesis
1. Towards self-driving microservices
Open this publication in new window or tab >>Towards self-driving microservices
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Mot självkörande mikrotjänster
Abstract [en]

In recent years, microservice architecture has become a popular method for software system design and development. This involves creating applications with multiple small services, each with multiple instances, operating as independent processes. Due to the distributed nature of microservices, communication between services presents a challenging task that becomes increasingly complex as the number of services grows. This complexity can even lead to short-term failures that can degrade application performance. Therefore, the auto-tuning of inter-service communication is necessary to prevent such failures. Service meshes were introduced to offer the necessary technical capabilities that can be employed in such scenarios. In essence, a service mesh is an infrastructure layer that includes a set of configurable proxies integrated into microservices. This enables the provision of traffic management policies such as circuit breaking and retry mechanisms to enhance microservice resilience against transient failures. However, static configuration or misconfiguration of these mechanisms is unsuitable for the dynamic environment of microservices and can lead to serious issues and performance problems, such as retry storms.

The goal of this thesis is three-fold. First, it aims to investigate the impact and effectiveness of service traffic management on application reliability and availability in the presence of transient failures. Second, it focuses on auto-tuning of service traffic management to increase carried throughput and maintain carried response time. Third, this research aims to propose measures that can improve research reproducibility in the area of distributed systems ensuring that the findings can be independently verified by others. In this thesis, we aim to offer detailed guidelines on best practices for implementing research software.

To achieve these goals, this thesis delves into the current state-of-the-art in service meshes and eBPF-powered microservices, identifying current challenges and potential future directions. It analyzes the effects of circuit breaker and retry mechanisms on microservice performance and proposes adaptive controllers for both. The results show the need for such controllers that increase throughput while maintaining the tail response time of the application. Additionally, it proposes a microservice benchmark generator to enable systematic microservice benchmark generation and improve reproducibility. It also provides recommendations for improving artifact evaluation in distributed systems research by compiling all existing recommendations.

Abstract [sv]

Mikrotjänster har de senaste åren blivit en populär arkitekturmodell för programvara. Modellen innebär att man skapar applikationer med flera små tjänster, var och en med flera instanser som fungerar som oberoende processer. Mikrotjänsters distribuerade natur gör kommunikationen mellan tjänster mer utmanande. Denna komplexitet kan även ge upphov till tillfälliga fel på grund av lastobalans eller överbelastning och även försämra applikationers prestanda. Av denna anledning är dynamisk konfigurationav kommunikationen mellan mikrotjänster nödvändig. En service mesh är en teknikplattform för att hantera hur mikrotjänster kommunicerar, med funktioner för att enkelt kryptera kommunikation mellan tjänster, mäta prestanda och finkorningt styra kommunikationsflöden.En service mesh implementeras ofta som en uppsättning konfigurerbara proxies. Detta möjliggör trafikhanteringspolicies baserade på mekanismer som kretsbrytning och omsändningar. Statisk konfiguration av dessa mekanismer kan dock ge allvarliga prestandaproblem såsom  låg genomströmning och/eller omfattande omsändningar.

Denna avhandling har tre mål. För det första undersöker den hur trafikhantering för mikrotjänster påverkar tillförlitligheten och tillgängligheten för applikationer vid tillfälliga störningar. För det andra fokuserar den på adaptiv reglering av trafikhantering av mikrotjänster för att öka genomströmningen och samtidigt bibehålla acceptabla svarstider. För det tredje syftar den till att förbättra reproducerbarheten i forskning inom distribuerade system och se till att forskningsresultat enklare kan verifieras av oberoende. 

För att uppnå dessa mål undersöker avhandlingen den tekniska frontlinjen inom service mesh och mikrotjänster driva av eBPF-tekniken. Avhandlingen analyserar vidare hur användandet av kretsbrytare och omsändningsmekanismer påverkar mikrotjänsters prestanda. Adaptiva reglersystem för att hantera konfiguration av båda dessa mekanismer föreslås och utvärderas i omfattande experimentent. Resultaten visar att sådana regulatorer är nödvändiga för att öka genomströmningen och samtidigt bibehålla (högre percentiler av) applikationers svarstider. Adaption är särskilt viktigt då faktorer som totala trafikmängden, applikationers prestanda, tillfälliga fel, etc. kan förändras snabbt. 

Avhandlingen introducerar även ett verktyg för att generera godtyckliga testapplikationer för att kunna genomföra mer heltäckande utvärderingar av olika typer av forskningsprogramvara som hanterar mikrotjänster. Avhandlingen bidrar även till reproducerbarhet genom att studera hur programvaru-artefakter bäst bör utvärderas inom forskningsområdet distribuerade system. Detta sker genom att sammanställa, och utöka, befintliga rekommendationer inom området.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2023. p. 61
Series
Report / UMINF, ISSN 0348-0542 ; 23:04
Keywords
Microservices, Autonomic Computing, Service Mesh, Reliability, Circuit Breaker, Retry, Microservice Resiliency, Microservice Benchmarking, Reproducibility, Repeatability.
National Category
Computer Sciences Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206990 (URN)978-91-8070-022-1 (ISBN)978-91-8070-023-8 (ISBN)
Public defence
2023-05-26, Aula Anatomica (BIO.A.206), Biologihuset, Umeå, 13:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2023-05-03 Created: 2023-04-24 Last updated: 2023-04-24Bibliographically approved

Open Access in DiVA

fulltext(3654 kB)1096 downloads
File information
File name FULLTEXT01.pdfFile size 3654 kBChecksum SHA-512
07679025cd03b00aa2c663ddbe71d216e88ad59b8f8d20a5170548b7f4933de6e1be2d4a81797d5a3b61bdd5fa6708812f2197d1c408baf85a5a4b57a2640574
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Saleh Sedghpour, Mohammad RezaKlein, CristianTordsson, Johan

Search in DiVA

By author/editor
Saleh Sedghpour, Mohammad RezaKlein, CristianTordsson, Johan
By organisation
Department of Computing Science
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 1098 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 849 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf