Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Saleh Sedghpour, Mohammad RezaORCID iD iconorcid.org/0000-0002-0751-9695
Publications (10 of 10) Show all publications
Meyers, C., Saleh Sedghpour, M. R., Löfstedt, T. & Elmroth, E. (2025). A training rate and survival heuristic for inference and robustness evaluation (Trashfire). In: Proceedings of 2024 International Conference on Machine Learning and Cybernetics: . Paper presented at 2024 International Conference on Machine Learning and Cybernetics (ICMLC),Miyazaki, Japan, September 20-23, (pp. 613-623). IEEE
Open this publication in new window or tab >>A training rate and survival heuristic for inference and robustness evaluation (Trashfire)
2025 (English)In: Proceedings of 2024 International Conference on Machine Learning and Cybernetics, IEEE, 2025, p. 613-623Conference paper, Published paper (Refereed)
Abstract [en]

Machine learning models—deep neural networks in particular—have performed remarkably well on benchmark datasets across a wide variety of domains. However, the ease of finding adversarial counter-examples remains a persistent problem when training times are measured in hours or days and the time needed to find a successful adversarial counter-example is measured in seconds. Much work has gone into generating and defending against these adversarial counter-examples, however the relative costs of attacks and defences are rarely discussed. Additionally, machine learning research is almost entirely guided by test/train metrics, but these would require billions of samples to meet industry standards. The present work addresses the problem of understanding and predicting how particular model hyper-parameters influence the performance of a model in the presence of an adversary. The proposed approach uses survival models, worst-case examples, and a cost-aware analysis to precisely and accurately reject a particular model change during routine model training procedures rather than relying on real-world deployment, expensive formal verification methods, or accurate simulations of very complicated systems (e.g., digitally recreating every part of a car or a plane). Through an evaluation of many pre-processing techniques, adversarial counter-examples, and neural network configurations, the conclusion is that deeper models do offer marginal gains in survival times compared to more shallow counterparts. However, we show that those gains are driven more by the model inference time than inherent robustness properties. Using the proposed methodology, we show that ResNet is hopelessly insecure against even the simplest of white box attacks.

Place, publisher, year, edition, pages
IEEE, 2025
Series
Proceedings (International Conference on Machine Learning and Cybernetics), ISSN 2160-133X, E-ISSN 2160-1348
Keywords
Machine Learning, Computer Vision, Neural Networks, Adversarial AI, Trustworthy AI
National Category
Artificial Intelligence Security, Privacy and Cryptography Computer Sciences
Identifiers
urn:nbn:se:umu:diva-237109 (URN)10.1109/ICMLC63072.2024.10935101 (DOI)2-s2.0-105002274020 (Scopus ID)9798331528041 (ISBN)9798331528058 (ISBN)
Conference
2024 International Conference on Machine Learning and Cybernetics (ICMLC),Miyazaki, Japan, September 20-23,
Funder
Knut and Alice Wallenberg Foundation, 2019.0352eSSENCE - An eScience Collaboration
Available from: 2025-04-02 Created: 2025-04-02 Last updated: 2025-05-19Bibliographically approved
Saleh Sedghpour, M. R., Garlan, D., Schmerl, B., Klein, C. & Tordsson, J. (2023). Breaking the vicious circle: self-adaptive microservice circuit breaking and retry. In: Lisa O’Conner (Ed.), 2023 IEEE international conference on cloud engineering: proceedings. Paper presented at 2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, Massachusetts, 25–28 September 2023. (pp. 32-42). IEEE Computer Society, Article ID 24126172.
Open this publication in new window or tab >>Breaking the vicious circle: self-adaptive microservice circuit breaking and retry
Show others...
2023 (English)In: 2023 IEEE international conference on cloud engineering: proceedings / [ed] Lisa O’Conner, IEEE Computer Society, 2023, p. 32-42, article id 24126172Conference paper, Published paper (Refereed)
Abstract [en]

Microservice-based architectures consist of numerous, loosely coupled services with multiple instances. Service meshes aim to simplify traffic management and prevent microservice overload through circuit breaking and request retry mechanisms. Previous studies have demonstrated that the static configuration of these mechanisms is unfit for the dynamic environment of microservices. We conduct a sensitivity analysis to understand the impact of retrying across a wide range of scenarios. Based on the findings, we propose a retry controller that can also work with dynamically configured circuit breakers. We have empirically assessed our proposed controller in various scenarios, including transient overload and noisy neighbors while enforcing adaptive circuit breaking. The results show that our proposed controller does not deviate from a well-tuned configuration while maintaining carried response time and adapting to the changes. In comparison to the default static retry configuration that is mostly used in practice, our approach improves the carried throughput up to 12x and 32x respectively in the cases of transient overload and noisy neighbors.

Place, publisher, year, edition, pages
IEEE Computer Society, 2023
Series
Proceedings of the ... IEEE International Symposium on Requirements Engineering, E-ISSN 2332-6441
Keywords
reliability, retry mechanism, circuit breaker pattern, service mesh, microservices
National Category
Computer Engineering Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206989 (URN)10.1109/IC2E59103.2023.00012 (DOI)001103216300004 ()2-s2.0-85179510941 (Scopus ID)979-8-3503-4394-6 (ISBN)
Conference
2023 IEEE International Conference on Cloud Engineering (IC2E), Boston, Massachusetts, 25–28 September 2023.
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

Originally included in thesis in manuscript form. 

Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2025-04-24Bibliographically approved
Saleh Sedghpour, M. R., Obeso Duque, A., Cai, X., Skubic, B., Elmroth, E., Klein, C. & Tordsson, J. (2023). Hydragen: a microservice benchmark generator. In: C. Ardagna; N. Atukorala; P. Beckman; C.K. Chang; R.N. Chang; C. Evangelinos; J. Fan; G.C. Fox; J. Fox; C. Hagleitner; Z. Jin; T. Kosar; M. Parashar (Ed.), 2023 IEEE 16th international conference on cloud computing (CLOUD): . Paper presented at 16th IEEE International Conference on Cloud Computing, CLOUD 2023, Hybrid/Chicago, July 2-8, 2023 (pp. 189-200). IEEE, 2023-July
Open this publication in new window or tab >>Hydragen: a microservice benchmark generator
Show others...
2023 (English)In: 2023 IEEE 16th international conference on cloud computing (CLOUD) / [ed] C. Ardagna; N. Atukorala; P. Beckman; C.K. Chang; R.N. Chang; C. Evangelinos; J. Fan; G.C. Fox; J. Fox; C. Hagleitner; Z. Jin; T. Kosar; M. Parashar, IEEE, 2023, Vol. 2023-July, p. 189-200Conference paper, Published paper (Refereed)
Abstract [en]

Microservice-based architectures have become ubiq-uitous in large-scale software systems. Experimental cloud re-searchers constantly propose enhanced resource management mechanisms for such systems. These mechanisms need to be eval-uated using both realistic and flexible microservice benchmarks to study in which ways diverse application characteristics can affect their performance and scalability. However, current mi-croservice benchmarks have limitations including static compu-tational complexity, limited architectural scale, and fixed topology (i.e., number of tiers, fan-in, and fan-out characteristics).

We therefore propose HydraGen, a tool that enables re-searchers to systematically generate benchmarks with different computational complexities and topologies, to tackle experimental evaluation of performance at scale for web-serving applications, with a focus on inter-service communication. To illustrate the potential of our open-source tool, we demonstrate how it can reproduce an existing microservice benchmark with preserved architectural properties. We also demonstrate how HydraGen can enrich the evaluation of cloud management systems based on a case study related to traffic engineering.

Place, publisher, year, edition, pages
IEEE, 2023
Series
IEEE International Conference on Cloud Computing, CLOUD, ISSN 2159-6182, E-ISSN 2159-6190
Keywords
microservices, benchmark generator, performance analysis, emulation, validation, cloud systems
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206987 (URN)10.1109/CLOUD60044.2023.00030 (DOI)001085065100020 ()2-s2.0-85174317366 (Scopus ID)9798350304817 (ISBN)9798350304824 (ISBN)
Conference
16th IEEE International Conference on Cloud Computing, CLOUD 2023, Hybrid/Chicago, July 2-8, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Google
Note

Originally included in thesis in manuscript form. 

Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2025-04-24Bibliographically approved
Saleh Sedghpour, M. R. (2023). Towards self-driving microservices. (Doctoral dissertation). Umeå: Umeå University
Open this publication in new window or tab >>Towards self-driving microservices
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Mot självkörande mikrotjänster
Abstract [en]

In recent years, microservice architecture has become a popular method for software system design and development. This involves creating applications with multiple small services, each with multiple instances, operating as independent processes. Due to the distributed nature of microservices, communication between services presents a challenging task that becomes increasingly complex as the number of services grows. This complexity can even lead to short-term failures that can degrade application performance. Therefore, the auto-tuning of inter-service communication is necessary to prevent such failures. Service meshes were introduced to offer the necessary technical capabilities that can be employed in such scenarios. In essence, a service mesh is an infrastructure layer that includes a set of configurable proxies integrated into microservices. This enables the provision of traffic management policies such as circuit breaking and retry mechanisms to enhance microservice resilience against transient failures. However, static configuration or misconfiguration of these mechanisms is unsuitable for the dynamic environment of microservices and can lead to serious issues and performance problems, such as retry storms.

The goal of this thesis is three-fold. First, it aims to investigate the impact and effectiveness of service traffic management on application reliability and availability in the presence of transient failures. Second, it focuses on auto-tuning of service traffic management to increase carried throughput and maintain carried response time. Third, this research aims to propose measures that can improve research reproducibility in the area of distributed systems ensuring that the findings can be independently verified by others. In this thesis, we aim to offer detailed guidelines on best practices for implementing research software.

To achieve these goals, this thesis delves into the current state-of-the-art in service meshes and eBPF-powered microservices, identifying current challenges and potential future directions. It analyzes the effects of circuit breaker and retry mechanisms on microservice performance and proposes adaptive controllers for both. The results show the need for such controllers that increase throughput while maintaining the tail response time of the application. Additionally, it proposes a microservice benchmark generator to enable systematic microservice benchmark generation and improve reproducibility. It also provides recommendations for improving artifact evaluation in distributed systems research by compiling all existing recommendations.

Abstract [sv]

Mikrotjänster har de senaste åren blivit en populär arkitekturmodell för programvara. Modellen innebär att man skapar applikationer med flera små tjänster, var och en med flera instanser som fungerar som oberoende processer. Mikrotjänsters distribuerade natur gör kommunikationen mellan tjänster mer utmanande. Denna komplexitet kan även ge upphov till tillfälliga fel på grund av lastobalans eller överbelastning och även försämra applikationers prestanda. Av denna anledning är dynamisk konfigurationav kommunikationen mellan mikrotjänster nödvändig. En service mesh är en teknikplattform för att hantera hur mikrotjänster kommunicerar, med funktioner för att enkelt kryptera kommunikation mellan tjänster, mäta prestanda och finkorningt styra kommunikationsflöden.En service mesh implementeras ofta som en uppsättning konfigurerbara proxies. Detta möjliggör trafikhanteringspolicies baserade på mekanismer som kretsbrytning och omsändningar. Statisk konfiguration av dessa mekanismer kan dock ge allvarliga prestandaproblem såsom  låg genomströmning och/eller omfattande omsändningar.

Denna avhandling har tre mål. För det första undersöker den hur trafikhantering för mikrotjänster påverkar tillförlitligheten och tillgängligheten för applikationer vid tillfälliga störningar. För det andra fokuserar den på adaptiv reglering av trafikhantering av mikrotjänster för att öka genomströmningen och samtidigt bibehålla acceptabla svarstider. För det tredje syftar den till att förbättra reproducerbarheten i forskning inom distribuerade system och se till att forskningsresultat enklare kan verifieras av oberoende. 

För att uppnå dessa mål undersöker avhandlingen den tekniska frontlinjen inom service mesh och mikrotjänster driva av eBPF-tekniken. Avhandlingen analyserar vidare hur användandet av kretsbrytare och omsändningsmekanismer påverkar mikrotjänsters prestanda. Adaptiva reglersystem för att hantera konfiguration av båda dessa mekanismer föreslås och utvärderas i omfattande experimentent. Resultaten visar att sådana regulatorer är nödvändiga för att öka genomströmningen och samtidigt bibehålla (högre percentiler av) applikationers svarstider. Adaption är särskilt viktigt då faktorer som totala trafikmängden, applikationers prestanda, tillfälliga fel, etc. kan förändras snabbt. 

Avhandlingen introducerar även ett verktyg för att generera godtyckliga testapplikationer för att kunna genomföra mer heltäckande utvärderingar av olika typer av forskningsprogramvara som hanterar mikrotjänster. Avhandlingen bidrar även till reproducerbarhet genom att studera hur programvaru-artefakter bäst bör utvärderas inom forskningsområdet distribuerade system. Detta sker genom att sammanställa, och utöka, befintliga rekommendationer inom området.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2023. p. 61
Series
Report / UMINF, ISSN 0348-0542 ; 23:04
Keywords
Microservices, Autonomic Computing, Service Mesh, Reliability, Circuit Breaker, Retry, Microservice Resiliency, Microservice Benchmarking, Reproducibility, Repeatability.
National Category
Computer Sciences Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206990 (URN)978-91-8070-022-1 (ISBN)978-91-8070-023-8 (ISBN)
Public defence
2023-05-26, Aula Anatomica (BIO.A.206), Biologihuset, Umeå, 13:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2023-05-03 Created: 2023-04-24 Last updated: 2023-04-24Bibliographically approved
Saleh Sedghpour, M. R., Klein, C. & Tordsson, J. (2022). An Empirical Study of Service Mesh Traffic Management Policies for Microservices. In: ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering: . Paper presented at CPE '22: ACM/SPEC International Conference on Performance Engineering, Bejing, China, April 9 - 13, 2022 (pp. 17-27). New York: ACM Digital Library
Open this publication in new window or tab >>An Empirical Study of Service Mesh Traffic Management Policies for Microservices
2022 (English)In: ICPE '22: Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering, New York: ACM Digital Library, 2022, p. 17-27Conference paper, Published paper (Refereed)
Abstract [en]

A microservice architecture features hundreds or even thousands of small loosely coupled services with multiple instances. Because microservice performance depends on many factors including the workload, inter-service traffic management is complex in such dynamic environments. Service meshes aim to handle this complexity and to facilitate management, observability, and communication between microservices. Service meshes provide various traffic management policies such as circuit breaking and retry mechanisms, which are claimed to protect microservices against overload and increase the robustness of communication between microservices. However, there have been no systematic studies on the effects of these mechanisms on microservice performance and robustness. Furthermore, the exact impact of various tuning parameters for circuit breaking and retries are poorly understood. This work presents a large set of experiments conducted to investigate these issues using a representative microservice benchmark in a Kubernetes testbed with the widely used Istio service mesh. Our experiments reveal effective configurations of circuit breakers and retries. The findings presented will be useful to engineers seeking to configure service meshes more systematically and also open up new areas of research for academics in the area of service meshes for (autonomic) microservice resource management.

Place, publisher, year, edition, pages
New York: ACM Digital Library, 2022
Keywords
microservices, service mesh, traffic management, circuit breaking, retry, microservice resiliency
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:umu:diva-193341 (URN)10.1145/3489525.3511686 (DOI)000883411400004 ()2-s2.0-85128651087 (Scopus ID)
Conference
CPE '22: ACM/SPEC International Conference on Performance Engineering, Bejing, China, April 9 - 13, 2022
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2022-03-28 Created: 2022-03-28 Last updated: 2023-09-05Bibliographically approved
Saleh Sedghpour, M. R. & Ulfsparre, S. I. (2022). Integration of research software into the EOSC infrastructure: lessons learned from computer science. In: : . Paper presented at EOSC Symposium 2022, Prague, Czech Republic, November 14-17, 2022.
Open this publication in new window or tab >>Integration of research software into the EOSC infrastructure: lessons learned from computer science
2022 (English)Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

ABOUT THE SESSION: FAIR research software is essential to the quality assurance and reusability of research. As EOSC evolves, it is crucial to integrate infrastructures to share, collaborate, evaluate, reproduce, and preserve research software for use in the academic landscape and beyond. With the publication of the FAIR Principles for Research Software (FAIR4RS Principles), we expect that there will be an increase in demand for FAIR software, and related platforms, in all fields of research.

During this session, we will explore how computer science practices can inform further EOSC infrastructure development. Such practices can advise the development of similar methods and platforms in a broad range of other academic domains. In research fields with similar characteristics, they should also be directly transferable. Besides practical implementations, the session may also inspire policy development for open science research software practices.

The session will take place as an interactive lecture.

RELEVANCE FOR EOSC: FAIR research software is essential to research quality assurance and reusability. As EOSC evolves, it is crucial to integrate infrastructures to share, collaborate, evaluate, reproduce, and preserve research software for use in the academic landscape and beyond. During this session, we will explore how computer science practices can inform further EOSC infrastructure development.

Keywords
FAIR Software, Open Science, Reproducibility, Repeatability, Transparency, Reusability, Metadata, Metadata Standards, Computer Science, Computing Science, Artifact Evaluation, Artefact Evaluation, Best Practice, Badging System, Recommendations, Policy, Training, Reviewers, Researchers, Research Communities, EOSC, European Open Science Cloud, Policy Makers, Publishers, Funding Agencies
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-204900 (URN)10.5281/zenodo.7643323 (DOI)
Conference
EOSC Symposium 2022, Prague, Czech Republic, November 14-17, 2022
Note

DOI links to slides in Zenodo. There is also a video recording on Youtube.

Available from: 2023-02-15 Created: 2023-02-15 Last updated: 2024-08-27Bibliographically approved
Saleh Sedghpour, M. R. & Townend, P. (2022). Service mesh and eBPF-powered microservices: a survey and future directions. In: 2022 IEEE International Conference on Service-Oriented System Engineering (SOSE): . Paper presented at IEEE SOSE 2022, 16th International Conference on Service-Oriented System Engineering (SOSE), San Fransisco, USA, August 15-18, 2022 (pp. 176-184). IEEE
Open this publication in new window or tab >>Service mesh and eBPF-powered microservices: a survey and future directions
2022 (English)In: 2022 IEEE International Conference on Service-Oriented System Engineering (SOSE), IEEE, 2022, p. 176-184Conference paper, Published paper (Refereed)
Abstract [en]

Modern software development practice has seen a profound shift in architectural design, moving from monolithic approaches to distributed, microservice-based architectures. This allows for much simpler and faster application orchestration and management, especially in cloud-based systems, with the result being that orchestration systems themselves are becoming a key focus of computing research.

Orchestration system research addresses many different subject areas, including scheduling, automation, and security. However, the key characteristic that is common throughout is the complex and dynamic nature of distributed, multi-tenant cloud-based microservice systems that must be orchestrated. This complexity has led to many challenges in areas such as inter-service communication, observability, reliability, single cluster to multi-cluster, hybrid environments, and multi-tenancy.

The concept of service meshes has been introduced to handle this complexity. In essence, a service mesh is an infrastructure layer built directly into the microservices - or the nodes of orchestrators - as a set of configurable proxies that are responsible for the management, observability, and security of microservices.

Service meshes aim to be a full networking solution for microservices; however, they also introduce overhead into a system - this can be significant for low-powered edge devices, as service mesh proxies work in user space and are responsible for processing the incoming and outgoing traffic of each service. To mitigate performance issues caused by these proxies, the industry is pushing the boundaries of monitoring and security to kernel space by employing eBPF for faster and more efficient responses. 

We propose that the movement towards the use of service meshes as a networking solution for most of the required features by industry - combined with their integration with eBPF - is the next key trend in the evolution of microservices. This paper highlights the challenges of this movement, explores its current state, and discusses future opportunities in the context of microservices. 

Place, publisher, year, edition, pages
IEEE, 2022
Series
Proceedings (IEEE International Symposium on Service-Oriented System Engineering), E-ISSN 2642-6587
Keywords
Service-Oriented Computing, Microservice, Service Mesh, eBPF
National Category
Computer Systems
Research subject
Computer Systems
Identifiers
urn:nbn:se:umu:diva-200303 (URN)10.1109/SOSE55356.2022.00027 (DOI)000942754700021 ()2-s2.0-85141439805 (Scopus ID)978-1-6654-7534-1 (ISBN)978-1-6654-7535-8 (ISBN)
Conference
IEEE SOSE 2022, 16th International Conference on Service-Oriented System Engineering (SOSE), San Fransisco, USA, August 15-18, 2022
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2022-10-14 Created: 2022-10-14 Last updated: 2023-09-05Bibliographically approved
Saleh Sedghpour, M. R., Klein, C. & Tordsson, J. (2021). Service mesh circuit breaker: From panic button to performance management tool. In: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems: . Paper presented at EuroSys '21: Sixteenth European Conference on Computer Systems, Online, UK, April, 2021 (pp. 4-10). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Service mesh circuit breaker: From panic button to performance management tool
2021 (English)In: HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems, Association for Computing Machinery (ACM), 2021, p. 4-10Conference paper, Published paper (Refereed)
Abstract [en]

Site Reliability Engineers are at the center of two tensions: On one hand, they need to respond to alerts within a short time, to restore a non-functional system. On the other hand, short response times is disruptive to everyday life and lead to alert fatigue. To alleviate this tension, many resource management mechanisms are proposed handle overload and mitigate the faults. One recent such mechanism is circuit breaking in service meshes. Circuit breaking rejects incoming requests to protect latency at the expense of availability (successfully answered requests), but in many scenarios achieve neither due to the difficulty of knowing when to trigger circuit breaking in highly dynamic microservice environments.

We propose an adaptive circuit breaking mechanism, implemented through an adaptive controller, that not only avoids overload and mitigate failure, but keeps the tail response time below a given threshold while maximizing service throughput. Our proposed controller is experimentally compared with a static circuit breaker across a wide set of overload scenarios in a testbed based on Istio and Kubernetes. The results show that our controller maintains tail response time below the given threshold 98% of the time (including cold starts) on average with an availability of 70% with 29% of requests circuit broken. This compares favorably to a static circuit breaker configuration, which features a 63% availability, 30% circuit broken requests, and more than 5% of requests timing out.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
micro-services, circuit breaker, performance management, control theory
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-182614 (URN)10.1145/3447851.3458740 (DOI)2-s2.0-85106002930 (Scopus ID)978-1-4503-8336-3 (ISBN)
Conference
EuroSys '21: Sixteenth European Conference on Computer Systems, Online, UK, April, 2021
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2021-04-27 Created: 2021-04-27 Last updated: 2023-04-24Bibliographically approved
Meyers, C., Saleh Sedghpour, M. R., Elmroth, E. & Löfstedt, T.A cost-aware approach to adversarial robustness in neural networks.
Open this publication in new window or tab >>A cost-aware approach to adversarial robustness in neural networks
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Considering the growing prominence of production-level AI and the threat of adversarial attacks that can evade a model at run-time, evaluating the robustness of models to these evasion attacks is of critical importance.Additionally, testing model changes likely means deploying the models to (e.g., a car or a medical imaging device), or a drone to see how it affects performance, making un-tested changes a public problem that reduces development speed, increases cost of development, and makes it difficult (if not impossible) to parse cause from effect.In this work, we used survival analysis as a cloud-native, time-efficient and precise method for predicting model performance in the presence of adversarial noise.For neural networks in particular, the relationships between the learning rate, batch size, training time, convergence time, and deployment cost are highly complex, so researchers generally rely on benchmark datasets to assess the ability of a model to generalize beyond the training data. However, in practice, this means that each model configuration needs to be evaluated against real-world deployment samples which can be prohibitively expensive or time-consuming to collect --- especially when other parts of the software or hardware stack are developed in parallel. To address this, we propose using accelerated failure time models to measure the effect of hardware choice, batch size, number of epochs, and test-set accuracy by using adversarial attacks to induce failures on a reference model architecture before deploying the model to the real world. We evaluate several GPU types and use the Tree Parzen Estimator to maximize model robustness and minimize model run-time simultaneously. This provides a way to evaluate the model and optimise it in a single step, while simultaneously allowing us to model the effect of model parameters on training time, prediction time, and accuracy. Using this technique, we demonstrate that newer, more-powerful hardware does decrease the training time, but with a monetary and power cost that far outpaces the marginal gains in accuracy.

Keywords
artificial intelligence, machine learning, adversarial AI, optimisation, compliance
National Category
Computer Sciences
Research subject
Computer Science; Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-238922 (URN)
Funder
Knut and Alice Wallenberg Foundation, 2019.0352
Available from: 2025-05-16 Created: 2025-05-16 Last updated: 2025-05-19Bibliographically approved
Saleh Sedghpour, M. R., Papadopoulos, A. V., Klein, C. & Tordsson, J.Artifact evaluation for distributed systems: current practices and beyond.
Open this publication in new window or tab >>Artifact evaluation for distributed systems: current practices and beyond
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Although repeatability and reproducibility are essential in science, failed attempts to replicate results across diverse fields made some scientists argue for a reproducibility crisis. In response, several high-profile venues within computing established artifact evaluation tracks, a systematic procedure for evaluating and badging research artifacts, with an increasing number of artifacts submitted.

This study compiles recent artifact evaluation procedures and guidelines to show how artifact evaluation in distributed systems research lags behind other computing disciplines, and/or is less unified and more complex. We further argue that current artifact assessment criteria are uncoordinated and insufficient for the unique challenges of distributed systems research. We examine the current state of the practice for artifacts and their evaluation to provide recommendations to assist artifact authors, reviewers, and track chairs. Although our recommendations alone will not resolve the repeatability and reproducibility crisis, we want to start a discussion in our community to increase both the number of submitted artifacts and their quality over time.

The ambition of this paper is to provide both artifact authors and reviewers with a one-stop shop for all required knowledge to make this successful.

Keywords
Artifact evaluation, reproducibility, repeatability, distributed systems
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-206988 (URN)
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Research Council, PSI
Available from: 2023-04-24 Created: 2023-04-24 Last updated: 2023-04-24
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0751-9695

Search in DiVA

Show all publications