Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
CloudResilienceML: ensuring robustness of machine learning models in dynamic cloud systems
Umeå University, Faculty of Science and Technology, Department of Computing Science. (ADSlab)ORCID iD: 0000-0002-9156-3364
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-8097-1143
RIKEN, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan.
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-2633-6798
2024 (English)In: The 17th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2024), 2024Conference paper, Published paper (Refereed)
Abstract [en]

Machine Learning (ML) models play a crucial role in enabling intelligent decision-making across diverse cloud system management tasks. However, as cloud operational data evolves, shifts in data distributions can occur, leading to a gradual degradation of deployed ML models and, consequently, a reduction in the overall efficiency of cloud systems.

We introduce CloudResilienceML, a framework designed to maintain the resilience of ML models in dynamic cloud environments. CloudResilienceML includes: (1) a performance degradation detection mechanism, using dynamic programming change point detection to identify when a model needs retraining, and (2) a data valuation method to select a minimal, effective training set for retraining, reducing unnecessary overhead.

Evaluated with two ML models on real cloud operational data, CloudResilienceML significantly boosts model resilience and reduces retraining costs compared to incremental learning and data drift-based retraining. In high-drift scenarios (e.g., Wikipedia trace), it reduces overhead by 50% compared to concept drift retraining and by 91% compared to incremental retraining. In stable environments (e.g., Microsoft Azure trace), CloudResilienceML maintains high accuracy with retraining costs 96% lower than concept drift methods and 86% lower than incremental retraining.

Place, publisher, year, edition, pages
2024.
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-231946OAI: oai:DiVA.org:umu-231946DiVA, id: diva2:1923287
Conference
The 17th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2024), University of Sharjah, Sharjah, United Arab Emirates, 16-19 December, 2024
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)eSSENCE - An eScience CollaborationAvailable from: 2024-12-21 Created: 2024-12-21 Last updated: 2025-01-02

Open Access in DiVA

No full text in DiVA

Authority records

Nguyen, Chanh Le TanKidane, LidiaElmroth, Erik

Search in DiVA

By author/editor
Nguyen, Chanh Le TanKidane, LidiaElmroth, Erik
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 67 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf