Open this publication in new window or tab >>2024 (English)In: Proceedings - 2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing, UCC 2024, IEEE, 2024, p. 73-81, Institute of Electrical and Electronics Engineers (IEEE), 2024, p. 73-81Conference paper, Published paper (Refereed)
Abstract [en]
Machine Learning (ML) models play a crucial role in enabling intelligent decision-making across diverse cloud system management tasks. However, as cloud operational data evolves, shifts in data distributions can occur, leading to a gradual degradation of deployed ML models and, consequently, a reduction in the overall efficiency of cloud systems.
We introduce CloudResilienceML, a framework designed to maintain the resilience of ML models in dynamic cloud environments. CloudResilienceML includes: (1) a performance degradation detection mechanism, using dynamic programming change point detection to identify when a model needs retraining, and (2) a data valuation method to select a minimal, effective training set for retraining, reducing unnecessary overhead.
Evaluated with two ML models on real cloud operational data, CloudResilienceML significantly boosts model resilience and reduces retraining costs compared to incremental learning and data drift-based retraining. In high-drift scenarios (e.g., Wikipedia trace), it reduces overhead by 50% compared to concept drift retraining and by 91% compared to incremental retraining. In stable environments (e.g., Microsoft Azure trace), CloudResilienceML maintains high accuracy with retraining costs 96% lower than concept drift methods and 86% lower than incremental retraining.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Autonomous System, Change Point Detection, Cloud Operational Data, Data Drift, Machine Learning, Resource Management, Time series
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-231946 (URN)10.1109/UCC63386.2024.00019 (DOI)2-s2.0-105004740726 (Scopus ID)9798350367201 (ISBN)9798350367218 (ISBN)
Conference
The 17th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2024), University of Sharjah, Sharjah, United Arab Emirates, 16-19 December, 2024
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)eSSENCE - An eScience Collaboration
2024-12-212024-12-212025-06-17Bibliographically approved