Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (8 of 8) Show all publications
Kidane, L. (2025). Accurate and low-overhead workload prediction for cloud management. (Doctoral dissertation). Umeå: Umeå University
Open this publication in new window or tab >>Accurate and low-overhead workload prediction for cloud management
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Noggrann och effektiv prediktering av last för resurshantering i datormoln
Abstract [en]

Cloud computing has transformed the IT landscape by offering users and orga-nizations on-demand access to computing power, storage, data processing, andmachine learning resources. Despite the benefits, cloud resource managementfaces challenges due to the heterogeneous and dynamic nature of workloads.Inefficient provisioning manifests in two critical forms: underprovisioning leadsto degraded Quality of Service (QoS) and unmet Service-Level Agreements(SLAs), while overprovisioning results in unnecessary energy consumption andhigh operational costs. With the current rise of AI and machine learning in-novations, machine learning-based workload prediction for resource provisionplays a vital role in predicting future scenarios and identifying new occurrences,enabling service providers to prepare ahead of time. However, various challengesare associated with machine learning-based workload prediction.This thesis addresses the challenges of machine learning-based workloadprediction in cloud environments, including data drift due to dynamic workloads,high computational overhead, and storage overhead. Firstly, cloud workloads aredynamic, and models trained with old historical data can become obsolete overtime. We addressed the challenge of accurate prediction and data drift by incor-porating machine learning and streaming data processing algorithms to assistadaptive prediction. Secondly, constantly training and updating deep learningmodels adds significant computational overhead to the cloud infrastructure. Weaddressed this problem by proposing a solution that incorporates a knowledgebase repository with transfer learning-based adaptation. Moreover, we exploredthe tradeoff between model accuracy and computational overhead. Finally, wepropose a data compression mechanism that leverages an autoencoder to reducestorage overhead resulting from the continuous generation of monitoring datain cloud management systems.Our findings reveal that the proposed methods have significantly improvedthe machine learning-based cloud management system. Extensive evaluationusing real-world datasets reveals that the proposed methods facilitate thecreation of accurate predictions, even in the face of ever-changing patterns incloud workloads. Moreover, the methods reduced computation overhead byleveraging existing knowledge and highlighting the tradeoff required to achievea balance between prediction accuracy and computation overhead.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2025. p. 38
Series
Report / UMINF, ISSN 0348-0542 ; 25.09
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-238533 (URN)978-91-8070-713-8 (ISBN)978-91-8070-712-1 (ISBN)
Public defence
2025-05-30, MIT.A.121, MIT-huset, Umeå,, 13:15 (English)
Opponent
Supervisors
Available from: 2025-05-09 Created: 2025-05-07 Last updated: 2025-05-08Bibliographically approved
Kidane, L., Townend, P., Metsch, T. & Elmroth, E. (2025). Balancing compression and prediction: a hybrid autoencoder-LSTM framework for cloud workloads. In: BDCAT 2025 - IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Co Located Conference UCC 2025: . Paper presented at 12th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT 2025, Nantes, France, 1-4 December, 2025.. Association for Computing Machinery (ACM), Article ID 10.
Open this publication in new window or tab >>Balancing compression and prediction: a hybrid autoencoder-LSTM framework for cloud workloads
2025 (English)In: BDCAT 2025 - IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Co Located Conference UCC 2025, Association for Computing Machinery (ACM), 2025, article id 10Conference paper, Published paper (Refereed)
Abstract [en]

Accurate future workload prediction is an essential step for proactive resource allocation and efficient provisioning in cloud computing environments. Deep learning strategies have proven successful for this task, but they face challenges due to the high dimensionality of monitoring data, extensive preprocessing requirements, and computational overhead. In this paper, we propose a hybrid framework that integrates autoencoders for workload compression with Long Short-Term Memory (LSTM) networks for time-series forecasting. Unlike prior studies, our approach systematically analyzes the trade-off between compression ratio and predictive accuracy, demonstrating how dimensionality reduction can improve both scalability and robustness. Thereby reducing the computational burden associated with processing massive-scale monitoring data. Experiments conducted on both synthetic and real-world datasets demonstrate that the proposed method achieves up to 60% data compression with minimal reconstruction loss, while also improving prediction accuracy compared to baseline LSTM models. We evaluate the overall performance of the framework using various metrics, including data reduction ratio, prediction accuracy, and the effects of different compression stages on predictive performance. Additionally, we quantify the computational savings in terms of CPU usage, memory footprint, and training/inference times, confirming the framework's feasibility for real-world deployment. These results underscore the potential of integrating compression and prediction to achieve scalable, accurate, and resource-efficient management of cloud workloads.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2025
Keywords
Autoencoders, Cloud computing, Data compression, Information extraction, Workload prediction
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:umu:diva-248586 (URN)10.1145/3773276.3774300 (DOI)2-s2.0-105026855587 (Scopus ID)9798400722868 (ISBN)
Conference
12th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, BDCAT 2025, Nantes, France, 1-4 December, 2025.
Funder
Knut and Alice Wallenberg Foundation, KAW 2019.0352eSSENCE - An eScience Collaboration
Available from: 2026-01-23 Created: 2026-01-23 Last updated: 2026-01-23Bibliographically approved
Nguyen, C. L., Kidane, L., Vo Nguyen Le, D. & Elmroth, E. (2024). CloudResilienceML: ensuring robustness of machine learning models in dynamic cloud systems. In: Proceedings - 2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing, UCC 2024, IEEE, 2024, p. 73-81: . Paper presented at The 17th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2024), University of Sharjah, Sharjah, United Arab Emirates, 16-19 December, 2024 (pp. 73-81). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>CloudResilienceML: ensuring robustness of machine learning models in dynamic cloud systems
2024 (English)In: Proceedings - 2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing, UCC 2024, IEEE, 2024, p. 73-81, Institute of Electrical and Electronics Engineers (IEEE), 2024, p. 73-81Conference paper, Published paper (Refereed)
Abstract [en]

Machine Learning (ML) models play a crucial role in enabling intelligent decision-making across diverse cloud system management tasks. However, as cloud operational data evolves, shifts in data distributions can occur, leading to a gradual degradation of deployed ML models and, consequently, a reduction in the overall efficiency of cloud systems.

We introduce CloudResilienceML, a framework designed to maintain the resilience of ML models in dynamic cloud environments. CloudResilienceML includes: (1) a performance degradation detection mechanism, using dynamic programming change point detection to identify when a model needs retraining, and (2) a data valuation method to select a minimal, effective training set for retraining, reducing unnecessary overhead.

Evaluated with two ML models on real cloud operational data, CloudResilienceML significantly boosts model resilience and reduces retraining costs compared to incremental learning and data drift-based retraining. In high-drift scenarios (e.g., Wikipedia trace), it reduces overhead by 50% compared to concept drift retraining and by 91% compared to incremental retraining. In stable environments (e.g., Microsoft Azure trace), CloudResilienceML maintains high accuracy with retraining costs 96% lower than concept drift methods and 86% lower than incremental retraining.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Autonomous System, Change Point Detection, Cloud Operational Data, Data Drift, Machine Learning, Resource Management, Time series
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-231946 (URN)10.1109/UCC63386.2024.00019 (DOI)2-s2.0-105004740726 (Scopus ID)9798350367201 (ISBN)9798350367218 (ISBN)
Conference
The 17th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2024), University of Sharjah, Sharjah, United Arab Emirates, 16-19 December, 2024
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)eSSENCE - An eScience Collaboration
Available from: 2024-12-21 Created: 2024-12-21 Last updated: 2025-06-17Bibliographically approved
Kidane, L., Townend, P., Metsch, T. & Elmroth, E. (2023). Automated hyperparameter tuning for adaptive cloud workload prediction. In: UCC '23: Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing. Paper presented at CC '23: IEEE/ACM 16th International Conference on Utility and Cloud Computing, Taormina (Messina), Italy, December 4-7, 2023. New York: Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Automated hyperparameter tuning for adaptive cloud workload prediction
2023 (English)In: UCC '23: Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing, New York: Association for Computing Machinery (ACM), 2023Conference paper, Published paper (Refereed)
Abstract [en]

Efficient workload prediction is essential for enabling timely resource provisioning in cloud computing environments. However, achieving accurate predictions, ensuring adaptability to changing conditions, and minimizing computation overhead pose significant challenges for workload prediction models. Furthermore, the continuous streaming nature of workload metrics requires careful consideration when applying machine learning and data mining algorithms, as manual hyperparameter optimization can be time-consuming and suboptimal. We propose an automated parameter tuning and adaptation approach for workload prediction models and concept drift detection algorithms utilized in predicting future workload. Our method leverages a pre-built knowledge-base based on historical data statistical features, enabling automatic adjustment of model weights and concept drift detection parameters. Additionally, model adaptation is facilitated through a transfer learning approach. We evaluate the effectiveness of our automated approach by comparing it with static approaches using synthetic and real-world datasets. By automating the parameter tuning process and integrating concept drift detection, in our experiments the proposed method enhances the accuracy and efficiency of workload prediction models by 50%.

Place, publisher, year, edition, pages
New York: Association for Computing Machinery (ACM), 2023
Keywords
Cloud computing, Hyperparameter optimization, Workload prediction, Concept drift, Data mining
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-223451 (URN)10.1145/3603166.3632244 (DOI)001211822800044 ()2-s2.0-85191659681 (Scopus ID)979-8-4007-0234-1 (ISBN)
Conference
CC '23: IEEE/ACM 16th International Conference on Utility and Cloud Computing, Taormina (Messina), Italy, December 4-7, 2023
Funder
Knut and Alice Wallenberg Foundation, 2019.0352eSSENCE - An eScience Collaboration
Available from: 2024-04-16 Created: 2024-04-16 Last updated: 2025-05-07Bibliographically approved
Kidane, L., Townend, P., Metsch, T. & Elmroth, E. (2022). When and How to Retrain Machine Learning-based Cloud Management Systems. In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): . Paper presented at 2022 IEEE International Parallel and Distributed Processing Symposium, 30 May 2022-03 June 2022, Lyon, France (pp. 688-698). IEEE
Open this publication in new window or tab >>When and How to Retrain Machine Learning-based Cloud Management Systems
2022 (English)In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, 2022, p. 688-698Conference paper, Published paper (Refereed)
Abstract [en]

Cloud management systems increasingly rely on machine learning (ML) models to predict incoming workload rates, load, and other system behaviors for efficient dynamic resource management. Current state-of-the-art prediction models demonstrate high accuracy, but assume that data patterns remain stable. However, in production use, systems may face hardware upgrades, changes in user behavior etc. that lead to concept drifts - significant changes in characteristics of data streams over time. To mitigate prediction deterioration, ML models need to be updated - but questions of when and how to best retrain these models are unsolved in the context of cloud management. We present a pilot study that address these questions for one of the most common models for adaptive prediction - Long Short Term Memory (LSTM) - using synthetic and real-world workload data. Our analysis of when to retrain explores approaches for detecting when retraining is required using both concept drift detection and prediction error thresholds, and at what point of retraining should actually take place. Our analysis of how to retrain focuses on the data required for retraining, and what proportion should be taken from before and after the need for retraining is detected. We present initial results that indicate that retraining of existing models can achieve prediction accuracy close to that of newly trained models but for much less cost, and present initial advice for how to provide cloud management systems with support for automatic retraining of ML-based methods.

Place, publisher, year, edition, pages
IEEE, 2022
Keywords
cloud computing, cloud workload prediction, concept drift, machine learning, time series prediction
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-198541 (URN)10.1109/IPDPSW55747.2022.00120 (DOI)000855041000086 ()2-s2.0-85136190866 (Scopus ID)9781665497473 (ISBN)9781665497480 (ISBN)
Conference
2022 IEEE International Parallel and Distributed Processing Symposium, 30 May 2022-03 June 2022, Lyon, France
Funder
Knut and Alice Wallenberg FoundationeSSENCE - An eScience Collaboration
Available from: 2022-08-09 Created: 2022-08-09 Last updated: 2025-05-07Bibliographically approved
Kidane, L., Townend, P., Metsch, T. & Elmroth, E.A data-driven framework for efficient and automated workload prediction in cloud computing.
Open this publication in new window or tab >>A data-driven framework for efficient and automated workload prediction in cloud computing
(English)Manuscript (preprint) (Other academic)
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:umu:diva-238557 (URN)
Available from: 2025-05-08 Created: 2025-05-08 Last updated: 2025-05-08Bibliographically approved
Kidane, L., Townend, P., Metsch, T. & Elmroth, E.A hybrid autoencoder-LSTM framework for efficient workload prediction.
Open this publication in new window or tab >>A hybrid autoencoder-LSTM framework for efficient workload prediction
(English)Manuscript (preprint) (Other academic)
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:umu:diva-238556 (URN)
Available from: 2025-05-08 Created: 2025-05-08 Last updated: 2025-05-08Bibliographically approved
Kidane, L., Townend, P., Metsch, T. & Elmroth, E.Efficient retraining of machine learning algorithms in cloud management systems.
Open this publication in new window or tab >>Efficient retraining of machine learning algorithms in cloud management systems
(English)Manuscript (preprint) (Other academic)
National Category
Computer Systems
Identifiers
urn:nbn:se:umu:diva-238555 (URN)
Available from: 2025-05-08 Created: 2025-05-08 Last updated: 2025-05-08Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-8097-1143

Search in DiVA

Show all publications