Machine Learning (ML) models play a crucial role in enabling intelligent decision-making across diverse cloud system management tasks. However, as cloud operational data evolves, shifts in data distributions can occur, leading to a gradual degradation of deployed ML models and, consequently, a reduction in the overall efficiency of cloud systems.
We introduce CloudResilienceML, a framework designed to maintain the resilience of ML models in dynamic cloud environments. CloudResilienceML includes: (1) a performance degradation detection mechanism, using dynamic programming change point detection to identify when a model needs retraining, and (2) a data valuation method to select a minimal, effective training set for retraining, reducing unnecessary overhead.
Evaluated with two ML models on real cloud operational data, CloudResilienceML significantly boosts model resilience and reduces retraining costs compared to incremental learning and data drift-based retraining. In high-drift scenarios (e.g., Wikipedia trace), it reduces overhead by 50% compared to concept drift retraining and by 91% compared to incremental retraining. In stable environments (e.g., Microsoft Azure trace), CloudResilienceML maintains high accuracy with retraining costs 96% lower than concept drift methods and 86% lower than incremental retraining.