Dimensionality Reduction and Machine Learning for Genome-Wide Association Studies in Norway Spruce: A Comparative Study of XGBoost and CatBoost
2025 (Engelska)Självständigt arbete på grundnivå (kandidatexamen), 10 poäng / 15 hp
Studentuppsats (Examensarbete)
Abstract [en]
Genome-wide association studies (GWAS) are studies in the field of genetics which are continuously conducted on genetic data to find correlations between particular gene markers and diseases or genetic variations. The specific genetic variation that are commonly examined are single nucleotide polymorphisms (SNPs). Machine learning models, which gained a lot of popularity over the last decade in the genetics field have shown a potential in identifying and understanding these genomic correlations. However, the accurate prediction of machine learning models when trained on high-dimensional genomic data presents a significant challenge in the field of supervised machine learning, necessitating the evaluation of various regression models and approaches to identify the most effective approach. This thesis utilizes two prominent models: Extreme Gradient Boosting (XGBoost) regression Catagorical Boost (CatBoost) regression, aiming to compare and analyze the results to determine how dimensionality reduction via Principal Component Analysis (PCA) influences their performance and interpretability. The research focuses on Norway Spruce genomic data, employing cross-validation and hyperparameter tuning to assess prediction accuracy and computational efficiency, and uses the Shapley Additive Explanations (SHAP) package to explain the output of the models.
The findings reveal a critical trade-off between computational efficiency, predictive accuracy, and model interpretability. PCA drastically reduced training times by over 98\% for both models. XGBoost proved highly robust to dimensionality reduction, maintaining stable predictive accuracy (RMSE) across all dataset configurations. In contrast, CatBoost's accuracy declined more noticeably on PCA-reduced datasets but achieved the best absolute performance on the original data. A key difference was found in model interpretability: SHAP-based feature elimination showed that CatBoost attributed predictive power to a minimal set of SNPs (as few as 4-5 for some traits), suggesting a simple genetic architecture. On the other hand, XGBoost identified over a hundred significant features per trait, indicating a more complex basis. The thesis concludes that the choice of algorithm and preprocessing technique depends on the research goal, with PCA-enabled XGBoost ideal for computational efficiency, and CatBoost on original data superior for generating specific and interpretable biological hypotheses.
Ort, förlag, år, upplaga, sidor
2025. , s. 36
Serie
UMNAD ; 1585
Nyckelord [en]
XGBoost, CatBoost, MACHINE LEARNING, DIMENSIONALITY REDUCTION
Nationell ämneskategori
Artificiell intelligens
Identifikatorer
URN: urn:nbn:se:umu:diva-244636OAI: oai:DiVA.org:umu-244636DiVA, id: diva2:2001270
Utbildningsprogram
Kandidatprogrammet i Datavetenskap
Examinatorer
2025-09-262025-09-252025-09-26Bibliografiskt granskad