Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Dimensionality Reduction and Machine Learning for Genome-Wide Association Studies in Norway Spruce: A Comparative Study of XGBoost and CatBoost
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2025 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Genome-wide association studies (GWAS) are studies in the field of genetics which are continuously conducted on genetic data to find correlations between particular gene markers and diseases or genetic variations. The specific genetic variation that are commonly examined are single nucleotide polymorphisms (SNPs). Machine learning models, which gained a lot of popularity over the last decade in the genetics field have shown a potential in identifying and understanding these genomic correlations. However, the accurate prediction of machine learning models when trained on high-dimensional genomic data presents a significant challenge in the field of supervised machine learning, necessitating the evaluation of various regression models and approaches to identify the most effective approach. This thesis utilizes two prominent models: Extreme Gradient Boosting (XGBoost) regression Catagorical Boost (CatBoost) regression, aiming to compare and analyze the results to determine how dimensionality reduction via Principal Component Analysis (PCA) influences their performance and interpretability. The research focuses on Norway Spruce genomic data, employing cross-validation and hyperparameter tuning to assess prediction accuracy and computational efficiency, and uses the Shapley Additive Explanations (SHAP) package to explain the output of the models. 

The findings reveal a critical trade-off between computational efficiency, predictive accuracy, and model interpretability. PCA drastically reduced training times by over 98\% for both models. XGBoost proved highly robust to dimensionality reduction, maintaining stable predictive accuracy (RMSE) across all dataset configurations. In contrast, CatBoost's accuracy declined more noticeably on PCA-reduced datasets but achieved the best absolute performance on the original data. A key difference was found in model interpretability: SHAP-based feature elimination showed that CatBoost attributed predictive power to a minimal set of SNPs (as few as 4-5 for some traits), suggesting a simple genetic architecture. On the other hand, XGBoost identified over a hundred significant features per trait, indicating a more complex basis. The thesis concludes that the choice of algorithm and preprocessing technique depends on the research goal, with PCA-enabled XGBoost ideal for computational efficiency, and CatBoost on original data superior for generating specific and interpretable biological hypotheses.

Place, publisher, year, edition, pages
2025. , p. 36
Series
UMNAD ; 1585
Keywords [en]
XGBoost, CatBoost, MACHINE LEARNING, DIMENSIONALITY REDUCTION
National Category
Artificial Intelligence
Identifiers
URN: urn:nbn:se:umu:diva-244636OAI: oai:DiVA.org:umu-244636DiVA, id: diva2:2001270
Educational program
Bachelor of Science Programme in Computing Science
Examiners
Available from: 2025-09-26 Created: 2025-09-25 Last updated: 2025-09-26Bibliographically approved

Open Access in DiVA

fulltext(4958 kB)98 downloads
File information
File name FULLTEXT01.pdfFile size 4958 kBChecksum SHA-512
ee492180fe517a196c009008f1d08cbf54531f3f0d15338f93b1f9648a82bfa6c85d0e0b44e009fa2b1d6cc809c0e3444b50fa7394276f4c039dd4797416e6bc
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Artificial Intelligence

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 266 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf