Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Dimensionality Reduction and Machine Learning for Genome-Wide Association Studies in Norway Spruce: A Comparative Study of XGBoost and CatBoost
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
2025 (engelsk)Independent thesis Basic level (degree of Bachelor), 10 poäng / 15 hpOppgave
Abstract [en]

Genome-wide association studies (GWAS) are studies in the field of genetics which are continuously conducted on genetic data to find correlations between particular gene markers and diseases or genetic variations. The specific genetic variation that are commonly examined are single nucleotide polymorphisms (SNPs). Machine learning models, which gained a lot of popularity over the last decade in the genetics field have shown a potential in identifying and understanding these genomic correlations. However, the accurate prediction of machine learning models when trained on high-dimensional genomic data presents a significant challenge in the field of supervised machine learning, necessitating the evaluation of various regression models and approaches to identify the most effective approach. This thesis utilizes two prominent models: Extreme Gradient Boosting (XGBoost) regression Catagorical Boost (CatBoost) regression, aiming to compare and analyze the results to determine how dimensionality reduction via Principal Component Analysis (PCA) influences their performance and interpretability. The research focuses on Norway Spruce genomic data, employing cross-validation and hyperparameter tuning to assess prediction accuracy and computational efficiency, and uses the Shapley Additive Explanations (SHAP) package to explain the output of the models. 

The findings reveal a critical trade-off between computational efficiency, predictive accuracy, and model interpretability. PCA drastically reduced training times by over 98\% for both models. XGBoost proved highly robust to dimensionality reduction, maintaining stable predictive accuracy (RMSE) across all dataset configurations. In contrast, CatBoost's accuracy declined more noticeably on PCA-reduced datasets but achieved the best absolute performance on the original data. A key difference was found in model interpretability: SHAP-based feature elimination showed that CatBoost attributed predictive power to a minimal set of SNPs (as few as 4-5 for some traits), suggesting a simple genetic architecture. On the other hand, XGBoost identified over a hundred significant features per trait, indicating a more complex basis. The thesis concludes that the choice of algorithm and preprocessing technique depends on the research goal, with PCA-enabled XGBoost ideal for computational efficiency, and CatBoost on original data superior for generating specific and interpretable biological hypotheses.

sted, utgiver, år, opplag, sider
2025. , s. 36
Serie
UMNAD ; 1585
Emneord [en]
XGBoost, CatBoost, MACHINE LEARNING, DIMENSIONALITY REDUCTION
HSV kategori
Identifikatorer
URN: urn:nbn:se:umu:diva-244636OAI: oai:DiVA.org:umu-244636DiVA, id: diva2:2001270
Utdanningsprogram
Bachelor of Science Programme in Computing Science
Examiner
Tilgjengelig fra: 2025-09-26 Laget: 2025-09-25 Sist oppdatert: 2025-09-26bibliografisk kontrollert

Open Access i DiVA

fulltext(4958 kB)98 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 4958 kBChecksum SHA-512
ee492180fe517a196c009008f1d08cbf54531f3f0d15338f93b1f9648a82bfa6c85d0e0b44e009fa2b1d6cc809c0e3444b50fa7394276f4c039dd4797416e6bc
Type fulltextMimetype application/pdf

Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 266 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf