Comparative Analysis of Tree-Based Methods for Predictive Modeling
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE credits
Student thesisAlternative title
Trädbaserade metoder för regression och klassificering (Swedish)
Abstract [en]
This paper delves into the application and predictive ability of tree-based methods in machine learning, focusing on decision trees, Random Forests, C-Trees, and boosting methods for regression and classification tasks. With the increasing importance of data-driven decision- making, these methodologies have emerged as powerful tools for extracting insights from complex datasets in fields such as finance, healthcare, and marketing. Supervised learning, encompassing both regression and classification, forms the backbone of these methods. The aim of this paper is to probe into what method has the best predictive performance.
Tree-based methods trace all the way back to 1963, evolving through significant contributions such as the Classification and Regression Trees (CART) framework along with bagging and Random Forests. These advancements have enhanced the robustness and accuracy of predictive models. Modern boosting techniques, like AdaBoost and Gradient Boosting, iteratively improve model performance by focusing on previously misclassified instances.
Methods like linear and logistic regression are explored and used as baseline models for their simplicity and interpretability. The paper also investigates more advanced techniques such as Support Vector Machines and Artificial Neural Networks. Additionally, the Conditional Inference Trees (C-Tree) approach offers a hypothesis-testing framework for optimizing splits, enhancing model accuracy.
The analysis includes fitting various models to training data, cross-validation for optimal parameter selection, and performance evaluation on validation datasets. Results from the analysis indicate that Random Forest and XGBoost were strong performers in both the classification and regression problems.
Place, publisher, year, edition, pages
2024.
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:umu:diva-226745OAI: oai:DiVA.org:umu-226745DiVA, id: diva2:1874423
Subject / course
Statistics C: Bachelor's Thesis
Educational program
Programme in Statistics and Data Science
2024-06-202024-06-202024-06-20Bibliographically approved