Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
DISCOLEAF: Personalized DIScretization of COntinuous Attributes for LEArning with Federated Decision Trees
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-4896-7849
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-0368-8037
2024 (English)In: Privacy in statistical databases: International Conference, PSD 2024, Antibes Juan-les-Pins, France, September 25–27, 2024, Proceedings, Springer Nature, 2024, p. 344-357Conference paper, Published paper (Refereed)
Abstract [en]

Federated learning is a distributed machine learning framework, in which each client participating to the federation trains a machine learning model on its data, and shares the trained model information with a central server, which aggregates, and sends the aggregated information back to the distributed clients. The machine learning model we choose to work with is decision trees, due to their simplicity, and interpretability. On that note, we propose a full-fledged federated pipeline, which includes discretization  and learning with decision trees for the horizontally partitioned data. Our federated discretization approach can be plugged-in as a prepossessing step before any other federated learning algorithm. During discretization, we ensure that each client creates the number of discrete bins, according to their own data/choice. Hence, our approach is both federated and personalized. After discretization, we propose to apply the post randomization method to protect the discretized data with differential privacy guarantees. After protecting its database,  each client trains a decision tree classifier on its protected database locally, and shares the nodes, containing the split attribute, and the split value with the central server. The central server obtains the most occurred split attribute, and combines the split values. This process goes on until all the nodes to be merged are leaf nodes. The central server then shares the merged tree with the distributed clients. Hence, our proposed framework performs personalized, privacy-preserving federated learning with decision trees by discretizing continuous attributes, and then masking them prior to the training stage. We call our proposed framework DISCOLEAF.

Place, publisher, year, edition, pages
Springer Nature, 2024. p. 344-357
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14915
Keywords [en]
Federated Learning, Discretization, Decision Trees, Differential Privacy, Decision Tree Aggregation
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-230552DOI: 10.1007/978-3-031-69651-0_23ISI: 001343416300023Scopus ID: 2-s2.0-85205134092ISBN: 978-3-031-69650-3 (print)ISBN: 978-3-031-69651-0 (electronic)OAI: oai:DiVA.org:umu-230552DiVA, id: diva2:1903600
Conference
International Conference on Privacy in Statistical Databases(PSD) 2024, Antibes Juan-les-Pins, France, September 25-27, 2024
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Research Council, 2023-05531Available from: 2024-10-04 Created: 2024-10-04 Last updated: 2025-04-24Bibliographically approved
In thesis
1. Navigating data privacy and utility: a strategic perspective
Open this publication in new window or tab >>Navigating data privacy and utility: a strategic perspective
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Navigera i datasekretess och verktyg : ett strategiskt perspektiv
Abstract [en]

Privacy in machine learning should not merely be viewed as an afterthought; rather, it must serve as the foundation upon which machine learning systems are designed. In this thesis, along with the centralized machine learning, we also consider the distributed environments for training machine learning models, particularly federated learning. Federated learning lets multiple clients or organizations train a machine learning model in a collaborative manner without moving their data. Each client participating to the federation shares the model parameters learnt by training a machine learning model on its data. Even though the setup of federated learning keeps the data local, there is still a risk of sensitive information leaking through the model updates. For instance, attackers could potentially use the updates of the model parameters to figure out details about the data held by clients. So, while federated learning is designed to protect privacy, it still faces challenges in ensuring that the data remains secure throughout the training process. 

Originally, federated learning was introduced in the context of deep learning models. However, this thesis focuses on federated learning for decision trees. Decision Trees are intuitive, and interpretable models, making them popular in a wide range of applications, especially where explanability of the decisions made by the decision tree model is important. However, Decision Trees are vulnerable to inference attacks, particularly when the structure of the decision tree is exposed. To mitigate these vulnerabilities, a key contribution of this thesis is the development of novel federated learning algorithms that incorporate privacy-preserving techniques, such as $k$-anonymity and differential privacy, into the construction of decision trees. By doing so, we seek to ensure user privacy without significantly compromising the performance of the model. Machine learning models learn patterns from data, and during this process, they might leak sensitive information. Each step of the machine learning pipeline presents unique vulnerabilities, making it essential to assess and quantify the privacy risks involved. One focus of this thesis is the quantification of privacy by devising a data reconstruction attack tailored to Principal Component Analysis (PCA), a widely used dimensionality reduction technique. Furthermore, various protection mechanisms are evaluated in terms of their effectiveness in preserving privacy against such reconstruction attacks while maintaining the utility of the model. In addition to federated learning, this thesis also addresses the privacy concerns associated with synthetic datasets generated by models such as generative networks. Specifically, we perform an Attribute Inference Attack on synthetic datasets, and quantify privacy by calculating the Inference Accuracy—a metric that reflects the success of the attacker in estimating sensitive attributes of target individuals.

Overall, this thesis contributes to the development of privacy-preserving algorithms for decision trees in federated learning and introduces methods to quantify privacy in machine learning systems. Also, the findings of this thesis set a ground for further research at the intersection of privacy, and machine learning.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2024. p. 103
Series
UMINF, ISSN 0348-0542 ; 24.10
Keywords
Privacy, Data Reconstruction Attacks, k-anonymity, Differential Privacy, Federated Learning, Decision Trees, Principal Component Analysis
National Category
Computer Sciences Other Engineering and Technologies
Identifiers
urn:nbn:se:umu:diva-230616 (URN)978-91-8070-481-6 (ISBN)978-91-8070-482-3 (ISBN)
Public defence
2024-11-04, BIO.A.206 Aula Anatomica, Biologihuset, 09:15 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2024-10-14 Created: 2024-10-08 Last updated: 2024-10-24Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Kwatra, SaloniTorra, Vicenç

Search in DiVA

By author/editor
Kwatra, SaloniTorra, Vicenç
By organisation
Department of Computing Science
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 111 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf