Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Navigating data privacy and utility: a strategic perspective
Umeå University, Faculty of Science and Technology, Department of Computing Science. (NAUSICA)ORCID iD: 0000-0002-4896-7849
2024 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Navigera i datasekretess och verktyg : ett strategiskt perspektiv (Swedish)
Abstract [en]

Privacy in machine learning should not merely be viewed as an afterthought; rather, it must serve as the foundation upon which machine learning systems are designed. In this thesis, along with the centralized machine learning, we also consider the distributed environments for training machine learning models, particularly federated learning. Federated learning lets multiple clients or organizations train a machine learning model in a collaborative manner without moving their data. Each client participating to the federation shares the model parameters learnt by training a machine learning model on its data. Even though the setup of federated learning keeps the data local, there is still a risk of sensitive information leaking through the model updates. For instance, attackers could potentially use the updates of the model parameters to figure out details about the data held by clients. So, while federated learning is designed to protect privacy, it still faces challenges in ensuring that the data remains secure throughout the training process. 

Originally, federated learning was introduced in the context of deep learning models. However, this thesis focuses on federated learning for decision trees. Decision Trees are intuitive, and interpretable models, making them popular in a wide range of applications, especially where explanability of the decisions made by the decision tree model is important. However, Decision Trees are vulnerable to inference attacks, particularly when the structure of the decision tree is exposed. To mitigate these vulnerabilities, a key contribution of this thesis is the development of novel federated learning algorithms that incorporate privacy-preserving techniques, such as $k$-anonymity and differential privacy, into the construction of decision trees. By doing so, we seek to ensure user privacy without significantly compromising the performance of the model. Machine learning models learn patterns from data, and during this process, they might leak sensitive information. Each step of the machine learning pipeline presents unique vulnerabilities, making it essential to assess and quantify the privacy risks involved. One focus of this thesis is the quantification of privacy by devising a data reconstruction attack tailored to Principal Component Analysis (PCA), a widely used dimensionality reduction technique. Furthermore, various protection mechanisms are evaluated in terms of their effectiveness in preserving privacy against such reconstruction attacks while maintaining the utility of the model. In addition to federated learning, this thesis also addresses the privacy concerns associated with synthetic datasets generated by models such as generative networks. Specifically, we perform an Attribute Inference Attack on synthetic datasets, and quantify privacy by calculating the Inference Accuracy—a metric that reflects the success of the attacker in estimating sensitive attributes of target individuals.

Overall, this thesis contributes to the development of privacy-preserving algorithms for decision trees in federated learning and introduces methods to quantify privacy in machine learning systems. Also, the findings of this thesis set a ground for further research at the intersection of privacy, and machine learning.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2024. , p. 103
Series
UMINF, ISSN 0348-0542 ; 24.10
Keywords [en]
Privacy, Data Reconstruction Attacks, k-anonymity, Differential Privacy, Federated Learning, Decision Trees, Principal Component Analysis
National Category
Computer Sciences Other Engineering and Technologies
Identifiers
URN: urn:nbn:se:umu:diva-230616ISBN: 978-91-8070-481-6 (print)ISBN: 978-91-8070-482-3 (electronic)OAI: oai:DiVA.org:umu-230616DiVA, id: diva2:1904274
Public defence
2024-11-04, BIO.A.206 Aula Anatomica, Biologihuset, 09:15 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2024-10-14 Created: 2024-10-08 Last updated: 2024-10-24Bibliographically approved
List of papers
1. A Survey on Tree Aggregation
Open this publication in new window or tab >>A Survey on Tree Aggregation
2021 (English)In: 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2021, Vol. 2021-JulyConference paper, Published paper (Refereed)
Abstract [en]

The research dedicated to the aggregation of classification trees and general trees (hierarchical structure of objects) has made enormous progress in the past decade. The problem statement for aggregation of classification trees or general trees is as follows: Given k classification or general trees for a set of objects, we aim to build a consensus tree (classification or general). That is, a representative tree for the given trees. In this paper, we explore different perspectives for the motivation to construct a single tree from multiple trees given by researchers. The survey presents the approaches for the aggregation of both the classification trees as well as general trees. We bifurcate our study of the aggregation approaches into two categories: Selecting a single tree from multiple trees and merging trees. We will discuss these categories and the aggregation approaches under these categories in the paper comprehensively. We also discuss the privacy aspects of tree aggregation approaches and the possible directions for new research like using the technique of aggregating decision trees in the field of Federated Learning, which is a booming topic.

Place, publisher, year, edition, pages
IEEE, 2021
Series
IEEE International Fuzzy Systems conference proceedings, ISSN 1544-5615, E-ISSN 1558-4739
Keywords
Federated Learning, Privacy, Tree Aggregation
National Category
Computer Sciences
Research subject
computer and systems sciences
Identifiers
urn:nbn:se:umu:diva-187671 (URN)10.1109/FUZZ45933.2021.9494546 (DOI)000698710800127 ()2-s2.0-85114680117 (Scopus ID)978-1-6654-4407-1 (ISBN)978-1-6654-4408-8 (ISBN)
Conference
2021 IEEE CIS International Conference on Fuzzy Systems, FUZZ 2021, Online (Luxembourg), July 11-14, 2021.
Note

Part of the series IEEE CIS International Conference on Fuzzy Systems, issn 1098-7584.

Available from: 2021-09-20 Created: 2021-09-20 Last updated: 2024-10-09Bibliographically approved
2. A k-Anonymised Federated Learning Framework with Decision Trees
Open this publication in new window or tab >>A k-Anonymised Federated Learning Framework with Decision Trees
2022 (English)In: Data Privacy Management, Cryptocurrencies and Blockchain Technology / [ed] Garcia-Alfaro J.; Muñoz-Tapia J.L.; Navarro-Arribas G.; Soriano M., Springer Science+Business Media B.V., 2022, Vol. 13140, p. 106-120Conference paper, Published paper (Refereed)
Abstract [en]

We propose a privacy-preserving framework using Mondrian k-anonymity with decision trees in a Federated Learning (FL) setting for the horizontally partitioned data. Data heterogeneity in FL makes the data non-IID (Non-Independent and Identically Distributed). We use a novel approach to create non-IID partitions of data by solving an optimization problem. In this work, each device trains a decision tree classifier. Devices share the root node of their trees with the aggregator. The aggregator merges the trees by choosing the most common split attribute and grows the branches based on the split values of the chosen split attribute. This recursive process stops when all the nodes to be merged are leaf nodes. After the merging operation, the aggregator sends the merged decision tree to the distributed devices. Therefore, we aim to build a joint machine learning model based on the data from multiple devices while offering k-anonymity to the participants.

Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2022
Series
Lecture Notes in Computer Science, ISSN 03029743, E-ISSN 16113349
Keywords
Aggregation, Decision tree, Federated learning, Privacy
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:umu:diva-192745 (URN)10.1007/978-3-030-93944-1_7 (DOI)000754558600007 ()2-s2.0-85124654159 (Scopus ID)9783030939434 (ISBN)
Conference
16th International Workshop on Data Privacy Management, DPM 2021, and 5th International Workshop on Cryptocurrencies and Blockchain Technology, CBT 2021 held in conjunction with ESORICS 2021, Online, 8 October, 2021.
Available from: 2022-02-24 Created: 2022-02-24 Last updated: 2024-10-09Bibliographically approved
3. Data reconstruction attack against principal component analysis
Open this publication in new window or tab >>Data reconstruction attack against principal component analysis
2023 (English)In: Security and Privacy in Social Networks and Big Data: 9th International Symposium, SocialSec 2023, Canterbury, UK, August 14–16, 2023 / [ed] Budi Arief; Anna Monreale; Michael Sirivianos; Shujun Li, Springer Science+Business Media B.V., 2023, p. 79-92Conference paper, Published paper (Refereed)
Abstract [en]

Attacking machine learning models is one of the many ways to measure the privacy of machine learning models. Therefore, studying the performance of attacks against machine learning techniques is essential to know whether somebody can share information about machine learning models, and if shared, how much can be shared? In this work, we investigate one of the widely used dimensionality reduction techniques Principal Component Analysis (PCA). We refer to a recent paper that shows how to attack PCA using a Membership Inference Attack (MIA). When using membership inference attacks against PCA, the adversary gets access to some of the principal components and wants to determine if a particular record was used to compute those principal components. We assume that the adversary knows the distribution of training data, which is a reasonable and useful assumption for a membership inference attack. With this assumption, we show that the adversary can make a data reconstruction attack, which is a more severe attack than the membership attack. For a protection mechanism, we propose that the data guardian first generate synthetic data and then compute the principal components. We also compare our proposed approach with Differentially Private Principal Component Analysis (DPPCA). The experimental findings show the degree to which the adversary successfully attempted to recover the users’ original data. We obtained comparable results with DPPCA. The number of principal components the attacker intercepted affects the attack’s outcome. Therefore, our work aims to answer how much information about machine learning models is safe to disclose while protecting users’ privacy.

Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2023
Series
Lecture Notes in Computer Science, ISSN 03029743, E-ISSN 16113349 ; 14097
Keywords
Data reconstruction attack, Generative Adversarial Networks, Membership Inference Attack, Principal Component Analysis, Privacy
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-214976 (URN)10.1007/978-981-99-5177-2_5 (DOI)2-s2.0-85172275690 (Scopus ID)9789819951765 (ISBN)978-981-99-5177-2 (ISBN)
Conference
9th International Symposium, SocialSec 2023, Canterbury, UK, August 14–16, 2023.
Available from: 2023-10-16 Created: 2023-10-16 Last updated: 2024-10-09Bibliographically approved
4. Empirical evaluation of synthetic data created by generative models via attribute inference attack
Open this publication in new window or tab >>Empirical evaluation of synthetic data created by generative models via attribute inference attack
2024 (English)In: Privacy and identity management: sharing in a digital world / [ed] Felix Bieker; Silvia de Conca; Nils Gruschka; Meiko Jensen; Ina Schiering, Springer, 2024, p. 282-291Conference paper, Published paper (Refereed)
Abstract [en]

The disclosure risk of synthetic/artificial data is still being determined. Studies show that synthetic data generation techniques generate similar data to the original data and sometimes even the exact original data. Therefore, publishing synthetic datasets can endanger the privacy of users. In our work, we study the synthetic data generated from different synthetic data generation techniques, including the most recent diffusion models. We perform a disclosure risk assessment of synthetic datasets via an attribute inference attack, in which an attacker has access to a subset of publicly available features and at least one synthesized dataset, and the aim is to infer the sensitive features unknown to the attacker. We also compute the predictive accuracy and F1 score of the random forest classifier trained on several synthetic datasets. For sensitive categorical features, we show that Attribute Inference Attack is not highly feasible or successful. In contrast, for continuous attributes, we can have an approximate inference. This holds true for the synthetic datasets derived from Diffusion models, GANs, and DPGANs, which shows that we can only have approximated Attribute Inference, not the exact Attribute Inference.

Place, publisher, year, edition, pages
Springer, 2024
Series
IFIP Advances in Information and Communication Technology (IFIPAICT), ISSN 1868-4238, E-ISSN 1868-422X ; 695
Keywords
Attribute Inference Attack, Differentially Private Generative Adversarial Networks, Diffusion Models, Generative Adversarial Networks, Privacy
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:umu:diva-224381 (URN)10.1007/978-3-031-57978-3_18 (DOI)2-s2.0-85192354341 (Scopus ID)9783031579776 (ISBN)9783031579783 (ISBN)
Conference
18th IFIP WG 9.2, 9.6/11.7, 11.6/SIG 9.2.2 International Summer School on Privacy and Identity Management, Privacy and Identity 2023. Oslo, Norway, August 8–11, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2024-05-24 Created: 2024-05-24 Last updated: 2024-10-09Bibliographically approved
5. Balancing act: navigating the privacy-utility spectrum in principal component analysis
Open this publication in new window or tab >>Balancing act: navigating the privacy-utility spectrum in principal component analysis
2024 (English)In: Proceedings of the International Conference on Security and Cryptography / [ed] Sabrina De Capitani Di Vimercati; Pierangela Samarati, Science and Technology Publications, Lda , 2024, p. 850-857Conference paper, Published paper (Refereed)
Abstract [en]

A lot of research in federated learning is ongoing ever since it was proposed. Federated learning allows collaborative learning among distributed clients without sharing their raw data to a central aggregator (if it is present) or to other clients in a peer to peer architecture. However, each client participating in the federation shares their model information learned from their data with other clients participating in the FL process, or with the central aggregator. This sharing of information, however, makes this approach vulnerable to various attacks, including data reconstruction attacks. Our research specifically focuses on Principal Component Analysis (PCA), as it is a widely used dimensionality technique. For performing PCA in a federated setting, distributed clients share local eigenvectors computed from their respective data with the aggregator, which then combines and returns global eigenvectors. Previous studies on attacks against PCA have demonstrated that revealing eigenvectors can lead to membership inference and, when coupled with knowledge of data distribution, result in data reconstruction attacks. Consequently, our objective in this work is to augment privacy in eigenvectors while sustaining their utility. To obtain protected eigenvectors, we use k-anonymity, and generative networks. Through our experimentation, we did a complete privacy, and utility analysis of original and protected eigenvectors. For utility analysis, we apply HIERARCHICAL CLUSTERING, RANDOM FOREST regressor, and RANDOM FOREST classifier on the protected, and original eigenvectors. We got interesting results, when we applied HIERARCHICAL CLUSTERING on the original, and protected datasets, and eigenvectors. The height at which the clusters are merged declined from 250 to 150 for original, and synthetic version of CALIFORNIA-HOUSING data, respectively. For the k-anonymous version of CALIFORNIA-HOUSING data, the height lies between 150, and 250. To evaluate the privacy risks of the federated PCA system, we act as an attacker, and conduct a data reconstruction attack.

Place, publisher, year, edition, pages
Science and Technology Publications, Lda, 2024
Series
International Conference on Security and Cryptography, ISSN 2184-7711
Keywords
Data Reconstruction Attack, Federated Learning, Generative Networks, k-anonymity, Membership Inference Attack, Principal Component Analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-229365 (URN)10.5220/0012855000003767 (DOI)2-s2.0-85202804524 (Scopus ID)9789897587092 (ISBN)
Conference
21st International Conference on Security and Cryptography, SECRYPT 2024, Dijon, 8 July 2024 to 10 July 2024
Funder
EU, Horizon 2020, NFRAIA-01-2018-2019Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2024-09-23 Created: 2024-09-23 Last updated: 2024-10-09Bibliographically approved
6. DISCOLEAF: Personalized DIScretization of COntinuous Attributes for LEArning with Federated Decision Trees
Open this publication in new window or tab >>DISCOLEAF: Personalized DIScretization of COntinuous Attributes for LEArning with Federated Decision Trees
2024 (English)In: Privacy in statistical databases: International Conference, PSD 2024, Antibes Juan-les-Pins, France, September 25–27, 2024, Proceedings, Springer Nature, 2024, p. 344-357Conference paper, Published paper (Refereed)
Abstract [en]

Federated learning is a distributed machine learning framework, in which each client participating to the federation trains a machine learning model on its data, and shares the trained model information with a central server, which aggregates, and sends the aggregated information back to the distributed clients. The machine learning model we choose to work with is decision trees, due to their simplicity, and interpretability. On that note, we propose a full-fledged federated pipeline, which includes discretization  and learning with decision trees for the horizontally partitioned data. Our federated discretization approach can be plugged-in as a prepossessing step before any other federated learning algorithm. During discretization, we ensure that each client creates the number of discrete bins, according to their own data/choice. Hence, our approach is both federated and personalized. After discretization, we propose to apply the post randomization method to protect the discretized data with differential privacy guarantees. After protecting its database,  each client trains a decision tree classifier on its protected database locally, and shares the nodes, containing the split attribute, and the split value with the central server. The central server obtains the most occurred split attribute, and combines the split values. This process goes on until all the nodes to be merged are leaf nodes. The central server then shares the merged tree with the distributed clients. Hence, our proposed framework performs personalized, privacy-preserving federated learning with decision trees by discretizing continuous attributes, and then masking them prior to the training stage. We call our proposed framework DISCOLEAF.

Place, publisher, year, edition, pages
Springer Nature, 2024
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14915
Keywords
Federated Learning, Discretization, Decision Trees, Differential Privacy, Decision Tree Aggregation
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-230552 (URN)10.1007/978-3-031-69651-0_23 (DOI)001343416300023 ()2-s2.0-85205134092 (Scopus ID)978-3-031-69650-3 (ISBN)978-3-031-69651-0 (ISBN)
Conference
International Conference on Privacy in Statistical Databases(PSD) 2024, Antibes Juan-les-Pins, France, September 25-27, 2024
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Swedish Research Council, 2023-05531
Available from: 2024-10-04 Created: 2024-10-04 Last updated: 2025-04-24Bibliographically approved
7. Rakshit: attacking and defending federated learning with decision trees
Open this publication in new window or tab >>Rakshit: attacking and defending federated learning with decision trees
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In Federated Learning (FL), two or more devices collaboratively learn a model without sharing the data. However, it has been studied that sharing model parameters can lead to substantial privacy leakage. We study an FL framework for training Gradient Boosting Decision Trees (GBDT), which is Weighted Gradient Boosting Decision Trees (WGBDT), where an aggregated gradient of all similar records from different devices is used for updating the gradients for building trees for each device. One of the approaches for determining similar records among different devices is the Locality Sensitive Hashing (LSH), which places similar records in the same bucket and dissimilar records in different buckets. Hence, WGBDT uses the aggregated gradient of similar samples while training a GBDT model for each distributed device. WGBDT assumes that each device knows the hash values of other devices' records. Using two data reconstruction attacks, we show that this assumption compromises individual privacy. When the number of LSH functions (L) is bigger than the data dimension (d), we can reconstruct more than 90% of the training data. We also show that even when L < d, significant data reconstruction is still possible. We propose Rakshit, which adds another layer of anonymization using Mondrian. We show that the reconstruction accuracy is less in Rakshit. However, some reconstruction is still possible. Our findings provide useful insights for research in privacy-preserving FL for training GBDT.

National Category
Engineering and Technology
Identifiers
urn:nbn:se:umu:diva-230672 (URN)
Available from: 2024-10-09 Created: 2024-10-09 Last updated: 2024-10-09

Open Access in DiVA

fulltext(4514 kB)363 downloads
File information
File name FULLTEXT01.pdfFile size 4514 kBChecksum SHA-512
21b3e03af8ff58427a265b6667bd97d569b65ad873c0dd6906895ed17ad2360e1a2a55bd88c029b5f9aa49681dba44348283dfe7489abf18259833a00dcc2f77
Type fulltextMimetype application/pdf
spikblad(180 kB)51 downloads
File information
File name SPIKBLAD01.pdfFile size 180 kBChecksum SHA-512
7a80c02b299627b19300260977f97ea5e2d9b536a8dbb96151699ca1215b7203cfa4baece0ba6d600498d71914e98d29b2d756384480bbad70f3f220ea06798e
Type spikbladMimetype application/pdf

Authority records

Kwatra, Saloni

Search in DiVA

By author/editor
Kwatra, Saloni
By organisation
Department of Computing Science
Computer SciencesOther Engineering and Technologies

Search outside of DiVA

GoogleGoogle Scholar
Total: 363 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 946 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf