Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Unlearning clients, features and samples in vertical federated learning
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-8073-6784
Ericsson Research, Ericsson Stockholm, Sweden.
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-0368-8037
2025 (English)In: Proceedings on Privacy Enhancing Technologies, E-ISSN 2299-0984, Vol. 2025, no 2, p. 39-53Article in journal (Refereed) Published
Abstract [en]

Federated Learning ( FL ) has emerged as a prominent distributed learning paradigm that allows multiple users to collaboratively train a model without sharing their data thus preserving privacy.Within the scope of privacy preservation, information privacy regulations such as GDPR entitle users to request the removal (or unlearning) of their contribution from a service that is hosting the model. For this purpose, a server hosting an ML model must be able to unlearn certain information in cases such as copyright infringement or security issues that can make the model vulnerable or impact the performance of a service based on that model. While most unlearning approaches in FL focus on Horizontal Federated Learning (HFL), where clients share the feature space and the global model, Vertical Federated Learning (VFL) has received less attention from the research community. VFL involves clients (passive parties) sharing the sample space among them while not having access to the labels. In this paper, we explore unlearning in VFL from three perspectives: unlearning passive parties, unlearning features, and unlearning samples. To unlearn passive parties and features we introduce VFU-KD which is based on knowledge distillation(KD) while to unlearn samples, VFU-GA is introduced which is based on gradient ascent (GA). To provide evidence of approximate unlearning, we utilize Membership Inference Attack (MIA) to audit the effectiveness of our unlearning approach. Our experiments across six tabular datasets and two image datasets demonstrate that VFU-KD and VFU-GA achieve performance comparable to or better than both retraining from scratch and the benchmark R2S method in many cases, with improvements of (0 − 2%). In the remaining cases, utility scores remain comparable, with a modest utility loss ranging from 1 − 5%. Unlike existing methods, VFU-KD and VFU-GA require no communication between active and passive parties during unlearning. However, they do require the active party to store the previously communicated embeddings.

Place, publisher, year, edition, pages
Privacy Enhancing Technologies Symposium Advisory Board , 2025. Vol. 2025, no 2, p. 39-53
Keywords [en]
Federated learning; Unlearning; Vertical federated learning; Auditing; \ac{MIA}
National Category
Security, Privacy and Cryptography
Identifiers
URN: urn:nbn:se:umu:diva-237429DOI: 10.56553/popets-2025-0048OAI: oai:DiVA.org:umu-237429DiVA, id: diva2:1950809
Conference
The 25th Privacy Enhancing Technologies Symposium, Washington, USA and online, July 14-19, 2025
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Available from: 2025-04-09 Created: 2025-04-09 Last updated: 2025-04-28Bibliographically approved
In thesis
1. Navigating model anonymity and adaptability
Open this publication in new window or tab >>Navigating model anonymity and adaptability
2025 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

In the rapidly evolving era of Artificial Intelligence (AI), privacy-preserving techniques have become increasingly important, particularly in areas such as deep learning (DL). Deep learning models are inherently data-intensive and often rely on large volumes of sensitive personal information to achieve high performance. The increasing usage of data has raised critical concerns around data privacy, user control, and regulatory compliance. As data-driven systems grow in complexity and scale, safeguarding individual privacy while maintaining model utility has become a central challenge in the field of deep learning. Traditional privacy-preserving frameworks focus either on protecting privacy at the database level (e.g., k-anonymity) or during the inference/output from the model (e.g., differential privacy), each introducing trade-offs between privacy and utility. While these approaches have contributed significantly to mitigating privacy risks, they also face practical limitations, such as vulnerability to inference attacks or degradation in model performance due to the added noise. 

In this thesis, we take a different approach by focusing on anonymous models (i.e., models that can be generated by a set of different datasets) in model space with integral privacy. Anonymous models create ambiguity to the  intruder by ensuring that a trained model could plausibly have originated from various datasets, at the same time it gives the flexibility of choosing models which do not cause much utility loss. Since exhaustively exploring the model space to find recurring models is computationally intensive, we introduce a relaxed variant called ∆-Integral Privacy, where two models are considered recurring if they are within a bounded ∆ distance. Using this notion, we present practical frameworks for generating integrally private models for both machine learning and deep learning settings. We also provide probabilistic guarantees, demonstrating that under similar training environments, models tend to recur with high probability when optimized using mean samplers (e.g., SGD, Adam). These recurring models can be further utilized as an ensemble of private model that can estimate prediction uncertainty, which can be used for privacy-preserving concept drift detection in streaming data. We further extend our investigation to distributed settings, particularly Federated Learning (FL), where a central server only aggregates client-side model updates without accessing raw data. With strong empirical evidences supported by theoretical guarantees, we can claim that our frameworks with integral privacy are robust alternatives to conventional privacy-preserving methods.

In addition to generating anonymous models, this thesis also focus on developing approaches which enable users to remove their data and their influence from a machine learning model (machine unlearning) under their right-to-be-forgotten. As the demand for data removal and compliance with right-to-be-forgotten regulations intensifies, the need for efficient, auditable, and realistic unlearning mechanisms becomes increasingly urgent. Beyond regulatory compliance, machine unlearning plays a vital role in removing copyrighted content as well as mitigating security risks. In federated learning, when the clients share same feature space and participate in a huge number, we argue that if the server employs integrally private aggregation mechanism then it can plausibly deny client participation in training the global model up to certain extent. Hence, reducing the computational requirements for frequent unlearning requests. However, when there are limited number of clients with complimentary features, the server must employ unlearning mechanisms to deal with the model compression and high communication cost. This thesis shows that machine unlearning can be made efficient, effective and auditable, even in complex, distributed and generative environments. Our work spans across multiple dimensions, including privacy, auditability, computational efficiency, and adaptability across diverse data modalities and model architectures. 

Abstract [sv]

I den snabbt utvecklande eran av artificiell intelligens (AI) har tekniker för att bevara integritet blivit allt viktigare, särskilt inom områden som djupinlärning (deep learning, DL). Djupinlärningsmodeller är i grunden dataintensiva och förlitar sig ofta på stora mängder känslig personlig information för att uppnå hög prestanda. Den ökande användningen av data har väckt kritiska frågor kring dataintegritet, användarkontroll och efterlevnad av regler. När datadrivna system växer i komplexitet och skala har det blivit en central utmaning inom djupinlärning att skydda individers integritet samtidigt som modellens använd-barhet bibehålls. Traditionella integritetsbevarande ramverk fokuserar antingen på att skydda integritet på databasnivå (t.ex. k-anonymitet) eller i samband med inferens/utdata från modellen (t.ex. differentiell integritet), där bägge introducerar avvägningar mellan integritet och användbarhet. Även om dessa metoder har bidragit avsevärt till att minska integritetsrisker, står de också inför praktiska begränsningar, såsom sårbarhet för inferensattacker eller försämring av modellprestanda på grund av medkommande brus.

I denna avhandling tar vi en annan ansats genom att fokusera på anonyma modeller (dvs. modeller som kan genereras av en uppsättning olika datamäng-der) i modellrummet med väsentlig integritet (integral privacy). Anonyma modeller skapar tvetydighet för inkräktaren genom att säkerställa att en tränad modell rimligen kan ha härstammat från olika datamängder, samtidigt som det ger flexibiliteten att välja modeller som inte orsakar stora förluster i användbar-het. Eftersom det är beräkningsintensivt att uttömmande utforska modellutrymmet för att hitta återkommande modeller, introducerar vi en mindre strikt variant kallad ∆-essentiell integritet, där två modeller anses återkomma-nde om de ligger inom ett begränsat ∆-avstånd. Med hjälp av detta begrepp presenterar vi praktiska ramverk för att generera integrerade privata modeller för både maskininlärnings- och djupinlärningsmiljöer. Vi tillhandahåller också sannolikhetsgarantier som visar att under liknande träningsmiljöer tenderar modeller att återkomma med stor sannolikhet när de optimeras med medelvärdesprover (t.ex. SGD, Adam). Dessa återkommande modeller kan vidare användas som en gruppering av privata modeller som kan uppskatta prediktionsosäkerhet, vilket kan användas för integritetsbevarande upptäckt av konceptavvikelse i strömmande data. Vidare utvidgar vi vår undersökning till distribuerade miljöer, särskilt federerad maskininlärning (federated learning, FL), där en central server endast aggregerar klient-sidans modelluppdateringar utan att komma åt rådata. Med starka empiriska bevis stödda av teoretiska garantier kan vi hävda att våra ramverk med integrerad integritet är robusta alternativ till konventionella integritetsbevarande metoder.

Förutom att generera anonyma modeller fokuserar denna avhandling också på att utveckla metoder som gör det möjligt för användare att ta bort sina data och deras påverkan från en maskininlärningsmodell (maskinavlärning) vid deras rätt-att- bli-bortglömd (right-to-be-forgotten). När efterfrågan på dataradering och överensstämmelse med rätt-att-bli-bortglömd intensifieras, blir behovet av effektiva, reviderbara och realistiska avlärningsmekanismer alltmer angeläget. Utöver överensstämmelse med regelverket spelar maskinavlärning en viktig roll i att ta bort upphovsrättsskyddat innehåll samt minska säkerhetsrisker. I federerad inlärning, när klienterna delar samma omfång av särdrag och deltar i stort antal, hävdar vi att om servern använder en integrerad privat aggregeringsmekanism kan den rimligen förneka klientdeltagande i att träna den globala modellen till viss del. Därmed minskar beräkningskraven för frekventa avlärningsförfrågningar. Däremot när det finns ett begränsat antal klienter med kompletterande särdrag måste servern använda avlärningsmekanismer för att hantera modellkompression och höga kommunikationskostnader. Denna avhandling visar att maskinavlärning kan göras ändamålsenlig, effektiv och reviderbar, även i komplexa, distribuerade och generativa miljöer. Vårt arbete sträcker sig över flera dimensioner, inklusive integritet, reviderbarhet, beräkningsmässig effektivitet och anpassningsbarhet över olika datamodaliteter och modellarkitekturer.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2025. p. 180
Series
Report / UMINF, ISSN 0348-0542 ; 25.04
Keywords
Machine learning, Integral privacy, Federated Learning, Machine Unlearning, Federated Unlearning, Generative Machine Unlearning, Multimodal Unlearning.
National Category
Artificial Intelligence
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-238196 (URN)978-91-8070-671-1 (ISBN)978-91-8070-672-8 (ISBN)
Public defence
2025-05-26, BIO.A.206 - Aula Anatomica, Biologihuset, plan 2, Umeå, 10:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2025-05-05 Created: 2025-04-27 Last updated: 2025-05-05Bibliographically approved

Open Access in DiVA

fulltext(3530 kB)41 downloads
File information
File name FULLTEXT01.pdfFile size 3530 kBChecksum SHA-512
a9c4c26494d43e40c0a865137890ce5a9c3c31007146b33d6edd362573bc508b12056ce157a941f130b0327f0af34698d69e18be9b01f4d92613d9a467a3aa21
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Varshney, Ayush K.Torra, Vicenç

Search in DiVA

By author/editor
Varshney, Ayush K.Torra, Vicenç
By organisation
Department of Computing Science
Security, Privacy and Cryptography

Search outside of DiVA

GoogleGoogle Scholar
Total: 41 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 268 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf