Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
dpUGC: learn differentially private representation for user generated contents
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Database and Data Mining Group)ORCID iD: 0000-0001-8820-2405
ICT Discipline, University of Tasmania, Hobart, Australia.
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Database and Data Mining Group)ORCID iD: 0000-0002-7788-3986
2023 (English)In: Computational linguistics and intelligent text processing: 20th international conference, CICLing 2019, La Rochelle, France, April 7–13, 2019, revised selected papers, part I / [ed] Alexander Gelbukh, Springer, 2023, Vol. 13451, p. 316-331Conference paper, Published paper (Refereed)
Abstract [en]

This paper firstly proposes a simple yet efficient generalized approach to apply differential privacy to text representation (i.e., word embedding). Based on it, we propose a user-level approach to learn personalized differentially private word embedding model on user generated contents (UGC). To our best knowledge, this is the first work of learning user-level differentially private word embedding model from text for sharing. The proposed approaches protect the privacy of the individual from re-identification, especially provide better trade-off of privacy and data utility on UGC data for sharing. The experimental results show that the trained embedding models are applicable for the classic text analysis tasks (e.g., regression). Moreover, the proposed approaches of learning differentially private embedding models are both framework- and dataindependent, which facilitates the deployment and sharing. The source code is available at https://github.com/sonvx/dpText.

Place, publisher, year, edition, pages
Springer, 2023. Vol. 13451, p. 316-331
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13451
Keywords [en]
Private word embedding, Differential privacy, UGC
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:umu:diva-160887DOI: 10.1007/978-3-031-24337-0_23Scopus ID: 2-s2.0-85149907226ISBN: 978-3-031-24336-3 (print)ISBN: 978-3-031-24337-0 (electronic)OAI: oai:DiVA.org:umu-160887DiVA, id: diva2:1330435
Conference
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, April 7-13, 2019.
Note

Originally included in thesis in manuscript form. 

Available from: 2019-06-25 Created: 2019-06-25 Last updated: 2023-03-28Bibliographically approved
In thesis
1. Privacy-awareness in the era of Big Data and machine learning
Open this publication in new window or tab >>Privacy-awareness in the era of Big Data and machine learning
2019 (English)Licentiate thesis, comprehensive summary (Other academic)
Alternative title[sv]
Integritetsmedvetenhet i eran av Big Data och maskininlärning
Abstract [en]

Social Network Sites (SNS) such as Facebook and Twitter, have been playing a great role in our lives. On the one hand, they help connect people who would not otherwise be connected before. Many recent breakthroughs in AI such as facial recognition [49] were achieved thanks to the amount of available data on the Internet via SNS (hereafter Big Data). On the other hand, due to privacy concerns, many people have tried to avoid SNS to protect their privacy. Similar to the security issue of the Internet protocol, Machine Learning (ML), as the core of AI, was not designed with privacy in mind. For instance, Support Vector Machines (SVMs) try to solve a quadratic optimization problem by deciding which instances of training dataset are support vectors. This means that the data of people involved in the training process will also be published within the SVM models. Thus, privacy guarantees must be applied to the worst-case outliers, and meanwhile data utilities have to be guaranteed.

For the above reasons, this thesis studies on: (1) how to construct data federation infrastructure with privacy guarantee in the big data era; (2) how to protect privacy while learning ML models with a good trade-off between data utilities and privacy. To the first point, we proposed different frameworks em- powered by privacy-aware algorithms that satisfied the definition of differential privacy, which is the state-of-the-art privacy-guarantee algorithm by definition. Regarding (2), we proposed different neural network architectures to capture the sensitivities of user data, from which, the algorithm itself decides how much it should learn from user data to protect their privacy while achieves good performance for a downstream task. The current outcomes of the thesis are: (1) privacy-guarantee data federation infrastructure for data analysis on sensitive data; (2) privacy-guarantee algorithms for data sharing; (3) privacy-concern data analysis on social network data. The research methods used in this thesis include experiments on real-life social network dataset to evaluate aspects of proposed approaches.

Insights and outcomes from this thesis can be used by both academic and industry to guarantee privacy for data analysis and data sharing in personal data. They also have the potential to facilitate relevant research in privacy-aware representation learning and related evaluation methods.

Place, publisher, year, edition, pages
Umeå: Department of computing science, Umeå University, 2019. p. 42
Series
Report / UMINF, ISSN 0348-0542 ; 19.06
Keywords
Differential Privacy, Machine Learning, Deep Learning, Big Data
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-162182 (URN)9789178551101 (ISBN)
Presentation
2019-09-09, 23:40 (English)
Supervisors
Available from: 2019-08-22 Created: 2019-08-15 Last updated: 2021-03-18Bibliographically approved
2. Privacy-guardian: the vital need in machine learning with big data
Open this publication in new window or tab >>Privacy-guardian: the vital need in machine learning with big data
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Integritetsväktaren : det vitala behovet i maskininlärning med big data
Abstract [en]

Social Network Sites (SNS) such as Facebook and Twitter, play a great role in our lives. On one hand, they help to connect people who would not otherwise be connected. Many recent breakthroughs in AI such as facial recognition [Kow+18], were achieved thanks to the amount of available data on the Internet via SNS (hereafter Big Data). On the other hand, many people have tried to avoid SNS to protect their privacy [Sti+13]. However, Machine Learning (ML), as the core of AI, was not designed with privacy in mind. For instance, one of the most popular supervised machine learning algorithms, Support Vector Machines (SVMs), try to solve a quadratic optimization problem in which the data of people involved in the training process is also published within the SVM models. Similarly, many other ML applications (e.g., ClearView) compromise the privacy of individuals presented in the data, especially when the big data era enhances the data federation. Thus, in the context of machine learning with big data, it is important to (1) protect sensitive information (privacy protection) while (2) preserving the quality of the output of algorithms (i.e., data utility). 

For the vital need of privacy in machine learning with big data, this thesis studies on: (1) how to construct information infrastructures for data federation with privacy guarantee in the big data era; (2) how to protect privacy while learning ML models with a good trade-off between data utility and privacy. To the first point, we proposed different frameworks empowered by privacy-aware algorithms. Regarding the second point, we proposed different neural architectures to capture the sensitivities of user data, from which, the algorithms themselves decide how much they should learn from user data to protect their privacy while achieving good performances for downstream tasks. The current outcomes of the thesis are: (a) privacy-guarantee data federation infrastructure for data analysis on sensitive data; (b) privacy utilities for privacy-concern analysis; and (c) privacy-aware algorithms for learning on personal data. For each outcome, extensive experimental studies were conducted on real-life social network datasets to evaluate aspects of the proposed approaches. 

Insights and outcomes from this thesis can be used by both academia and industry to provide privacy-guarantee data analysis and data learning in big data containing personal information. They also have the potential to facilitate relevant research in privacy-aware learning and its related evaluation methods.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2020. p. 63
Series
Report / UMINF, ISSN 0348-0542 ; 20.11
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-175292 (URN)978-91-7855-377-8 (ISBN)978-91-7855-376-1 (ISBN)
Public defence
2020-10-20, N460, Naturvetarhuset, Umeå, 14:00 (English)
Opponent
Supervisors
Available from: 2020-09-29 Created: 2020-09-23 Last updated: 2021-03-18Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopusarXiv

Authority records

Vu, Xuan-SonJiang, Lili

Search in DiVA

By author/editor
Vu, Xuan-SonJiang, Lili
By organisation
Department of Computing Science
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 1462 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf