umu.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Privacy-awareness in the era of Big Data and machine learning
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Database and Data Mining Group)ORCID-id: 0000-0001-8820-2405
2019 (Engelska)Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)Alternativ titel
Integritetsmedvetenhet i eran av Big Data och maskininlärning (Svenska)
Abstract [en]

Social Network Sites (SNS) such as Facebook and Twitter, have been playing a great role in our lives. On the one hand, they help connect people who would not otherwise be connected before. Many recent breakthroughs in AI such as facial recognition [49] were achieved thanks to the amount of available data on the Internet via SNS (hereafter Big Data). On the other hand, due to privacy concerns, many people have tried to avoid SNS to protect their privacy. Similar to the security issue of the Internet protocol, Machine Learning (ML), as the core of AI, was not designed with privacy in mind. For instance, Support Vector Machines (SVMs) try to solve a quadratic optimization problem by deciding which instances of training dataset are support vectors. This means that the data of people involved in the training process will also be published within the SVM models. Thus, privacy guarantees must be applied to the worst-case outliers, and meanwhile data utilities have to be guaranteed.

For the above reasons, this thesis studies on: (1) how to construct data federation infrastructure with privacy guarantee in the big data era; (2) how to protect privacy while learning ML models with a good trade-off between data utilities and privacy. To the first point, we proposed different frameworks em- powered by privacy-aware algorithms that satisfied the definition of differential privacy, which is the state-of-the-art privacy-guarantee algorithm by definition. Regarding (2), we proposed different neural network architectures to capture the sensitivities of user data, from which, the algorithm itself decides how much it should learn from user data to protect their privacy while achieves good performance for a downstream task. The current outcomes of the thesis are: (1) privacy-guarantee data federation infrastructure for data analysis on sensitive data; (2) privacy-guarantee algorithms for data sharing; (3) privacy-concern data analysis on social network data. The research methods used in this thesis include experiments on real-life social network dataset to evaluate aspects of proposed approaches.

Insights and outcomes from this thesis can be used by both academic and industry to guarantee privacy for data analysis and data sharing in personal data. They also have the potential to facilitate relevant research in privacy-aware representation learning and related evaluation methods.

Ort, förlag, år, upplaga, sidor
Umeå: Department of computing science, Umeå University , 2019. , s. 42
Serie
Report / UMINF, ISSN 0348-0542 ; 19.06
Nyckelord [en]
Diferential Privacy, Machine Learning, Deep Learning, Big Data
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
datalogi
Identifikatorer
URN: urn:nbn:se:umu:diva-162182ISBN: 9789178551101 (tryckt)OAI: oai:DiVA.org:umu-162182DiVA, id: diva2:1343260
Presentation
2019-09-09, 23:40 (Engelska)
Handledare
Tillgänglig från: 2019-08-22 Skapad: 2019-08-15 Senast uppdaterad: 2019-08-26Bibliografiskt granskad
Delarbeten
1. Personality-based Knowledge Extraction for Privacy-preserving Data Analysis
Öppna denna publikation i ny flik eller fönster >>Personality-based Knowledge Extraction for Privacy-preserving Data Analysis
2017 (Engelska)Ingår i: K-CAP 2017 - Proceedings of the Knowledge Capture Conference, Austin, TX, USA: ACM Digital Library, 2017, artikel-id 45Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

In this paper, we present a differential privacy preserving approach, which extracts personality-based knowledge to serve privacy guarantee data analysis on personal sensitive data. Based on the approach, we further implement an end-to-end privacy guarantee system, KaPPA, to provide researchers iterative data analysis on sensitive data. The key challenge for differential privacy is determining a reasonable amount of privacy budget to balance privacy preserving and data utility. Most of the previous work applies unified privacy budget to all individual data, which leads to insufficient privacy protection for some individuals while over-protecting others. In KaPPA, the proposed personality-based privacy preserving approach automatically calculates privacy budget for each individual. Our experimental evaluations show a significant trade-off of sufficient privacy protection and data utility.

Ort, förlag, år, upplaga, sidor
Austin, TX, USA: ACM Digital Library, 2017
Nyckelord
Differential Privacy, Privacy-preserving Data Analysis
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
datalogi
Identifikatorer
urn:nbn:se:umu:diva-143228 (URN)10.1145/3148011.3154479 (DOI)978-1-4503-5553-7 (ISBN)
Konferens
K-CAP 2017: The 9th International Conference on Knowledge Capture, Austin, Texas, December 4-6, 2017
Projekt
Privacy-aware data federation
Tillgänglig från: 2017-12-19 Skapad: 2017-12-19 Senast uppdaterad: 2019-08-22Bibliografiskt granskad
2. Graph-based Interactive Data Federation System for Heterogeneous Data Retrieval and Analytics
Öppna denna publikation i ny flik eller fönster >>Graph-based Interactive Data Federation System for Heterogeneous Data Retrieval and Analytics
2019 (Engelska)Ingår i: Proceedings of The World Wide Web Conference WWW 2019, New York, NY, USA: ACM Digital Library, 2019, s. 3595-3599Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Given the increasing number of heterogeneous data stored in relational databases, file systems or cloud environment, it needs to be easily accessed and semantically connected for further data analytic. The potential of data federation is largely untapped, this paper presents an interactive data federation system (https://vimeo.com/ 319473546) by applying large-scale techniques including heterogeneous data federation, natural language processing, association rules and semantic web to perform data retrieval and analytics on social network data. The system first creates a Virtual Database (VDB) to virtually integrate data from multiple data sources. Next, a RDF generator is built to unify data, together with SPARQL queries, to support semantic data search over the processed text data by natural language processing (NLP). Association rule analysis is used to discover the patterns and recognize the most important co-occurrences of variables from multiple data sources. The system demonstrates how it facilitates interactive data analytic towards different application scenarios (e.g., sentiment analysis, privacyconcern analysis, community detection).

Ort, förlag, år, upplaga, sidor
New York, NY, USA: ACM Digital Library, 2019
Nyckelord
heterogeneous data federation, RDF, interactive data analysis
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:umu:diva-160892 (URN)10.1145/3308558.3314138 (DOI)000483508403101 ()2-s2.0-85066893934 (Scopus ID)978-1-4503-6674-8 (ISBN)
Konferens
WWW '19, The World Wide Web Conference, San Francisco, CA, USA, May 13–17, 2019
Tillgänglig från: 2019-06-25 Skapad: 2019-06-25 Senast uppdaterad: 2019-11-14Bibliografiskt granskad
3. Self-adaptive Privacy Concern Detection for User-generated Content
Öppna denna publikation i ny flik eller fönster >>Self-adaptive Privacy Concern Detection for User-generated Content
2018 (Engelska)Ingår i: Proceedings of the 19th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), 2018, Cornell University Library, arXiv.org , 2018Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

To protect user privacy in data analysis, a state-of-the-art strategy is differential privacy in which scientific noise is injected into the real analysis output. The noise masks individual’s sensitive information contained in the dataset. However, determining the amount of noise is a key challenge, since too much noise will destroy data utility while too little noise will increase privacy risk. Though previous research works have designed some mechanisms to protect data privacy in different scenarios, most of the existing studies assume uniform privacy concerns for all individuals. Consequently, putting an equal amount of noise to all individuals leads to insufficient privacy protection for some users, while over-protecting others. To address this issue, we propose a self-adaptive approach for privacy concern detection based on user personality. Our experimental studies demonstrate the effectiveness to address a suitable personalized privacy protection for cold-start users (i.e., without their privacy-concern information in training data).

Ort, förlag, år, upplaga, sidor
Cornell University Library, arXiv.org, 2018
Nyckelord
privacy-guaranteed data analysis, deep learning, multi-layer perceptron
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:umu:diva-146470 (URN)
Konferens
19th International Conference on Computational Linguistics and Intelligent Text Processing, Hanoi, Vietnam, March 18-24, 2018
Projekt
Privacy-aware Data Federation
Tillgänglig från: 2018-04-10 Skapad: 2018-04-10 Senast uppdaterad: 2019-10-29Bibliografiskt granskad
4. dpUGC: Learn Differentially Private Representation for User Generated Contents
Öppna denna publikation i ny flik eller fönster >>dpUGC: Learn Differentially Private Representation for User Generated Contents
2019 (Engelska)Ingår i: Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), 2019, Cornell University Library, arXiv.org , 2019Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper firstly proposes a simple yet efficient generalized approach to apply differential privacy to text representation (i.e., word embedding). Based on it, we propose a user-level approach to learn personalized differentially private word embedding model on user generated contents (UGC). To our best knowledge, this is the first work of learning user-level differentially private word embedding model from text for sharing. The proposed approaches protect the privacy of the individual from re-identification, especially provide better trade-off of privacy and data utility on UGC data for sharing. The experimental results show that the trained embedding models are applicable for the classic text analysis tasks (e.g., regression). Moreover, the proposed approaches of learning differentially private embedding models are both framework- and dataindependent, which facilitates the deployment and sharing. The source code is available at https://github.com/sonvx/dpText.

Ort, förlag, år, upplaga, sidor
Cornell University Library, arXiv.org, 2019
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:umu:diva-160887 (URN)
Konferens
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, April 7-13, 2019
Tillgänglig från: 2019-06-25 Skapad: 2019-06-25 Senast uppdaterad: 2019-10-30Bibliografiskt granskad
5. Generic Multilayer Network Data Analysis with the Fusion of Content and Structure
Öppna denna publikation i ny flik eller fönster >>Generic Multilayer Network Data Analysis with the Fusion of Content and Structure
2019 (Engelska)Ingår i: Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), 2019, Cornell University Library, arXiv.org , 2019Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Multi-feature data analysis (e.g., on Facebook, LinkedIn) is challenging especially if one wants to do it efficiently and retain the flexibility by choosing features of interest for analysis. Features (e.g., age, gender, relationship, political view etc.) can be explicitly given from datasets, but also can be derived from content (e.g., political view based on Facebook posts). Analysis from multiple perspectives is needed to understand the datasets (or subsets of it) and to infer meaningful knowledge. For example, the influence of age, location, and marital status on political views may need to be inferred separately (or in combination). In this paper, we adapt multilayer network (MLN) analysis, a nontraditional approach, to model the Facebook datasets, integrate content analysis, and conduct analysis, which is driven by a list of desired application based queries. Our experimental analysis shows the flexibility and efficiency of the proposed approach when modeling and analyzing datasets with multiple features.

Ort, förlag, år, upplaga, sidor
Cornell University Library, arXiv.org, 2019
Nyckelord
Social network analysis, Multilayer networks, Content analysis
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
urn:nbn:se:umu:diva-162572 (URN)
Konferens
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, April 7-13, 2019
Tillgänglig från: 2019-08-22 Skapad: 2019-08-22 Senast uppdaterad: 2019-10-30Bibliografiskt granskad

Open Access i DiVA

fulltext(2203 kB)32 nedladdningar
Filinformation
Filnamn FULLTEXT02.pdfFilstorlek 2203 kBChecksumma SHA-512
aea67d9d9669867cdcb7ec4654d3072abf295d1d8a54821bde8ae8151d89039b161f4fc85e132f93d883bedf1cd92c298f323ef285182610322befc15b798ad0
Typ fulltextMimetyp application/pdf

Personposter BETA

Vu, Xuan-Son

Sök vidare i DiVA

Av författaren/redaktören
Vu, Xuan-Son
Av organisationen
Institutionen för datavetenskap
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 39 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 274 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf