Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Pseudonymization categories across domain boundaries
Språkbanken Text, SFS, University of Gothenburg, Sweden.
CLASP, FLoV, University of Gothenburg, Sweden.
Department of Finnish, Finnougrian and Scandinavian Studies, University of Helsinki, Finland.
Språkbanken Text, SFS, University of Gothenburg, Sweden.
Show others and affiliations
2024 (English)In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) / [ed] Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue, ELRA Language Resource Association , 2024, p. 13303-13314Conference paper, Published paper (Refereed)
Abstract [en]

Linguistic data, a component critical not only for research in a variety of fields but also for the development of various Natural Language Processing (NLP) applications, can contain personal information. As a result, its accessibility is limited, both from a legal and an ethical standpoint. One of the solutions is the pseudonymization of the data. Key stages of this process include the identification of sensitive elements and the generation of suitable surrogates in a way that the data is still useful for the intended task. Within this paper, we conduct an analysis of tagsets that have previously been utilized in anonymization and pseudonymization. We also investigate what kinds of Personally Identifiable Information (PII) appear in various domains. These reveal that none of the analyzed tagsets account for all of the PII types present cross-domain at the level of detailedness seemingly required for pseudonymization. We advocate for a universal system of tags for categorizing PIIs leading up to their replacement. Such categorization could facilitate the generation of grammatically, semantically, and sociolinguistically appropriate surrogates for the kinds of information that are considered sensitive in a given domain, resulting in a system that would enable dynamic pseudonymization while keeping the texts readable and useful for future research in various fields.

Place, publisher, year, edition, pages
ELRA Language Resource Association , 2024. p. 13303-13314
Series
International conference on computational linguistics, ISSN 2951-2093
Keywords [en]
anonymization, deidentification, privacy, pseudonymization, universal tagset
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:umu:diva-226956Scopus ID: 2-s2.0-85195988143ISBN: 978-2-493814-10-4 (print)OAI: oai:DiVA.org:umu-226956DiVA, id: diva2:1876720
Conference
The 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), Torino, Italy, May 20-25, 2024
Note

Also part of series: LREC proceedings, ISBN: 2522-2686

Available from: 2024-06-25 Created: 2024-06-25 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(379 kB)39 downloads
File information
File name FULLTEXT01.pdfFile size 379 kBChecksum SHA-512
2bf221f38834d3b3de6bdadca09fe8a4f9095aef636fa488783185dd00c1beea3cfe60deaf63698a413969705986a60b32e62228f4c8a50156d1e4bc259eddd9
Type fulltextMimetype application/pdf

Other links

ScopusPublisher's full text

Authority records

Vu, Xuan-Son

Search in DiVA

By author/editor
Vu, Xuan-Son
By organisation
Department of Computing Science
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 39 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 262 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf