Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Developing a multilingual corpus of wikipedia biographies
Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-4366-7863
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0003-4466-1567
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0009-0004-0580-6270
2023 (English)In: International conference. Recent advances in natural language processing 2023, large language models for natural language processing: proceedings / [ed] Ruslan Mitkov; Maria Kunilovskaya; Galia Angelova, Shoumen, Bulgaria: Incoma ltd. , 2023, article id 2023.ranlp-1.32Conference paper, Published paper (Refereed)
Abstract [en]

For many languages, Wikipedia is the mostaccessible source of biographical information. Studying how Wikipedia describes the lives ofpeople can provide insights into societal biases, as well as cultural differences more generally. We present a method for extracting datasetsof Wikipedia biographies. The accompanying codebase is adapted to English, Swedish, Russian, Chinese, and Farsi, and is extendable to other languages. We present an exploratory analysis of biographical topics and gendered patterns in four languages using topic modelling and embedding clustering. We find similarities across languages in the types of categories present, with the distribution of biographies concentrated in the language’s core regions. Masculine terms are over-represented and spread out over a wide variety of topics. Feminine terms are less frequent and linked to more constrained topics. Non-binary terms are nearly non-represented.

Place, publisher, year, edition, pages
Shoumen, Bulgaria: Incoma ltd. , 2023. article id 2023.ranlp-1.32
Series
International conference Recent advances in natural language processing, ISSN 2603-2813 ; 2023
National Category
Language Technology (Computational Linguistics)
Research subject
computational linguistics
Identifiers
URN: urn:nbn:se:umu:diva-213781DOI: 10.26615/978-954-452-092-2_032Scopus ID: 2-s2.0-85179178058ISBN: 978-954-452-092-2 (electronic)OAI: oai:DiVA.org:umu-213781DiVA, id: diva2:1811174
Conference
14th international conference on Recent Advances in Natural Language Processing 2023, Varna, Bulgaria, September 4-6, 2023.
Available from: 2023-11-10 Created: 2023-11-10 Last updated: 2023-12-22Bibliographically approved

Open Access in DiVA

fulltext(1397 kB)37 downloads
File information
File name FULLTEXT01.pdfFile size 1397 kBChecksum SHA-512
976cf7064f67fc972a92ac2f7a2bbd81ecefd1d1cbc424bc557accd079549ef69f04380a791bc646035a6073e1d03836344358eeb3f5b60b0530c6b09d97adb6
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Devinney, HannahEklund, AntonRyazanov, IgorCai, Jingwen

Search in DiVA

By author/editor
Devinney, HannahEklund, AntonRyazanov, IgorCai, Jingwen
By organisation
Department of Computing ScienceUmeå Centre for Gender Studies (UCGS)
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 37 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 121 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf