Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Predicting User Competence from Linguistic Data
Umeå University, Faculty of Science and Technology, Department of Computing Science. (FLP)
Umeå University, Faculty of Science and Technology, Department of Computing Science.
Umeå University, Faculty of Science and Technology, Department of Computing Science. (FLP, LPCN)ORCID iD: 0000-0002-4696-9787
2017 (English)In: Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017) / [ed] Sivaji Bandyopadhyay, Jadavpur University , 2017, p. 476-484Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the problem of predicting the competence of users of the crowd-sourcing platform Zooniverse by analyzing their chat texts. Zooniverse is an online platform where objects of different types are displayed to volunteer users to classify. Our research focuses on the Zoonivers Galaxy Zoo project, where users classify the images of galaxies and discuss their classifications in text. We apply natural language processing methods to extract linguistic features including syntactic categories, bag-of-words, and punctuation marks. We trained three supervised machine-learning classifiers on the resulting dataset: k-nearest neighbors, decision trees (with gradient boosting) and naive Bayes. They are evaluated (regarding accuracy and F-measure) with two different but related domain datasets. The performance of the classifiers varies across the feature set configurations designed during the training phase. A challenging part of this research is to compute the competence of the users without ground truth data available. We implemented a tool that estimates the proficiency of users and annotates their text with computed competence. Our evaluation results show that the trained classifier models give results that are significantly better than chance and can be deployed for other crowd-sourcing projects as well. 

Place, publisher, year, edition, pages
Jadavpur University , 2017. p. 476-484
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:umu:diva-146185OAI: oai:DiVA.org:umu-146185DiVA, id: diva2:1194456
Conference
14th International Conference on Natural Language Processing (ICON-2017)
Available from: 2018-04-03 Created: 2018-04-03 Last updated: 2025-02-07Bibliographically approved
In thesis
1.
The record could not be found. The reason may be that the record is no longer available or you may have typed in a wrong id in the address field.
2. NLP methods for improving user rating systems in crowdsourcing forums and speech recognition of less resourced languages
Open this publication in new window or tab >>NLP methods for improving user rating systems in crowdsourcing forums and speech recognition of less resourced languages
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
NLP-metoder för att förbättra användarbetygssystem i crowdsourcing-forum och taligenkänning av språk med mindre resurser
Abstract [en]

We develop NLP and ASR methods (e.g., algorithms, architectures) for solving these problems: biases induced by user rating, ranking, recommendation and search engine algorithms, computational inefficiencies related with conventional syntactic-semantic parsing algorithms, extensive linguistic resource requirements imposed by traditional ASR methods, interoperability issues faced by NLP and ASR components within cross media analysis and audio-video content searchability problems over the Web. 

User rating systems (URSs) in crowdsourcing forums (CSFs) (e.g., QA) completely rely on solely voting schemes, and fail to incorporate linguistic quality and user competence information. Such potential failure affects the trustworthiness of answers over the Web as search engines are likely biased towards popular (high-voted) answers. That also contagiously affects the quality of the entire QA platforms as other components depend on the accuracy of the underlying URSs. On the other hand, conventional ASR methods present two major challenges: a failure of acoustic models to work within collaborative environments, as these methods only help build models limited to operate in isolation, and a resource related challenge. 

Significant contributions have been made in our thesis, published on prestigious AI, NLP and ASR venues, and received over 90 citations of our 9 papers. The proposed approaches potentially transform voting based rating to linguistic quality based rating, and shallow linguistic (meta-data) feature based answer quality predictions to deep syntactic-semantic and user competence based, and also single machine sequential fashion syntactic parsing to parallel and distributed cloud based parsing, meta-data based querying of spoken documents to full text querying and searching, as well as sentiment and competence based querying of textual content.

Theoretically, we advance the understanding of the relationships between author text and their associated proficiency in performing certain tasks through successive research works to discover the rules governing the conjectured link between them. Also, new bag of word approaches (based on latent topic modeling, syntactic categories and dependency relations) have been proposed. These approaches yield significant accuracy gains over conventional TF-IDF (term frequency–inverse document frequency) based models, and reduce domain dependencies as they potentially capture structural and topical information. 

Place, publisher, year, edition, pages
Umeå: Umeå University, 2024. p. 55
Series
Report / UMINF, ISSN 0348-0542 ; 23.09
Keywords
NLP-algorithms, speech-recognition, transfer-learning, syntax-semantics, computational linguistic model, Amharic-NLP, cloud-NLP architecture, question-answering, crowdsourcing, user-rating, less-resourced languages
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-220298 (URN)978-91-8070-261-4 (ISBN)978-91-8070-260-7 (ISBN)
Public defence
2024-02-29, NAT.D.360, Umeå University, 13:00 (English)
Opponent
Supervisors
Note

Wrong bibliographic information of Paper III, see errata sheet. 

Available from: 2024-02-08 Created: 2024-01-31 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(282 kB)41 downloads
File information
File name FULLTEXT01.pdfFile size 282 kBChecksum SHA-512
90650631732109a58d0dc7744200a26a9e01aa88d9245e2f18a1740677ee075c70a646060aaf8407be37e5a9c1af052c0d85354bd7a106e1e995697c20f3cea2
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Woldemariam, Yonas DemekeBensch, SunaBjörklund, Henrik

Search in DiVA

By author/editor
Woldemariam, Yonas DemekeBensch, SunaBjörklund, Henrik
By organisation
Department of Computing Science
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 41 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 678 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf