Umeå University's logo

umu.sePublications
Change search
Refine search result
12 1 - 50 of 91
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Andersson, Eric
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Björklund, Johanna
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Jonsson, Anna
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Generating semantic graph corpora with graph expansion grammar2023In: 13th International Workshop on Non-Classical Models of Automata and Applications (NCMA 2023) / [ed] Nagy B., Freund R., Open Publishing Association , 2023, Vol. 388, p. 3-15Conference paper (Refereed)
    Abstract [en]

    We introduce LOVELACE, a tool for creating corpora of semantic graphs.The system uses graph expansion grammar as  a representational language, thus allowing users to craft a grammar that describes a corpus with desired properties. When given such grammar as input, the system generates a set of output graphs that are well-formed according to the grammar, i.e., a graph bank.The generation process can be controlled via a number of configurable parameters that allow the user to, for example, specify a range of desired output graph sizes.Central use cases are the creation of synthetic data to augment existing corpora, and as a pedagogical tool for teaching formal language theory. 

    Download full text (pdf)
    fulltext
  • 2.
    Bensch, Suna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Hellström, Thomas
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Grammatical Inference of Graph Transformation Rules2015In: Proceedings of the 7th Workshop on Non-Classical Modelsof Automata and Applications (NCMA 2015), Austrian Computer Society , 2015, p. 73-90Conference paper (Refereed)
  • 3.
    Berglund, Martin
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Björklund, Henrik
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    On the Parameterized Complexity of Linear Context-Free Rewriting Systems2013In: Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13), Association for Computational Linguistics, 2013, p. 21-29Conference paper (Other academic)
    Abstract [en]

    We study the complexity of uniform membership for Linear Context-Free RewritingSystems, i.e., the problem where we aregiven a string w and a grammar G and areasked whether w ∈ L(G). In particular,we use parameterized complexity theoryto investigate how the complexity dependson various parameters. While we focusprimarily on rank and fan-out, derivationlength is also considered.

    Download full text (pdf)
    On the Parameterized Complexity of Linear Context-Free Rewriting Systems
  • 4.
    Björklund, Henrik
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Devinney, Hannah
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Computer, enhence: POS-tagging improvements for nonbinary pronoun use in Swedish2023In: Proceedings of the third workshop on language technology for equality, diversity, inclusion, The Association for Computational Linguistics , 2023, p. 54-61Conference paper (Refereed)
    Abstract [en]

    Part of Speech (POS) taggers for Swedish routinely fail for the third person gender-neutral pronoun hen, despite the fact that it has been a well-established part of the Swedish language since at least 2014. In addition to simply being a form of gender bias, this failure can have negative effects on other tasks relying on POS information. We demonstrate the usefulness of semi-synthetic augmented datasets in a case study, retraining a POS tagger to correctly recognize hen as a personal pronoun. We evaluate our retrained models for both tag accuracy and on a downstream task (dependency parsing) in a classicial NLP pipeline.

    Our results show that adding such data works to correct for the disparity in performance. The accuracy rate for identifying hen as a pronoun can be brought up to acceptable levels with only minor adjustments to the tagger’s vocabulary files. Performance parity to gendered pronouns can be reached after retraining with only a few hundred examples. This increase in POS tag accuracy also results in improvements for dependency parsing sentences containing hen.

  • 5.
    Björklund, Henrik
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Devinney, Hannah
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Improving Swedish part-of-speech tagging for hen2022Conference paper (Refereed)
    Abstract [en]

    Despite the fact that the gender-neutral pro-noun hen was officially added to the Swedish language in 2014, state of the art part of speech taggers still routinely fail to identify it as a pronoun. We retrain both efselab and spaCy models with augmented (semi-synthetic) data, where instances of gendered pronouns are replaced by hen to correct for the lack of representation in the original training data. Our results show that adding such data works to correct for the disparity in performance

    Download full text (pdf)
    fulltext
  • 6.
    Björklund, Johanna
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Tree-to-Graph Transductions with Scope2018In: Developments in Language Theory. DLT 2018., Springer, 2018, p. 133-144Conference paper (Refereed)
    Abstract [en]

    High-level natural language processing requires formal languages to represent semantic information. A recent addition of this kind is abstract meaning representations. These are graphs in which nodes encode concepts and edges relations. Node-sharing is common, and cycles occur. We show that the required structures can be generated through the combination of (i) a regular tree grammar, (ii) a sequence of linear top-down tree transducers, and (iii) a fold operator that merges selected nodes. Delimiting the application of the fold operator to connected subgraphs gains expressive power, while keeping the complexity of the associated membership problem in polynomial time.

  • 7.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Cleophas, Loek
    Stellenbosch University, Republic of South Africa.
    Karlsson, My
    Codemill.
    An evaluation of structured language modeling for automatic speech recognition2017In: Journal of universal computer science (Online), ISSN 0948-695X, E-ISSN 0948-6968, Vol. 23, no 11, p. 1019-1034Article in journal (Refereed)
    Abstract [en]

    We evaluated probabilistic lexicalized tree-insertion grammars (PLTIGs) on a classification task relevant for automatic speech recognition. The baseline is a family of n-gram models tuned with Witten-Bell smoothing. The language models are trained on unannotated corpora, consisting of 10,000 to 50,000 sentences collected from the English section of Wikipedia. For the evaluation, an additional 150 random sentences were selected from the same source, and for each of these, approximately 3,200 variations were generated. Each variant sentence was obtained by replacing an arbitrary word by a similar word, chosen to be at most 2 character edits from the original. The evaluation task consisted of identifying the original sentence among the automatically constructed (and typically inferior) alternatives. In the experiments, the n-gram models outperformed the PLTIG model on the smaller data set, but as the size of data grew, the PLTIG model gave comparable results. While PLTIGs are more demanding to train, they have the advantage that they assign a parse structure to their input sentences. This is valuable for continued algorithmic processing, for example, for summarization or sentiment analysis.

  • 8.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Cohen, Shay B.
    University of Edinburgh.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Satta, Giorgio
    University of Padova.
    Bottom-up unranked tree-to-graph transducers for translation into semantic graphs2019In: Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing / [ed] Heiko Vogler; Andreas Maletti, Association for Computational Linguistics, 2019, p. 7-17, article id W19-3104Conference paper (Refereed)
    Abstract [en]

    We propose a formal model for translating unranked syntactic trees, such as dependency trees, into semantic graphs. These tree-to-graph transducers can serve as a formal basis of transition systems for semantic parsing which recently have been shown to perform very well, yet hitherto lack formalization. Our model features "extended" rules and an arc-factored normal form, comes with an efficient translation algorithm, and can be equipped with weights in a straightforward manner.

  • 9.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Jonsson, Anna
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Generation and polynomial parsing of graph languages with non-structural reentrancies2023In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 49, no 4, p. 841-882Article in journal (Refereed)
    Abstract [en]

    Graph-based semantic representations are popular in natural language processing (NLP), where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but which context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.

    Download full text (pdf)
    fulltext
  • 10.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Mollevik, Iris
    Towards Semantic Representations with a Temporal Dimension2020Conference paper (Refereed)
    Abstract [en]

    We outline the initial ideas for a representational framework for capturing temporal aspects in semantic parsing of multimodal data.As a starting point, we take the Abstract Meaning Representations of Banarescu et al. andpropose a way of extending them to coversequential progressions of events. The firstmodality to be considered is text, but the long-term goal is to also incorporate informationfrom visual and audio modalities, as well ascontextual information.

    Download full text (pdf)
    fulltext
  • 11.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Fernau, Henning
    FB IV - Abteilung Informatikwissenschaften, Universität Trier, Trier, Germany.
    Learning tree languages2016In: Topics in grammatical inference / [ed] Jeffrey Heinz; José M. Sempere, Springer Berlin/Heidelberg, 2016, p. 173-214Chapter in book (Refereed)
    Abstract [en]

    Tree languages have proved to be a versatile and rewarding extension of the classical notion of string languages.Many nice applications have been established over the years, in areas such as Natural Language Processing, Information Extraction, and Computational Biology. Although some properties of string languages transfer easily to the tree case, in particular for regular languages, several computational aspects turn out to be harder. It is therefore both of theoretical and of practical interest to investigate howfar and in whatways Grammatical Inference algorithms developed for the string case are applicable to trees. This chapter surveys known results in this direction. We begin by recalling the basics of tree language theory. Then, the most popular learning scenarios and algorithms are presented. Several applications of Grammatical Inference of tree languages are reviewed in some detail. We conclude by suggesting a number of directions for future research.

  • 12.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Johansson Falck, Marlene
    Umeå University, Faculty of Arts, Department of language studies.
    How Spatial Relations Structure Linguistic Meaning2019In: Proceedings of the 15th SweCog Conference / [ed] Holm, Linus & Erik Billing, Skövde: University of Skövde , 2019, p. 29-31Conference paper (Refereed)
  • 13.
    Björklund, Johanna
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Zechner, Niklas
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Syntactic methods for topic-independent authorship attribution2017In: Natural Language Engineering, ISSN 1351-3249, E-ISSN 1469-8110, Vol. 23, no 5, p. 789-806Article in journal (Refereed)
    Abstract [en]

    The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are 'deep' in the sense that they are derived by parsing the subject texts, in contrast to 'shallow' syntactic features for which a part-of-speech analysis is enough. The experiments are made on two corpora of online texts and one corpus of novels written around the year 1900. The classification tasks include classical closed-world authorship attribution, identification of separate texts among the works of one author, and cross-topic authorship attribution. In the first tasks, the feature sets were fairly evenly matched, but for the last task, the syntax-based feature set outperformed the baseline feature set. These results suggest that, compared to lexical features, syntactic features are more robust to changes in topic.

  • 14.
    Brand, Dirk
    et al.
    Computer Science Division, Stellenbosch University, South Africa.
    Kroon, Steve
    Computer Science Division, Stellenbosch University, South Africa.
    Van Der Merwe, Brink
    Computer Science Division, Stellenbosch University, South Africa.
    Cleophas, Loek
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Dept. of Information Science, Stellenbosch University, Stellenbosch, South Africa.
    N-Gram Representations for Comment Filtering2015In: SAICSIT '15: Proceedings of the 2015 Annual Research Conference on South African Institute of Computer Scientists and Information Technologists, ACM Digital Library, 2015, article id 6Conference paper (Refereed)
    Abstract [en]

    Accurate classifiers for short texts are valuable assets in many applications. Especially in online communities, where users contribute to content in the form of posts and com- ments, an effective way of automatically categorising posts proves highly valuable. This paper investigates the use of N- grams as features for short text classification, and compares it to manual feature design techniques that have been popu- lar in this domain. We find that the N-gram representations greatly outperform manual feature extraction techniques.

  • 15.
    Chen, Hung Chiao
    et al.
    Umeå University, Faculty of Social Sciences, Department of Psychology.
    Weck, Saskia
    Umeå University, Faculty of Social Sciences, Department of Psychology.
    Understanding Robots: The Effects of Conversational Strategies on the Understandability of Robot-Robot Interactions from a Human Standpoint2020Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    As the technology develops and robots are integrating into more and more facets of our lives, the futureof human-robot interaction may take form in all kinds of arrangements and configurations. In this study, we examined the understandability of di erent conversational strategies in robot-robot communication from a human-bystander standpoint. Specifically, we examined the understandability of verbal explanations constructed under Grice's maxims of informativeness. A prediction task was employed to test the understandability of the proposed strategy among other strategies. Furthermore, participants' perception of the robots' interaction was assessed with a range of ratings and rankings. The results suggest that those robots using the proposed strategy and those using the other tested strategies were understood and perceived similarly.

    Download full text (pdf)
    fulltext
  • 16.
    Chiang, David
    et al.
    University of Notre Dame.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Gildea, Daniel
    University of Rochester.
    Lopez, Adam
    University of Edinburgh.
    Satta, Giorgio
    University of Padua.
    Weighted DAG automata for semantic graphs2018In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 44, no 1, p. 119-186Article in journal (Refereed)
    Abstract [en]

    Graphs have a variety of uses in natural language processing, particularly as representations of linguistic meaning. A deficit in this area of research is a formal framework for creating, combining, and using models involving graphs that parallels the frameworks of finite automata for strings and finite tree automata for trees. A possible starting point for such a framework is the formalism of directed acyclic graph (DAG) automata, defined by Kamimura and Slutzki and extended by Quernheim and Knight. In this article, we study the latter in depth, demonstrating several new results, including a practical recognition algorithm that can be used for inference and learning with models defined on DAG automata. We also propose an extension to graphs with unbounded node degree and show that our results carry over to the extended formalism.

    Download full text (pdf)
    fulltext
  • 17. Coelho Mollo, Dimitri
    et al.
    Millière, Raphael
    Rathkopf, Charles
    Stinson, Catherine
    Conceptual Combinations - Benchmark Task for BIG-Bench2021Other (Refereed)
    Abstract [en]

    This is a task accepted in July 2021 as part of Google’s “Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models”. It is published at https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/conceptual_combinations. Links to the collection of queries are below, followed by the ReadMe file that explains the task, its justification, and its performance with existing AI language models.

  • 18.
    Deutschmann, Mats
    et al.
    Umeå University, Faculty of Arts, Department of language studies.
    Molka-Danielsen, Judith
    Molde University, Norway.
    Future Directions for Learning in Virtual Worlds2009In: Learning and Teaching in the Virtual World of Second Life / [ed] Molka-Danielsen, J & M. Deutschmann, Trondheim: Tapir Academic Press , 2009, 1, p. 185-190Chapter in book (Refereed)
    Abstract [en]

    Some may claim that this book has been a showcase of case studies, without common thread. However, the common goal that runs through each of these cases is the focus on learning and the roles of learners and educators in learning activities. Do virtual worlds assist learning and do they create new opportunities? The answer from these analyses is “Yes” and this book demonstrates “how” to make use of the affordances of the virtual word of Second Life as it exists today. Yet, many questions remain both for practitioners and researchers. To give some examples: On what principles should learners’ tasks be designed, who are doing research on education in virtual worlds and what is the future of virtual worlds in a learning context? In this chapter we attempt to address some of these issues.

  • 19.
    Deutschmann, Mats
    et al.
    Umeå University, Faculty of Arts, Department of language studies.
    Panichi, Luisa
    Pisa University, Italy.
    Instructional Design: Teacher Practice and Learning Autonomy2009In: Learning and Teaching in the Virtual World of Second Life / [ed] Judith Molka-Danielsen & Mats Deutschmann, Trondheim: Tapir Academic Press , 2009, 1, p. 24-44Chapter in book (Refereed)
    Abstract [en]

    This chapter is based on the experiences from language proficiency courses given on Kamimo education island and addresses concerns related to teacher practice in Second Life. We examine preparatory issues, task design and the teacher’s role in fostering learner autonomy in Second Life. Although the chapter draws mainly on experiences from and reflections in the domain of language education, it has general pedagogical implications for teaching in SL.

  • 20.
    Deutschmann, Mats
    et al.
    Umeå University, Faculty of Arts, Department of language studies.
    Panichi, Luisa
    Pisa University, Itlay.
    Talking into empty space?: signalling involvement in a virtual language classroom in Second Life2009In: Language Awareness, ISSN 0965-8416, Vol. 18, no 3-4, p. 310-328Article in journal (Refereed)
    Abstract [en]

    In this study, we compare the first and the last sessions from an online oral proficiencycourse aimed at doctoral students conducted in the virtual world Second Life. The study attempts to identify how supportive moves made by the teacher encourage learners to engage with language, and what type of linguistic behaviour in the learners leads to engagement in others. We compare overall differences in terms of floor space and turn-taking patterns, and also conduct a more in-depth discourse analysis of parts of the sessions focusing on supportive moves such as back-channelling and elicitors. There are indications that the supportive linguistic behaviour of teachers is important in increasing learner engagement. In our studywe are also able to observe a change in student linguistic behaviour between the first and the last sessions with students becoming more active in signalling involvement as the course progresses. Finally, by illustrating some of the language awareness issues that arise in online environments, we hope to contribute to the understanding of the dynamics of online communication.

  • 21.
    Devinney, Hannah
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Gender and representation: investigations of bias in natural language processing2024Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Natural Language Processing (NLP) technologies are a part of our every day realities. They come in forms we can easily see as ‘language technologies’ (auto-correct, translation services, search results) as well as those that fly under our radar (social media algorithms, 'suggested reading' recommendations on news sites, spam filters). NLP fuels many other tools under the Artificial Intelligence umbrella – such as algorithms approving for loan applications – which can have major material effects on our lives. As large language models like ChatGPT have become popularized, we are also increasingly exposed to machine-generated texts.

    Machine Learning (ML) methods, which most modern NLP tools rely on, replicate patterns in their training data. Typically, these language data are generated by humans, and contain both overt and underlying patterns that we consider socially undesirable, comprising stereotypes and other reflections of human prejudice. Such patterns (often termed 'bias') are picked up and repeated, or even made more extreme, by ML systems. Thus, NLP technologies become a part of the linguistic landscapes in which we humans transmit stereotypes and act on our prejudices. They may participate in this transmission by, for example, translating nurses as women (and doctors as men) or systematically preferring to suggest promoting men over women. These technologies are tools in the construction of power asymmetries not only through the reinforcement of hegemony, but also through the distribution of material resources when they are included in decision-making processes such as screening job applications.

    This thesis explores gendered biases, trans and nonbinary inclusion, and queer representation within NLP through a feminist and intersectional lens. Three key areas are investigated: the ways in which “gender” is theorized and operationalized by researchers investigating gender bias in NLP; gendered associations within datasets used for training language technologies; and the representation of queer (particularly trans and nonbinary) identities in the output of both low-level NLP models and large language models (LLMs). 

    The findings indicate that nonbinary people/genders are erased by both bias in NLP tools/datasets, and by research/ers attempting to address gender biases. Men and women are also held to cisheteronormative standards (and stereotypes), which is particularly problematic when considering the intersection of gender and sexuality. Although it is possible to mitigate some of these issues in particular circumstances, such as addressing erasure by adding more examples of nonbinary language to training data, the complex nature of the socio-technical landscape which NLP technologies are a part of means that simple fixes may not always be sufficient. Additionally, it is important that ways of measuring and mitigating 'bias' remain flexible, as our understandings of social categories, stereotypes and other undesirable norms, and 'bias' itself will shift across contexts such as time and linguistic setting. 

    Download full text (pdf)
    fulltext
    Download (pdf)
    spikblad
    Download (jpg)
    presentationsbild
  • 22.
    Devinney, Hannah
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Björklund, Jenny
    Uppsala University.
    Björklund, Henrik
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Crime and Relationship: Exploring Gender Bias in NLP Corpora2020Conference paper (Refereed)
    Abstract [en]

    Gender bias in natural language processing (NLP) tools, deriving from implicit human bias embedded in language data, is an important and complicated problem on the road to fair algorithms. We leverage topic modeling to retrieve documents associated with particular gendered categories, and discuss how exploring these documents can inform our understanding of the corpora we may use to train NLP tools. This is a starting point for challenging the systemic power structures and producing a justice-focused approach to NLP.

    Download full text (pdf)
    fulltext
  • 23.
    Devinney, Hannah
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Björklund, Jenny
    Centre for Gender Research, Uppsala University.
    Björklund, Henrik
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Semi-Supervised Topic Modeling for Gender Bias Discovery in English and Swedish2020In: Proceedings of the Second Workshop on Gender Bias in Natural Language Processing / [ed] Marta R. Costa-jussà, Christian Hardmeier, Will Radford, Kellie Webster, Association for Computational Linguistics, 2020, p. 79-92Conference paper (Refereed)
    Abstract [en]

    Gender bias has been identified in many models for Natural Language Processing, stemming from implicit biases in the text corpora used to train the models. Such corpora are too large to closely analyze for biased or stereotypical content. Thus, we argue for a combination of quantitative and qualitative methods, where the quantitative part produces a view of the data of a size suitable for qualitative analysis. We investigate the usefulness of semi-supervised topic modeling for the detection and analysis of gender bias in three corpora (mainstream news articles in English and Swedish, and LGBTQ+ web content in English). We compare differences in topic models for three gender categories (masculine, feminine, and nonbinary or neutral) in each corpus. We find that in all corpora, genders are treated differently and that these differences tend to correspond to hegemonic ideas of gender.

    Download full text (pdf)
    fulltext
  • 24.
    Devinney, Hannah
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Björklund, Jenny
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Uppsala University, Sweden.
    Björklund, Henrik
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Theories of Gender in Natural Language Processing2022In: Proceedings of the fifth annual ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT'22), 2022Conference paper (Refereed)
    Abstract [en]

    The rise of concern around Natural Language Processing (NLP) technologies containing and perpetuating social biases has led to a rich and rapidly growing area of research. Gender bias is one of the central biases being analyzed, but to date there is no comprehensive analysis of how “gender” is theorized in the field. We survey nearly 200 articles concerning gender bias in NLP to discover how the field conceptualizes gender both explicitly (e.g. through definitions of terms) and implicitly (e.g. through how gender is operationalized in practice). In order to get a better idea of emerging trajectories of thought, we split these articles into two sections by time.

    We find that the majority of the articles do not make their theo- rization of gender explicit, even if they clearly define “bias.” Almost none use a model of gender that is intersectional or inclusive of non- binary genders; and many conflate sex characteristics, social gender, and linguistic gender in ways that disregard the existence and expe- rience of trans, nonbinary, and intersex people. There is an increase between the two time-sections in statements acknowledging that gender is a complicated reality, however, very few articles manage to put this acknowledgment into practice. In addition to analyzing these findings, we provide specific recommendations to facilitate interdisciplinary work, and to incorporate theory and methodol- ogy from Gender Studies. Our hope is that this will produce more inclusive gender bias research in NLP.

  • 25.
    Devinney, Hannah
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Eklund, Anton
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Ryazanov, Igor
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Cai, Jingwen
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Developing a multilingual corpus of wikipedia biographies2023In: International conference. Recent advances in natural language processing 2023, large language models for natural language processing: proceedings / [ed] Ruslan Mitkov; Maria Kunilovskaya; Galia Angelova, Shoumen, Bulgaria: Incoma ltd. , 2023, article id 2023.ranlp-1.32Conference paper (Refereed)
    Abstract [en]

    For many languages, Wikipedia is the mostaccessible source of biographical information. Studying how Wikipedia describes the lives ofpeople can provide insights into societal biases, as well as cultural differences more generally. We present a method for extracting datasetsof Wikipedia biographies. The accompanying codebase is adapted to English, Swedish, Russian, Chinese, and Farsi, and is extendable to other languages. We present an exploratory analysis of biographical topics and gendered patterns in four languages using topic modelling and embedding clustering. We find similarities across languages in the types of categories present, with the distribution of biographies concentrated in the language’s core regions. Masculine terms are over-represented and spread out over a wide variety of topics. Feminine terms are less frequent and linked to more constrained topics. Non-binary terms are nearly non-represented.

    Download full text (pdf)
    fulltext
  • 26.
    Drewes, Frank
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Gebhardt, Kilian
    Technische Universität Dresden.
    Vogler, Heiko
    Technische Universität Dresden.
    EM-training for probabilistic aligned hypergraph bimorphisms2016In: Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata, Association for Computational Linguistics , 2016, p. 60-69Conference paper (Refereed)
    Abstract [en]

    We define the concept of probabilistic aligned hypergraph bimorphism. Each such bimorphism consists of a probabilistic regular tree grammar, two hypergraph algebras in which the generated trees are interpreted, and a family of alignments between the two interpretations. It generates a set of bihypergraphs each consisting of two hypergraphs and an alignment between them; for instance, discontinuous phrase structures and non-projective dependency structures are bihypergraphs. We show an EM-training algorithm which takes a corpus of bihypergraphs and an aligned hypergraph bimorphism as input and calculates a probability assignment to the rules of the regular tree grammar such that in the limit the maximum-likelihood of the corpus is approximated.

  • 27.
    Drewes, Frank
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Knight, Kevin
    University of Southern California.
    Kuhlmann, Marco
    Linköpings universitet.
    Formal Models of Graph Transformation in Natural Language Processing2015Report (Other academic)
    Abstract [en]

    In natural language processing (NLP) there is an increasing interest in formal models for processing graphs rather than more restricted structures such as strings or trees. Such models of graph transformation have previously been studied and applied in various other areas of computer science, including formal language theory, term rewriting, theory and implementation of programming languages, concurrent processes, and software engineering. However, few researchers from NLP are familiar with this work, and at the same time, few researchers from the theory of graph transformation are aware of the specific desiderata, possibilities and challenges that one faces when applying the theory of graph transformation to NLP problems. The Dagstuhl Seminar 15122 “Formal Models of Graph Transformation in Natural Language Processing” brought researchers from the two areas together. It initiated an interdisciplinary exchange about existing work, open problems, and interesting applications.

    Download full text (pdf)
    fulltext
  • 28.
    Drewes, Frank
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Prorok, Kalle
    Umeå University, Faculty of Science and Technology, Department of Applied Physics and Electronics.
    AI för dokumentgenerering2021Report (Other academic)
    Abstract [sv]

    Den här rapporten ger en kortfattad introduktion till metoder och några praktiska resultat från ett AI-texthanteringsprojekt i samarbete mellan Trafikverket, Umeå universitet och Sweco. Tester har gjorts för att extrahera information (geografiska orter, sammanfattningar, frågor och svar) från dokument. Även dokumentgenerering, projektets ursprungliga fokus, har adresserats. Där var målet att automatiskt skapa texter för utvalda syften, något som visade sig vara svårt i nuläget då de existerande metoderna är begränsade och samtidigt mycket krävande på datorkraft. Till rapporten hör några förenklade kodexempel där läsaren själv kan testköra och förhoppningsvis lära sig från lite olika fall.Rapporten är indelad i fyra delar: En icke teknisk översikt för allmänt intresserade, en mer detaljerad beskrivning av resultaten, en del om begrepp och metoder för speciellt intresserade samt en del om implementation för programmerare.

    Download full text (pdf)
    fulltext
  • 29.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Forsman, Mona
    Adlede AB.
    Topic modeling by clustering language model embeddings: human validation on an industry dataset2022Conference paper (Refereed)
    Abstract [en]

    Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.

  • 30.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Adlede AB, Umeå, Sweden.
    Forsman, Mona
    Adlede AB, Umeå, Sweden.
    Topic modeling by clustering language model embeddings: human validation on an industry dataset2022In: EMNLP 2022 Industry Track: Proceedings of the conference, Association for Computational Linguistics (ACL) , 2022, p. 645-653Conference paper (Refereed)
    Abstract [en]

    Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.

  • 31.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Adlede, Umeå, Sweden.
    Forsman, Mona
    Adlede, Umeå, Sweden.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    An empirical configuration study of a common document clustering pipeline2023In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 9, no 1Article in journal (Refereed)
    Abstract [en]

    Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or create topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.

    Download full text (pdf)
    fulltext
  • 32.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Adlede AB, Umeå, Sweden.
    Forsman, Mona
    Adlede AB, Umeå, Sweden.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Dynamic topic modeling by clustering embeddings from pretrained language models: a research proposal2022In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop / [ed] Yan Hanqi; Yang Zonghan; Sebastian Ruder; Wan Xiaojun, Association for Computational Linguistics , 2022, p. 84-91Conference paper (Refereed)
    Abstract [en]

    A new trend in topic modeling research is to do Neural Topic Modeling by Clustering document Embeddings (NTM-CE) created with a pretrained language model. Studies have evaluated static NTM-CE models and found them performing comparably to, or even better than other topic models. An important extension of static topic modeling is making the models dynamic, allowing the study of topic evolution over time, as well as detecting emerging and disappearing topics. In this research proposal, we present two research questions to understand dynamic topic modeling with NTM-CE theoretically and practically. To answer these, we propose four phases with the aim of establishing evaluation methods for dynamic topic modeling, finding NTM-CE-specific properties, and creating a framework for dynamic NTM-CE. For evaluation, we propose to use both quantitative measurements of coherence and human evaluation supported by our recently developed tool.

  • 33.
    Eriksson, Erik J.
    et al.
    Umeå University, Faculty of Arts, Philosophy and Linguistics.
    Rodman, Robert D.
    Dept. of Computer Science, NCSU, USA.
    Hubal, Robert C.
    Technology Assisted Learning Ctr., RTI International, USA.
    Emotions in speech: juristic implications2007In: Speaker Classification: Volume I, Berlin: Springer Verlag , 2007Chapter in book (Other academic)
    Abstract [en]

    This chapter focuses on the detection of emotion in speech and the impact that using technology to automate emotion detection would have within the legal system. The current states of the art for studies of perception and acoustics are described, and a number of implications for legal contexts are provided. We discuss, inter alia, assessment of emotion in others, witness credibility, forensic investigation, and training of law enforcement officers.

  • 34.
    Farahani, Mehrdad
    et al.
    Department of Computer Engineering, Islamic Azad University North Tehran Branch, Tehran, Iran.
    Gharachorloo, Mohammad
    Queensland University of Technology, School of Electrical Engineering and Robotics, Brisbane, Australia.
    Farahani, Marzieh
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Manthouri, Mohammad
    Department of Electrical and Electronic Engineering, Shahed Univerisity, Tehran, Iran.
    ParsBERT: Transformer-based Model for Persian Language Understanding2021In: Neural Processing Letters, ISSN 1370-4621, E-ISSN 1573-773X, Vol. 53, no 6, p. 3831-3847Article in journal (Refereed)
    Abstract [en]

    The surge of pre-trained language models has begun a new era in the field of Natural Language Processing (NLP) by allowing us to build powerful language models. Among these models, Transformer-based models such as BERT have become increasingly popular due to their state-of-the-art performance. However, these models are usually focused on English, leaving other languages to multilingual models with limited resources. This paper proposes a monolingual BERT for the Persian language (ParsBERT), which shows its state-of-the-art performance compared to other architectures and multilingual models. Also, since the amount of data available for NLP tasks in Persian is very restricted, a massive dataset for different NLP tasks as well as pre-training the model is composed. ParsBERT obtains higher scores in all datasets, including existing ones and gathered ones, and improves the state-of-the-art performance by outperforming both multilingual BERT and other prior works in Sentiment Analysis, Text Classification, and Named Entity Recognition tasks.

  • 35.
    Granberg, Johan
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Minock, Michael
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    A natural language interface over the MusicBrainz database2011In: Proceedings of the 1st workshop on Question Answering over Linked Data (QALD-1) / [ed] Christina Unger, Philipp Cimiano, Vanessa Lopez, Enrico Motta, 2011, p. 38-43Conference paper (Refereed)
    Abstract [en]

    This paper demonstrates a way to build a natural language interface (NLI) over semantically rich data. Specifically we show this over the MusicBrainz domain, inspired by the second shared task of the QALD-1 workshop. Our approach uses the tool C-Phrase [4] to build an NLI over a set of views defined over the original MusicBrainz relational database. C-Phrase uses a limited variant of X-Bar theory [3] for syntax and tuple calculus for semantics. The C-Phrase authoring tool works over any domain and only the end configuration has to be redone for each new database covered – a task that does not require deep knowledge about linguistics and system internals. Working over the MusicBrainz domain was a challenge due to the size of the database – quite a lot of effort went into optimizing computation times and memory usage to manageable levels. This paper reports on this work and anticipates a live demonstration for querying by the public

    Download full text (pdf)
    fulltext
  • 36.
    Hansson, Britt
    Umeå University, Faculty of Social Sciences, Education.
    Större chans att klara det?: En specialpedagogisk studie av 10 ungdomars syn på hur datorstöd har påverkat deras språk, lärande och skolsituation.2008Independent thesis Basic level (professional degree), 10 credits / 15 HE creditsStudent thesis
    Abstract [sv]

    I studien intervjuades 10 ungdomar om sina erfarenheter av att använda dator med talsyntes och inspelade böcker. De tillfrågades om i vilka situationer verktygen har kommit till nytta eller upplevts hämmande i deras lärande och skolsituation. På grund av stora skolsvårigheter har ungdomarna fått låna en bärbar dator av skolan. Den har de använt både hemma och i skolan. Tillsammans med föräldrar och lärare har de fått handledning vid kommunens Skoldatatek. Att språket utvecklas när det används har varit utgångspunkt i studien, ur ett sociokulturellt perspektiv. Skolan ska erbjuda en tidsenlig utbildning och elever i skolsvårigheter har rätt att få stöd. Hur detta stöd ska utformas kan skapa ett dilemma på den enskilda skolan. Ett stöd riktat direkt till den enskilde kan nämligen uppfattas som att skolsvårigheter ses som en elevburen problematik, vilket inte får förekomma i ”en skola för alla”. Med tanke på detta dilemma var det viktigt att efterforska ungdomarnas upplevelser av stöd, utveckling och hinder, för att förstå om de orsakar utpekande och exkludering. Resultatet visade att ungdomarna upplevde att de kände sig mer motiverade med sina datorverktyg, som har kompenserat deras svårigheter och tilltalat deras olika lärstilar. Ungdomarna sade sig ha blivit säkrare skribenter och läsare tack vare ökat språkbruk. I deras berättelse framgår även nödvändigheten av stöd från lärare och föräldrar. Resultatet pekar på att alternativa verktyg i lärandet skulle kunna medverka till större måluppfyllelse i en skola för alla, med pedagogisk mångfald.

    Download full text (pdf)
    FULLTEXT01
  • 37.
    Hatefi, Arezoo
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Document Clustering Using Attentive Hierarchical Document Representation2020Conference paper (Refereed)
    Abstract [en]

    We propose a text clustering algorithm that applies an attention mechanism on both word andsentence level. This ongoing work is motivated by an application in contextual programmatic advertising, where the goal is to grouponline articles into clusters corresponding to agiven set of marketing objectives. The maincontribution is the use of attention to identify words and sentences that are of specific importance for the formation of the clusters

    Download full text (pdf)
    fulltext
  • 38.
    Hatefi, Arezoo
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Vu, Xuan-Son
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Bhuyan, Monowar
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    The efficiency of pre-training with objective masking in pseudo labeling for semi-supervised text classificationManuscript (preprint) (Other academic)
  • 39.
    Hatefi, Arezoo
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Vu, Xuan-Son
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Bhuyan, Monowar H.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Cformer: Semi-Supervised Text Clustering Based on Pseudo Labeling2021In: CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, ACM Digital Library, 2021, p. 3078-3082Conference paper (Refereed)
    Abstract [en]

    We propose a semi-supervised learning method called Cformer for automatic clustering of text documents in cases where clusters are described by a small number of labeled examples, while the majority of training examples are unlabeled. We motivate this setting with an application in contextual programmatic advertising, a type of content placement on news pages that does not exploit personal information about visitors but relies on the availability of a high-quality clustering computed on the basis of a small number of labeled samples.

    To enable text clustering with little training data, Cformer leverages the teacher-student architecture of Meta Pseudo Labels. In addition to unlabeled data, Cformer uses a small amount of labeled data to describe the clusters aimed at. Our experimental results confirm that the performance of the proposed model improves the state-of-the-art if a reasonable amount of labeled data is available. The models are comparatively small and suitable for deployment in constrained environments with limited computing resources. The source code is available at https://github.com/Aha6988/Cformer.

  • 40.
    Hellsten, Simon
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Incremental Re-tokenization in BPE-trained SentencePiece Models2024Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
    Abstract [en]

    This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts.

    Download full text (pdf)
    fulltext
  • 41.
    Hendrick, Stephanie
    Umeå University, Faculty of Arts, Humlab. Umeå University, Faculty of Arts, Modern Languages. Engelska.
    Following Conversational Traces: Part 1: Creating a corpus with the ICWSM dataset.2007Conference paper (Refereed)
    Abstract [en]

    This poster will present the methodology behind the creation of a linguistic corpus based on a subset of the 2007 International Conference on Weblogs and Social Media dataset. Posts from a small group of political bloggers were tagged for parts of speech and indexed into a corpus using the program Xairia. From this corpus, the political blogger subset will be investigated for register and referential information. Referential information,especially with regards to new and given information, will be compared against network placement both to identify network innovators as well as to compare network placement as a catalyst for innovation. The final section, Further Research, will outline the modifications necessary for the creation of a full-scale corpus based on the entire ICWSM 2006 dataset.

  • 42.
    Jarlbrink, Johan
    et al.
    Umeå University, Faculty of Arts, Department of culture and media studies.
    Snickars, Pelle
    Umeå University, Faculty of Arts, Department of culture and media studies.
    Cultural heritage as digital noise: nineteenth century newspapers in the digital archive2017In: Journal of Documentation, ISSN 0022-0418, E-ISSN 1758-7379, Vol. 73, no 6, p. 1228-1243Article in journal (Refereed)
    Abstract [en]

    Purpose

    The purpose of this paper is to explore and analyze the digitized newspaper collection at the National Library of Sweden, focusing on cultural heritage as digital noise. In what specific ways are newspapers transformed in the digitization process? If the digitized document is not the same as the source document – is it still a historical record, or is it transformed into something else?

    Design/methodology/approach

    The authors have analyzed the XML files from Aftonbladet 1830 to 1862. The most frequent newspaper words not matching a high-quality references corpus were selected to zoom in on the noisiest part of the paper. The variety of the interpretations generated by optical character recognition (OCR) was examined, as well as texts generated by auto-segmentation. The authors have made a limited ethnographic study of the digitization process.

    Findings

    The research shows that the digital collection of Aftonbladet contains extreme amounts of noise: millions of misinterpreted words generated by OCR, and millions of texts re-edited by the auto-segmentation tool. How the tools work is mostly unknown to the staff involved in the digitization process? Sticking to any idea of a provenance chain is hence impossible, since many steps have been outsourced to unknown factors affecting the source document.

    Originality/value

    The detail examination of digitally transformed newspapers is valuable to scholars depending on newspaper databases in their research. The paper also highlights the fact that libraries outsourcing digitization processes run the risk of losing control over the quality of their collections.

  • 43.
    Khairova, Nina
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. National Technical University ”Kharkiv Polytechnic Institute”, Ukraine.
    Hamon, Thierry
    Institut Galilée, Univ. Sorbonne Paris Nord, France.
    Grabar, Natalia
    University of Lille, France.
    Burov, Yevhen
    Lviv Polytechnic National University, Ukraine.
    Preface: Computational Linguistics Workshop2023In: CoLInS 2023, Computational Linguistics and Intelligent Systems 2023: Proceedings of the 7th International Conference on Computational Linguistics and Intelligent Systems. Volume II: Computational Linguistics Workshop, CEUR-WS , 2023Conference paper (Refereed)
    Download full text (pdf)
    fulltext
  • 44. Kleyko, Denis
    et al.
    Osipov, Evgeny
    De Silva, Daswin
    Wiklund, Urban
    Umeå University, Faculty of Medicine, Department of Radiation Sciences, Radiation Physics.
    Vyatkin, Valeriy
    Alahakoon, Damminda
    Distributed representation of n-gram statistics for boosting self-organizing maps with hyperdimensional computing2019In: Perspectives of system informatics / [ed] Nikolaj Bjørner, Irina Virbitskaite, Andrei Voronkov, Cham: Springer, 2019, p. 64-79Conference paper (Refereed)
    Abstract [en]

    This paper presents an approach for substantial reduction of the training and operating phases of Self-Organizing Maps in tasks of 2-D projection of multi-dimensional symbolic data for natural language processing such as language classification, topic extraction, and ontology development. The conventional approach for this type of problem is to use n-gram statistics as a fixed size representation for input of Self-Organizing Maps. The performance bottleneck with n-gram statistics is that the size of representation and as a result the computation time of Self-Organizing Maps grows exponentially with the size of n-grams. The presented approach is based on distributed representations of structured data using principles of hyperdimensional computing. The experiments performed on the European languages recognition task demonstrate that Self-Organizing Maps trained with distributed representations require less computations than the conventional n-gram statistics while well preserving the overall performance of Self-Organizing Maps.

  • 45.
    Kucherenko, Taras
    et al.
    SEED, Electronic Arts (EA), Stockholm, Sweden.
    Nagy, Rajmund
    KTH Royal Institute of Technology, Stockholm, Sweden.
    Yoon, Youngwoo
    ETRI, Daejeon, South Korea.
    Woo, Jieyeon
    ISIR, Sorbonne University, Paris, France.
    Nikolov, Teodor
    Umeå University.
    Tsakov, Mihail
    Umeå University.
    Henter, Gustav Eje
    KTH Royal Institute of Technology, Stockholm, Sweden.
    The GENEA challenge 2023: a large-scale evaluation of gesture generation models in monadic and dyadic settings2023In: ICMI '23: proceedings of the 25th international conference on multimodal interaction / [ed] Elisabeth André; Mohamed Chetouani; Dominique Vaufreydaz; Gale Lucas; Tanja Schultz; Louis-Philippe Morency; Alessandro Vinciarelli, Association for Computing Machinery (ACM), 2023, p. 792-801Conference paper (Refereed)
    Abstract [en]

    This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the interlocutor. We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies. The studies focused on three aspects: 1) the human-likeness of the motion, 2) the appropriateness of the motion for the agent's own speech whilst controlling for the human-likeness of the motion, and 3) the appropriateness of the motion for the behaviour of the interlocutor in the interaction, using a setup that controls for both the human-likeness of the motion and the agent's own speech. We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap. Appropriateness seems far from being solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor.

  • 46.
    Li, Yunyao
    et al.
    IBM Research - Almaden.
    Grandison, Tyrone
    The Data-Driven Institute.
    Silveyra, Patricia
    University of North Carolina - Chapel Hill.
    Douraghy, Ali
    The National Academies of Sciences, Engineering and Medicine.
    Guan, Xinyu
    Yale University.
    Kieselbach, Thomas
    Umeå University, Umeå University Library.
    Li, Chengka
    University of Texas - Arlington.
    Zhang, Haiqi
    University of Texas - Arlington.
    Jennifer for COVID-19: An NLP-Powered Chatbot Built for the Peopleand by the People to Combat Misinformation2020In: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, 2020Conference paper (Refereed)
    Abstract [en]

    Just as SARS-CoV-2 continues to infect a growing number of people around the world, harmful misinformation about the outbreak also continues to spread. We designed and built Jennifer chatbot to provide easily accessible information from reliable resources to answer questions related to the current COVID-19 pandemic. It covers a wide variety of topics, from case statistics to best practices for disease prevention and management.

    Download full text (pdf)
    Submitted paper
    Download full text (pdf)
    fulltext
  • 47.
    Lindgren, Eva
    et al.
    Umeå University, Faculty of Arts, Department of language studies.
    Sullivan, Kirk
    Umeå University, Faculty of Arts, Department of language studies.
    Zhao, Huahui
    Umeå University, Faculty of Arts, Department of language studies.
    Deutschmann, Mats
    Umeå University, Faculty of Arts, Department of language studies.
    Steinvall, Anders
    Umeå University, Faculty of Arts, Department of language studies.
    Developing Peer-to-Peer Supported Reflection as a Life-Long Learning Skill: an Example from the Translation Classroom2011In: Human Development and Global Advancements through Information Communication Technologies: New Initiatives / [ed] Susheel Chhabra & Hakikur Rahman, Hershey USA: IGI publishing , 2011, 1, p. 188-210Chapter in book (Refereed)
    Abstract [en]

    Life-long learning skills have moved from being a side-affect of a formal education to skills that are explicitly trained during a university degree. In a case study a University class undertook a translation from Swedish to English in a keystroke logging environment and then replayed their translations in pairs while discussing their thought processes when undertaking the translations, and why they made particular choices and changes to their translations. Computer keystroke logging coupled with Peerbased intervention assisted the students in discussing how they worked with their translations, enabled them to see how their ideas relating to the translation developed as they worked with the text, develop reflection skills and learn from their peers. The process showed that Computer Keystroke logging coupled with Peer-based intervention has to potential to (1) support student reflection and discussion around their translation tasks, (2) enhance student motivation and enthusiasm for translation and (3) develop peer-to-peer supported reflection as a life-long learning skill.

  • 48.
    Lindgren, Helena
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Heintz, Fredrik
    Linköping University, Linköping, Sweden.
    The wasp-ed AI curriculum: A holistic curriculum for artificial intelligence2023In: INTED2023 Proceedings: 17th International Technology, Education and Development Conference, 2023, p. 6496-6502Conference paper (Refereed)
    Abstract [en]

    Efforts in lifelong learning and competence development in Artificial Intelligence (AI) have been on the rise for several years. These initiatives have mostly been applied to Science, Technology, Engineering and Mathematics (STEM) disciplines. Even though there has been significant development in Digital Humanities to incorporate AI methods and tools in higher education, the potential for such competences in Arts, Humanities and Social Sciences is far from being realised. Furthermore, there is an increasing awareness that the STEM disciplines need to include competences relating to AI in humanity and society. This is especially important considering the widening and deepening of the impact of AI on society at large and individuals. 

    The aim of the presented work is to provide a broad and inclusive AI Curriculum that covers the breadth of the topic as it is seen today, which is significantly different from only a decade ago. It is important to note that with the curriculum we mean an overview of the subject itself, rather than a particular education program. The curriculum is intended to be used as a foundation for educational activities in AI to for example harmonize terminology, compare different programs, and identify educational gaps to be filled. An important aspect of the curriculum is the ethical, legal, and societal aspects of AI and to not limit the curriculum to the STEM subjects, instead extending to a holistic, human-centred AI perspective. 

    The curriculum is developed as part of the national research program WASP-ED, the Wallenberg AI and transformative technologies education development program. 

    Download full text (pdf)
    fulltext
  • 49.
    Lindgren, Simon
    Umeå University, Faculty of Social Sciences, Department of Sociology.
    Introducing Connected Concept Analysis: A network approach to big text datasets2016In: Text & Talk, ISSN 1860-7330, E-ISSN 1860-7349, Vol. 36, no 3, p. 341-362Article in journal (Refereed)
    Abstract [en]

    This paper introduces Connected Concept Analysis (CCA) as a framework for text analysis which ties qualitative and quantitative considerations together in one unified model. Even though CCA can be used to map and analyze any full text dataset, of any size, the method was created specifically for taking the sensibilities of qualitative discourse analysis into the age of the Internet and big data. Using open data from a large online survey on habits and views relating to intellectual property rights, piracy and file sharing, I introduce CCA as a mixed-method approach aiming to bring out knowledge about corpuses of text, the sizes of which make it unfeasible to make comprehensive close readings. CCA aims to do this without reducing the text to numbers, as often becomes the case in content analysis. Instead of simply counting words or phrases, I draw on constant comparative coding for building concepts and on network analysis for connecting them. The result - a network graph visualization of key connected concepts in the analyzed text dataset - meets the need for text visualization systems that can support discourse analysis.

  • 50.
    Martin, Benjamin G.
    et al.
    Department of History of Science and Ideas, Uppsala University, Uppsala, Sweden.
    Norén, Fredrik Mohammedi
    Department of Media and Communication Studies, Malmö University, Malmö, Sweden.
    Mähler, Roger
    Umeå University, Faculty of Arts, Humlab.
    Marklund, Andreas
    Umeå University, Faculty of Arts, Humlab.
    Martin, Oriane
    Department of Linguistics, University of Lausanne, Lausanne, Switzerland.
    The curated UNESCO Courier 1.0: annotated corpora for digital research in the global humanities2024In: Journal of Open Humanities Data, E-ISSN 2059-481X, Vol. 10, article id 20Article in journal (Refereed)
    Abstract [en]

    The monthly magazine of the United Nations Educational, Scientific and Cultural Organization, founded in 1948 as The UNESCO Courier, represents an extraordinary resource for research on global themes in the humanities. We present the Curated Courier 1.0, a package of digital text corpora, text analysis tools, and supplementary material that aims to make the complete archive of this publication from 1948 to 2020 machine-readable, accessible, and reusable for digital text analysis. One corpus compiles the text of all articles, which we carefully reconstructed and linked to a comprehensive curated metadata index while excluding additional text (masthead, photo captions, letters to the editor, and so on). A second corpus brings together the complete text of all issues. This article first presents the value of Courier as a source for digital research in the global humanities. Second, it outlines how we created the curated corpus and discusses some challenges we met. Third, it offers examples of tools researchers might use to explore and utilize the annotated corpus and discusses a few approaches that we have developed and tested.

    Download full text (pdf)
    fulltext
12 1 - 50 of 91
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf