Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 39) Show all publications
Nakamura, T., Mishra, M., Tedeschi, S., Chai, Y., Stillerman, J. T., Friedrich, F., . . . Pyysalo, S. (2025). AURORA-M: open source continual pre-training for multilingual language and code. In: Owen Rambow; Leo Wanner; Marianna Apidianaki; Hend Al-Khalifa; Barbara Di Eugenio; Steven Schockaert (Ed.), Proceedings - International Conference on Computational Linguistics, COLING: . Paper presented at 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, United Arab Emirates, 19 January to 24 January, 2025. (pp. 656-678). Association for Computational Linguistics
Open this publication in new window or tab >>AURORA-M: open source continual pre-training for multilingual language and code
Show others...
2025 (English)In: Proceedings - International Conference on Computational Linguistics, COLING / [ed] Owen Rambow; Leo Wanner; Marianna Apidianaki; Hend Al-Khalifa; Barbara Di Eugenio; Steven Schockaert, Association for Computational Linguistics, 2025, p. 656-678Conference paper, Published paper (Refereed)
Abstract [en]

Pretrained language models are an integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as BLOOM and STARCODER aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents AURORA-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from STARCODERPLUS on 435B additional tokens, AURORA-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate AURORA-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2025
Series
International Conference on Computational Linguistics, ISSN 29512093
National Category
Natural Language Processing Artificial Intelligence Formal Methods
Identifiers
urn:nbn:se:umu:diva-238631 (URN)2-s2.0-105000111106 (Scopus ID)9798891761971 (ISBN)
Conference
31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, United Arab Emirates, 19 January to 24 January, 2025.
Available from: 2025-05-21 Created: 2025-05-21 Last updated: 2025-05-21Bibliographically approved
Kristan, M., Matas, J., Tokmakov, P., Felsberg, M., Zajc, L. Č., Lukežič, A., . . . Zunin, V. (2025). The second visual object tracking segmentation VOTS2024 challenge results. In: Alessio Del Bue; Cristian Canton; Jordi Pont-Tuset; Tatiana Tommasi (Ed.), Computer Vision – ECCV 2024 Workshops: ECCV 2024. Paper presented at Workshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024, Milan, Italy, September 29 - October 4, 2024 (pp. 357-383). Cham: Springer
Open this publication in new window or tab >>The second visual object tracking segmentation VOTS2024 challenge results
Show others...
2025 (English)In: Computer Vision – ECCV 2024 Workshops: ECCV 2024 / [ed] Alessio Del Bue; Cristian Canton; Jordi Pont-Tuset; Tatiana Tommasi, Cham: Springer, 2025, p. 357-383Conference paper, Published paper (Refereed)
Abstract [en]

The Visual Object Tracking Segmentation VOTS2024 challenge is the twelfth annual tracker benchmarking activity of the VOT initiative. This challenge consolidates the new tracking setup proposed in VOTS2023, which merges short-term and long-term as well as single-target and multiple-target tracking with segmentation masks as the only target location specification. Two sub-challenges are considered. The VOTS2024 standard challenge, focusing on classical objects and the VOTSt2024, which considers objects undergoing a topological transformation. Both challenges use the same performance evaluation methodology. Results of 28 submissions are presented and analyzed. A leaderboard, with participating trackers details, the source code, the datasets, and the evaluation kit are publicly available on the website (https://www.votchallenge.net/vots2024/).

Place, publisher, year, edition, pages
Cham: Springer, 2025
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15629
Keywords
performance evaluation, tracking and segmentation, transformative object tracking, VOTS
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:umu:diva-240095 (URN)10.1007/978-3-031-91767-7_24 (DOI)2-s2.0-105007227161 (Scopus ID)9783031917660 (ISBN)
Conference
Workshops that were held in conjunction with the 18th European Conference on Computer Vision, ECCV 2024, Milan, Italy, September 29 - October 4, 2024
Available from: 2025-06-12 Created: 2025-06-12 Last updated: 2025-06-12Bibliographically approved
Volodina, E., Alfter, D., Dobnik, S., Lindström Tiedemann, T., Sánchez, R. M., Szawerna, M. I. & Vu, X.-S. (2024). Introduction. In: Elena Volodina; David Alfter; Simon Dobnik; Therese Lindström Tiedemann; Ricardo Muñoz Sánchez; Maria Irena Szawerna; Xuan-Son Vu (Ed.), Proceedings of the workshop on computational approaches to language data pseudonymization (CALD-pseudo 2024): . Paper presented at 1st Workshop on Computational Approaches to Language Data Pseudonymization, CALD-pseudo 2024 (pp. ii-iii). Paper presented at 1st Workshop on Computational Approaches to Language Data Pseudonymization, CALD-pseudo 2024. Association for Computational Linguistics
Open this publication in new window or tab >>Introduction
Show others...
2024 (English)In: Proceedings of the workshop on computational approaches to language data pseudonymization (CALD-pseudo 2024) / [ed] Elena Volodina; David Alfter; Simon Dobnik; Therese Lindström Tiedemann; Ricardo Muñoz Sánchez; Maria Irena Szawerna; Xuan-Son Vu, Association for Computational Linguistics, 2024, p. ii-iiiChapter in book (Other academic)
Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-223761 (URN)2-s2.0-85190584439 (Scopus ID)9798891760851 (ISBN)
Conference
1st Workshop on Computational Approaches to Language Data Pseudonymization, CALD-pseudo 2024
Note

Conference:

CALD-pseudo 2024 - Workshop on Computational Approaches to Language Data Pseudonymization, 1st Workshop on Computational Approaches to Language Data Pseudonymization, CALD-pseudo 2024, St. Julian's, 21 March 2024.

Available from: 2024-05-06 Created: 2024-05-06 Last updated: 2024-05-06Bibliographically approved
Tran, K.-T., Hy, T. S., Jiang, L. & Vu, X.-S. (2024). MGLEP: multimodal graph learning for modeling emerging pandemics with big data. Scientific Reports, 14(1), Article ID 16377.
Open this publication in new window or tab >>MGLEP: multimodal graph learning for modeling emerging pandemics with big data
2024 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 14, no 1, article id 16377Article in journal (Refereed) Published
Abstract [en]

Accurate forecasting and analysis of emerging pandemics play a crucial role in effective public health management and decision-making. Traditional approaches primarily rely on epidemiological data, overlooking other valuable sources of information that could act as sensors or indicators of pandemic patterns. In this paper, we propose a novel framework, MGLEP, that integrates temporal graph neural networks and multi-modal data for learning and forecasting. We incorporate big data sources, including social media content, by utilizing specific pre-trained language models and discovering the underlying graph structure among users. This integration provides rich indicators of pandemic dynamics through learning with temporal graph neural networks. Extensive experiments demonstrate the effectiveness of our framework in pandemic forecasting and analysis, outperforming baseline methods across different areas, pandemic situations, and prediction horizons. The fusion of temporal graph learning and multi-modal data enables a comprehensive understanding of the pandemic landscape with less time lag, cheap cost, and more potential information indicators.

Place, publisher, year, edition, pages
Springer Nature, 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-227969 (URN)10.1038/s41598-024-67146-y (DOI)001337302400019 ()39013976 (PubMedID)2-s2.0-85198649048 (Scopus ID)
Funder
The Swedish Foundation for International Cooperation in Research and Higher Education (STINT), MG2020-8848Knut and Alice Wallenberg Foundation
Available from: 2024-07-23 Created: 2024-07-23 Last updated: 2025-04-24Bibliographically approved
Tran, K.-T., Vu, X.-S., Nguyen, K. & Nguyen, H. D. (2024). NeuProNet: neural profiling networks for sound classification. Neural Computing & Applications, 36(11), 5873-5887
Open this publication in new window or tab >>NeuProNet: neural profiling networks for sound classification
2024 (English)In: Neural Computing & Applications, ISSN 0941-0643, E-ISSN 1433-3058, Vol. 36, no 11, p. 5873-5887Article in journal (Refereed) Published
Abstract [en]

Real-world sound signals exhibit various aspects of grouping and profiling behaviors, such as being recorded from identical sources, having similar environmental settings, or encountering related background noises. In this work, we propose novel neural profiling networks (NeuProNet) capable of learning and extracting high-level unique profile representations from sounds. An end-to-end framework is developed so that any backbone architectures can be plugged in and trained, achieving better performance in any downstream sound classification tasks. We introduce an in-batch profile grouping mechanism based on profile awareness and attention pooling to produce reliable and robust features with contrastive learning. Furthermore, extensive experiments are conducted on multiple benchmark datasets and tasks to show that neural computing models under the guidance of our framework gain significant performance gaps across all evaluation tasks. Particularly, the integration of NeuProNet surpasses recent state-of-the-art (SoTA) approaches on UrbanSound8K and VocalSound datasets with statistically significant improvements in benchmarking metrics, up to 5.92% in accuracy compared to the previous SoTA method and up to 20.19% compared to baselines. Our work provides a strong foundation for utilizing neural profiling for machine learning tasks.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
Audio classification, Deep learning, Neural profiling network, Signal processing
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-220001 (URN)10.1007/s00521-023-09361-8 (DOI)001152242400004 ()2-s2.0-85182479547 (Scopus ID)
Available from: 2024-01-31 Created: 2024-01-31 Last updated: 2024-05-07Bibliographically approved
Szawerna, M. I., Dobnik, S., Lindström Tiedemann, T., Muñoz Sánchez, R., Vu, X.-S. & Volodina, E. (2024). Pseudonymization categories across domain boundaries. In: Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue (Ed.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): . Paper presented at The 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), Torino, Italy, May 20-25, 2024 (pp. 13303-13314). ELRA Language Resource Association
Open this publication in new window or tab >>Pseudonymization categories across domain boundaries
Show others...
2024 (English)In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) / [ed] Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue, ELRA Language Resource Association , 2024, p. 13303-13314Conference paper, Published paper (Refereed)
Abstract [en]

Linguistic data, a component critical not only for research in a variety of fields but also for the development of various Natural Language Processing (NLP) applications, can contain personal information. As a result, its accessibility is limited, both from a legal and an ethical standpoint. One of the solutions is the pseudonymization of the data. Key stages of this process include the identification of sensitive elements and the generation of suitable surrogates in a way that the data is still useful for the intended task. Within this paper, we conduct an analysis of tagsets that have previously been utilized in anonymization and pseudonymization. We also investigate what kinds of Personally Identifiable Information (PII) appear in various domains. These reveal that none of the analyzed tagsets account for all of the PII types present cross-domain at the level of detailedness seemingly required for pseudonymization. We advocate for a universal system of tags for categorizing PIIs leading up to their replacement. Such categorization could facilitate the generation of grammatically, semantically, and sociolinguistically appropriate surrogates for the kinds of information that are considered sensitive in a given domain, resulting in a system that would enable dynamic pseudonymization while keeping the texts readable and useful for future research in various fields.

Place, publisher, year, edition, pages
ELRA Language Resource Association, 2024
Series
International conference on computational linguistics, ISSN 2951-2093
Keywords
anonymization, deidentification, privacy, pseudonymization, universal tagset
National Category
Natural Language Processing
Identifiers
urn:nbn:se:umu:diva-226956 (URN)2-s2.0-85195988143 (Scopus ID)978-2-493814-10-4 (ISBN)
Conference
The 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), Torino, Italy, May 20-25, 2024
Note

Also part of series: LREC proceedings, ISBN: 2522-2686

Available from: 2024-06-25 Created: 2024-06-25 Last updated: 2025-02-07Bibliographically approved
Hoang, V.-T., Tran, K.-T., Vu, X.-S., Nguyen, D.-K., Bhuyan, M. & Nguyen, H. D. (2024). Wave2Graph: integrating spectral features and correlations for graph-based learning in sound waves. AI OPEN, 5, 115-125
Open this publication in new window or tab >>Wave2Graph: integrating spectral features and correlations for graph-based learning in sound waves
Show others...
2024 (English)In: AI OPEN, ISSN 2666-6510, Vol. 5, p. 115-125Article in journal (Refereed) Published
Abstract [en]

This paper investigates a novel graph-based representation of sound waves inspired by the physical phenomenon of correlated vibrations. We propose a Wave2Graph framework for integrating multiple acoustic representations, including the spectrum of frequencies and correlations, into various neural computing architectures to achieve new state-of-the-art performances in sound classification. The capability and reliability of our end-to-end framework are evidently demonstrated in voice pathology for low-cost and non-invasive mass-screening of medical conditions, including respiratory illnesses and Alzheimer's Dementia. We conduct extensive experiments on multiple public benchmark datasets (ICBHI and ADReSSo) and our real-world dataset (IJSound: Respiratory disease detection using coughs and breaths). Wave2Graph framework consistently outperforms previous state-of-the-art methods with a large magnitude, up to 7.65% improvement, promising the usefulness of graph-based representation in signal processing and machine learning.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
Wave2Graph, Graph neural network, Correlation, Sound signal processing, Neural network architecture
National Category
Computer and Information Sciences Electrical Engineering, Electronic Engineering, Information Engineering Medical Engineering
Identifiers
urn:nbn:se:umu:diva-232448 (URN)10.1016/j.aiopen.2024.08.004 (DOI)001336039100001 ()2-s2.0-85203410523 (Scopus ID)
Available from: 2024-11-29 Created: 2024-11-29 Last updated: 2024-11-29Bibliographically approved
Hatefi, A., Vu, X.-S., Bhuyan, M. H. & Drewes, F. (2023). ADCluster: Adaptive Deep Clustering for unsupervised learning from unlabeled documents. In: Mourad Abbas; Abed Alhakim Freihat (Ed.), Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023): . Paper presented at 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), Online, December 16-17, 2023. (pp. 68-77). Association for Computational Linguistics
Open this publication in new window or tab >>ADCluster: Adaptive Deep Clustering for unsupervised learning from unlabeled documents
2023 (English)In: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023) / [ed] Mourad Abbas; Abed Alhakim Freihat, Association for Computational Linguistics, 2023, p. 68-77Conference paper, Published paper (Refereed)
Abstract [en]

We introduce ADCluster, a deep document clustering approach based on language models that is trained to adapt to the clustering task. This adaptability is achieved through an iterative process where K-Means clustering is applied to the dataset, followed by iteratively training a deep classifier with generated pseudo-labels – an approach referred to as inner adaptation. The model is also able to adapt to changes in the data as new documents are added to the document collection. The latter type of adaptation, outer adaptation, is obtained by resuming the inner adaptation when a new chunk of documents has arrived. We explore two outer adaptation strategies, namely accumulative adaptation (training is resumed on the accumulated set of all documents) and non-accumulative adaptation (training is resumed using only the new chunk of data). We show that ADCluster outperforms established document clustering techniques on medium and long-text documents by a large margin. Additionally, our approach outperforms well-established baseline methods under both the accumulative and non-accumulative outer adaptation scenarios.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2023
Keywords
deep clustering, adaptive, deep learning, unsupervised, data stream
National Category
Computer Sciences
Research subject
Computer Science; computational linguistics
Identifiers
urn:nbn:se:umu:diva-220260 (URN)
Conference
6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), Online, December 16-17, 2023.
Available from: 2024-01-31 Created: 2024-01-31 Last updated: 2024-07-02Bibliographically approved
Vu, X.-S., Tran, S. N. & Jiang, L. (2023). dpUGC: learn differentially private representation for user generated contents. In: Alexander Gelbukh (Ed.), Computational linguistics and intelligent text processing: 20th international conference, CICLing 2019, La Rochelle, France, April 7–13, 2019, revised selected papers, part I. Paper presented at 20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, April 7-13, 2019. (pp. 316-331). Springer, 13451
Open this publication in new window or tab >>dpUGC: learn differentially private representation for user generated contents
2023 (English)In: Computational linguistics and intelligent text processing: 20th international conference, CICLing 2019, La Rochelle, France, April 7–13, 2019, revised selected papers, part I / [ed] Alexander Gelbukh, Springer, 2023, Vol. 13451, p. 316-331Conference paper, Published paper (Refereed)
Abstract [en]

This paper firstly proposes a simple yet efficient generalized approach to apply differential privacy to text representation (i.e., word embedding). Based on it, we propose a user-level approach to learn personalized differentially private word embedding model on user generated contents (UGC). To our best knowledge, this is the first work of learning user-level differentially private word embedding model from text for sharing. The proposed approaches protect the privacy of the individual from re-identification, especially provide better trade-off of privacy and data utility on UGC data for sharing. The experimental results show that the trained embedding models are applicable for the classic text analysis tasks (e.g., regression). Moreover, the proposed approaches of learning differentially private embedding models are both framework- and dataindependent, which facilitates the deployment and sharing. The source code is available at https://github.com/sonvx/dpText.

Place, publisher, year, edition, pages
Springer, 2023
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13451
Keywords
Private word embedding, Differential privacy, UGC
National Category
Natural Language Processing
Identifiers
urn:nbn:se:umu:diva-160887 (URN)10.1007/978-3-031-24337-0_23 (DOI)2-s2.0-85149907226 (Scopus ID)978-3-031-24336-3 (ISBN)978-3-031-24337-0 (ISBN)
Conference
20th International Conference on Computational Linguistics and Intelligent Text Processing, La Rochelle, France, April 7-13, 2019.
Note

Originally included in thesis in manuscript form. 

Available from: 2019-06-25 Created: 2019-06-25 Last updated: 2025-02-07Bibliographically approved
Volodina, E., Dobnik, S., Lindström Tiedemann, T. & Vu, X.-S. (2023). Grandma Karl is 27 years old - research agenda for pseudonymization of research data. In: Proceedings - IEEE 9th International Conference on Big Data Computing Service and Applications, BigDataService 2023: . Paper presented at 9th IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2023, 17-20 July 2023, Athens, Greece (pp. 229-233). IEEE
Open this publication in new window or tab >>Grandma Karl is 27 years old - research agenda for pseudonymization of research data
2023 (English)In: Proceedings - IEEE 9th International Conference on Big Data Computing Service and Applications, BigDataService 2023, IEEE, 2023, p. 229-233Conference paper, Published paper (Refereed)
Abstract [en]

Accessibility of research data is critical for advances in many research fields, but textual data often cannot be shared due to the personal and sensitive information which it contains, e.g names or political opinions. General Data Protection Regulation (GDPR) suggests pseudonymization as a solution to secure open access to research data, but we need to learn more about pseudonymization as an approach before adopting it for manipulation of research data. This paper outlines a research agenda within pseudonymization, namely need of studies into the effects of pseudonymization on unstructured data in relation to e.g. readability and language assessment, as well as the effectiveness of pseudonymization as a way of protecting writer identity, while also exploring different ways of developing context-sensitive algorithms for detection, labelling and replacement of personal information in unstructured data. The recently granted project on pseudonymization 'Grandma Karl is 27 years old'1 addresses exactly those challenges. 1.https://spraakbanken.gu.se/en/projects/mormor-karl

Place, publisher, year, edition, pages
IEEE, 2023
Keywords
natural language processing, pseudonymization
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-215235 (URN)10.1109/BigDataService58306.2023.00047 (DOI)2-s2.0-85173010164 (Scopus ID)9798350333794 (ISBN)9798350335347 (ISBN)
Conference
9th IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2023, 17-20 July 2023, Athens, Greece
Funder
Swedish Research Council, 2022-02311Swedish Research Council, 2023-2029
Available from: 2023-10-17 Created: 2023-10-17 Last updated: 2023-10-17Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-8820-2405

Search in DiVA

Show all publications