Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Deep learning for news topic identification in limited supervision and unsupervised settings
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-6791-8284
2024 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Djup maskininlärning för identifiering av nyhetsämnen i inlärningssituationer med begränsad eller ingen övervakning (Swedish)
Abstract [en]

In today's world, following news is crucial for decision-making and staying informed. With the growing volume of daily news, automated processing is essential for timely insights and in aiding individuals and corporations in navigating the complexities of the information society. Another use of automated processing is contextual advertising, which addresses privacy concerns associated with cookie-based advertising by placing ads solely based on web page content, without tracking users or their online behavior. Therefore, accurately determining and categorizing page content is crucial for effective ad placements. The news media, heavily reliant on advertising to sustain operations, represent a substantial market for contextual advertising strategies.

Inspired by these practical applications and the advancements in deep learning over the past decade, this thesis mainly focuses on using deep learning for categorizing news articles into topics of varying granularity. Considering the dynamic nature of these applications and the limited availability of relevant labeled datasets for training models, the thesis emphasizes developing methods that can be trained effectively using unlabeled or partially labeled data. It proposes semi-supervised text classification models for categorizing datasets into predefined coarse-grained topics, where only a few labeled examples exist for each topic, while the majority of the dataset remains unlabeled. Furthermore, to better explore coarse-grained topics within news archives and streams and overcome the limitations of predefined topics in text classification the thesis suggests deep clustering approaches that can be trained in unsupervised settings. 

Moreover, to address the identification of fine-grained topics, the thesis introduces a novel story discovery model for monitoring event-based topics in multi-source news streams. Given that online news reporting often incorporates diverse modalities like text, images, video, and audio to convey information, the thesis finally initiates an investigation into the synergy between textual and visual elements in news article analysis. To achieve this objective, a text-image dataset was annotated, and a baseline was established for event-topic discovery in multimodal news streams. While primarily intended for news monitoring and contextual advertising, the proposed models can, more generally, be regarded as novel approaches in semi-supervised text classification, deep clustering, and news story discovery. Comparison with state-of-the-art baseline models demonstrates their effectiveness in addressing the respective objectives.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2024. , p. 62
Series
Report / UMINF, ISSN 0348-0542 ; 24.04
Keywords [en]
Topic Identification, Data Clustering, News Stream Clustering, Semi-Supervised Learning, Unsupervised Learning, Event Topics, News Stories, Multimodal News, Document Classification, Document Clustering, Deep Learning, Deep Clustering, Pre-trained Language Models
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-222534ISBN: 9789180703420 (print)ISBN: 9789180703437 (electronic)OAI: oai:DiVA.org:umu-222534DiVA, id: diva2:1845905
Public defence
2024-04-16, MIT.A.121, MIT-huset, Umeå, 13:15 (English)
Opponent
Supervisors
Available from: 2024-03-26 Created: 2024-03-20 Last updated: 2024-03-22Bibliographically approved
List of papers
1. Cformer: Semi-Supervised Text Clustering Based on Pseudo Labeling
Open this publication in new window or tab >>Cformer: Semi-Supervised Text Clustering Based on Pseudo Labeling
2021 (English)In: CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, ACM Digital Library, 2021, p. 3078-3082Conference paper, Published paper (Refereed)
Abstract [en]

We propose a semi-supervised learning method called Cformer for automatic clustering of text documents in cases where clusters are described by a small number of labeled examples, while the majority of training examples are unlabeled. We motivate this setting with an application in contextual programmatic advertising, a type of content placement on news pages that does not exploit personal information about visitors but relies on the availability of a high-quality clustering computed on the basis of a small number of labeled samples.

To enable text clustering with little training data, Cformer leverages the teacher-student architecture of Meta Pseudo Labels. In addition to unlabeled data, Cformer uses a small amount of labeled data to describe the clusters aimed at. Our experimental results confirm that the performance of the proposed model improves the state-of-the-art if a reasonable amount of labeled data is available. The models are comparatively small and suitable for deployment in constrained environments with limited computing resources. The source code is available at https://github.com/Aha6988/Cformer.

Place, publisher, year, edition, pages
ACM Digital Library, 2021
Keywords
meta pseudo clustering, semi-supervised learning, pseudo labeling
National Category
Natural Language Processing Computer Sciences
Research subject
computational linguistics; Computer Science
Identifiers
urn:nbn:se:umu:diva-186787 (URN)10.1145/3459637.3482073 (DOI)001054156203023 ()2-s2.0-85119178919 (Scopus ID)978-1-4503-8446-9 (ISBN)
Conference
CIKM2021, 30th ACM International Conference on Information and Knowledge Management, Online via Queensland, Australia, November 1-5, 2021
Available from: 2021-09-03 Created: 2021-09-03 Last updated: 2025-04-24Bibliographically approved
2. The efficiency of pre-training with objective masking in pseudo labeling for semi-supervised text classification
Open this publication in new window or tab >>The efficiency of pre-training with objective masking in pseudo labeling for semi-supervised text classification
(English)Manuscript (preprint) (Other academic)
Keywords
Semi-Supervised Classification, Masked Language Modelling, Objective Masking, Topic Modelling
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-222531 (URN)
Available from: 2024-03-20 Created: 2024-03-20 Last updated: 2025-02-07
3. ADCluster: Adaptive Deep Clustering for unsupervised learning from unlabeled documents
Open this publication in new window or tab >>ADCluster: Adaptive Deep Clustering for unsupervised learning from unlabeled documents
2023 (English)In: Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNLSP 2023) / [ed] Mourad Abbas; Abed Alhakim Freihat, Association for Computational Linguistics, 2023, p. 68-77Conference paper, Published paper (Refereed)
Abstract [en]

We introduce ADCluster, a deep document clustering approach based on language models that is trained to adapt to the clustering task. This adaptability is achieved through an iterative process where K-Means clustering is applied to the dataset, followed by iteratively training a deep classifier with generated pseudo-labels – an approach referred to as inner adaptation. The model is also able to adapt to changes in the data as new documents are added to the document collection. The latter type of adaptation, outer adaptation, is obtained by resuming the inner adaptation when a new chunk of documents has arrived. We explore two outer adaptation strategies, namely accumulative adaptation (training is resumed on the accumulated set of all documents) and non-accumulative adaptation (training is resumed using only the new chunk of data). We show that ADCluster outperforms established document clustering techniques on medium and long-text documents by a large margin. Additionally, our approach outperforms well-established baseline methods under both the accumulative and non-accumulative outer adaptation scenarios.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2023
Keywords
deep clustering, adaptive, deep learning, unsupervised, data stream
National Category
Computer Sciences
Research subject
Computer Science; computational linguistics
Identifiers
urn:nbn:se:umu:diva-220260 (URN)
Conference
6th International Conference on Natural Language and Speech Processing (ICNLSP 2023), Online, December 16-17, 2023.
Available from: 2024-01-31 Created: 2024-01-31 Last updated: 2024-07-02Bibliographically approved
4. PromptStream: self-supervised news story discovery using topic-aware article representations
Open this publication in new window or tab >>PromptStream: self-supervised news story discovery using topic-aware article representations
2024 (English)In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) / [ed] Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue, ELRA Language Resource Association , 2024, p. 13222-13232Conference paper, Published paper (Refereed)
Abstract [en]

Given the importance of identifying and monitoring news stories within the continuous flow of news articles, this paper presents PromptStream, a novel method for unsupervised news story discovery. In order to identify coherent and comprehensive stories across the stream, it is crucial to create article representations that incorporate as much topic-related information from the articles as possible. PromptStream constructs these article embeddings using cloze-style prompting. These representations continually adjust to the evolving context of the news stream through self-supervised learning, employing a contrastive loss and a memory of the most confident article-story assignments from the most recent days. Extensive experiments with real news datasets highlight the notable performance of our model, establishing a new state of the art. Additionally, we delve into selected news stories to reveal how the model’s structuring of the article stream aligns with story progression.

Place, publisher, year, edition, pages
ELRA Language Resource Association, 2024
Series
International conference on computational linguistics, ISSN 2951-2093
Keywords
news story discovery, online clustering, data stream, contrastive learning, cloze-style prompting, article embedding
National Category
Computer and Information Sciences
Research subject
computational linguistics; Computer Science
Identifiers
urn:nbn:se:umu:diva-222533 (URN)2-s2.0-85195905320 (Scopus ID)978-2-493814-10-4 (ISBN)
Conference
The 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), Torino, Italy, May 20-25, 2024
Note

Also part of series: LREC proceedings, ISBN: 2522-2686

Available from: 2024-03-20 Created: 2024-03-20 Last updated: 2024-06-25Bibliographically approved
5. METOD: a dataset and baseline for multimodal discovery of event-based news topics
Open this publication in new window or tab >>METOD: a dataset and baseline for multimodal discovery of event-based news topics
(English)Manuscript (preprint) (Other academic)
Keywords
news stream, multimodal news, news dataset, event-topic discovery, online clustering
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:umu:diva-222532 (URN)
Available from: 2024-03-20 Created: 2024-03-20 Last updated: 2025-02-18

Open Access in DiVA

fulltext(1484 kB)264 downloads
File information
File name FULLTEXT01.pdfFile size 1484 kBChecksum SHA-512
6ec0043227f5cfc30f798d6ee56bf5520b50a2b0d46c6e6891b7266f73930202fc7028e3138b7aeb1f50bc087ecd15235b614c44b7584ce3dad508b274622a58
Type fulltextMimetype application/pdf
spikblad(143 kB)73 downloads
File information
File name SPIKBLAD01.pdfFile size 143 kBChecksum SHA-512
011730024603acc955785340bd5f76d4775f3961d915905ca44effc36837dd562b8e05a65347a574805cac9606c13f6daa49a969a1a7351c3efce5bac3385931
Type spikbladMimetype application/pdf

Authority records

Hatefi, Arezoo

Search in DiVA

By author/editor
Hatefi, Arezoo
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 266 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1200 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf