Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Text analysis techniques and determining the similarity of course contents
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2025 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Natural language processing is an ever-growing subfield of AI that has a multitude of different applications. A common and seemingly straightforward application is document similarity, which often implements various NLP algorithms. However, along with the versatility of its different techniques, there are also drawbacks. The different algorithms tend to focus on one or more factors of similarity, which means they can excel in one type of similarity assessment but struggle with another. This thesis investigates three NLP techniques with a focus on their ability to automate similarity assessments. Their focus is on similarity between course content for use in course eligibility or course crediting. At this point in time, this comparison is done manually. To determine what factors are important when crediting courses, three algorithms have been implemented and run on various course comparison tests. The algorithms and factors chosen were TF-IDF for weighted term overlap, N-Gram for contextual matching, and NER with Keyword extraction for topic detection. When assessing their overall accuracy, NER with Keyword extraction seemed the best choice. That is until it became apparent that it was more consistently and confidently giving wrong answers. It gave a high similarity score on courses that had some similarities, such as being from the same university but were not similar enough to be credited as each other. Using N-Grams to determine similarity was the most reliable both on similar and dissimilar courses and was shown to be the reliable choice. TF-IDF did not perform adequately with its current vocabulary. To summarize, context-based similarity with N-Grams proved to be a reliable and useful factor when looking into automatic crediting of courses but needs further work before actual usage.

Place, publisher, year, edition, pages
2025.
Series
UMNAD ; 1524
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:umu:diva-235480OAI: oai:DiVA.org:umu-235480DiVA, id: diva2:1937821
Educational program
Bachelor of Science Programme in Computing Science
Supervisors
Examiners
Available from: 2025-02-17 Created: 2025-02-14 Last updated: 2025-02-17Bibliographically approved

Open Access in DiVA

fulltext(2562 kB)65 downloads
File information
File name FULLTEXT01.pdfFile size 2562 kBChecksum SHA-512
42f0afd2d397ff4a57859f3ceafa8601ae7386b289ced78bdc0924dfd150a38b0fa859ac3ddbf37497f9c4c91fb3a2d039bbde415f297c03fc4fe0e138b6be65
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 65 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 297 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf