Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Effectivisation of keywords extraction process: A supervised binary classification approach of scraped words from company websites
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
2023 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

In today’s digital era, establishing an online presence and maintaining a well-structured website is vitalfor companies to remain competitive in their respective markets. A crucial aspect of online success liesin strategically selecting the right words to optimize customer engagement and search engine visibility.However, this process is often time-consuming, involving extensive analysis of a company’s website aswell as its competitors’. This thesis focuses on developing an efficient binary classification approachto identify key words and phrases extracted from multiple company websites. The data set used forthis solution consists of approximately 92,000 scraped samples, primarily comprising non-key samples.Various features were extracted, and a word embedding model was employed to assess each sample’srelevance to its specific industry and topic. The logistic regression, decision tree and random forestalgorithms were all explored and implemented as different solutions to the classification problem. Theresults indicated that the logistic regression model excelled in retaining keywords but was less effectivein eliminating non-keywords. Conversely, the tree-based methods demonstrated superior classificationof keywords, albeit at the cost of misclassifying a few keywords. Overall, the random forest approachoutperformed the others, achieving a result of 76 percent in recall and 20 percent in precision whenpredicting key samples. In summary, this thesis presents a solution for classifying words and phrasesfrom company websites into key and non-key categories, and the developed methodology could offervaluable insights for companies seeking to enhance their website optimization strategies.

Place, publisher, year, edition, pages
2023. , p. 56
Keywords [en]
Machine learning, keyword classification, unbalanced data, word embedding
National Category
Mathematics
Identifiers
URN: urn:nbn:se:umu:diva-209805OAI: oai:DiVA.org:umu-209805DiVA, id: diva2:1767551
External cooperation
Annalect
Educational program
Master of Science in Engineering and Management
Supervisors
Examiners
Available from: 2023-06-14 Created: 2023-06-14 Last updated: 2023-06-14Bibliographically approved

Open Access in DiVA

fulltext(686 kB)343 downloads
File information
File name FULLTEXT01.pdfFile size 686 kBChecksum SHA-512
92fd7c1af71ecb4199e7ecb3de0b7b46d30f66ce3c477d1f72c64a9bed9e703ade8aad931a75b7f4de77e3428a76267524f80aea2a2a2cf335a143334bda3acb
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Andersson, JosefFremling, Max
By organisation
Department of Mathematics and Mathematical Statistics
Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 343 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 771 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf