Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards a Unified Terminology: A study on Siemens Energy Technical Information Systems
Umeå University, Faculty of Science and Technology, Department of Physics.
2024 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The lack of standardized terminology within Siemens Energy Technical Information Systems department, particularly in translating technical documents, impacts both internal operations and international collaboration. This project explores the application of machine learning to automate the standardization of terminology in Siemens Energy technical documentation, aiming to address inconsistencies that hinder clarity and international collaboration. Specifically, it focuses on developing a Retrieval Augmented Generation (RAG) model that inputs text and outputs a text with standardized terminology, thereby simplifying translation and communication processes. Employing methods such as prompt engineering, semantic search embeddings, and generating text with a large language model (LLM), particularly OpenAI’s GPT-3.5, the research involved preprocessing a non-standardized dataset, crafting effective prompts, and testing the model’s performance on both a generated test dataset and real-world documents. The findings revealed a 66% accuracy rate in the test dataset and a 60% accuracy rate in practical applications, indicating that while the model can correct terminology with moderate success, it faces limitations due to a lack of real-world examples and an optimal dataset. These results underscore the necessity for a company-wide standardized terminology database to improve communication within Siemens Energy and facilitate better for language models. The research contributes to the field of technical documentation and machine learning by demonstrating a structured approach to enhancing language model performance for terminology standardization. While time constraints limited the exploration of various embedding models and prompts, the project highlights future work opportunities to investigate embedding techniques further and integrate a standardized terminology framework for improved model accuracy.

Place, publisher, year, edition, pages
2024.
Keywords [en]
Terminology, Large Language Model, Retrieval Augmented Generation, Prompt Engineering
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:umu:diva-228144OAI: oai:DiVA.org:umu-228144DiVA, id: diva2:1886644
External cooperation
Siemens Energy
Subject / course
Examensarbete i teknisk fysik
Educational program
Master of Science Programme in Engineering Physics
Presentation
2024-06-03, 14:08 (English)
Supervisors
Examiners
Available from: 2024-08-12 Created: 2024-08-02 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(465 kB)204 downloads
File information
File name FULLTEXT01.pdfFile size 465 kBChecksum SHA-512
77a16b008779b393ef5d0f44065513e0d83365c0180b212fb78e93498dec5e2bae443d8d08ab3ba58dc8095c7f604c770965842620bb7d4563d105dc24b9d0b6
Type fulltextMimetype application/pdf

By organisation
Department of Physics
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 204 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 544 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf