Towards a Unified Terminology: A study on Siemens Energy Technical Information Systems
2024 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE credits
Student thesis
Abstract [en]
The lack of standardized terminology within Siemens Energy Technical Information Systems department, particularly in translating technical documents, impacts both internal operations and international collaboration. This project explores the application of machine learning to automate the standardization of terminology in Siemens Energy technical documentation, aiming to address inconsistencies that hinder clarity and international collaboration. Specifically, it focuses on developing a Retrieval Augmented Generation (RAG) model that inputs text and outputs a text with standardized terminology, thereby simplifying translation and communication processes. Employing methods such as prompt engineering, semantic search embeddings, and generating text with a large language model (LLM), particularly OpenAI’s GPT-3.5, the research involved preprocessing a non-standardized dataset, crafting effective prompts, and testing the model’s performance on both a generated test dataset and real-world documents. The findings revealed a 66% accuracy rate in the test dataset and a 60% accuracy rate in practical applications, indicating that while the model can correct terminology with moderate success, it faces limitations due to a lack of real-world examples and an optimal dataset. These results underscore the necessity for a company-wide standardized terminology database to improve communication within Siemens Energy and facilitate better for language models. The research contributes to the field of technical documentation and machine learning by demonstrating a structured approach to enhancing language model performance for terminology standardization. While time constraints limited the exploration of various embedding models and prompts, the project highlights future work opportunities to investigate embedding techniques further and integrate a standardized terminology framework for improved model accuracy.
Place, publisher, year, edition, pages
2024.
Keywords [en]
Terminology, Large Language Model, Retrieval Augmented Generation, Prompt Engineering
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:umu:diva-228144OAI: oai:DiVA.org:umu-228144DiVA, id: diva2:1886644
External cooperation
Siemens Energy
Subject / course
Examensarbete i teknisk fysik
Educational program
Master of Science Programme in Engineering Physics
Presentation
2024-06-03, 14:08 (English)
Supervisors
Examiners
2024-08-122024-08-022025-02-07Bibliographically approved