Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Machine learning-based diagnostics and observability in mobile networks
Umeå University, Faculty of Science and Technology, Department of Computing Science. (ADS LAB)ORCID iD: 0000-0001-9013-6603
2023 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Maskininlärningsbaserad diagnostik och observerbarhet i mobila nätverk (Swedish)
Abstract [en]

To meet the high-performance and reliability demands of 5G, the Radio Access Network (RAN) is moving to a cloud-native architecture. The new microservice architecture promises increased operational efficiency and a shorter time-to-market, but it also comes with a price. The new distributed and virtualized architecture is far more complex than ever before, and with the increasing number of features it brings, troubleshooting becomes more difficult. So far, RAN troubleshooters have relied on their expertise to analyze systems manually, but the ever-growing data and increased complexity make it challenging to grasp system behavior.

This thesis contributes threefold, where the proposed machine learning and statistical methods help RAN troubleshooters find deviations in system logs, identify the root cause of these deviations, and improve the system's observability. These methods learn the application's behavior from the system logs events and can identify behavior deviations from many different aspects. The thesis also demonstrates how observability can be improved by using a new software instrumentation guideline. The guideline enables the tracking of systemized procedures and enhances system understanding. The purpose of the guideline is to make RAN developers aware that machine learning can utilize debug information and help their troubleshooting process. To familiarize the reader with the research area, the challenges, and methods that can be used to detect anomalies, perform root cause analysis and observe RAN system behavior. The proposed research methods are integrated and tested in an advanced 5G test bed to evaluate the methods' accuracy, speed, system impact, and implementation cost.

The results demonstrate the advantage of using machine learning and statistical methods when troubleshooting the behavior of RAN. Machine learning methods, similar to those presented in this thesis, may help those who troubleshoot RAN and accelerate the development of 5G. The thesis ends with presenting potential research areas where this research could be further developed and applied, both in RAN and other systems.

Abstract [sv]

För att möta de höga kraven på prestanda och tillförlitlighet i det nya mobila 5G nätet sker nu en övergång till en molnbaserad arkitektur i radioaccessnätverket (RAN). Den nya mikrotjänstarkitekturen är tänkt att öka skalbarheten, prestandan och korta ner ledtiderna för produktleveranserna. Den distribuerade och virtuella arkitekturen är däremot mer komplicerad än tidigare och medför att det blir svårare att felsöka. Hittills har de som felsökt RAN förlitat sig på sin expertis för att manuellt analysera systemet. Men den ständigt växande datamängden och den ökade komplexiteten gör det svårt att förstå systemets beteende.

Denna avhandling bidrar med kunskap inom tre närliggande områden, där de föreslagna maskininlärnings- och statistiska metoderna hjälper de som felsöker RAN att hitta avvikelser i systemloggar, hjälper till att identifiera grundorsaken till dessa avvikelser och förbättrar systemets observerbarhet. Dessa metoder lär sig RANs beteende utifrån händelser i systemloggar och kan identifiera ett antal beteendeavvikelser. Avhandlingen visar också på hur observerbarheten kan förbättras genom att använda en ny riktlinje för mjukvaruinstrumentering. Riktlinjen gör det möjligt att följa hur RANs applikationer påverkar varandra vilket i sin tur förbättrar systemförståelsen. Syftet med riktlinjerna är att göra dem som arbetar med RAN medvetna om hur maskininlärning kan hjälpa till i deras felsökningsprocess. För att bekanta läsaren med forskningsområdet diskuteras först utmaningarna och metoderna som kan användas för att upptäcka avvikelser i RAN data, orsaken till avvikelserna samt hur observerbarheten av systemet kan förbättras. För att utvärdera de föreslagna metodernas noggrannhet, hastighet, systempåverkan och implementeringskostnad, integrerar och testas metoderna i en avancerad 5G-testbädd.

Resultatet visar på de stora fördelarna med att använda maskininlärning och statistiska metoder vid felsökning av beteendet hos RAN. Maskininlärningsmetoder, liknande de som presenteras i denna avhandling, kan komma att hjälpa dem som felsöker RAN och påskynda utvecklingen av 5G. Avhandlingen avslutas med en presentation av potentiella forskningsområden där forskningen i denna avhandling skulle kunna vidareutvecklas och tillämpas, både i RAN men även i andra system.

Place, publisher, year, edition, pages
Umeå: Umeå universitet , 2023. , p. 45
Series
Report / UMINF, ISSN 0348-0542 ; 23.02
Keywords [en]
Anomaly detection, Root cause analysis, Observability, Machine learning, Radio Access Network, 5G
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-206055ISBN: 978-91-8070-053-5 (print)ISBN: 978-91-8070-054-2 (electronic)OAI: oai:DiVA.org:umu-206055DiVA, id: diva2:1746128
Public defence
2023-04-21, Aula Biologica BIO.E.203, Umeå, 09:15 (English)
Opponent
Supervisors
Available from: 2023-03-31 Created: 2023-03-27 Last updated: 2024-07-02Bibliographically approved
List of papers
1. Boosted Ensemble Learning for Anomaly Detection in 5G RAN
Open this publication in new window or tab >>Boosted Ensemble Learning for Anomaly Detection in 5G RAN
2020 (English)In: Artificial Intelligence Applications and Innovations, AIAI 2020 / [ed] Maglogiannis I., Iliadis L., Pimenidis E., Springer, 2020, Vol. 583, p. 15-30Conference paper, Published paper (Refereed)
Abstract [en]

The emerging 5G networks promises more throughput, faster, and more reliable services, but as the network complexity and dynamics increases, it becomes more difficult to troubleshoot the systems. Vendors are spending a lot of time and effort on early anomaly detection in their development cycle and majority of the time is spent on manually analyzing system logs. While main research in anomaly detection uses performance metrics, anomaly detection using functional behaviour is still lacking in depth analysis. In this paper we show how a boosted ensemble of Long Short Term Memory classifiers can detect anomalies in the 5G Radio Access Network system logs. Acquiring system logs from a live 5G network is difficult due to confidentiality issues, live network disturbance, and problems to repeat scenarios. Therefore, we perform our evaluation on logs from a 5G test bed that simulate realistic traffic in a city. Our ensemble learns the functional behaviour of an application by training on logs from normal execution time. It can then detect deviations from normal behaviour and also be retrained on false positive cases found during validation. Anomaly detection in RAN shows that our ensemble called BoostLog, outperforms a single LSTM classifier and further testing on HDFS logs confirms that BoostLog also can be used in other domains. Instead of using domain experts to manually analyse system logs, BoostLog can be used by less experienced trouble shooters to automatically detect anomalies faster and more reliable.

Place, publisher, year, edition, pages
Springer, 2020
Series
IFIP Advances in Information and Communication Technology, ISSN 1868-4238, E-ISSN 1868-422X
Keywords
Anomaly detection, AdaBoost, LSTM, Radio Access Network (RAN), 5G, System logs, Functional area
National Category
Other Computer and Information Science
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-175008 (URN)10.1007/978-3-030-49161-1_2 (DOI)001289292800002 ()2-s2.0-85086252683 (Scopus ID)978-3-030-49160-4 (ISBN)978-3-030-49161-1 (ISBN)
Conference
16th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2020, 5-7 June 2020, Neos Marmaras, Greece
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Available from: 2020-09-15 Created: 2020-09-15 Last updated: 2025-04-24Bibliographically approved
2. Uncovering latency anomalies in 5G RAN - A combination learner approach
Open this publication in new window or tab >>Uncovering latency anomalies in 5G RAN - A combination learner approach
2022 (English)In: 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), IEEE, 2022, p. 621-629Conference paper, Published paper (Refereed)
Abstract [en]

The fifth generation (5G) RAN is a complex system consisting of virtualized and distributed parts with many concurrent threads interacting. Several services are time critical, and it is important for RAN vendors to understand where and why abnormal delays occur. Research within the latency anomaly detection area mainly focuses on utilizing performance metrics to analyze end-to-end latency or to detect delays at certain interfaces. One challenge is that many delays might occur due to application overload, scheduling, software bugs, or conflicts within the applications and in these cases, it is important to get a detailed view of the entire system. To improve the ability to analyze large complex systems, such as RAN, we present LogLatency, a novel combination learner approach that utilizes smart instrumented system logs. LogLatency employs statistical, probabilistic, and clustering methods to automatically learn the functional behavior of RAN and identify where latency anomalies occur. In our evaluation, which used an advanced 5G testbed, we conclude that LogLatency is superior to existing techniques and provides the fine granular view needed to improve 5G RAN in the future.

Place, publisher, year, edition, pages
IEEE, 2022
Series
International Conference on Communication Systems and Networks, ISSN 2155-2487, E-ISSN 2155-2509
Keywords
anomaly detection, clustering-based detection, latency, probabilistic detection, Radio Access Network (RAN), statistical detection, system logs
National Category
Computer Sciences
Research subject
computer and systems sciences
Identifiers
urn:nbn:se:umu:diva-192887 (URN)10.1109/COMSNETS53615.2022.9668431 (DOI)2-s2.0-85125166747 (Scopus ID)978-1-6654-2104-1 (ISBN)978-1-6654-2105-8 (ISBN)
Conference
14th International Conference on COMmunication Systems and NETworkS, COMSNETS 2022, Bangalore, January 4-8, 2022.
Funder
Knut and Alice Wallenberg Foundation
Available from: 2022-03-11 Created: 2022-03-11 Last updated: 2024-07-02Bibliographically approved
3. Unsupervised root-cause identification of software bugs in 5G RAN
Open this publication in new window or tab >>Unsupervised root-cause identification of software bugs in 5G RAN
2022 (English)In: 2022 IEEE 19th Annual Consumer Communications & Networking Conference (CCNC): Proceedings, IEEE, 2022, p. 624-630Conference paper, Published paper (Refereed)
Abstract [en]

Developers of complex system like 5G Radio Access Networks (RAN) need algorithms that can automatically locate the root causes of software bugs. Existing methods mainly use supervised learning to track down root causes and only a few of these provide enough information to identify the function in which a software bug occurs. Supervised learning methods work well when scenarios can be repeated, and the normal behavior is somewhat similar. In RAN, where thousands of different configurations are used, software is updated frequently, and each node has its own traffic intensity, using unsupervised learning that does not require any pre-training can be more suitable. The few existing methods that use unsupervised learning to locate the root cause of software bugs can only detect delays or software hangs, and are not able to identify the many types of bugs that occur in RAN. We propose a multi-step method that uses unsupervised learning to analyze kernel and user space traces in system logs. The methods can guide developers by suggesting top-k candidate functions that are likely to contain a software bug. Our methods, MultiSpace and CallGraph were evaluated using an advanced 5G testbed in which many different software bugs that are common in RAN, were injected. The results shows that MultiSpace and CallGraph, can detect a wider range of software bugs than previous methods and only adds an average CPU load of 1.3% on the testbed. An important aspect is also that our methods scale well with large amount of data produced by real time systems, like RAN, and can analyze the data much faster.

Place, publisher, year, edition, pages
IEEE, 2022
Series
IEEE Consumer Communications and Networking Conference, ISSN 2331-9852, E-ISSN 2331-9860
Keywords
kernel space, Radio Access Network, root cause, software bug, system log, user space
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:umu:diva-198735 (URN)10.1109/CCNC49033.2022.9700501 (DOI)2-s2.0-85135731161 (Scopus ID)9781665431620 (ISBN)9781665431613 (ISBN)
Conference
19th IEEE Annual Consumer Communications and Networking Conference (CCNC 2022), Las Vegas, NV, USA, 08-11 January 2022
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)Knut and Alice Wallenberg Foundation
Available from: 2022-08-22 Created: 2022-08-22 Last updated: 2024-07-02Bibliographically approved
4. Robust procedural learning for anomaly detection and observability in 5G RAN
Open this publication in new window or tab >>Robust procedural learning for anomaly detection and observability in 5G RAN
2024 (English)In: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 21, no 2, p. 1432-1445Article in journal (Refereed) Published
Abstract [en]

Most existing large distributed systems have poor observability and cannot use the full potential of machine learning-based behavior analysis. The system logs, which contain the primary source of information, are unstructured and lack the context needed to track procedures and learn the system’s behavior. This work presents a new trace guideline that enables a component-and procedure-based split of the system logs for the future 5G Radio Access Network (RAN). As the system can be broken into smaller pieces, models can more accurately learn the system’s behavior and use the context to improve anomaly detection and observability. The evaluation result is astonishing; where previously state-of-the-art methods struggle to learn the behavior, a fast, dictionary-based algorithm can detect all anomalies and keep false positives close to zero. Troubleshooters can also more quickly identify anomalies and gain useful insights into the component interaction in RAN.

Place, publisher, year, edition, pages
IEEE, 2024
Keywords
observability, trace guidelines, anomaly detection, Radio Access Network, 5G
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-206058 (URN)10.1109/TNSM.2023.3321401 (DOI)001205268100057 ()2-s2.0-85174831102 (Scopus ID)
Funder
Knut and Alice Wallenberg Foundation
Note

Originally included in thesis in manuscript form.

Available from: 2023-03-27 Created: 2023-03-27 Last updated: 2025-04-24Bibliographically approved
5. Bottleneck identification and failure prevention with procedural learning in 5G RAN
Open this publication in new window or tab >>Bottleneck identification and failure prevention with procedural learning in 5G RAN
2023 (English)In: 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid) / [ed] Simmhan Y., Altintas I., Varbanescu A.-L., Balaji P., Prasad A.S., Carnevale L., IEEE, 2023, p. 426-436Conference paper, Published paper (Refereed)
Abstract [en]

To meet the low latency requirements of 5G Radio Access Networks (RAN), it is essential to learn where performance bottlenecks occur. As parts are distributed and virtualized, it becomes troublesome to identify where unwanted delays occur. Today, vendors spend huge manual effort analyzing key performance indicators (KPIs) and system logs to detect these bottlenecks. The 5G architecture allows a flexible scaling of microservices to handle the variation in traffic. But knowing how, when, and where to scale is difficult without a detailed latency analysis. In this article, we propose a novel method that combines procedural learning with latency analysis of system log events. The method, which we call LogGenie, learns the latency pattern of the system at different load scenarios and automatically identifies the parts with the most significant increase in latency. Our evaluation in an advanced 5G testbed shows that LogGenie can provide a more detailed analysis than previous research has achieved and help troubleshooters locate bottlenecks faster. Finally, through experiments, we show how a latency prediction model can dynamically fine-tune the behavior where bottlenecks occur. This lowers resource utilization, makes the architecture more flexible, and allows the system to fulfill its latency requirements.

Place, publisher, year, edition, pages
IEEE, 2023
Keywords
bottleneck detection, latency, RAN, failure prevention, 5G
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-205960 (URN)10.1109/CCGrid57682.2023.00047 (DOI)2-s2.0-85166323115 (Scopus ID)979-8-3503-0119-9 (ISBN)979-8-3503-0120-5 (ISBN)
Conference
23rd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid), Bangalore, India, May 1-4, 2023
Funder
Knut and Alice Wallenberg Foundation
Available from: 2023-03-24 Created: 2023-03-24 Last updated: 2024-07-02Bibliographically approved

Open Access in DiVA

fulltext(4338 kB)1427 downloads
File information
File name FULLTEXT02.pdfFile size 4338 kBChecksum SHA-512
ccc02f7e878ea8e24cf8c4b7093212b2f7f98a7c47fc565a013d151743f0dbfc60acc91d5478ce4bad148298dba3768c577d312e673f2f58158fbb4edab59086
Type fulltextMimetype application/pdf
spikblad(130 kB)96 downloads
File information
File name SPIKBLAD01.pdfFile size 130 kBChecksum SHA-512
b532c1ff04b352c52be7c051074fcea8437ef2170798e7c7defe536edf6ecd282cbaae68c3846b09aafe635bf68a37e6636a6f924224efb278175cc9d4263eaf
Type spikbladMimetype application/pdf

Authority records

Sundqvist, Tobias

Search in DiVA

By author/editor
Sundqvist, Tobias
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1428 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1503 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf