Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Synergies between Chemometrics and Machine Learning
Umeå University, Faculty of Science and Technology, Department of Chemistry.ORCID iD: 0000-0001-7881-0968
2021 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Synergier mellan kemometri och maskininlärning (Swedish)
Abstract [en]

Thanks to digitization and automation, data in all shapes and forms are generated in ever-growing quantities throughout society, industry and science. Data-driven methods, such as machine learning algorithms, are already widely used to benefit from all these data in all kinds of applications, ranging from text suggestion in smartphones to process monitoring in industry. To ensure maximal benefit to society, we need workflows to generate, analyze and model data that are performant as well as robust and trustworthy.

There are several scientific disciplines aiming to develop data-driven methodologies, two of which are machine learning and chemometrics. Machine learning is part of artificial intelligence and develops algorithms that learn from data. Chemometrics, on the other hand, is a subfield of chemistry aiming to generate and analyze complex chemical data in an optimal manner. There is already a certain overlap between the two fields where machine learning algorithms are used for predictive modelling within chemometrics. Although, since both fields aims to increase value of data and have disparate backgrounds, there are plenty of possible synergies to benefit both fields. Thanks to its wide applicability, there are many tools and lessons learned within machine learning that goes beyond the predictive models that are used within chemometrics today. On the other hand, chemometrics has always been application-oriented and this pragmatism has made it widely used for quality assurance within regulated industries. 

This thesis serves to nuance the relationship between the two fields and show that knowledge in either field can be used to benefit the other. We explore how tools widely used in applied machine learning can help chemometrics break new ground in a case study of text analysis of patents in Paper I. We then draw inspiration from chemometrics and show how principles of experimental design can help us optimize large-scale data processing pipelines in Paper II and how a method common in chemometrics can be adapted to allow artificial neural networks detect outlier observations in Paper III. We then show how experimental design principles can be used to ensure quality in the core of concurrent machine learning, namely generation of large-scale datasets in Paper IV. Lastly, we outline directions for future research and how state-of-the-art research in machine learning can benefit chemometric method development.

Abstract [sv]

Tack vare digitalisering och automation genereras växande mängder data i alla möjliga former runtom i samhället, industrin och akademin. För att utnyttja dessa data på  bästa vis används redan idag så kallade datadrivna metoder, till exempel maskininlärning, i mängder av tillämpningar i allt ifrån förslag av nästa ord i SMS på smartphones till process-övervakning inom industri. För att maximera samhällsnyttan av den data som genereras behöver vi robusta och pålitliga arbetsflöden för att skapa, analysera och modellera data för alla tänkbara tillämpningar.

Det finns många vetenskapliga fält som utvecklar och utnyttjar datadrivna metoder, där två av dessa är maskininlärning och kemometri. Maskininlärning faller inom det som kallas artificiell intelligens och utvecklar algoritmer som lär sig från data. Kemometri däremot har sitt ursprung i kemi och utvecklar metoder för att generera, analysera och maximera värdet av komplexa kemiska data. Det finns ett visst överlapp mellan fälten där maskininlärnings-algoritmer används flitigt för prediktiv modellering inom kemometrin. Eftersom bägge fält försöker öka värdet av data och har vitt skilda bakgrunder finns det många potentiella synergieffekter. Tack vare att maskininlärning är så vida använt finns det många verktyg och lärdomar utöver dom prediktiva modeller som redan används inom kemometrin. Å andra sidan har kemometri alltid varit inriktat på praktisk tillämpning och har tack vare sin pragmatism lett till att det idag är välanvänt för kvalitetsarbete inom reglerad industri. 

Den här avhandlingen har som mål att nyansera förhållandet mellan kemometri och maskin-inlärning och visa att lärdomar inom vardera fält kan gynna det andra. Vi visar på hur man kan använda verktyg vanliga inom maskininlärning för att hjälpa kemometrin att bryta ny mark i en case-studie på textanalys av patentsamlingar i Paper I. Sedan lånar vi från kemometrin och visar hur vi kan använda experimentdesign för att optimera storskaliga beräkningsflöden i Paper II och hur en vanlig metod inom kemometrin kan formuleras om för att för att upptäcka avvikande mätningar i artificiella neuronnät i Paper III. Efter det visar vi hur principer från experimentdesign kan användas för att säkerställa kvalitet i kärnan av modern maskininlärning, nämligen skapandet av stora dataset i Paper IV. Till sist ger vi förslag på framtida forskning och hur de senaste metoderna inom maskin-inlärning kan gynna metodutveckling inom kemometrin.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet , 2021. , p. 50
Keywords [en]
computational science, machine learning, chemometrics, multivariate data analysis, design of experiments, data science
Keywords [sv]
beräkningsvetenskap, maskininlärning, kemometri, multivariat dataanalys, experimentdesign
National Category
Other Chemistry Topics Bioinformatics and Computational Biology Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-182683ISBN: 978-91-7855-558-1 (print)ISBN: 978-91-7855-559-8 (electronic)OAI: oai:DiVA.org:umu-182683DiVA, id: diva2:1548580
Public defence
2021-05-28, KBC Glasburen, KBC - building, Umeå, 09:00 (English)
Opponent
Supervisors
Funder
eSSENCE - An eScience CollaborationThe Swedish Foundation for International Cooperation in Research and Higher Education (STINT)Swedish Research Council, 2016‐04376Available from: 2021-05-07 Created: 2021-05-03 Last updated: 2025-02-05Bibliographically approved
List of papers
1. Multivariate patent analysis: using chemometrics to analyze collections of chemical and pharmaceutical patents
Open this publication in new window or tab >>Multivariate patent analysis: using chemometrics to analyze collections of chemical and pharmaceutical patents
2020 (English)In: Journal of Chemometrics, ISSN 0886-9383, E-ISSN 1099-128X, Vol. 34, no 1, article id e3041Article in journal (Refereed) Published
Abstract [en]

Patents are an important source of technological knowledge, but the amount of existing patents is vast and quickly growing. This makes development of tools and methodologies for quickly revealing patterns in patent collections important. In this paper, we describe how structured chemometric principles of multivariate data analysis can be applied in the context of text analysis in a novel combination with common machine learning preprocessing methodologies. We demonstrate our methodology in 2 case studies. Using principal component analysis (PCA) on a collection of 12338 patent abstracts from 25 companies in big pharma revealed sub-fields which the companies are active in. Using PCA on a smaller collection of patents retrieved by searching for a specific term proved useful to quickly understand how patent classifications relate to the search term. By using orthogonal projections to latent structures (O-PLS) on patent classification schemes, we were able to separate patents on a more detailed level than using PCA. Lastly, we performed multi-block modeling using OnPLS on bag-of-words representations of abstracts, claims, and detailed descriptions, respectively, showing that semantic variation relating to patent classification is consistent across multiple text blocks, represented as globally joint variation. We conclude that using machine learning to transform unstructured data into structured data provide a good preprocessing tool for subsequent chemometric multivariate data analysis and provides an easily interpretable and novel workflow to understand large collections of patents. We demonstrate this on collections of chemical and pharmaceutical patents.

Place, publisher, year, edition, pages
John Wiley & Sons, 2020
Keywords
text analytics, OnPLS, principal component analysis, orthogonal projections to latent structures, feature engineering
National Category
Other Chemistry Topics
Identifiers
urn:nbn:se:umu:diva-152511 (URN)10.1002/cem.3041 (DOI)000509318600011 ()2-s2.0-85046797919 (Scopus ID)
Funder
Swedish Research Council, 2016‐04376eSSENCE - An eScience CollaborationThe Swedish Foundation for International Cooperation in Research and Higher Education (STINT)
Available from: 2018-10-09 Created: 2018-10-09 Last updated: 2023-03-24Bibliographically approved
2. doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
Open this publication in new window or tab >>doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
Show others...
2019 (English)In: BMC Bioinformatics, E-ISSN 1471-2105, Vol. 20, no 1, article id 498Article in journal (Refereed) Published
Abstract [en]

Background: Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed.

Results: We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline.

Conclusions: Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.

Place, publisher, year, edition, pages
BioMed Central, 2019
Keywords
Design of Experiments, Optimization, Sequencing, Nanopore, MinION, Assembly, Classification, Scaffolding, Variant calling
National Category
Bioinformatics and Computational Biology Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:umu:diva-164986 (URN)10.1186/s12859-019-3091-z (DOI)000490501600003 ()31615395 (PubMedID)2-s2.0-85073473946 (Scopus ID)
Funder
Knut and Alice Wallenberg Foundation, 2011.0042Swedish Research Council, 2016-04376eSSENCE - An eScience CollaborationSwedish Armed Forces
Available from: 2019-11-11 Created: 2019-11-11 Last updated: 2025-02-05Bibliographically approved
3. Out-of-Distribution Example Detection in Deep Neural Networks using Distance to Modelled Embedding
Open this publication in new window or tab >>Out-of-Distribution Example Detection in Deep Neural Networks using Distance to Modelled Embedding
2021 (English)Manuscript (preprint) (Other academic)
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-182680 (URN)
Available from: 2021-05-03 Created: 2021-05-03 Last updated: 2021-05-04
4. LIVECell: a large-scale dataset for label-free live cell segmentation
Open this publication in new window or tab >>LIVECell: a large-scale dataset for label-free live cell segmentation
Show others...
2021 (English)In: Nature Methods, ISSN 1548-7091, E-ISSN 1548-7105, Vol. 18, no 9, p. 1038-1045Article in journal (Other academic) Published
Abstract [en]

Light microscopy combined with well-established protocols of two-dimensional cell culture facilitates high-throughput quantitative imaging to study biological phenomena. Accurate segmentation of individual cells in images enables exploration of complex biological questions, but can require sophisticated imaging processing pipelines in cases of low contrast and high object density. Deep learning-based methods are considered state-of-the-art for image segmentation but typically require vast amounts of annotated data, for which there is no suitable resource available in the field of label-free cellular imaging. Here, we present LIVECell, a large, high-quality, manually annotated and expert-validated dataset of phase-contrast images, consisting of over 1.6 million cells from a diverse set of cell morphologies and culture densities. To further demonstrate its use, we train convolutional neural network-based models using LIVECell and evaluate model segmentation accuracy with a proposed a suite of benchmarks.

Place, publisher, year, edition, pages
Nature Publishing Group, 2021
National Category
Medical Imaging
Identifiers
urn:nbn:se:umu:diva-182681 (URN)10.1038/s41592-021-01249-6 (DOI)000691220800001 ()34462594 (PubMedID)2-s2.0-85113983609 (Scopus ID)
Note

Previously included in thesis in manuscript form. 

Available from: 2021-05-03 Created: 2021-05-03 Last updated: 2025-02-09Bibliographically approved

Open Access in DiVA

fulltext(940 kB)2543 downloads
File information
File name FULLTEXT01.pdfFile size 940 kBChecksum SHA-512
04239109cf3d780022c90107cd6dabc44de4078a90a3d22ce4b90c306231ee852b9fc357a37404b675c9eaa002caae52a5e4d210ad570342da25473e0a5da24e
Type fulltextMimetype application/pdf
spikblad(129 kB)114 downloads
File information
File name SPIKBLAD01.pdfFile size 129 kBChecksum SHA-512
c924d4f29bb635ccf6d983c111d6ae4ab9282a4938ca095e292ecd1c116d6ebcf473bff883354fb055c0671aa4e6207cd7c47e703515d10c6eb73c70775abfed
Type spikbladMimetype application/pdf

Authority records

Sjögren, Rickard

Search in DiVA

By author/editor
Sjögren, Rickard
By organisation
Department of Chemistry
Other Chemistry TopicsBioinformatics and Computational BiologyComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 2550 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2191 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf