umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Multivariate patent analysis: using chemometrics to analyze collections of chemical and pharmaceutical patents
Umeå University, Faculty of Science and Technology, Department of Chemistry.
Umeå University, Faculty of Science and Technology, Department of Chemistry.
Umeå University, Faculty of Science and Technology, Department of Chemistry.
Umeå University, Faculty of Science and Technology, Department of Chemistry. Sartorius Stedim Data Analytics, Umeå, Sweden.ORCID iD: 0000-0003-3799-6094
2018 (English)In: Journal of Chemometrics, ISSN 0886-9383, E-ISSN 1099-128X, article id e3041Article in journal (Refereed) Published
Abstract [en]

Abstract Patents are an important source of technological knowledge, but the amount of existing patents is vast and quickly growing. This makes development of tools and methodologies for quickly revealing patterns in patent collections important. In this paper, we describe how structured chemometric principles of multivariate data analysis can be applied in the context of text analysis in a novel combination with common machine learning preprocessing methodologies. We demonstrate our methodology in 2 case studies. Using principal component analysis (PCA) on a collection of 12338 patent abstracts from 25 companies in big pharma revealed sub-fields which the companies are active in. Using PCA on a smaller collection of patents retrieved by searching for a specific term proved useful to quickly understand how patent classifications relate to the search term. By using orthogonal projections to latent structures (O-PLS) on patent classification schemes, we were able to separate patents on a more detailed level than using PCA. Lastly, we performed multi-block modeling using OnPLS on bag-of-words representations of abstracts, claims, and detailed descriptions, respectively, showing that semantic variation relating to patent classification is consistent across multiple text blocks, represented as globally joint variation. We conclude that using machine learning to transform unstructured data into structured data provide a good preprocessing tool for subsequent chemometric multivariate data analysis and provides an easily interpretable and novel workflow to understand large collections of patents. We demonstrate this on collections of chemical and pharmaceutical patents.

Place, publisher, year, edition, pages
John Wiley & Sons, 2018. article id e3041
Keywords [en]
text analytics, OnPLS, principal component analysis, orthogonal projections to latent structures, feature engineering
National Category
Other Chemistry Topics
Identifiers
URN: urn:nbn:se:umu:diva-152511DOI: 10.1002/cem.3041OAI: oai:DiVA.org:umu-152511DiVA, id: diva2:1254377
Conference
2018/10/09
Funder
Swedish Research Council, 2016‐04376eSSENCE - An eScience CollaborationThe Swedish Foundation for International Cooperation in Research and Higher Education (STINT)Available from: 2018-10-09 Created: 2018-10-09 Last updated: 2018-11-06Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records BETA

Sjögren, RickardSkotare, TomasTrygg, Johan

Search in DiVA

By author/editor
Sjögren, RickardSkotare, TomasTrygg, Johan
By organisation
Department of Chemistry
In the same journal
Journal of Chemometrics
Other Chemistry Topics

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 193 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf