umu.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Cultural heritage as digital noise: nineteenth century newspapers in the digital archive
Umeå universitet, Humanistiska fakulteten, Institutionen för kultur- och medievetenskaper.ORCID-id: 0000-0002-1167-046X
Umeå universitet, Humanistiska fakulteten, Institutionen för kultur- och medievetenskaper.
2017 (Engelska)Ingår i: Journal of Documentation, ISSN 0022-0418, E-ISSN 1758-7379, Vol. 73, nr 6, s. 1228-1243Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Purpose

The purpose of this paper is to explore and analyze the digitized newspaper collection at the National Library of Sweden, focusing on cultural heritage as digital noise. In what specific ways are newspapers transformed in the digitization process? If the digitized document is not the same as the source document – is it still a historical record, or is it transformed into something else?

Design/methodology/approach

The authors have analyzed the XML files from Aftonbladet 1830 to 1862. The most frequent newspaper words not matching a high-quality references corpus were selected to zoom in on the noisiest part of the paper. The variety of the interpretations generated by optical character recognition (OCR) was examined, as well as texts generated by auto-segmentation. The authors have made a limited ethnographic study of the digitization process.

Findings

The research shows that the digital collection of Aftonbladet contains extreme amounts of noise: millions of misinterpreted words generated by OCR, and millions of texts re-edited by the auto-segmentation tool. How the tools work is mostly unknown to the staff involved in the digitization process? Sticking to any idea of a provenance chain is hence impossible, since many steps have been outsourced to unknown factors affecting the source document.

Originality/value

The detail examination of digitally transformed newspapers is valuable to scholars depending on newspaper databases in their research. The paper also highlights the fact that libraries outsourcing digitization processes run the risk of losing control over the quality of their collections.

Ort, förlag, år, upplaga, sidor
2017. Vol. 73, nr 6, s. 1228-1243
Nyckelord [en]
Sweden, Archives, Documents, Accuracy, Print media, Auto-segmentation, Character recognition equipment, Large-scale digitization
Nationell ämneskategori
Kulturstudier Språkteknologi (språkvetenskaplig databehandling)
Forskningsämne
medie- och kommunikationsvetenskap
Identifikatorer
URN: urn:nbn:se:umu:diva-140302DOI: 10.1108/JD-09-2016-0106ISI: 000412873700008OAI: oai:DiVA.org:umu-140302DiVA, id: diva2:1146975
Tillgänglig från: 2017-10-04 Skapad: 2017-10-04 Senast uppdaterad: 2018-06-09Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltext

Personposter BETA

Jarlbrink, JohanSnickars, Pelle

Sök vidare i DiVA

Av författaren/redaktören
Jarlbrink, JohanSnickars, Pelle
Av organisationen
Institutionen för kultur- och medievetenskaper
I samma tidskrift
Journal of Documentation
KulturstudierSpråkteknologi (språkvetenskaplig databehandling)

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 393 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf