A preliminary study on similarity-preserving digital book identifiers

Klemo Vladimir, Marin Silic, Nenad Romic, Goran Delac, Sinisa Srbljic

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to
even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
Original languageEnglish
Title of host publicationProceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
PublisherAssociation for Computational Linguistics
Pages78-83
Number of pages6
DOIs
Publication statusPublished - Jul 2015
Event9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - Beijing, China
Duration: 30 Jul 201530 Jul 2015
https://www.aclweb.org/portal/content/9th-sighum-workshop-language-technology-cultural-heritage-social-sciences-and-humanities

Conference

Conference9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
Abbreviated titleLaTeCH 2015
CountryChina
CityBeijing
Period30/07/1530/07/15
Internet address

Fingerprint

Metadata
Optical character recognition
Hash functions

Cite this

Vladimir, K., Silic, M., Romic, N., Delac, G., & Srbljic, S. (2015). A preliminary study on similarity-preserving digital book identifiers. In Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) (pp. 78-83). Association for Computational Linguistics. https://doi.org/10.18653/v1/W15-3712

A preliminary study on similarity-preserving digital book identifiers. / Vladimir, Klemo; Silic, Marin; Romic, Nenad; Delac, Goran; Srbljic, Sinisa.

Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Association for Computational Linguistics, 2015. p. 78-83.

Research output: Chapter in Book/Report/Conference proceedingChapter

Vladimir, K, Silic, M, Romic, N, Delac, G & Srbljic, S 2015, A preliminary study on similarity-preserving digital book identifiers. in Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Association for Computational Linguistics, pp. 78-83, 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Beijing, China, 30/07/15. https://doi.org/10.18653/v1/W15-3712
Vladimir K, Silic M, Romic N, Delac G, Srbljic S. A preliminary study on similarity-preserving digital book identifiers. In Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Association for Computational Linguistics. 2015. p. 78-83 https://doi.org/10.18653/v1/W15-3712
Vladimir, Klemo ; Silic, Marin ; Romic, Nenad ; Delac, Goran ; Srbljic, Sinisa. / A preliminary study on similarity-preserving digital book identifiers. Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH). Association for Computational Linguistics, 2015. pp. 78-83
@inbook{0236db8f3fb1489f827e5bbd329f2c59,
title = "A preliminary study on similarity-preserving digital book identifiers",
abstract = "Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing toeven smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.",
author = "Klemo Vladimir and Marin Silic and Nenad Romic and Goran Delac and Sinisa Srbljic",
year = "2015",
month = "7",
doi = "10.18653/v1/W15-3712",
language = "English",
pages = "78--83",
booktitle = "Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)",
publisher = "Association for Computational Linguistics",

}

TY - CHAP

T1 - A preliminary study on similarity-preserving digital book identifiers

AU - Vladimir, Klemo

AU - Silic, Marin

AU - Romic, Nenad

AU - Delac, Goran

AU - Srbljic, Sinisa

PY - 2015/7

Y1 - 2015/7

N2 - Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing toeven smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.

AB - Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing toeven smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.

U2 - 10.18653/v1/W15-3712

DO - 10.18653/v1/W15-3712

M3 - Chapter

SP - 78

EP - 83

BT - Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

PB - Association for Computational Linguistics

ER -