A preliminary study on similarity-preserving digital book identifiers

Klemo Vladimir, Marin Silic, Nenad Romic, Goran Delac, Sinisa Srbljic

    Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

    Abstract

    Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to
    even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
    Original languageEnglish
    Title of host publicationProceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
    PublisherAssociation for Computational Linguistics
    Pages78-83
    Number of pages6
    DOIs
    Publication statusPublished - Jul 2015
    Event9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - Beijing, China
    Duration: 30 Jul 201530 Jul 2015
    https://www.aclweb.org/portal/content/9th-sighum-workshop-language-technology-cultural-heritage-social-sciences-and-humanities

    Conference

    Conference9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
    Abbreviated titleLaTeCH 2015
    Country/TerritoryChina
    CityBeijing
    Period30/07/1530/07/15
    Internet address

    Fingerprint

    Dive into the research topics of 'A preliminary study on similarity-preserving digital book identifiers'. Together they form a unique fingerprint.

    Cite this