Abstract
Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to
even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
Original language | English |
---|---|
Title of host publication | Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) |
Publisher | Association for Computational Linguistics |
Pages | 78-83 |
Number of pages | 6 |
DOIs | |
Publication status | Published - Jul 2015 |
Event | 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - Beijing, China Duration: 30 Jul 2015 → 30 Jul 2015 https://www.aclweb.org/portal/content/9th-sighum-workshop-language-technology-cultural-heritage-social-sciences-and-humanities |
Conference
Conference | 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities |
---|---|
Abbreviated title | LaTeCH 2015 |
Country/Territory | China |
City | Beijing |
Period | 30/07/15 → 30/07/15 |
Internet address |