Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author’s “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author’s style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.
Bibliographical noteNOTICE: this is the author’s version of a work that was accepted for publication in Future Generation Computer Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Future Generation Computer Systems, , (2019) DOI: 10.1016/j.future.2017.11.023
© 2017, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/
ASJC Scopus subject areas
- Hardware and Architecture
- Computer Networks and Communications
Alsallal, M., Iqbal, R., Palade, V., Amin, S., & Chang, V. (2019). An integrated approach for intrinsic plagiarism detection. Future Generation Computer Systems, 96, 700-712 . https://doi.org/10.1016/j.future.2017.11.023