An integrated approach for intrinsic plagiarism detection

Muna Alsallal, Rahat Iqbal, Vasile Palade, Saad Amin, Victor Chang

Research output: Contribution to journalArticle

17 Downloads (Pure)

Abstract

Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author’s “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author’s style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.
Original languageEnglish
Pages (from-to)700-712
Number of pages13
JournalFuture Generation Computer Systems
Volume96
Early online date16 Dec 2017
DOIs
Publication statusPublished - Jul 2019

Fingerprint

Semantics
Bayesian networks
Multilayer neural networks
Support vector machines

Bibliographical note

NOTICE: this is the author’s version of a work that was accepted for publication in Future Generation Computer Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Future Generation Computer Systems, [96], (2019) DOI: 10.1016/j.future.2017.11.023

© 2017, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

An integrated approach for intrinsic plagiarism detection. / Alsallal, Muna; Iqbal, Rahat; Palade, Vasile; Amin, Saad; Chang, Victor.

In: Future Generation Computer Systems, Vol. 96, 07.2019, p. 700-712 .

Research output: Contribution to journalArticle

Alsallal, Muna ; Iqbal, Rahat ; Palade, Vasile ; Amin, Saad ; Chang, Victor. / An integrated approach for intrinsic plagiarism detection. In: Future Generation Computer Systems. 2019 ; Vol. 96. pp. 700-712 .
@article{efe2056ef0ef49e499fb141d2f1e0a92,
title = "An integrated approach for intrinsic plagiarism detection",
abstract = "Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author’s “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author’s style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97{\%}.",
author = "Muna Alsallal and Rahat Iqbal and Vasile Palade and Saad Amin and Victor Chang",
note = "NOTICE: this is the author’s version of a work that was accepted for publication in Future Generation Computer Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Future Generation Computer Systems, [96], (2019) DOI: 10.1016/j.future.2017.11.023 {\circledC} 2017, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/",
year = "2019",
month = "7",
doi = "10.1016/j.future.2017.11.023",
language = "English",
volume = "96",
pages = "700--712",
journal = "Future Generation Computer Systems",
issn = "0167-739X",
publisher = "Elsevier",

}

TY - JOUR

T1 - An integrated approach for intrinsic plagiarism detection

AU - Alsallal, Muna

AU - Iqbal, Rahat

AU - Palade, Vasile

AU - Amin, Saad

AU - Chang, Victor

N1 - NOTICE: this is the author’s version of a work that was accepted for publication in Future Generation Computer Systems. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Future Generation Computer Systems, [96], (2019) DOI: 10.1016/j.future.2017.11.023 © 2017, Elsevier. Licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/

PY - 2019/7

Y1 - 2019/7

N2 - Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author’s “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author’s style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.

AB - Employing effective plagiarism detection methods are seen to be essential in the next generation web. In this paper, we present a novel approach for plagiarism detection without reference collections. The proposed approach relies on using some statistical properties of the most common words, and the Latent Semantic Analysis that is applied to extract the most common words usage patterns. This method aims to generate a model of author’s “style” by revealing a set of certain features of authorship. The model generation procedure focuses on just one author, as an attempt to summarise the aspects of an author’s style in a definitive and clear-cut manner. The feature set of the intrinsic model were based on the frequency of the most common words, their relative frequencies in the book series, and the deviation of these frequencies across all books for a particular author. The approach has been evaluated using the leave-one-out-cross-validation method on the CEN (Corpus of English Novel) data set. Results have indicated that, by integrating deep latent semantic and stylometric analyses, hidden changes can be identified when a reference collection does not exist. The results have also shown that our Multi-Layer Perceptron based approach statistically outperforms Bayesian Network, Support Vector Machine and Random Forest models, by accurately predicting the author classes with an overall accuracy of 97%.

UR - http://www.scopus.com/inward/record.url?scp=85040009178&partnerID=8YFLogxK

U2 - 10.1016/j.future.2017.11.023

DO - 10.1016/j.future.2017.11.023

M3 - Article

VL - 96

SP - 700

EP - 712

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

SN - 0167-739X

ER -