Text and data mining for information extraction for scientific documents

  • Bello Aliyu Muhammad

    Student thesis: Doctoral ThesisDoctor of Philosophy


    This research has produced a novel approach for information extraction from scientific research documents by means of text and data mining, including machine learning (ML) and automatic text summarisation (ATS). The scientific research documents (SRDs) consist of an unstructured data which does not have any predefined data model nor organised in any predefined manner. The SRDs are, however, organised in a hierarchical structure commonly known as IMRaD. Extracting the desired information from this structured is both challenging and time-consuming. Automated data extraction is useful in optimising certain administrative processes, review of scientific literature/documents (SRDs) etc. The novel approach developed in this research is centred around the SRDs, i.e. the SRDs are used as the case study to develop the novel approach. Therefore, the approach is most suitable for automatic data (information) extraction from the primary studies (SRDs) during a review process but is scalable to any structured document. The review of literature is a scientific and rigorous process that aims to integrate the empirical research evidence to answer a given research question. The goal of the review, therefore, is to find all the relevant information on a given research question, how it is covered in the literature and extract it altogether into one piece of evidence that answers the review question using a well-defined set of procedures and guidelines. The most challenging stage in the review process is the information (data) extraction from the SRDs. A review process typically involves hundreds or thousands of SRDs. Therefore, the process of extracting the desired information from such a vast volume of unstructured data in scientific research documents is labour intensive, error prone and time consuming. Better automated approaches can save reviewers time from between 30% to 70%, and for this reason interest in automated information extraction research is growing. Several methods/approaches have been developed to support the review process, but they all have very little support for the automated information extraction from the primary studies (SRDs). Information extraction from the SRDs is a difficult/challenging process. It involves the identification and/or extraction of the summary of findings (main result), main topics covered and the methods proposed or used in the documents. Lack of a unified framework has been identified as the main obstacle in automating the data extraction process from the SRDs. Therefore, a framework is needed to automate or semi automate the process of data extraction from the SRDs. This research has developed a novel framework (approach) for automated extraction of the relevant data from scientific research publications (SRDs). The framework is based on the canonical structure of the SRDs. The canonical structure is the standard format for the body of scientific documents commonly referred to as IMRAD or IMRaD (Introduction, Methods, Result and Discussion & conclusion). Text and data mining, including machine learning and natural language processing technologies were used to achieve the goal. First, intelligent models were developed to enable the machines understand the canonical (IMRaD) structure, as machines do not have an implicit understanding of this structure. We analysed, experimented and selected three (3) machine learning (ML) methods for this task, viz: Support Vector Machine (SVM), Logistic Regression and Random Forest. Also, the deep learning Convolutional Neural Network (CNN), was trained. Based on the dataset used in this research, all the above ML methods returned a good accuracy, precision and recall but CNN outperformed them all. This was also enhanced by incorporating a hybrid approach to the machine learning process. The hybrid approach involves the evaluation by the human subject experts. Second, after the identification of the various sections of the canonical (IMRaD) structure above, the desired section containing the relevant data is delineated for further processing, for example the ‘Result’ section to extract the data (findings) from the document. To extract the relevant data, the text within the section is then automatically summarised. The research identified and evaluated the appropriate automatic text summarisation (ATS) approaches and methods. The extractive summarisation approach was used. Four (4) extractive ATS methods were selected and used: Frequency-based (TF-IDF score), Graph-based (LexRank and TextRank) and Cluster-based methods. Using the ROUGE standard for evaluating ATS, the TextRank (Graph-based) method achieved the best performance with an overwhelming recall of 81%, and precision of 65%, as per the dataset (text) used in this research.
    Date of AwardFeb 2021
    Original languageEnglish
    Awarding Institution
    • Coventry University
    SponsorsPetroleum Technology Development Fund
    SupervisorRahat Iqbal (Supervisor), Anne James (Supervisor) & Dianabasi Nkantah (Supervisor)

    Cite this