AbstractLong non-coding RNAs (lncRNAs) are known to play a significant role in several biological processes. These RNAs possess sequence length greater than 200 base pairs (bp), and so are often misclassified as protein-coding genes. Most Coding Potential Computation (CPC) tools fail to accurately identify, classify and predict the biological functions of lncRNAs in plant genomes, due to previous research being limited to mammalian genomes.
In this thesis, an investigation and extraction of various sequence and codon-bias features for identification of lncRNA sequences has been carried out, to develop a new CPC Framework. For identification of essential features, the framework implements regularisation-based selection. A novel classification algorithm is implemented, which removes the dependency on experimental datasets and provides a coordinate-based solution for sub-classification of lncRNAs. For imputing the lncRNA functions, lncRNA-protein interactions have been first determined through co-expression of genes which were re-analysed by a sequence similarity-based approach for identification of novel interactions and prediction of lncRNA functions in the genome. This integrates a D3-based application for visualisation of lncRNA sequences and their associated functions in the genome.
Standard evaluation metrics such as accuracy, sensitivity, and specificity have been used for benchmarking the performance of the framework against leading CPC tools. Case study analyses were conducted with plant RNA-seqdatasets for evaluating the effectiveness of the framework using a cross-validation approach. The tests show the framework can provide significant improvements on existing CPC models for plant genomes: 20-40% greater accuracy. Function prediction analysis demonstrates results are consistent with the experimentally-published findings.
|Date of Award||2018|
|Supervisor||James Shuttleworth (Supervisor), Jianhua Yang (Supervisor), Sandy Taramonli (Supervisor) & Matthew England (Supervisor)|