Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Simulating recombinant sequence date to evaluate and improve computational methods of multiple sequence alignment and recombinant identification

Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurat...

Full description

Saved in:
Bibliographic Details
Main Author: Swanepoel, Phillip
Other Authors: Martin, Darrin
Format: Thesis
Language:English
Published: Computational Biology Division 2024
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Motivation. Recombination is a central evolutionary process that substantially changes the structure of genomes and shapes their evolutionary trajectory. Recombination detection is thus an important computational step in understanding the evolutionary history of nucleotide sequences, and the accurate identification of recombinant sequences is particularly important in the context of downstream phylogenetics-based sequence analyses. Evaluating recombination detection methods requires the simulation of sequence data, and the training of statistical learning models requires large, realistic datasets. The goal of this study was thus to (1) simulate large, realistic sequence datasets that have evolved in the presence of frequent recombination, and (2) to use these datasets to improve one of the computational steps used in the analysis of recombination by the computer program, recombination detection program 5 (RDP5), specifically: the identification of the recombinant from a recombinant/parent/parent triplet. Results. To improve the accuracy with which RDP5 identifies recombinant sequences, we simulated the evolution of recombining sequences to produce large datasets that could then be used to train a number of machine learning models to accurately differentiate recombinants from their parental sequences. The artificial intelligence systems created using these models showed a substantial improvement in recombinant identification accuracy over the method currently implemented in RDP5 - with an increase in accuracy of up to 26 percentage points. Availability and implementation. Our simulation software is a forked version of SANTA-SIM developed in Java. All source code is released and is available at: https://github.com/phillipswanepoel/santa-sim/tree/Recomb_and_align.