Full Text Available

Note: Clicking the button above will open the full text document at the original institutional repository in a new window.

Creating and analysing an African pan-genome

The human reference genome is currently a core resource for understanding the role of genetics in human health, disease, and variation, and has been invaluable in the development of clinical and computational tools for these purposes. However, the limited number of individual genomes used to create...

Full description

Saved in:
Bibliographic Details
Main Author: Bourn, Jessica Jean
Other Authors: Mulder, Nicola
Format: Thesis
Language:English
Published: Department of Integrative Biomedical Sciences (IBMS) 2023
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1867613328406216704
access_status_str Open Access
author Bourn, Jessica Jean
author2 Mulder, Nicola
author_browse Bourn, Jessica Jean
Mulder, Nicola
author_facet Mulder, Nicola
Bourn, Jessica Jean
author_sort Bourn, Jessica Jean
collection Thesis
description The human reference genome is currently a core resource for understanding the role of genetics in human health, disease, and variation, and has been invaluable in the development of clinical and computational tools for these purposes. However, the limited number of individual genomes used to create the reference has resulted in an underrepresentation of the extensive genetic diversity present in different human populations. Since an important use of the reference genome is to identify genetic variants that may be implicated in disease, this lack of diversity could limit the scientific utility of the reference for ethnic groups that are poorly represented in it. As a result, adaptations to the reference genome structure have been proposed. One such proposal has been the use of multiple reference genomes, each of which represent different human populations. A logical and highly practical method of achieving this is through the use of a pan-genome, which is a curated collection of all the DNA sequences that are found within a population under study. Despite the fact that African populations exhibit the greatest genetic diversity and variation in the world, the many and sometimes ancient ethnolinguistic groups from Africa are among those least represented within the reference genome. Consequently, this study aimed to explore the feasibility of creating and analysing an African pangenome, and to begin developing tools to achieve this. Several distinct African regional ancestral groups – namely east African Nilo-Saharan, east African Afro-Asiatic, far west Niger-Congo, central west Niger-Congo, Bantu-speaking Niger-Congo, central African rainforest hunter-gatherer, and the Khoe and San – have previously been identified, and this study included and analysed samples from each group in order to assemble a more inclusive and representative pan-genome. A software pipeline developed by Duan et al. (2019), termed the HUman Pan-genome ANalysis (HUPAN) pipeline, was used here to assemble the African pan-genome. As the HUPAN pipeline was originally designed to analyse only single populations, the inclusion of multiple populations required modifications and improvements, which were implemented following the testing and analysis of the pipeline using a smaller dataset of whole genome sequences. Subsequently, a final dataset of 168 African high- and medium-coverage whole genome sequences representing the seven separate regional ancestral groups was submitted to the adapted HUPAN pipeline. For each group, nucleotide sequences that were absent from the human reference genome were assembled and extracted, which resulted in the identification of 43.37 Mbp of non-redundant non-reference genomic sequence and 31 novel predicted protein-coding genes from African individuals. Alignment to other pan-genome sequences, whole genomes from different human populations, and the complete telomere-to-telomere human genome validated a large portion of the sequences as nonreference and confirmed that the dataset contained sequences specific to African populations. However, the gene presence-absence variation analysis of the pan-genome within all 168 samples revealed patterns of gene presence and absence that were strongly correlated to the sample dataset of origin, rather than to the ancestral group of origin. This hindered the identification of genuine genetic variation specific to the groups analysed. Further, it appears that previous pan-genomic research has not investigated the degree to which the genetic variation identified is dataset-specific or truly population-specific. Consequently, the failure to acknowledge and account for the effects of spurious inter-dataset variation in previous pan-genomic research indicates that those analyses may be incomplete or ambiguous. This, therefore, calls into question the methods currently used for pangenomic research, and highlights that robust, standardised methods for human pan-genome research must be agreed on to ensure that comprehensive population-specific pan-genomes are produced in the future. Despite this inherent weakness of pan-genomic research, this study successfully enabled the creation and analysis of a comprehensive and inclusive African pan-genome. Unique sets of non-reference sequences specific to African regional ancestral groups were identified and obtained, enabling the assembly of a non-redundant set of pan-African non-reference sequences. Furthermore, certain complex but previously unconsidered aspects of pan-genome research were identified and explored, and these observations may play a role in the advancement of pan-genome research in future.
format Thesis
id oai:open.uct.ac.za:11427/37209
institution University of Cape Town (South Africa)
language eng
last_indexed 2026-06-10T12:34:23.309Z
license_str Not specified — see source repository
provenance_str_mv Harvested via OAI-PMH from UCTD — University of Cape Town Open Access Repository
publishDate 2023
publishDateRange 2023
publishDateSort 2023
publisher Department of Integrative Biomedical Sciences (IBMS)
publisherStr Department of Integrative Biomedical Sciences (IBMS)
record_format dspace
source_str UCTD — University of Cape Town Open Access Repository
spelling oai:open.uct.ac.za:11427/37209 Creating and analysing an African pan-genome Bourn, Jessica Jean Mulder, Nicola Bioinformatics The human reference genome is currently a core resource for understanding the role of genetics in human health, disease, and variation, and has been invaluable in the development of clinical and computational tools for these purposes. However, the limited number of individual genomes used to create the reference has resulted in an underrepresentation of the extensive genetic diversity present in different human populations. Since an important use of the reference genome is to identify genetic variants that may be implicated in disease, this lack of diversity could limit the scientific utility of the reference for ethnic groups that are poorly represented in it. As a result, adaptations to the reference genome structure have been proposed. One such proposal has been the use of multiple reference genomes, each of which represent different human populations. A logical and highly practical method of achieving this is through the use of a pan-genome, which is a curated collection of all the DNA sequences that are found within a population under study. Despite the fact that African populations exhibit the greatest genetic diversity and variation in the world, the many and sometimes ancient ethnolinguistic groups from Africa are among those least represented within the reference genome. Consequently, this study aimed to explore the feasibility of creating and analysing an African pangenome, and to begin developing tools to achieve this. Several distinct African regional ancestral groups – namely east African Nilo-Saharan, east African Afro-Asiatic, far west Niger-Congo, central west Niger-Congo, Bantu-speaking Niger-Congo, central African rainforest hunter-gatherer, and the Khoe and San – have previously been identified, and this study included and analysed samples from each group in order to assemble a more inclusive and representative pan-genome. A software pipeline developed by Duan et al. (2019), termed the HUman Pan-genome ANalysis (HUPAN) pipeline, was used here to assemble the African pan-genome. As the HUPAN pipeline was originally designed to analyse only single populations, the inclusion of multiple populations required modifications and improvements, which were implemented following the testing and analysis of the pipeline using a smaller dataset of whole genome sequences. Subsequently, a final dataset of 168 African high- and medium-coverage whole genome sequences representing the seven separate regional ancestral groups was submitted to the adapted HUPAN pipeline. For each group, nucleotide sequences that were absent from the human reference genome were assembled and extracted, which resulted in the identification of 43.37 Mbp of non-redundant non-reference genomic sequence and 31 novel predicted protein-coding genes from African individuals. Alignment to other pan-genome sequences, whole genomes from different human populations, and the complete telomere-to-telomere human genome validated a large portion of the sequences as nonreference and confirmed that the dataset contained sequences specific to African populations. However, the gene presence-absence variation analysis of the pan-genome within all 168 samples revealed patterns of gene presence and absence that were strongly correlated to the sample dataset of origin, rather than to the ancestral group of origin. This hindered the identification of genuine genetic variation specific to the groups analysed. Further, it appears that previous pan-genomic research has not investigated the degree to which the genetic variation identified is dataset-specific or truly population-specific. Consequently, the failure to acknowledge and account for the effects of spurious inter-dataset variation in previous pan-genomic research indicates that those analyses may be incomplete or ambiguous. This, therefore, calls into question the methods currently used for pangenomic research, and highlights that robust, standardised methods for human pan-genome research must be agreed on to ensure that comprehensive population-specific pan-genomes are produced in the future. Despite this inherent weakness of pan-genomic research, this study successfully enabled the creation and analysis of a comprehensive and inclusive African pan-genome. Unique sets of non-reference sequences specific to African regional ancestral groups were identified and obtained, enabling the assembly of a non-redundant set of pan-African non-reference sequences. Furthermore, certain complex but previously unconsidered aspects of pan-genome research were identified and explored, and these observations may play a role in the advancement of pan-genome research in future. 2023-03-03T11:58:50Z 2023-03-03T11:58:50Z 2022 2023-02-20T12:19:09Z Master Thesis Masters MSc http://hdl.handle.net/11427/37209 eng application/pdf Department of Integrative Biomedical Sciences (IBMS) Faculty of Health Sciences
spellingShingle Bioinformatics
Bourn, Jessica Jean
Creating and analysing an African pan-genome
thesis_degree_str Master's
title Creating and analysing an African pan-genome
title_full Creating and analysing an African pan-genome
title_fullStr Creating and analysing an African pan-genome
title_full_unstemmed Creating and analysing an African pan-genome
title_short Creating and analysing an African pan-genome
title_sort creating and analysing an african pan genome
topic Bioinformatics
url http://hdl.handle.net/11427/37209
work_keys_str_mv AT bournjessicajean creatingandanalysinganafricanpangenome