Statistics, machine learning and deep learning for population genetic inference

  • Xinghu Qin

Student thesis: Doctoral Thesis (PhD)

Abstract

Deciphering the evolutionary changes from raw DNA data effectively without the loss of intrinsic information has been the fundamental and core work in population genetics. However, some statistical challenges still restrict the inferential performance in population genetics, for example, the undue emphasis on rare or common alleles measured by different statistics, the ubiquitous multimodal genetic structure within populations, and complex genotype-by-environment associations. In this thesis, I propose to integrate the information-based statistics with machine learning approaches to address these problems and challenges for population genetic inference. First, I evaluated the performance of the information-based summary statistics for spatial demography inference. I showed that the summary statistics based on Shannon differentiation and the transformed diversity of order q=1 had higher power to discriminate spatially-structured scenarios than the traditional allelic richness and heterozygosity-based summary statistics. This provides guidelines for using summary statistics to make inference of spatial demography and for developing new statistical methods to detect signatures of evolutionary changes. Second, I proposed to use Kernel Local Fisher Discriminant Analysis of Principal Components (KLFDAPC) for population genetic structure inference considering the nonlinear and multimodal genetic information between individuals. KLFDAPC outperformed both PCA and DAPC in discriminatory power and in predicting individual geographic origin. KLFDAPC is useful for geographic ancestry inference and correction of population stratification in GWAS. Finally, I proposed a deep learning-based approach (DeepGenomeScan) to detect signals of selection. DeepGenomeScan had higher power than the commonly used machine learning approaches such as pcadapt and RDA in identifying signatures of selection. Furthermore, DeepGenomeScan can be extended to implement various genome-wide association studies (GWAS, TWAS, PWAS, and MWAS) by performing a systematic scanning on genome-wide variants to detect the genetic variations responsible for complex traits or involved in adaptation. In summary, this dissertation addresses several foundational questions in statistics-based and machine learning-based inference, contributing several the-state-of-the-art statistical tools for population genetic inference.
Date of Award30 Jun 2021
Original languageEnglish
Awarding Institution
  • University of St Andrews
SupervisorOscar Eduardo Gaggiotti (Supervisor)

Keywords

  • Machine learning
  • Deep learning
  • Population genetics
  • Population structure
  • Detection of natural selection
  • Information-based summary statistics
  • Genome scan
  • Genome wide association study

Access Status

  • Full text embargoed until
  • 2nd June 2022 [Embargo applies only to all chapter s and appendices]

Cite this

'