Statistical underpinning of mutational signature analyses of cancer sequencing data

Student thesis: Doctoral Thesis (PhD)


Cancer is a disease driven and characterised by mutations in the DNA. Thanks to massively parallel sequencing technologies, it is now possible to obtain the sequence of a cancer genome. The advent of modern sequencing technologies has allowed researchers to study the mutations involved in tumour development. More recently, attention has been drawn to the `passenger' mutations that are not involved in tumour development but bear fingerprints of the mutational processes that have been operative over a patient's lifetime.

Those fingerprints, termed mutational signatures, appear consistently across cancer genomes that have been exposed to the underlying mutational processes. Computational analyses have identified over a hundred such signatures, and it is now possible to estimate the relative prevalence of mutational signatures in a cancer genome. Both types of analyses are perhaps unique in the medical literature, in that no confidence intervals or other representations of uncertainty are demanded when reporting the results.

In this thesis, we address the problem of quantifying uncertainty around the reported mutational signatures and their relative prevalence in individual tumours. First, in Chapter 2, we review the available computational methods for mutational signature analyses, assessing the potential of existing approaches to characterise uncertainty. Then, in Chapter 3, we annotate ten statistical challenges. The remainder of the thesis is built on the aim of addressing some of those challenges.

To estimate the relative prevalence of mutational signatures in individual tumours, a method that quantifies the uncertainty around the estimated solution is lacking. Moreover, those analyses assume that the true values for the signatures are `known' as they are propagated from previous analyses. In Chapter 4, we suggest a setting where the signatures are `partially known'. We propose a novel approach for this problem, in a Bayesian setting, providing credible intervals around the estimated solution, propagating prior uncertainty regarding `partially known' signatures, and updating prior beliefs about them.

Estimation of mutational signatures is often performed in a matrix factorisation setting that is not fully probabilistic. While an alternative fully probabilistic approach is available, a post-processing method is needed to characterise the uncertainty around the reported solution. In Chapter 5, we introduce a novel post-processing approach to quantify uncertainty around the mutational signatures estimated in a cohort of cancer patients, along with software that allows investigators to use the proposed method and visualise results.
Date of Award2024
Original languageEnglish
Awarding Institution
  • University of St Andrews
SupervisorAndy Lynch (Supervisor) & Michail Papathomas (Supervisor)


  • Cancer genomics
  • Mutational signatures
  • Bioinformatics
  • Biostatistics
  • Bayesian statistics

Access Status

  • Full text embargoed until
  • 8 May 2025

Cite this