Lindgreen, Stinus4; Krogh, Anders5; Pedersen, Jakob Skou6
1 Computational and RNA Biology, Department of Biology, Faculty of Science, Københavns Universitet2 Department of Biology, Faculty of Science, Københavns Universitet3 Graduate School of Health and Medical Sciences, Faculty of Health and Medical Sciences, Københavns Universitet4 Department of Biology, Faculty of Science, Københavns Universitet5 Computational and RNA Biology, Department of Biology, Faculty of Science, Københavns Universitet6 Graduate School of Health and Medical Sciences, Faculty of Health and Medical Sciences, Københavns Universitet
a probabilistic graphical model for estimating genotypes
BACKGROUND: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage. FINDINGS: We present a novel approach to the genotyping problem where a probabilistic framework describing the process from sampling to sequencing is implemented as a graphical model. This makes it possible to model technology specific errors and other sources of variation that can affect the result. The inferred genotype is given a posterior probability to signify the confidence in the result. SNPest has already been used to genotype large scale projects such as the first ancient human genome published in 2010. CONCLUSIONS: We compare the performance of SNPest to a number of other widely used genotypers on both real and simulated data, covering both haploid and diploid genomes. We investigate the effects of read depth, of removing adapters before mapping and genotyping, of using different mapping tools, and of using the correct model in the genotyping process. We show that the performance of SNPest is comparable to existing methods, and we also illustrate cases where SNPest has an advantage over other methods, e.g. when dealing with simulated ancient DNA.