Have, Christian Theil3; Appel, Emil Vincent Rosenbaum3; Grarup, Niels3; Hansen, Torben3; Bork-Jensen, Jette4
1 Section for Metabolic Genetics, Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, Københavns Universitet2 Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, Københavns Universitet3 Section for Metabolic Genetics, Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, Københavns Universitet4 Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, Københavns Universitet
Abstract—Undetected mislabeled samples may affect the results of genotype studies, particular when rare genetic variants are investigated. Mislabeled samples are often not detected during quality control and if they are detected, they are normally discarded due to a lack of a reliable method to recover the correct labels. Here we describe a statistical method which given a few extra independent genotypes (barcode genotypes) detects mislabeled samples and recovers the correct labels for sample mix-ups. We have implemented the method in a program (named Wunderbar) and we evaluate the reliability of the method on simulated data. We find that even with only a small number of barcode genotypes, Wunderbar is capable of identifying mislabeled samples and sample mix-ups with high sensitivity and specificity, even with a high genotyping error rate and even in the presence of dependency between the individual barcode genotypes. To detect mislabeled samples we calculate the probability that the discordance between genotypes in the data and in the independent genotypes can be attributed to random (non-mislabeling) genotyping errors. To identify mix-ups we calculate the probability of identifying the set of identical genotypes between sample x and sample y by chance. Based on this we calculate a mix-up confidence score with penalization for introducing mismatches in the proposed new label and adjustment for independency among the genotypes. This confidence score is used to identify probable mix-ups.
International Journal of Bioscience, Biochemistry and Bioinformatics, 2014, Vol 4, Issue 5, p. 355-360