1 Department of Chemistry, Technical University of Denmark2 UCSC
We present a method for condensing the information in multiple alignments of proteins into amixture of Dirichlet densities over amino acid distributions. Dirichlet mixture densities aredesigned to be combined with observed amino acid frequencies to form estimates of expectedamino acid probabilities at each position in a profile, hidden Markov model or other statisticalmodel. These estimates give a statistical model greater generalization capacity, so that remotelyrelated family members can be more reliably recognized by the model. This paper corrects thepreviously published formula for estimating these expected probabilities, and contains completederivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to matchparticular databases, and suggestions for efficient implementation.