Everyone working with bacterial genomics is familiar with the phrase ‘too much data’. In this Genome Update, we discuss two methods for helping to deal with this explosion of genomic information. First, we introduce the concept of calculating a quality score for each sequenced genome, and second, we describe a method to quickly sort through genomes for a particular set of protein families. We apply these two methods to all of the current Escherichia coli genomes available in the The National Center for Biotechnology Information database. Out of the 2074 E. coli/Shigella genomes listed (June, 2013), only less than half (983) are of sufficient quality to use in comparative genomic work. Unfortunately, even some of the ‘complete’ E. coli genomes are in pieces, and a few ‘draft’ genomes are good quality. Six of the seven known sigma factors in E. coli strain K‐12 are extremely well conserved; the iron‐regulating sigma factor FecI (σ19) is missing in most genomes. Surprisingly, the E. coli strain CFT073 genome does not encode a functional RpoD (σ70), which is obviously essential, and this is likely due to poor genome assembly/annotation. We find a possible novel sigma factor present in more than a hundred E. coli genomes.
Environmental Microbiology, 2013, Vol 15, Issue 12, p. 3121-3129