Fritzemeier, Kai3; Kristensen, Jakob3; Larsen, Martin Røssel5; Ueckert, Torsten3; Delanghe, Bernard4; Færgeman, Nils J.5; Fredens, Julius5; Engholm-Keller, Kasper5; Ravnsborg, Christian3
1 Faculty of Science, SDU2 Department of Biochemistry and Molecular Biology, Faculty of Science, SDU3 Thermo Fisher Scientific4 Thermo Fisher Scientific5 Department of Biochemistry and Molecular Biology, Faculty of Science, SDU
Novel Aspect All major protein repositories integrated into a central domain for direct analyses and interpretation in a standard proteomics data analysis software. Introduction Modern proteomics must face the challenge of performing bioinformatics analysis and comparison of large datasets. It is a time consuming and at times nearly impossible task to distinguish known proteins from novel proteins in these data sets without proper annotation and comparison with literature sources. Tools are needed that can handle the complexity of these data including: redundancy (same protein but different accession codes), different protein database accession codes or outdated accession codes and protein annotation. To resolve these issues we have developed a consolidated proteomics database providing annotations to Proteome Discoverer via direct integrated web service technology – a repository that enables efficient data mining and categorizing of large data sets. Methods All samples were analyzed on an Orbitrap mass Spectrometer coupled to a nano Easy LC. The proteomics repository database is built using the Sun Java technology and the Microsoft mySQL database technology for optimal performance. Proteome Discoverer version 1.3 is used for database searching and is directly integrated with the proteomics repository. Preliminary Data Our proteomics database contains public sequence databases to form a comprehensive and consistent superset of 13 million protein sequences derived from over 100 million protein records from GenBank, Refseq, EMBL, Flybase, UniProt, Wormbase, Swiss-Prot, Trembl, PIR, IPI, PDB, Ensembl etc., including more than 10 million outdated accession numbers. Proteins are richly annotated by consolidation of annotations from public databases together with high-standards annotation from internal computational enrichment of the sequence data. The integrated database is constantly updated depending on its source, enabling tracking of outdated accession keys. Preliminary results from a comparison of protein annotation coverage in UniProt, NCBI and our proteomics repository on frequently used model organisms’ shows that collecting unique annotation information from multiple sources significantly increases the protein annotation coverage in human, mouse, yeast, C. elegans and E. coli. A quantitative stable isotope labeling proteomics study comparing wild type C. elegans and a nuclear hormone receptor 49 mutant is used as a case study to display the importance of using a consolidated Proteomics repository.