1 Department of Systems Biology, Technical University of Denmark2 Department of Bio and Health Informatics, Technical University of Denmark
Proteins are central to virtually all processes within the cell. The vast amount of functions performed by proteins in biological processes is conferred by their ability to bind in a selective and specific manner to other molecules. The nature of these interactions is, in general terms, three-dimensional, as binding sites normally consist of a pocket or a groove on the protein surface. However, in many cases such interactions contain a linear component and can be more conveniently represented, or approximated, by a protein-peptide interaction. Whereas time-consuming structural studies are necessary in systems where the three-dimensional aspect of the interaction is prevalent, protein-peptide interactions can normally be represented simply by a linear binding motif. Phage display and peptide microarray technologies allow generating large libraries of peptide sequences and the parallel detection of thousands of interactions in a single experiment, with virtually unlimited choice of potential targets and variants of these targets. However, the amount and complexity of data produced by high-throughput techniques poses serious challenges to researchers of limited bioinformatics expertise who need to analyze and interpret such data. The first paper in this thesis presents a new, publicly available method based on artificial neural networks that allows custom analysis of quantitative peptide data. The online NNAlign web-server provides a simple yet powerful tool for the discovery of sequence motifs in large-scale peptide data sets. It was successfully applied to characterize the binding motifs of MHC class I and class II molecules, and for the prediction of protease cleavage on data generated by a large-scale peptide microarray technology. In the second paper, NNAlign was applied to binding data for HLA-DP and DQ molecules, two classes of HLA molecules with recognized importance in immune response but poorly characterized sequence motifs. The sequence logos of 5 HLADP and 6 HLA-DQ molecules provide a characterization of their binding motifs at an unprecedented level of detail. The third paper in this thesis deals with the presence of multiple motifs, due to the experimental setup or the actual poly-specificity of the receptor, in peptide data. A new algorithm, based on Gibbs sampling, identifies multiple specificities by performing two tasks simultaneously: alignment and clustering of peptide data. The method, available online as a web-server, was applied to various data sets including mixtures of MHC binding data and distinct classes of ligands to SH3 domains. Next, we investigated how string kernels could be used to identify pattern in peptide data, with particular focus on the MHC class I system. We suggest a strategy that, unlike most available methods, allows to learn from peptides of multiple lengths to achieve improved predictive performance. This appeared particularly important in alleles and peptide lengths where experimental data was limited. The last chapter presents a method to rationally guide the discovery of T-cell epitopes from ELISPOT and ICS assays based on peptide pool matrices. By prediction of binding affinity, analysis of peptide pools intersections, and combination of information from different donors, we show that the method can effectively rank potential epitope candidates and reduce the number of experimental tests needed to identify new epitopes. Taken as a whole, this thesis provides a valuable series of algorithms and tools for the analysis of peptide data, both from the point of view of characterization of sequence motifs and the prediction of protein-peptide interactions.