Computational Methods for Protein Sequence Comparison and Search实验方法详情页

实验方法> 生物信息学技术> 数据库>Computational Methods for Protein Sequence Comparison and Search

Computational Methods for Protein Sequence Comparison and Search

关键词： computational methods来源：互联网

Abstract
Table of Contents
Figures
Literature Cited

Abstract

Protein sequence comparison and search has become commonplace not only for bioinformatics researchers but also for experimentalists in many cases. Because of the exponential growth in sequence data, sequence comparison in particular has become an increasingly important tool. Relating a new gene sequence to other known sequences often reveals its function, structure, and evolution. Many sequence comparison and search tools are available through public Web servers, and biologists can use them easily with little knowledge of computers or bioinformatics. This unit provides some theoretical background and describes popular tools for dot plot, sequence search against a database, multiple sequence alignments, protein tree construction, and protein family and motif search. Step?by?step examples are provided to illustrate how to use some of the most well?known tools. Finally, some general advice is given on combining different sequence analysis tools for biological inference. Curr. Protoc. Protein Sci. 56:2.1.1?2.1.27. © 2009 by John Wiley & Sons, Inc.

Keywords: protein sequence comparison; dot plot; multiple sequence alignment; protein tree; protein family; motif search

GO TO THE FULL PROTOCOL: PDF or HTML at Wiley Online Library Table of Contents

Introduction
Theoretical Background for Protein Sequence Analysis
Matrix Methods for Sequence Comparison: Dot Plots
Sequence Similarity Searching
Multiple Alignments
Protein Trees
Protein Family and Functional Site Identification
General Strategy for Sequence Analyses
Acknowledgement
Internet Resources
Literature Cited
Figures
Tables

GO TO THE FULL PROTOCOL: PDF or HTML at Wiley Online Library Materials

GO TO THE FULL PROTOCOL: PDF or HTML at Wiley Online Library Figures

Figure 2.1.1 Dot plot generated from comparison of peanut allergens Ara h 1 and Ara h 3 using PLALIGN. Regions of similarity between the two sequences appear as lines parallel and offset to the line of identity. The expectation values for the local alignments of these regions are shown in color. The horizontal axis indicates Ara h 1 and the vertical axis indicates Ara h 3.

View Image

Figure 2.1.2 The best local sequence alignment for peanut allergens Ara h 1 and Ara h 3 using PLALIGN. In the alignment, the lower sequence is Ara h 1 and the upper one is Ara h 3.

View Image

Figure 2.1.3 FASTA histogram from a global‐alignment search of the SWISS‐PROT database for a lectin protein. Numbers of windows at each opt score are plotted. Note that there are seven highly significant alignments.

View Image

Figure 2.1.4 FASTA alignment table and the best scoring alignment for the same search illustrated in Figure . The table shows the best alignment scores sorted by the highest opt score.

View Image

Figure 2.1.5 Input sequence file to run TCoffee for multiple sequence alignment. The sequences are from the query protein (“test”) and top seven significant hits in Figure .

View Image

Figure 2.1.6 TCoffee output multiple sequence alignment results in the ClustalW format for the input sequences in Figure . The fully conserved residues are marked with “*”, while somewhat conserved residues are indicated with “:” or “.”, the latter of which is less conserved.

View Image

Figure 2.1.7 TreeView display for the phylogenetic produced using TCoffee based on the multiple sequence alignment in Figure .

View Image

Figure 2.1.8 Partial output from MotifScan for protein Sin1, indicating the bipartite localization signals.

View Image

Videos

Literature Cited

Literature Cited
	Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403‐410.
	Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. 1997. Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25:3389‐3402.
	Argos, P. 1987. A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193:385‐396.
	Attwood, T.K., Bradley, P., Flower, D.R., Gaulton, A., Maudling, N., Mitchell, A., Moulton, G., Nordle, A., Paine, K., Taylor, P., Uddin, A. and Zygouri, C. 2003. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res. 31:400‐402.
	Bairoch, A. 1992. PROSITE: A dictionary of protein sites and patterns. Nucl. Acids Res. 19:2241‐2245.
	Barton, G.J. 1990. Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol. 183:403‐428.
	Borodovsky, M. and Ekisheva, S. 2006. Problems and Solutions in Biological Sequence Analysis. Cambridge University Press.
	Brendel, V., Bucher, P., Nourbaksh, I.R., Blaisdell, B.E., and Karlin, S. 1992. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. U.S.A. 89:2002‐2006.
	Burks, C. 1990. The flow of nucleotide sequence data into data banks: Role and impact of large‐scale sequencing projects. In Computers and DNA, Santa Fe Institute (G. Bell and T. Marr, eds.) pp. 35‐45. Addison‐Wesley, Reading, Mass.
	Chou, P.Y. and Fasman, G.D. 1974. Prediction of protein conformation. Biochemistry 13:222‐244.
	Corpet, F., Servant, F., Gouzy, J., and Kahn, D. 2000. ProDom and ProDom‐CG: Tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28:267‐269.
	Day, W.H.E. and McMorris, F.R. 1993. A consensus program for molecular sequences. CABIOS 9:653‐656.
	Dayhoff, M.O. 1978. Atlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, D.C.
	Depiereux, E. and Feytmans, E. 1991. Simultaneous and multivariate alignment of protein sequences: Correspondence between physicochemical profiles and structurally conserved regions (SCR). Protein Eng. 4:603‐613.
	De Rijk, P. and De Wachter, R. 1993. DCSE, an interactive tool for sequence alignment and secondary structure search. CABIOS 9:735‐740.
	Dodo, H., Marsic, D., Callender, M., Cebert, E., and Viquez, O. 2002 Screening 34 Peanut Introductions for Allergen Content Using Elisa, Food and Agricultural Immunology 14:147‐154.
	Doolittle, R.F. 1981. Similar amino acid sequences: Chance or common ancestry? Science 214:167‐339.
	Doolittle, R.F. 1986. Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Ann Arbor, Mich.
	Doolittle, R.F. 1989. Redundancies in protein sequences. In Prediction of Protein Structure and the Principles of Protein Conformation (G.D. Fasman, ed.) pp. 599‐623. Plenum, New York.
	Doolittle, R.F. 1990. What we have learned and will learn from sequence databases. In Computers and DNA, Santa Fe Institute (G. Bell and T. Marr, eds.) pp. 21‐31. Addison‐Wesley, Reading, Mass.
	Dumas, J.P. and Nunio, J. 1982. Efficient algorithm for folding and comparing nucleic acid sequences. Nucl. Acids Res. 10:197‐206.
	Eddy, S.R. Profile hidden Markov models. 1998. Bioinformatics 14:755‐763.
	Edgar, R.C. and Sjolander, K. 2004. Coach: profile‐profile alignment of protein families using hidden Markov models. Bioinformatics 20:1309‐1318.
	Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome‐wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95:14863‐14868.
	Eroshkin, A.M., Zhilkin, P.A., and Fomin, V.I. 1993. Algorithm and computer program: Pro_Anal for analysis of relationship between structure and activity in a family of proteins or peptides. CABIOS 9:491‐497.
	Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., and Bairoch, A. 2002. The PROSITE database, its status in 2002. Nucleic Acids Res. 30:235‐238.
	Felsenstein, J. 1989. PHYLIP ‐ Phylogeny Inference Package (Version 3.2). Cladistics 5:164‐166.
	Feng, D.F. and Doolittle, R.F. 1987. Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees. J. Mol. Evol. 25:351‐360.
	Finkelstein, A.V. and Ptitsyn, O.B. 1987. Why do globular proteins fit the limited set of folding patterns? Prog. Biophys. Mol. Biol. 50:171‐190.
	Finn, R.D., Mistry, J., Schuster‐Bockler, B., Griffiths‐Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., Eddy, S.R., Sonnhammer, E L., and Bateman, A. 2006. Pfam: Clans, web tools and services. Nucleic Acids Res. 34:D247‐D251.
	Fitch, W.M. 1966. An improved method of testing for evolutionary homology. J. Mol. Biol. 16:9‐16.
	Fitch, W.M. 1969. Locating gaps in amino acid sequences to optimize the homology between two proteins. Biochem. Genet. 3:99‐108.
	Fitch, W.M. 1970. Distinguishing homologous from analogous proteins. Syst Zool. 19:99‐113.
	Fuchs, R. 1994. Fast protein block searches. CABIOS 10:79‐80.
	Garnier, J., Osguthorpe, D.J., and Robson, B. 1978. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol. 120:97‐120.
	Genetics Computer Group. 1994. GCG Program Manual for the Wisconsin Package, Version 8, September 1994. Genetics Computer Group Inc., Madison, Wis.
	George, D., Hunt, L.T., and Barker, W.C. 1990. Mutation data matrix and its uses. Methods Enzymol. 183:333‐351.
	Gibbs, A.J. and McIntyre, G.A. 1970. The diagram, a method for comparing sequences. J. Biochem. 16:1‐11.
	Henikoff, S. and Henikoff, J.G. 1993. Performance evaluation of amino acid substitution matrices. Proteins Struct. Funct. Genet. 17:49‐61.
	Henikoff, J.G., Greene, E.A., Pietrokovski, S., and Henikoff, S. 2000. Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28:228‐230.
	Heringa, J., Sommerfeldt, H., Higgins, D.G., and Argos, P. 1992. OBSTRUCT: A program to obtain the largest cliques from a protein sequence set according to structural resolution and sequence similarity. CABIOS 8:599‐600.
	Hodgman, T.C. 1992. Nucleic acid and protein sequence management. In Microcomputers in Biochemistry: A Practical Approach (C.F.A. Bryce, ed.) pp. 131‐158. IRL Press, Oxford.
	Huang, H., Barker, W.C., Chen, Y., and Wu, C.H. 2003. iProClass: An integrated database of protein family, function and structure information. Nucleic Acids Res. 31:390‐392.
	Junier, T. and Pagni, M. 2000. Dotlet: Diagonal plots in a web browser. Bioinformatics 16:178‐179.
	Kanaoka, M., Kishimoto, F., Ueki, Y., and Umeyama, H. 1989. Alignment of protein sequences using the hydrophobic core scores. Protein Eng. 2:347‐351.
	Karlin, S.P., Morris, M., Ghandour, G., and Leung, M.‐Y. 1988. Algorithms for identifying local molecular sequence features. CABIOS 4:41‐51.
	Karlin, S.P., Ost, F., and Blaisdell, B.E. 1989. Patterns in DNA and amino acid sequences and their statistical significance. In Mathematical Methods for DNA Sequences (M.S. Waterman, ed.) pp. 133‐157. CRC Press, Boca Raton, Fla.
	Karlin, S., Bucher, P., and Brendel, V. 1991. Statistical methods and insights for protein and DNA sequences. Annu. Rev. Biophys. Chem. 20:175‐203.
	Karplus, K., Barrett, C., and Hughey, R. 1998. Hidden Markov models for detecting remote protein homologies. Bioinformatics 14:846‐856.
	Koonin, E.V., Makarova, K.S., and Aravind, L. 2001. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55:709‐742.
	Kruskal, J.B. 1983. An overview of sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J.B. Kruskal, eds.) pp. 1‐44. Addison‐Wesley, Reading, Mass.
	Kruskal, J.B. and Sankoff, D. 1983. An anthology of algorithms and concepts for sequence comparison. In Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (D. Sankoff and J.B. Kruskal, eds.) pp. 265‐310. Addison‐Wesley, Reading, Mass.
	Kyte, J. and Doolittle, R.F. 1982. A simple method for displaying the hydrophobic character of a protein. J. Mol. Biol. 157:105‐132.
	Landau, G.M., Vishkin, U., and Nussinov, R. 1988. Locating alignments with k differences for nucleotide and amino acid sequences. CABIOS 4:19‐24.
	Landau, G.M., Vishkin, U., and Nussinov, R. 1990. Fast alignment of DNA and protein sequences. Methods Enzymol. 183:487‐502.
	Landes, C., Henaut, A., and Risler, J.‐L. 1993. Dot‐plot comparisons by multivariate analysis (DOCMA): A tool for classifying protein sequences. CABIOS 9:91‐196.
	Lipman, D.J. and Pearson, W.R. 1985. Rapid and sensitive protein similarity searches. Science 227:1435‐1441.
	Livingstone, C.D. and Barton, G.F. 1993. Protein sequence alignments: A strategy for the hierarchical analysis of residue conservation. CABIOS 9:745‐756.
	Madera, M. and Gough, J. 2002. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res. 30:4321‐4328.
	Maizel, J.V. and Lenk, R.P. 1981. Enhanced graphic matrix analysis of nucleic acids and protein sequences. Proc. Natl. Acad. Sci. U.S.A. 78:7665‐7669.
	McLachlan, A.D. 1971. Test for comparing related amino acid sequences: Cytochrome c and cytochrome c‐551. J. Mol. Biol. 61:409‐424.
	Mrazek, J. and Kypr, J. 1993. UNIREP: A microcomputer program to find unique and repetitive nucleotide sequences in genomes. CABIOS 9:355‐360.
	Nedde, D.N. and Ward, M.O. 1993. Visualizing relationships between nucleic acid sequences using correlation images. CABIOS 9:331‐335.
	Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443‐453.
	Notredame, C., Holme, L., and Higgins, D. 1998. COFFEE: A New Objective Function for Multiple Sequence Alignment. Bioinformatics 14:407‐422.
	Notredame, C., Higgins, D., and Heringa, J. 2000. T‐Coffee: A novel method for multiple sequence alignments. J. Mol. Biol. 302:205‐217.
	Panjukov, V.V. 1993. Finding steady alignments: Similarity and distance. CABIOS 9:285‐290.
	Pearson, W.R. 1990. Rapid and sensitive comparison with FASTP and FASTA. Methods Enzymol. 183:63‐98.
	Pearson, W.R. 1994. Using the FASTA program to search protein and DNA seq

推荐方法