Protein sequence data

There is some value in doing some initial analysis on your protein sequence. If a protein has come (for example) directly from a gene prediction, it may consist of multiple domains. More seriously, it may contain regions that are unlikely to be globular, or soluble. This flowchart assumes that your protein is soluble, likely comprises a single domain, and does not contain non-globular regions.

Things to consider are:

Is your protein a transmembrane protein, or does it contain transmembrane segments? There are many methods for predicting these segments, including:
- TMAP (EMBL)
- PredictProtein (EMBL/Columbia)
- TMHMM (CBS, Denmark)
- TMpred (Baylor College)
- DAS (Stockholm)
Does your protein contain coiled-coils? You can predict coiled coils at the COILS server or you can download the COILS program (recently re-written by me of all people; note that a version of COILS is contained within the GCG suite of programs).
Does your protein contain regions of low complexity? Proteins frequently contain runs of poly-glutamine or poly-serine, which do not predict well. To check for this you can use the program SEG (a version of SEG is also contained within the GCG suite of programs).

If the answer to any of the above questions is yes, then it is worthwhile trying to break your sequence into pieces, or ignore particular sections of the sequence, etc. This is related to the problem of locating domains.

Next Sequence database searching

Back to the Flowchart