Using the ppi_analysis.py script

Getting and using the code

Download the script and the unit tests from here:

http://www.caporaso.us/intermolecular_coevolution/ppi_analysis.zip

This code is built using the Python bioinformatics toolkit, PyCogent. It will require that you have a working PyCogent installed on your system. I recommend grabbing the latest version from the Sourceforge svn repository. (Notes on installing PyCogent are below.)

After installing PyCogent and unzipping the above file, you should be able to change to the ppi_analysis directory and successfully run the test code with the following command:

python test_ppi_analysis.py

To get usage information and examples for the ppi_analysis.py script, run:

python ppi_analysis.py -h

Data preparation

First, the sequence identifiers must contain information on how they should be paired between proteins. For example, if you have fasta files for your two different proteins, sequence identifier lines might look like:

>human+protein1
>chimp+protein1
>mouse+protein1

>human+protein2
>chimp+protein2
>mouse+protein2
>rat+protein2

These would be the identifiers in the fasta file, which are the lines beginning with '>'. The script then knows how to match the (e.g.) human sequences with one another. Note that the set of sequence identifiers don't need to overlap perfectly -- in this example there is an extra sequence (rat) in the protein2 collection.

Next, I typically recommend at least 30 or 40 sequences in each coevolutionary analysis, after sequences have been paired. This is controlled via the -n parameter. Aligned sequences should have a maximum of about 95% sequence identity.

Understanding the output

The output will be csv files containing coevolution matrices. The order of the data in each coevolution matrix is defined in the naming of the output file, so if your output file was called:

protein1_protein2...

The columns of the matrix correspond to protein 1 positions, and the rows to protein 2 positions. So, if your alignments were:

>chimp+protein1
AC
>mouse+protein1
AD

and

>chimp+protein2
EFW
>mouse+protein2
EGY

Your output matrix would look like:

[[ 0.  0.]
 [ 0.  1.]
 [ 0.  1.]]

If there is ever any confusion about this, you can compare the length of the input alignments to the shape of the matrix. (This will work, except in the rare case when both of your alignments happen to be the same length.)

Pointers to installation notes and relevant references

You can find the Python bioinformations toolkit, PyCogent, here:

http://pycogent.sourceforge.net

and the PyCogent paper here:

http://genomebiology.com/2007/8/8/R171

Installation instructions for PyCogent are here:

http://pycogent.sourceforge.net/install.html

My paper on comparing coevolution algorithms might also be of interest:

http://www.biomedcentral.com/content/pdf/1471-2148-8-327.pdf

Chapters 5, 6, and 7 of my dissertation, which are otherwise unpublished, may also be of interest:

http://www.caporaso.us/jg_caporaso_thesis.pdf

The coevolution module in PyCogent is at:

cogent/evolve/coevolution.py

If you change to cogent/evolve directory, you can get usage information for this script with the command:

python coevolution.py -h

This script is useful for intramolecular coevolutionary analysis.

Citing this code

Detecting coevolution without phylogenetic trees? Tree-ignorant metrics of coevolution perform as well as tree-aware metrics. J. Gregory Caporaso, Sandra Smit, Brett C. Easton, Lawrence Hunter, Gavin A. Huttley, and Rob Knight. BMC Evolutionary Biology, December, 2008.

PyCogent: a toolkit for making sense from sequence; Rob Knight, Peter Maxwell, Amanda Birmingham, Jason Carnes, J. Gregory Caporaso, Brett C. Easton, Michael Eaton, Micah Hamady, Helen Lindsay, Zongzhi Liu, Catherine Lozupone, Daniel McDonald, Michael Robeson, Raymon Sammut, Sandra Smit, Matthew J. Wakefield, Jeremy Widmann, Shandy Wikman, Stephanie Wilson, Hua Ying, and Gavin A. Huttley; Genome Biology 2007, 8:R171; doi:10.1186/gb-2007-8-8-r171.

Greg Caporaso

gregcaporaso@gmail.com