Dear Anand,
First some general guidelines of running the pseudopipe pipeline in your local machine.
Since the pseudopipe pipeline was originally designed and automated to work with ensembl data, so some manual settings are required to run it with other input data.
Attached is an archive that consists of the pipeline and a simple try-out data.
===
There are three folders within a parent directory “pgenes” after extraction:
===
– pseudopipe: pipeline code;
– ppipe_input: input data;
– ppipe_output: output data.
===
Input data:
===
You may create a separate folder within the ppipe_input (and ppipe_output) for each species. There need to be three folders for each species genomic input data,
– dna: contains a file named dna_rm.fa, which is entire repeat masked dna from that species, and a list file for all unmasked dna divided into different chromosomes in FASTA format;
– pep: contains a FASTA file for all the proteins in the species;
– mysql: contains a list of files named as “chr1_exLocs”, “chr2_exLocs”, etc. to specify exons coordinates, one for each chromosome. Only thing matters for these files are their third and fourth columns, which should be start and end coordinates of exons.
===
Environment setting:
===
You’ll need python, blast and tfasty to run the pipeline. Their paths should be indicated at the end of /pseudopipe/bin/env.sh
===
Run the pipeline
===
First go to the folder pseudopipe/bin, and run with command line in the form of: ./pseudopipe.sh [output dir] [masked dna dir] [input dna dir] [input pep dir] [exon dir] 0.
An example using the try-out data is as follow:
./pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
(This command line assumes you extract the archive in your home directory, i.e., “~/”. Please note that the paths in the command line need to be absolute, and chromosome and exon files are specified with wild card “%s”.)
The blast step is already included in the pipeline.
===
Output:
===
The output can be found at ppipe_output/caenorhabditis_elegans_62_220a/pgenes/ppipe_output_pgenes.txt
===
Run time
===
On a single laptop (2.6GHz, 4GB RAM): The most time consuming step is tblastn. It may take around one day to finish an entire genome in a comparable size of C. elegans. The following steps will finish in a few hours.
We’ve implemented the pipeline to run parallel in cluster machines. However, the pipeline I sent can only run on a single machine. The parallel implementation is currently hard-coded to our local settings.
Some specific answers to your questions:
Q:
I am ready to run tBLASTn of proteome versus genome. I can repeat mask the genome during the tBLASTn run itself, would that be OK?
A:
You don’t have to run the tBLASTn by yourself since it is already integrated into the pipeline. In ENSEMBL, the genomes are repeat masked by RepeatMasker, that’s the input data currently used the pipeline. I would assume any reasonable repeat mask algorithm is fine.
Q:
For the tBLASTn, instead of using the entire genome, can I use the genome that is ‘masked’ for entire genes (not just exons). Based on gff info, I have converted the genic regions (not just exons) into stretches of Ns. Would this ‘masked’ genome be a good input for my tBLASTn?
A:
You don’t need to do that since the pipeline will remove blast hits significantly overlap known gene exons ( > 30 bp overlap). Also, manually masking the entire gene sequences may be problematic, since we do find in some species the pseudogenes with some overlap with genes annotation.
Q:
You mention in your paper that you use bite-sized portions of your proteome as query for your BLAST search. Does that mean I should chop up my proteome into peptides x amino acids in length? Is that x >= 10?
A:
No need to do that. You can keep the whole protein sequences in the input FASTA file.
Q:
Is there a latest README or even a User Guide for PseudoPipe that you can share with me?
A:
Unfortunately, we don’t have a user friendly README file for the entire pipeline, especially for it to run in different environment from ours. I hope this email can help you set it up and run the pipeline in your machine. And also you can find some comments on each individual pipeline script file.
Please feel free to let me know if you need further assistance.
Best,
Baikang