Pseudogene Prediction Pipeline Question

Q:
I am unsure if you are the correct person to contact you for my question, so
if that is the case, could you direct me to someone who might? I am
currently a master student at Ghent University doing my master’s thesis and
I trying to use different pipelines to predict pseudogenes.

So far, I have been able to succesfully use "Shiu’s Pipeline" and I an
interested in using the pipeline I found on "http://pseudogene.org/main.php"
too. but whilst trying to using it, I stumbled on some problems, which I was
hoping you (or someone from your lab) could help me with solving. I’ll
briefly try to explain what I’m trying to do and what the problem is.

In my research I’m trying to find/study pseudogenes from a certain
Whole-Genome Duplication in Populus trichocarpa. Step 1 and 2 of the
pipeline (as in the README file) have proven to be successful (Note: the
README file apparently searches for a ‘splitXXXXOut’ pattern, which my file
names don’t contain, as I didn’t split the proteome file into chunks).

I believe I have a problem with the pipeline in step 3, the masking step.
The pipeline has apparently 3 options (no masking, intron masking and gene
masking) for masking, but when looking into the script file
(extractKPExonLocations.py), it seems that only option 3 is available (there
seems to be no code for other options), while I would like to use option 1,
no masking.
As I couldn’t find a way to perform option 1, no masking, I tried masking
the genome anyway, but 2 problems arose:
1) the data I have been using so far comes from Phytozome.org and not
Ensembl plants, which doesn’t have the necessary files for step 2
2) I can’t try to recreate the files necessary for step 3, masking, as
Ensembl (no longer?) provides "translation_stable_id.txt" files. Were they
perhaps replaced by the "translation_attrib.txt" files?

Because of these issues I encountered, and didn’t want to mask the genome in
the first place (I can filter the results afterwards anyway), I tried
skipping step 3 and continue with the pipeline, to see if it would work
anyway. However, as the following steps require "Location of maskt files
(see Step 3) above)" and "The columns in the mask file that provide start
and stop data (0-based)", my results have been fruitless.

Now that you (hopefully) understand what the problems are that I
encountered, I was hoping if you, or someone else, could help me. In
particular I would like to know if it is possible to run the pipeline
without masking. As I said previously, I didn’t see the possibility to do
so, but I am still relatively new to bioinformatics, so I might be mistaken.
In addition to this, null exon data sets are required (which are empty
files?). In case this isn’t possible, would it be possible to tell me what
kind of information is stored in "translation_attrib.txt, so I coudl try to
recreate these files with Phytozome.org data?

I hope that you or someone else can help me with this problem, or point me
in the right direction. I know this was a long read and hopefully I have
explained myself well enough.

A:
Could you please send me all your commands line by line and the errors you encountered so we can help you.
In short replying to you’re queries:
— If you have you want to use your own custom data, not from Ensembl, you will need to format the files and create all the input files required for the pipeline to run. For this download our example file and look at the input files presented in ppipe_input folder.
— You do not need to use a masked genome, you can just pinpoint the pipeline to use the unmasked version, it works exactly the same.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s