pseudogenes in PseudoPipe

Q:
The pseudogene databases, including Pseudofam and PseudoPipe, have been extremely helpful for a project I am working on, and I was wondering if you knew how it would be possible to compare the DNA sequence of a human gene with all the pseudogenes on the PseudoPipe resources. I am looking to identify pseudogenes that may be related to the genes I am working with. I was hoping there was a way to devise this information by BLAST comparing the DNA sequence a specific gene with the sequences from all the pseudogenes in the genome, similar to NCBI BLAST or UniProt BLAST feature.

Any help or insight would be appreciated.

A:
If you have many genes to query, may be you can use BLAST+ (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to build your own tool. You can then download the sequences of all pseudogenes and make a BLAST database (https://www.ncbi.nlm.nih.gov/books/NBK279688/ ) from which you can query.

Asking or data used in finding processed psedogenes in the human genome

Q:
Recently, I was reading one of your papers about finding processed pseudogenes published in 2003: "Millions of Years of Evolution Preserved: A Comprehensive Catalog of the Processed Pseudogenes in the Human Genome". Because I want to find processed pseudogenes among several recently released mammalian genomes. Your paper is very interesting and helpful for my work. And to ensure the method i grasped is correct, I want to use your original data to redo your analysis process.

But I come across a problem when I download nonredundant human proteome set from the EBI Web site. Because the data was published in June 2002, and I can’t successfully download them from EBI website. Here I write to you with the hope of getting nonredundant human proteome set you used released in June 2002. Although I know many years have passed since the paper was published and you may also lost the original data, I still want to have a try!

A:
The data associated with the paper is here: http://pseudogene.org/human-all/index.html. You can also find the latest human pseudogene annotation here: http://pseudogene.org/Human/

Regarding obtaining data of pseudogene

Q:
Can you please help me to get pseudogene information for human, mouse, rat, drosophilla and C. elegans? I need exclusive fasta files or .bed files corresponding to pseudogene annotations for these five species separately.

A:
see pseudogene.org. For any infromation regarding the pseudogene annotation in human, mouse, drosophila and C.elegans please see:
http://www.pseudogene.org/psicube/
And
http://www.pseudogene.org/Mouse/

pseudoPipe

Q:
We are interested in using PseudoPipe for
identifying pseudogenes. I downloaded the software, but the program requires
older version of blast software including blastall and formatdb. Both
programs are replaced by newer version of the blast software, and are not
available to download from NCBI website. I am wondering if you could change
PseudoPipe to accommodate the new version of blast.

A:
Thank you for your suggestion. In the mean time, you can find the correct versions of fasta and blast freely available online. For easing the user experience we provide a link to the two packages on the website http://pseudogene.org/pseudopipe/ .

Pseudogene Prediction Pipeline Question

Q:
I am unsure if you are the correct person to contact you for my question, so
if that is the case, could you direct me to someone who might? I am
currently a master student at Ghent University doing my master’s thesis and
I trying to use different pipelines to predict pseudogenes.

So far, I have been able to succesfully use "Shiu’s Pipeline" and I an
interested in using the pipeline I found on "http://pseudogene.org/main.php"
too. but whilst trying to using it, I stumbled on some problems, which I was
hoping you (or someone from your lab) could help me with solving. I’ll
briefly try to explain what I’m trying to do and what the problem is.

In my research I’m trying to find/study pseudogenes from a certain
Whole-Genome Duplication in Populus trichocarpa. Step 1 and 2 of the
pipeline (as in the README file) have proven to be successful (Note: the
README file apparently searches for a ‘splitXXXXOut’ pattern, which my file
names don’t contain, as I didn’t split the proteome file into chunks).

I believe I have a problem with the pipeline in step 3, the masking step.
The pipeline has apparently 3 options (no masking, intron masking and gene
masking) for masking, but when looking into the script file
(extractKPExonLocations.py), it seems that only option 3 is available (there
seems to be no code for other options), while I would like to use option 1,
no masking.
As I couldn’t find a way to perform option 1, no masking, I tried masking
the genome anyway, but 2 problems arose:
1) the data I have been using so far comes from Phytozome.org and not
Ensembl plants, which doesn’t have the necessary files for step 2
2) I can’t try to recreate the files necessary for step 3, masking, as
Ensembl (no longer?) provides "translation_stable_id.txt" files. Were they
perhaps replaced by the "translation_attrib.txt" files?

Because of these issues I encountered, and didn’t want to mask the genome in
the first place (I can filter the results afterwards anyway), I tried
skipping step 3 and continue with the pipeline, to see if it would work
anyway. However, as the following steps require "Location of maskt files
(see Step 3) above)" and "The columns in the mask file that provide start
and stop data (0-based)", my results have been fruitless.

Now that you (hopefully) understand what the problems are that I
encountered, I was hoping if you, or someone else, could help me. In
particular I would like to know if it is possible to run the pipeline
without masking. As I said previously, I didn’t see the possibility to do
so, but I am still relatively new to bioinformatics, so I might be mistaken.
In addition to this, null exon data sets are required (which are empty
files?). In case this isn’t possible, would it be possible to tell me what
kind of information is stored in "translation_attrib.txt, so I coudl try to
recreate these files with Phytozome.org data?

I hope that you or someone else can help me with this problem, or point me
in the right direction. I know this was a long read and hopefully I have
explained myself well enough.

A:
Could you please send me all your commands line by line and the errors you encountered so we can help you.
In short replying to you’re queries:
— If you have you want to use your own custom data, not from Ensembl, you will need to format the files and create all the input files required for the pipeline to run. For this download our example file and look at the input files presented in ppipe_input folder.
— You do not need to use a masked genome, you can just pinpoint the pipeline to use the unmasked version, it works exactly the same.

Pseudogenes – PseudoPipe

Q:
I am trying to find SNPs in
pseudogenes but the database for the SNP’s is built for different genome
assemblies than pseudogenes predictions from PseudoPipe. Do you have the
current pipeline pseudogenes predictions on eukaryotic genomes? Or is there
a way to remap the genome assemblies used by Pipeline to a different
assembly?
If I want to use PsedoPipe, where in Ensembl can I find the input data set?

A:
Regarding your questions there are a number of things that you can do:
* if you are interested in the human/mouse genome, these are available for the latest assembly GRC38 from the pseudogene.org webpage , see http://mouse.pseudogene.org/data/Reference/Mus_musculus.GRCm38.87_pgene.txt and http://www.pseudogene.org/Human/Human90.txt respectively.
* the latest annotations for the worm and fly genomes, these are available from here :
http://pseudogene.org/psicube
* if you are interested in other eukaryotic genomes that have annotation build on older assemblies, one option is to do a lift over of the annotation from an old assembly to a newer one. This can easily be done using the UCSC genome browser resource https://genome.ucsc.edu/cgi-bin/hgLiftOver, however I would very much advise to actually run pseudo pipe on your machine given the fact that improvement in assembly and protein coding annotation will considerably improve the output of the pseudogene annotation. You can download and run pseudo pipe as described here: http://pseudogene.org/pseudopipe/
* also using the “fetch file” as described here http://pseudogene.org/pseudopipe/ will automatically download all the necessary data for you from the ensembl server.