Regarding obtaining data of pseudogene

Q:
Can you please help me to get pseudogene information for human, mouse, rat, drosophilla and C. elegans? I need exclusive fasta files or .bed files corresponding to pseudogene annotations for these five species separately.

A:
see pseudogene.org. For any infromation regarding the pseudogene annotation in human, mouse, drosophila and C.elegans please see:
http://www.pseudogene.org/psicube/
And
http://www.pseudogene.org/Mouse/

pseudoPipe

Q:
We are interested in using PseudoPipe for
identifying pseudogenes. I downloaded the software, but the program requires
older version of blast software including blastall and formatdb. Both
programs are replaced by newer version of the blast software, and are not
available to download from NCBI website. I am wondering if you could change
PseudoPipe to accommodate the new version of blast.

A:
Thank you for your suggestion. In the mean time, you can find the correct versions of fasta and blast freely available online. For easing the user experience we provide a link to the two packages on the website http://pseudogene.org/pseudopipe/ .

Pseudogene Prediction Pipeline Question

Q:
I am unsure if you are the correct person to contact you for my question, so
if that is the case, could you direct me to someone who might? I am
currently a master student at Ghent University doing my master’s thesis and
I trying to use different pipelines to predict pseudogenes.

So far, I have been able to succesfully use "Shiu’s Pipeline" and I an
interested in using the pipeline I found on "http://pseudogene.org/main.php"
too. but whilst trying to using it, I stumbled on some problems, which I was
hoping you (or someone from your lab) could help me with solving. I’ll
briefly try to explain what I’m trying to do and what the problem is.

In my research I’m trying to find/study pseudogenes from a certain
Whole-Genome Duplication in Populus trichocarpa. Step 1 and 2 of the
pipeline (as in the README file) have proven to be successful (Note: the
README file apparently searches for a ‘splitXXXXOut’ pattern, which my file
names don’t contain, as I didn’t split the proteome file into chunks).

I believe I have a problem with the pipeline in step 3, the masking step.
The pipeline has apparently 3 options (no masking, intron masking and gene
masking) for masking, but when looking into the script file
(extractKPExonLocations.py), it seems that only option 3 is available (there
seems to be no code for other options), while I would like to use option 1,
no masking.
As I couldn’t find a way to perform option 1, no masking, I tried masking
the genome anyway, but 2 problems arose:
1) the data I have been using so far comes from Phytozome.org and not
Ensembl plants, which doesn’t have the necessary files for step 2
2) I can’t try to recreate the files necessary for step 3, masking, as
Ensembl (no longer?) provides "translation_stable_id.txt" files. Were they
perhaps replaced by the "translation_attrib.txt" files?

Because of these issues I encountered, and didn’t want to mask the genome in
the first place (I can filter the results afterwards anyway), I tried
skipping step 3 and continue with the pipeline, to see if it would work
anyway. However, as the following steps require "Location of maskt files
(see Step 3) above)" and "The columns in the mask file that provide start
and stop data (0-based)", my results have been fruitless.

Now that you (hopefully) understand what the problems are that I
encountered, I was hoping if you, or someone else, could help me. In
particular I would like to know if it is possible to run the pipeline
without masking. As I said previously, I didn’t see the possibility to do
so, but I am still relatively new to bioinformatics, so I might be mistaken.
In addition to this, null exon data sets are required (which are empty
files?). In case this isn’t possible, would it be possible to tell me what
kind of information is stored in "translation_attrib.txt, so I coudl try to
recreate these files with Phytozome.org data?

I hope that you or someone else can help me with this problem, or point me
in the right direction. I know this was a long read and hopefully I have
explained myself well enough.

A:
Could you please send me all your commands line by line and the errors you encountered so we can help you.
In short replying to you’re queries:
— If you have you want to use your own custom data, not from Ensembl, you will need to format the files and create all the input files required for the pipeline to run. For this download our example file and look at the input files presented in ppipe_input folder.
— You do not need to use a masked genome, you can just pinpoint the pipeline to use the unmasked version, it works exactly the same.

Pseudogenes – PseudoPipe

Q:
I am trying to find SNPs in
pseudogenes but the database for the SNP’s is built for different genome
assemblies than pseudogenes predictions from PseudoPipe. Do you have the
current pipeline pseudogenes predictions on eukaryotic genomes? Or is there
a way to remap the genome assemblies used by Pipeline to a different
assembly?
If I want to use PsedoPipe, where in Ensembl can I find the input data set?

A:
Regarding your questions there are a number of things that you can do:
* if you are interested in the human/mouse genome, these are available for the latest assembly GRC38 from the pseudogene.org webpage , see http://mouse.pseudogene.org/data/Reference/Mus_musculus.GRCm38.87_pgene.txt and http://www.pseudogene.org/Human/Human90.txt respectively.
* the latest annotations for the worm and fly genomes, these are available from here :
http://pseudogene.org/psicube
* if you are interested in other eukaryotic genomes that have annotation build on older assemblies, one option is to do a lift over of the annotation from an old assembly to a newer one. This can easily be done using the UCSC genome browser resource https://genome.ucsc.edu/cgi-bin/hgLiftOver, however I would very much advise to actually run pseudo pipe on your machine given the fact that improvement in assembly and protein coding annotation will considerably improve the output of the pseudogene annotation. You can download and run pseudo pipe as described here: http://pseudogene.org/pseudopipe/
* also using the “fetch file” as described here http://pseudogene.org/pseudopipe/ will automatically download all the necessary data for you from the ensembl server.

Unitary Pseudogene PROMOTER

Q:
I am trying to find an example of a unitary pseudogene whose
promoter is known to be mutated as well and therefore the gene is
definitely non-functional. I can find articles stating there are many
examples of unitary pseudogenes in humans (e.g. Vitamin C) but none
seem to mention the promoter. Any thoughts?

A:
Our analyses compiled a number of activity features associated with pseudogenes (e.g. transcription, presence of functional Pol2 and TF binding sites in the upstream region, presence of open chromatin) that are available in online. Please see https://www.ncbi.nlm.nih.gov/pubmed/22951037 (http://pseudogene.org/psidr/ ) and https://www.ncbi.nlm.nih.gov/pubmed/25157146 (http://pseudogene.org/psicube/) for the functional characterisation of pseudogenes. In particular the unitary pseudogenes that do not have transcription, Pol2 and TF binding sites should be the ones to look at and to check the conservation or not of the promoter region.

Zebrafish pseudogenes

Q:
I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the
pseudogene.org, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

A:
Pseudogene.org provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

problem in PseudoPipe

Q:
I tried to run the software PseudoPipe
(http://pseudogene.org/DOWNLOADS/pipeline_codes/ppipe.tar.gz) using the
example as following:
./pseudopipe.sh ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs
0

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’:
No such file or directory
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’:
No such file or directory
gzip:
/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa:
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

A:
Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root.