Human pseudogene annotation

Q:
In your recent Nature Communication report of mouse pseudogenes (https://doi.org/10.1038/s41467-020-17157-w), you stated that “For human, we used a similar workflow to refine the reference pseudogene annotation to a high-quality set of 14,650 pseudogenes.” I wonder if you could kindly share the chromosome coordinate information of these 14,650 pseudogenes with me? I am investigating the distribution patterns of RNA-editing sites in human genome, and I cannot find a good source of pseudogene definition. A database named Pseudogene.org is too old and not based on GRCh38.

A:
In the paper we have worked with the GENCODE consortia to refine the pseudogene annotation. Since the paper publication we have continued to improve the human pseudogene annotation using a combination of manual and automatic pipelines as described in the paper. Attached is the pseudogene coordinates for the complete set of pseudogenes.

For a definition fo pseudogene i suggest you use our paper https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-9-r51 that defines pseudogenes as defunct genomic loci with sequence similarity to functional genes but lacking coding potential due to the presence of disruptive mutations such as frame shifts and premature stop codons.

Request re pseudogene.org

Q:
We develop MH Guide, a genome-guided cancer treatment decision support software (https://www.molecularhealth.com/us/).

I was trying to get the current annotated pseudogene information via http://pseudogene.org/Human/. The link to GENCODE seems not to work and returns “file not found” (https://www.gencodegenes.org/releases/current.html).

Could you please kindly redirect me to the file with annotated pseudogenes.

A:
If you are looking for the current GENCODE annotation, for the current release please follow this link: https://www.gencodegenes.org/human/ . If you want to use pseudo pipe to create a custom human annotation of a genome sequence of preferences, please follow the instructions here: http://pseudogene.org/pseudopipe/ . If you are interested in the functional annotation of pseudogenes with information regarding pseudogene activity please see http://pseudogene.org/psicube/ .

Query regarding Pseudopipe

Q:
Since I am working on pseudogene identification for my new project, I was using your pipeline. But I am having few errors which I am going to mention below. Can you please help me to resolve these errors. I shall be very grateful to you.
>
> ERRORS:
> 1. On terminal:
> sudo bash pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
> Making directories
> Copying sequences
> Fomatting the DNAs
> Preparing the blast jobs
> Skipping blast
> Processing blast output
> Skipping the processing of blast output
> Running Pseudopipe on both strands
> Working on M strand
> Finished Pseudopipe on strand M
> Working on P strand
> Finished Pseudopipe on strand P
> Generating final results
> find: ‘/home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’: No such file or directory
> find: ‘/home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’: No such file or directory
> gzip: /home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa: No such file or directory
> Finished generating pgene full alignment
> Finished running Pseudopipe
> 2. In log file inside minus and plus folder:
> need to document overlap parameter (30) and dependency on mask array files.
> mask fields [2, 3]
> Traceback (most recent call last):
> File "/home/kashmir/SOFTWARE/pgenes/pseudopipe/core/filterEnsemblGene.py", line 60, in <module>
> maskFile = openOrFail(ExonMaskTemplate % chr, ‘r’)
> TypeError: not all arguments converted during string formatting
> running filterEnsemblGene.py
> failed during filterEnsemblGene.py stage.

A:
From the output it looks like you had a couple of issues starting with the blast job.

Could you please check your output directory in the blast/output folder and see if you see any split000*.Out files (where * is a number). If you don’t see any output files it means that your blast job did not run. In order run the pipeline you need to have a couple of additional software packages installed and preferentially added to the path. Specifically you will need: blast-2.2.13 and fasta-35.1.5. If you do not want to add them to the path, you can add the path to their location in the env.sh file that you can find in the bin folder of the PseudoPipe.

This should allow you to run the pipeline without any issues.

Regarding obtaining data of pseudogene

Q:
Can you please help me to get pseudogene information for human, mouse, rat, drosophilla and C. elegans? I need exclusive fasta files or .bed files corresponding to pseudogene annotations for these five species separately.

A:
see pseudogene.org. For any infromation regarding the pseudogene annotation in human, mouse, drosophila and C.elegans please see:
http://www.pseudogene.org/psicube/
And
http://www.pseudogene.org/Mouse/

pseudoPipe

Q:
We are interested in using PseudoPipe for
identifying pseudogenes. I downloaded the software, but the program requires
older version of blast software including blastall and formatdb. Both
programs are replaced by newer version of the blast software, and are not
available to download from NCBI website. I am wondering if you could change
PseudoPipe to accommodate the new version of blast.

A:
Thank you for your suggestion. In the mean time, you can find the correct versions of fasta and blast freely available online. For easing the user experience we provide a link to the two packages on the website http://pseudogene.org/pseudopipe/ .

Pseudogene Prediction Pipeline Question

Q:
I am unsure if you are the correct person to contact you for my question, so
if that is the case, could you direct me to someone who might? I am
currently a master student at Ghent University doing my master’s thesis and
I trying to use different pipelines to predict pseudogenes.

So far, I have been able to succesfully use "Shiu’s Pipeline" and I an
interested in using the pipeline I found on "http://pseudogene.org/main.php"
too. but whilst trying to using it, I stumbled on some problems, which I was
hoping you (or someone from your lab) could help me with solving. I’ll
briefly try to explain what I’m trying to do and what the problem is.

In my research I’m trying to find/study pseudogenes from a certain
Whole-Genome Duplication in Populus trichocarpa. Step 1 and 2 of the
pipeline (as in the README file) have proven to be successful (Note: the
README file apparently searches for a ‘splitXXXXOut’ pattern, which my file
names don’t contain, as I didn’t split the proteome file into chunks).

I believe I have a problem with the pipeline in step 3, the masking step.
The pipeline has apparently 3 options (no masking, intron masking and gene
masking) for masking, but when looking into the script file
(extractKPExonLocations.py), it seems that only option 3 is available (there
seems to be no code for other options), while I would like to use option 1,
no masking.
As I couldn’t find a way to perform option 1, no masking, I tried masking
the genome anyway, but 2 problems arose:
1) the data I have been using so far comes from Phytozome.org and not
Ensembl plants, which doesn’t have the necessary files for step 2
2) I can’t try to recreate the files necessary for step 3, masking, as
Ensembl (no longer?) provides "translation_stable_id.txt" files. Were they
perhaps replaced by the "translation_attrib.txt" files?

Because of these issues I encountered, and didn’t want to mask the genome in
the first place (I can filter the results afterwards anyway), I tried
skipping step 3 and continue with the pipeline, to see if it would work
anyway. However, as the following steps require "Location of maskt files
(see Step 3) above)" and "The columns in the mask file that provide start
and stop data (0-based)", my results have been fruitless.

Now that you (hopefully) understand what the problems are that I
encountered, I was hoping if you, or someone else, could help me. In
particular I would like to know if it is possible to run the pipeline
without masking. As I said previously, I didn’t see the possibility to do
so, but I am still relatively new to bioinformatics, so I might be mistaken.
In addition to this, null exon data sets are required (which are empty
files?). In case this isn’t possible, would it be possible to tell me what
kind of information is stored in "translation_attrib.txt, so I coudl try to
recreate these files with Phytozome.org data?

I hope that you or someone else can help me with this problem, or point me
in the right direction. I know this was a long read and hopefully I have
explained myself well enough.

A:
Could you please send me all your commands line by line and the errors you encountered so we can help you.
In short replying to you’re queries:
— If you have you want to use your own custom data, not from Ensembl, you will need to format the files and create all the input files required for the pipeline to run. For this download our example file and look at the input files presented in ppipe_input folder.
— You do not need to use a masked genome, you can just pinpoint the pipeline to use the unmasked version, it works exactly the same.

Pseudogenes – PseudoPipe

Q:
I am trying to find SNPs in
pseudogenes but the database for the SNP’s is built for different genome
assemblies than pseudogenes predictions from PseudoPipe. Do you have the
current pipeline pseudogenes predictions on eukaryotic genomes? Or is there
a way to remap the genome assemblies used by Pipeline to a different
assembly?
If I want to use PsedoPipe, where in Ensembl can I find the input data set?

A:
Regarding your questions there are a number of things that you can do:
* if you are interested in the human/mouse genome, these are available for the latest assembly GRC38 from the pseudogene.org webpage , see http://mouse.pseudogene.org/data/Reference/Mus_musculus.GRCm38.87_pgene.txt and http://www.pseudogene.org/Human/Human90.txt respectively.
* the latest annotations for the worm and fly genomes, these are available from here :
http://pseudogene.org/psicube
* if you are interested in other eukaryotic genomes that have annotation build on older assemblies, one option is to do a lift over of the annotation from an old assembly to a newer one. This can easily be done using the UCSC genome browser resource https://genome.ucsc.edu/cgi-bin/hgLiftOver, however I would very much advise to actually run pseudo pipe on your machine given the fact that improvement in assembly and protein coding annotation will considerably improve the output of the pseudogene annotation. You can download and run pseudo pipe as described here: http://pseudogene.org/pseudopipe/
* also using the “fetch file” as described here http://pseudogene.org/pseudopipe/ will automatically download all the necessary data for you from the ensembl server.

Unitary Pseudogene PROMOTER

Q:
I am trying to find an example of a unitary pseudogene whose
promoter is known to be mutated as well and therefore the gene is
definitely non-functional. I can find articles stating there are many
examples of unitary pseudogenes in humans (e.g. Vitamin C) but none
seem to mention the promoter. Any thoughts?

A:
Our analyses compiled a number of activity features associated with pseudogenes (e.g. transcription, presence of functional Pol2 and TF binding sites in the upstream region, presence of open chromatin) that are available in online. Please see https://www.ncbi.nlm.nih.gov/pubmed/22951037 (http://pseudogene.org/psidr/ ) and https://www.ncbi.nlm.nih.gov/pubmed/25157146 (http://pseudogene.org/psicube/) for the functional characterisation of pseudogenes. In particular the unitary pseudogenes that do not have transcription, Pol2 and TF binding sites should be the ones to look at and to check the conservation or not of the promoter region.

Zebrafish pseudogenes

Q:
I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the
pseudogene.org, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

A:
Pseudogene.org provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

problem in PseudoPipe

Q:
I tried to run the software PseudoPipe
(http://pseudogene.org/DOWNLOADS/pipeline_codes/ppipe.tar.gz) using the
example as following:
./pseudopipe.sh ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs
0

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’:
No such file or directory
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’:
No such file or directory
gzip:
/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa:
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

A:
Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root.