Pseudogene database: the link “current human pseudogenes” on the main webpage leads to build 61

I have a question regarding your Pseudogene database: the link "current
human pseudogenes" on the main webpage leads to build 61. Looking however at
"Database" -"Eukaryote Pseudogenes" I found build 68 for human
pseudogenes. The latter seems to contain less pseudogenes than build 61
(lower count). So I’m not sure which one I should best consider. Probably
build 68 is the latest version and the link on the main page is not up to
date, right?

Now the link is pointing to build 68. I would
suggest you to use this file, which is the latest results based on the
release-68 of Ensembl genomes. The number of pseudogene changes due to
the different annotation of protein coding genes between the different
genome releases.

Data associated with paper “The GENCODE pseudogene resource”

Your work on the ENCODE project has helped to
produce an incredible set of data!

I had a question about your pseudogene article, "The GENCODE
pseudogene resource." You note that at least 9% of them are
transcribed. Do you have a list somewhere? I couldn’t find a
supplementary file that might contain such a list. I realize it would
be quite long, but I assume it must exist somewhere. If not, do you
happen to know if the GULO pseudogene is one of the transcribed

It transcribes pseudogenes should be available from the resource associated with the paper.

The data associated with the paper is located at

pseudogenes in bacteria


I see you have developed pipelines to look for pseudogenes. Is there someone still working on this problem in your lab?

It turns out there is no good method to find pseudogenes in bacteria, which is perhaps a less challenging problem than in eukaryotes, because of the lack of splicing. There is a real need for such a pipeline because genome degradation is a common pathway to specialization in pathogens. We have hundreds of genomes we would like to put through such a pipeline.

turns out we’ve done pgenes in bacteria and have some old lists available. See :

Zebrafish pseudogenes

In, the pseudogene datasets of zebrafish (Danio rerio) was based on old annotations (Ensembl 55?). There were about ~1800 processed pseudogenes. However, based on a recent research (, there were rare pseudogenes in zebrafish. (Only 21 processed pseudogenes, according to Supplementary Table 14 in the published manuscript). Is this great conflict due to the old annotations?

This is right, the results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication.


Where is the psiDR file?
The file can be downloaded at:

Is H1-hesc included in the psiDR file?
The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file.

Could you let me know briefly how the chromatin states in the psiDR file are determined?
The chromatin states were assessed using the Segway segmentation. Segway annotates the genome using 25 different labels representing active and repressive marks. we use two selection criteria to pinpoint pseudogenes with active chromatin states:
(1) the frequency of the TSS is three times higher than the frequency of any repressive markers;
(2) the gene body start (GS), gene body middle (GM) and gene body end (GE) frequencies are two times larger than the frequency of the repressive markers.
The selection criteria were chosen to match the segmentation behavior of the active genes.

Consult for help about PseudoPipe

Why the genomic sequences need to be repeatmasked before their inputs to the pipeline?

This is to block the low complexity regions in genome from pseudogene searching.

Which database we should use to do the repeatmasking?

Our current pipeline downloads genome data from Ensembl, where the repeats are detected with the RepeatMasker tool. More information about the pseudopipe can be found at:

> Dear Prof. Gerstein,
> My name is Yiling Lai, a PhD student from Prof. Xingzhong Liu’s > group in Institute of Microbiology, Chinese Academy of Sciences. Our > research focus on comparative genomics of nematode endoparasitic > fungi Hirsutella spp.. Now we start to analyse the genomic sequences > and use the PseudoPipe from your published method to identify > pseudogenes in these genomes. However, some questions confuse us > when we use the pipeline. The first one is why the genomic sequences > need to be repeatmasked before their inputs to the pipeline. The > second question is which database we should use to do the > repeatmasking, the repbase database or database established from de > nove consensus sequences by RepeatScout? We would be very > appreciated if you could give us some good suggestions. Thank you > very much! We’re looking forward for your reply. >
> Best wishes
> Yiling Lai
> State Key Laboratory of Mycology
> Institute of Microbiology
> Chinese Academy of Sciences
> No.3 1st Beichen West Road, Chaoyang District
> Beijing 100101, PR China

PGOHUM00000250821 probably not a pseudogene

This is supported as a protein coding gene based on transcript and genomic data in human, and homology data. The differences with the human reference assembly (insertions at nt 475-476 and nt 496-497 in the CDS) are supported by transcript data and alignment to the alternate (Celera) assembly. The mouse protein NP_758465.2 (Ppp1r9b, Entrez GeneID 217124) is the same length as the human protein (NP_115984.3) and 96% identical. The region where the mouse gene is located on chromosome 11 has the same genes in the same order as the location on human chromosome 17 where this gene is annotated.
Thanks to Dr. Janet Weber from the Refseq project group for pointing this to us. PGOHUM00000250821 is most likely a protein-coding gene PPP1R9B. The erroneous annotations probably results due to either an error or difference in the canonical human reference genome. Please note that this locus is tagged for follow-up by the Genome Reference Consortium as a possible locus where the reference genome is incorrect (GRC Jira system as HG-191, ).

Cow pseudogenes?

[tag sb]
We’re wondering if you happen to have a database for cow pseudogenes
We haven’t done a Pesudopipe run on cow genome.
I see that the genome is available from Ensembl. You can download the code and run it. In theory,
Pseudopipe can be executed when the genome and the annotation files are a part of Ensembl. The code to run Pseudopipe can be downlaoded from


Dear Anand,

First some general guidelines of running the pseudopipe pipeline in your local machine.

Since the pseudopipe pipeline was originally designed and automated to work with ensembl data, so some manual settings are required to run it with other input data.

Attached is an archive that consists of the pipeline and a simple try-out data.

There are three folders within a parent directory “pgenes” after extraction:
– pseudopipe: pipeline code;
– ppipe_input: input data;
– ppipe_output: output data.

Input data:
You may create a separate folder within the ppipe_input (and ppipe_output) for each species. There need to be three folders for each species genomic input data,
– dna: contains a file named dna_rm.fa, which is entire repeat masked dna from that species, and a list file for all unmasked dna divided into different chromosomes in FASTA format;
– pep: contains a FASTA file for all the proteins in the species;
– mysql: contains a list of files named as “chr1_exLocs”, “chr2_exLocs”, etc. to specify exons coordinates, one for each chromosome. Only thing matters for these files are their third and fourth columns, which should be start and end coordinates of exons.

Environment setting:
You’ll need python, blast and tfasty to run the pipeline. Their paths should be indicated at the end of /pseudopipe/bin/

Run the pipeline
First go to the folder pseudopipe/bin, and run with command line in the form of: ./ [output dir] [masked dna dir] [input dna dir] [input pep dir] [exon dir] 0.

An example using the try-out data is as follow:
./ ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
(This command line assumes you extract the archive in your home directory, i.e., “~/”. Please note that the paths in the command line need to be absolute, and chromosome and exon files are specified with wild card “%s”.)

The blast step is already included in the pipeline.

The output can be found at ppipe_output/caenorhabditis_elegans_62_220a/pgenes/ppipe_output_pgenes.txt

Run time
On a single laptop (2.6GHz, 4GB RAM): The most time consuming step is tblastn. It may take around one day to finish an entire genome in a comparable size of C. elegans. The following steps will finish in a few hours.

We’ve implemented the pipeline to run parallel in cluster machines. However, the pipeline I sent can only run on a single machine. The parallel implementation is currently hard-coded to our local settings.

Some specific answers to your questions:
I am ready to run tBLASTn of proteome versus genome. I can repeat mask the genome during the tBLASTn run itself, would that be OK?

You don’t have to run the tBLASTn by yourself since it is already integrated into the pipeline. In ENSEMBL, the genomes are repeat masked by RepeatMasker, that’s the input data currently used the pipeline. I would assume any reasonable repeat mask algorithm is fine.

For the tBLASTn, instead of using the entire genome, can I use the genome that is ‘masked’ for entire genes (not just exons). Based on gff info, I have converted the genic regions (not just exons) into stretches of Ns. Would this ‘masked’ genome be a good input for my tBLASTn?

You don’t need to do that since the pipeline will remove blast hits significantly overlap known gene exons ( > 30 bp overlap). Also, manually masking the entire gene sequences may be problematic, since we do find in some species the pseudogenes with some overlap with genes annotation.

You mention in your paper that you use bite-sized portions of your proteome as query for your BLAST search. Does that mean I should chop up my proteome into peptides x amino acids in length? Is that x >= 10?

No need to do that. You can keep the whole protein sequences in the input FASTA file.

Is there a latest README or even a User Guide for PseudoPipe that you can share with me?

Unfortunately, we don’t have a user friendly README file for the entire pipeline, especially for it to run in different environment from ours. I hope this email can help you set it up and run the pipeline in your machine. And also you can find some comments on each individual pipeline script file.
Please feel free to let me know if you need further assistance.


Request for Pseudogene


We are basically looking for the pseudogenes of protein P53 (tumor protein 53, or tumor suppressor) and protein WSTF (also call it as BAZ1B) in human species. There have no information in Could you please help us to find a way to get the result?
Later on I found one webservice, which is called PseudoGeneQuest, and I submitted my target protein sequences and I got the results as shown in the following forwarded emails.

The results showed that there are known-pseudogenes in your database, however, I couldn’t extract the data out. Could you please help me to do so?
We are basically looking for the pseudogenes of protein P53 (tumor protein 53, or tumor suppressor) and protein WSTF (also call it as BAZ1B) in human species.


I have looked at our pseudogene database and there are no pseudogenes for P53 and WSTF. I have further rechecked this by redoing homology analysis to the genome based on both P53 and WSTF sequence and there are no other regions in the genome which are good hits to P53 and WSTF. I have also looked at the results from the other program and either the matches are to other coding exons of other genes or all they are not significant matches, i.e. the match-lengths are very small and the e-values are not significant.

For example, these are the other regions in the genome homologous to the coding sequence in BLAST. Please see attached image. The only significant matches to P53 proteins are
1. NT_010718.16

This corresponds to P53 itself

2. NT_004350.19 This corresponds to P73, another gene and not a pseudogene

3. NT_005612.16 This corresponds to P63, another gene and not a pseudogene

The other two matches are not significant matches and have length homology only to 20% of P53.

This is the result that you obtained from the other program.

0 - QUERY:111222153038348410812
2 - KNOWN_PSEUDOGENE:ref|NT_004350.19|:NT_010755.15:3118600:3119076
2 - KNOWN_PSEUDOGENE:ref|NT_004350.19|:NT_033903.7:3114083:3118495
2 - KNOWN_PSEUDOGENE:ref|NT_010718.16|:NT_008470.18:7177265:7178188
2 - KNOWN_PSEUDOGENE:ref|NT_010718.16|:NT_023935.17:7181340:7182403
2 - KNOWN_PSEUDOGENE:ref|NT_010718.16|:NT_079573.3:7181224:7182633
3 - REAL GENE OR EXON:ref|NT_004350.19|:3122278:3122442
3 - REAL GENE OR EXON:ref|NT_005612.16|:96077137:96077361
3 - REAL GENE OR EXON:ref|NT_005612.16|:96079592:96079771
3 - REAL GENE OR EXON:ref|NT_005612.16|:96080735:96080899
3 - REAL GENE OR EXON:ref|NT_005612.16|:96081483:96081638
3 - REAL GENE OR EXON:ref|NT_010718.16|:7176274:7176414
3 - REAL GENE OR EXON:ref|NT_010718.16|:7180194:7180331
3 - REAL GENE OR EXON:ref|NT_010718.16|:7180364:7180564
3 - REAL GENE OR EXON:ref|NT_010718.16|:7180845:7181012
3 - REAL GENE OR EXON:ref|NT_010718.16|:7183182:7183316

So all the good hits are to coding exons of P53 or P63 or P73 presumably because P53 is homologous to P63, P73 etc.

Similarly for WSTF, the other matches are either to known genes or the matches are not significant. You can easily check this by querying your protein sequence using BLAST (