Pseudogenes lacking on short arm of chr13, 14, 15 and 22

Found your paper ( about
pseudogene’s very informative.

Have been looking at the distribution of pseudogenes(
using the Ensemble annotation as well as that found on

However, pseudogenes seem to be lacking on short arm of chr13, 14, 15 and 22.

Could you please let me know if this is a known biological pattern or
some missing annotation?

In human, chromosomes 13, 14, 15, 21 and 22 are acrocentric. They are made of a very long arm and a very short arm that is homologous across the five chromosomes. Only the long arm has been sequenced and annotated. This is way there are no pseudogenes (or genes for that matter of fact) annotated on those chromosome arms.

For more info please see:

Question about data

I have a question about database.
I’m analyzing human pseudogene database and noticed that many "processed" pseudogene (>70%) don’t have polyA.
It seems like opposite of what textbook says. Is that true?
What’s the criteria of "processed pseudogene" in

I came to find another question.
I tried to blat search using several pseudogene sequence from each class of "polyA: "0" or "1" or "2" or "3" ".
But most of PolyA class 1,2,3 don’t have convincing polyA tail compare to following criteria.

Polya: "0" or "1" or "2" or "3".
"0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
"1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
"2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
"3": has polyA tail but no polyadenylation detected.

Does number coordinate of data depends on human genome assembly GRCh37/hg19?

Pseudogenes are identified primarily by homology matching of protein sequence against the human genome. However, the pipeline that we use incorporates poly A analysis. Our group published a paper a few years ago where we showed that ~ 50% of ribosomal protein pseudogenes do not have a detectable poly A signal. . We believe that this is due to decay in genome sequence and nucleotide substitutions.

For detecting poly A signals and classification, the following criteria is used according to the paper linked above.

We searched a 1000-bp region that was 3′ to the pseudogene homology segment, with a sliding window of 50 nucleotides for a region of elevated polyadenine content (>30 bp), and picked the most adenine-rich 50-bp segment as the most likely candidate. An interval of 1000 nucleotides was used because of the possible existence of 3′-untranslated regions (3′-UTRs); 90% of 3′-UTRs are of length less than 942 bp (Makalowski et al. 1996). In addition, we searched in the same 1000-bp region for candidate AATAAA or other polyadenylation signals and checked whether they were upstream of the candidate polyadenine tail site.

This criteria might not be very stringent.

And yes, the pseudogene coordinates are dependent on the human genome from which it is derived, hence the human genome version number is important.

Search help for PseudoPipe program


Recently, I have read your
published paper named" PseudoPipe: an automated pseudogene identification
pipeline"( Vol. 22 no. 12 2006, pages
1437–1439/doi:10.1093/bioinformatics/btl116), which impressed me so much. I
really admire your and co-workers’ excellent work.
After reading the literature, I downloaded the PseudoPipe program(Pipeline
Source Code) at and tried to use it to identify
pseudogene sequences in mammalian genome.But there are some questions
during pre-experiment.I input the exsiting data
(caenorhabditis_elegans_62_220a) and installed Python 2.26,howeverthe
PseudoPipe program failed to run the analysed the
fastaAlign did not do well.It really puzzle me a lot, and I will appreciate
it if you can solve them for me.

Please note that PseudoPipe was written to discover pseudogenes in mammalian genomes, it does not work well in C. elegans.

The pseudogene information of zebrafish in should be updated

In, the pseudogene datasets of zebrafish (Danio rerio)
was based on old annotations (Ensembl 55?). There were about ~1800
processed pseudogenes. However, based on a recent research
there were rare pseudogenes in zebrafish. (Only 21 processed
pseudogenes, according to Supplementary Table 14 in the published

Is this great conflict due to the old annotations?

You are right about the zebrafish pseudogenes in the The results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication:

Thanks for pointing this out to us.

information about


I am interested in using the information in your database to design PCR probes that would recognize usable and ensuing pseudogenes for several genes.

Do I need to obtain any type of written permission to use this information?


I checked one gene, with 9 pseudogenes listed and tried to align the
sequences to make PCR primers to detect the 10 copies, however, I realized
that being a bit naïve about pseudogenes led me down the wrong path, as I
thought the sequences would be more similar, and adept to being used to
estimate copy number for inserting foreign genes. While I did get regions
that hit 3-6 of the 10 genes, it wasn’t consistent enough.

I was wondering if you have the data about % conservation or any types of
algorithms that would predict the % conservation of pseudogene to gene and
pull out those names/gene Ids and number of pseudogenes?

It would be helpful if you can tell us a bit more about what you are trying to do.
I assume you are looking at human pseudogenes. We do have percent identity between the parent protein and the pseudogene.

I’m trying to figure out a sensible way to use the numbers of the pseudogene/gene as a natural standard curve for real time PCR. See attached excel file. I chose at random genes with 9 to1 listed pseudogene which theoretically would allow me to target endogenous genes of different copy number and get some type of standard curve. This is assuming equal efficiency etc.

I didn’t pay attention to the column "Identity" but now I’m thinking I can sort out genes based on high identity and try again?

I think that identity should be taken into account when you are creating the standard curse. Also, note that in the excel file, there is a column of fraction (after gene ID), which indicates the fraction of a parent gene aligned to its pseudogene. The start and end coordinates of an alignment are also in the excel file (columns between protein ID and gene ID). Maybe you want to take these into consideration too.

Pseudogene minilist for PCR.xlsx

Comparing chromatin state analysis at pseudogene regions

I am very interested to compare our chromatin state analysis at the pseudogene regions. I found this file at your website:

Could you please let me know if this is the right place to compare? I saw you do have h1-esc there. If I understand correctly, you classified each pseudogene as being in either active (1) or silent state (0).

The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file. Please let me know if you have any questions about that file.

Pseudogene database: the link “current human pseudogenes” on the main webpage leads to build 61

I have a question regarding your Pseudogene database: the link "current
human pseudogenes" on the main webpage leads to build 61. Looking however at
"Database" -"Eukaryote Pseudogenes" I found build 68 for human
pseudogenes. The latter seems to contain less pseudogenes than build 61
(lower count). So I’m not sure which one I should best consider. Probably
build 68 is the latest version and the link on the main page is not up to
date, right?

Now the link is pointing to build 68. I would
suggest you to use this file, which is the latest results based on the
release-68 of Ensembl genomes. The number of pseudogene changes due to
the different annotation of protein coding genes between the different
genome releases.