Asking or data used in finding processed psedogenes in the human genome

Recently, I was reading one of your papers about finding processed pseudogenes published in 2003: "Millions of Years of Evolution Preserved: A Comprehensive Catalog of the Processed Pseudogenes in the Human Genome". Because I want to find processed pseudogenes among several recently released mammalian genomes. Your paper is very interesting and helpful for my work. And to ensure the method i grasped is correct, I want to use your original data to redo your analysis process.

But I come across a problem when I download nonredundant human proteome set from the EBI Web site. Because the data was published in June 2002, and I can’t successfully download them from EBI website. Here I write to you with the hope of getting nonredundant human proteome set you used released in June 2002. Although I know many years have passed since the paper was published and you may also lost the original data, I still want to have a try!

The data associated with the paper is here: You can also find the latest human pseudogene annotation here:

Question about a potential error with

I want to say great job with the site! I recently noticed a potential error and wanted to send a email to inform you if you haven’t already picked it up yourselves….

In the file located at the following address:

The start and end chromosomal locations for the pseudogenes are the same. See below:



chr19 +



ENSG00000237521.1 ENST00000456448.1 OR7E24

"Transcribed: 0" "Active Chromatin: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 0"



chr2 –


98123508 . .

. "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Pol2: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Constraint: 0"


processed_pseudogene chr3 –


136527393 ENSG00000198075.5 ENST00000272452.2

SULT1C4 "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=1;Hepg2=1;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 1"

Thanks for pointing us the problem. However, I’m a little confused of what file you are referring to. The parents file with url in your message ( does not match the contents you provided. The contents look more like from the file: But neither file has the chromosome coordinates issue you mentioned. Maybe you meant some other file?

It appears you are correct, i provided the link for the GENCODEv10 pseudogene resource instead of the v7 resource by mistake. I was, however, able to go back and find the file where I had found the mistake.

I had downloaded the Pseudogene Resource psiDR from the GENCODE website ( ) and assumed that this file is the same as the link you provide ( ). Although it appears they are not… The link on the GENCODE website ( ) displays the problem that I previously described, whereas the link you provide does not.

The file with the problem I described is actually linked at this page:
Under the link entitled:
New! Pseudogene Resource psiDR
which redirects to:

I am not sure if you part of the administration for the GENCODE site or not, but potentially if you aren’t, you would like to contact them regarding the problem since it appears to be data from your lab that is represented.

I am sorry for providing the wrong link earlier. Please let me know if you have anymore trouble reproducing the problem.

I can see the problem too. I’ll contact GENCODE to have the file updated. Thanks for pointing this issue to us!

pseudogene similarities to parent genes

I am looking at your paper ("The Gencode pseudogene resource"), which
appears very relevant to something I am doing right now. Specifically
I am interested in the Sequence identity values between pseudogenes
and their parents, which are used in figure 4. Would it be possible
for you to make these available to me (or to tell me where i can
download them if they are already online ?)

You may find the data at

Pseudogene ontology

I’m currently working on reasoning with ontologies, and found in big
interest in reading your paper about reasoning of the pseudogene set.
I’m particularly interested in the time you give for the reasoning to
complete, since performance is my biggest issue. So I wanted to ask
you what was the size of the data you used, since the pseudogene set
does not seem to be available online anymore.

The pseudogene data set used in this paper can be found at:, which is based on
Ensembl genome release 48. There are 2,294 duplicated pseudogenes and
10,187 processed pseudogenes, in chromosome 1-22, X and Y. Hope this

PGOHUM00000250823 and SETP13/SETP3 pseudogenes


Could you please investigate the support you have for PGOHUM00000250823? At NCBI, we have PGOHUM00000250823 associated with HGNCid 42932, official symbol SETP13, and RefSeq accession NG_032538.1. However, we have a nearly identical RefSeq accession NG_032022.1 associated with HGNCid 31115 and official symbol SETP3. On the current human reference assembly, GRCh38, both NG_032538.1 and NG_032022.1 align perfectly to the same locus and have no other hits of comparable quality to the reference assembly (or to alternate assemblies HuRef or CHM1_1.1). In NCBI’s latest annotation (Annotation Release 106) SETP3 was annotated on the assembly but SETP13 was not because it overlapped with SETP3. Do you have any evidence these are distinct pseudogenes? If not, NCBI’s preference would be to preserve the older nomenclature associated with SETP3 and NG_032022.1. Also, if we agree that SETP13 is redundant with SETP3 then I will proceed to notify HGNC and will CC you on that email.

I looked at the PGOHUM00000250823/SETP13 locus. Our pipeline
predicted it as a pseudogene to SET with around 90% sequence identity.
When compared to SETP3 locus, SETP13 lacks sequences at both 3’ and 5’
ends, which can be aligned to the UTR regions of SET. Our pipeline
missed these sequences at both ends because it searches for homologous
sequence to CDS regions only. We are actually thinking of including
some checks of UTR alignment in our revised pipeline. Thanks for
pointing this case to us.

We have no problem to merge the SETP13 locus with SETP3.

The pseudogene information of zebrafish in should be updated

In, the pseudogene datasets of zebrafish (Danio rerio)
was based on old annotations (Ensembl 55?). There were about ~1800
processed pseudogenes. However, based on a recent research
there were rare pseudogenes in zebrafish. (Only 21 processed
pseudogenes, according to Supplementary Table 14 in the published

Is this great conflict due to the old annotations?

You are right about the zebrafish pseudogenes in the The results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication:

Thanks for pointing this out to us.

information about


I am interested in using the information in your database to design PCR probes that would recognize usable and ensuing pseudogenes for several genes.

Do I need to obtain any type of written permission to use this information?


I checked one gene, with 9 pseudogenes listed and tried to align the
sequences to make PCR primers to detect the 10 copies, however, I realized
that being a bit naïve about pseudogenes led me down the wrong path, as I
thought the sequences would be more similar, and adept to being used to
estimate copy number for inserting foreign genes. While I did get regions
that hit 3-6 of the 10 genes, it wasn’t consistent enough.

I was wondering if you have the data about % conservation or any types of
algorithms that would predict the % conservation of pseudogene to gene and
pull out those names/gene Ids and number of pseudogenes?

It would be helpful if you can tell us a bit more about what you are trying to do.
I assume you are looking at human pseudogenes. We do have percent identity between the parent protein and the pseudogene.

I’m trying to figure out a sensible way to use the numbers of the pseudogene/gene as a natural standard curve for real time PCR. See attached excel file. I chose at random genes with 9 to1 listed pseudogene which theoretically would allow me to target endogenous genes of different copy number and get some type of standard curve. This is assuming equal efficiency etc.

I didn’t pay attention to the column "Identity" but now I’m thinking I can sort out genes based on high identity and try again?

I think that identity should be taken into account when you are creating the standard curse. Also, note that in the excel file, there is a column of fraction (after gene ID), which indicates the fraction of a parent gene aligned to its pseudogene. The start and end coordinates of an alignment are also in the excel file (columns between protein ID and gene ID). Maybe you want to take these into consideration too.

Pseudogene minilist for PCR.xlsx

Comparing chromatin state analysis at pseudogene regions

I am very interested to compare our chromatin state analysis at the pseudogene regions. I found this file at your website:

Could you please let me know if this is the right place to compare? I saw you do have h1-esc there. If I understand correctly, you classified each pseudogene as being in either active (1) or silent state (0).

The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file. Please let me know if you have any questions about that file.

Pseudogene database: the link “current human pseudogenes” on the main webpage leads to build 61

I have a question regarding your Pseudogene database: the link "current
human pseudogenes" on the main webpage leads to build 61. Looking however at
"Database" -"Eukaryote Pseudogenes" I found build 68 for human
pseudogenes. The latter seems to contain less pseudogenes than build 61
(lower count). So I’m not sure which one I should best consider. Probably
build 68 is the latest version and the link on the main page is not up to
date, right?

Now the link is pointing to build 68. I would
suggest you to use this file, which is the latest results based on the
release-68 of Ensembl genomes. The number of pseudogene changes due to
the different annotation of protein coding genes between the different
genome releases.

Data associated with paper “The GENCODE pseudogene resource”

Your work on the ENCODE project has helped to
produce an incredible set of data!

I had a question about your pseudogene article, "The GENCODE
pseudogene resource." You note that at least 9% of them are
transcribed. Do you have a list somewhere? I couldn’t find a
supplementary file that might contain such a list. I realize it would
be quite long, but I assume it must exist somewhere. If not, do you
happen to know if the GULO pseudogene is one of the transcribed

It transcribes pseudogenes should be available from the resource associated with the paper.

The data associated with the paper is located at