Question about pseudogene.org data

Q:
I have a question about psueodgene.org database.
I’m analyzing human pseudogene database and noticed that many "processed" pseudogene (>70%) don’t have polyA.
It seems like opposite of what textbook says. Is that true?
What’s the criteria of "processed pseudogene" in pseudogene.org?

I came to find another question.
I tried to blat search using several pseudogene sequence from each class of "polyA: "0" or "1" or "2" or "3" ".
But most of PolyA class 1,2,3 don’t have convincing polyA tail compare to following criteria.

Polya: "0" or "1" or "2" or "3".
"0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
"1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
"2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
"3": has polyA tail but no polyadenylation detected.

Does number coordinate of pseudogene.org data depends on human genome assembly GRCh37/hg19?

A:
Pseudogenes are identified primarily by homology matching of protein sequence against the human genome. However, the pipeline that we use incorporates poly A analysis. Our group published a paper a few years ago where we showed that ~ 50% of ribosomal protein pseudogenes do not have a detectable poly A signal. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC187539/ . We believe that this is due to decay in genome sequence and nucleotide substitutions.

For detecting poly A signals and classification, the following criteria is used according to the paper linked above.

We searched a 1000-bp region that was 3′ to the pseudogene homology segment, with a sliding window of 50 nucleotides for a region of elevated polyadenine content (>30 bp), and picked the most adenine-rich 50-bp segment as the most likely candidate. An interval of 1000 nucleotides was used because of the possible existence of 3′-untranslated regions (3′-UTRs); 90% of 3′-UTRs are of length less than 942 bp (Makalowski et al. 1996). In addition, we searched in the same 1000-bp region for candidate AATAAA or other polyadenylation signals and checked whether they were upstream of the candidate polyadenine tail site.

This criteria might not be very stringent.

And yes, the pseudogene coordinates are dependent on the human genome from which it is derived, hence the human genome version number is important.

Gerstein Lab FAQs

Frequently Asked Questions

Question about pseudogene.org data

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply