Q:
I’m looking at some of the data connected with your recent publication and was wondering if I could get clarification on the BioType attribute in the following file:
http://www.pseudogene.org/psicube/data/Worm-Annotation.bed
In here there appear to be 3 biotypes
processed_pseudogene
pseudogene
unprocessed_pseudogene
Looking through the paper and the supplementary material I can find reference to processed_pseudogene and unprocessed_pseudogene, but not the generic pseudogene? Reading the S1 material I would not expect to see this 3rd biotype
–snip–
(a) Classification
Pseudogenes were classified as “processed” if they have lost their parental gene structures.
Conversely, we classified pseudogenes as “unprocessed”/ “duplicated” if they retained the
same exon-intron structure as their parent loci. In ambiguous cases we used other features to
resolve the provenance of the pseudogene. Where the pseudogene represented a fragment of
the parent, and the homology ended precisely at a splice junction the pseudogene was called
“unprocessed” (“duplicated”). Conversely, where the fragment contained the fusion of two or
more exons the pseudogene was called “processed”. If the parent had a single exon CDS, the
presence of parent gene structure in the 5′ UTR region (identified by alignment of mRNA and
EST evidence) allowed the pseudogene to be called “unprocessed”/“duplicated”. Meanwhile,
the presence of a pseudopoly(A) signal (the position of the parent poly(A) signal at the
pseudogene locus) followed by a tract of A-rich sequence in the genome (indicating the
insertion site of the polyadenylated parental mRNA) indicated a “processed” pseudogene. If
there was no other evidence available to resolve the route by which the pseudogene was
created, we used the position of the pseudogene relative to its parent. As such “processed”
pseudogenes are reinserted into the genome with an approximately random distribution while
“unprocessed”/“duplicated” pseudogenes tend to be more closely associated with the parent
locus. Parsimony therefore suggests that pseudogenes that lie near to the parent locus are
more likely to have arisen via a gene-duplication event than retrotransposition, and this was
used as a tie-breaker in defining the pseudogene biotype.
–snip–
I hope I haven’t missed anything obvious, but any clarification would help greatly.
A:
When we classify the pseudogenes according to their biotype we have processed pseudogenes and duplicated pseudogenes. This biotype is dependent on the pseudogene formation process (retrotransposition vs duplication) and this is the description that you see in the supplementary material. The third biotype that you find in some of the files on psicube website is actually not a biotype per se, these pseudogenes are most of the time highly degraded or short fragments and we could not assign with high confidence a definite biotype to them. In other words the pseudogenes with “pseudogene” as biotype have actually an undetermined biotype. But instead of saying “NA” (not available or unknown) we opted to simply call them “pseudogene".