I am trying to get fly pseudogene information available from pseudogene.org.
I want to know the parent gene of any pseudogene. Pseudogene.org provides “parent proteins”, such as FBpp0112526. However, I cannot find this id in flyabase. Is it the Flybase ID? If not, what database ID is that from?
The fly pseudogene information currently available on pseudogene.org website is old. As you can see it is from Ensembl build 50, when the current Ensembl release is 75. The FBpp00… id is an Ensembl protein ID based on flybase. However a lot of these ID have been deprecated between the two releases. We are currently preparing a new annotation file for fly pseudogenes based on the final stable gene annotation and it is going to be available online shortly. However if you still want to use the pseudogene.org fly pseudogene annotation you can parse all the parents protein ids in the file using Ensembl biomart and you can see which ids are still current and which are retired. Also the Ensembl biomart gives you the option to get the corresponding transcript and gene id for each protein id.
By downloading the fly pseudogenes from pseudogene.org, I can get >1000 pseudogenes, but if I use BioMart, after selecting pseudogene, I can only get 175 pseudogenes. Why?
Since all the pseudogenes at pseudogene.org were identified by your lab, you must have their parent information (gene name or transcript name). Could you provide that information? I do not need parent protein name.
By the way, what pipeline did the lab use to identify the pseudogene? The pseudogene has UTR? Which paper did the lab publish regarding how the pseudogene was identified?
As I said before the fly pseudogenes that are available from pseudogenes.org are based on a very old gene annotation (Ensembl 50). The quality of the pseudogene annotation is dependent on the quality of the gene annotation. As such, since the fly gene annotation for Ensembl build 50 was just a draft, many of the pseudogene entries that we obtained from build 50 are actually false positives. Currently we are working on the latest fly pseudogene annotation and we’ll make it available soon (next couple of weeks). In our latest annotation we have about 150 pseudogenes. This last set was obtained using a combined manual and automatic annotation. The automatic annotation was obtained using PseudoPipe – a pseudogene annotation pipeline.
Also if you select pseudogenes in BioMart, you will find only the Ensembl annotated pseudogenes. Those pseudogenes were identified using the Ensembl annotation pipeline.
Gerstein lab has published numerous papers regarding pseudogene annotation. For the full list please see: http://papers.gersteinlab.org/subject/pseudogenes/index.html
The pseudogenes do have UTR, however at the moment we do not provide an UTR annotation for fly pseudogenes.
How many of your 150 psudogenes are in the 175 psudogenes in Ensemble obtained via BioMart?
Your pseudogene pipeline starts with protein sequence, and that’s why your report has no UTR?
I attach here (see below) the latest fly pseudogene annotation.
Regarding your questions:
1. There is a reasonable overlap between Ensembl pseudogenes and our set. However I have to mention that Ensembl pseudogene are based on the automatic annotations while our pseudogenes are also manually annotated.
2. yes, our pseudogene annotation pipeline uses the protein data information.
Thanks for that. But the attached file only contains the common ones between Gerstein lab annotation and Ensembl annotation? since each row has a Ensembl ID.
The file contains the latest Gersteinlab annotation. Our annotation was done using a combination of automatic and manual annotation so it is of higher quality than the Ensembl one. The pseudogenes do have Ensembl IDs for easier processing.
I am confused. Could you, for example, show me one pseudogene that is annotated by Gerstein lab, but not by Ensembl?
Maybe I was not clear, our pseudogenes are available through Ensembl, but there are Ensembl-only pseudogenes that have no correspondent in our data set. Also we define their biotype while in Ensembl you won’t find the biotype information.
oh? I heard Gerstein lab just submitted the latest annotation not long ago , and the latest annotation will not be available to public right now. your previous attached file is exactly the latest one that hasn’t been published?
The file you sent me was obtained by BioMart of Ensembl? if so , how to set the "Filters" there in order to get the same file as you.
Could I also have the pseudogenes that are not included in Ensembl pseudogene list?
By the way, what is "processed_pseudogene" vs "unprocessed_pseudogene" ?
Sorry for keeping bothering you and thanks for your patience.
The file that I sent you is our latest and yet unpunished annotation and yes it is not publicly available at the moment. But this will be the official list of pseudogene to use for the fly genome since it is a high quality set, each pseudogene annotation being validated through manual inspection.
The “processed” and “unprocessed” nomenclature refers to the pseudogene biotype, a classification of pseudogenes based on their mode of creation (e.g. processed pseudogenes were formed through retrotransposition while unprocessed pseudogenes are usually the product of duplication). If there is no defined nomenclature , e.g. just “pseudogene” in the biotype field, that means we could not assign a definite biotype to that particular element.
If you want to compare our pseudogene set with the one from Ensembl I would recommend you to use bed tool. Create a bed file for each set and intersect them.