Questions about using PseudoPipe

Posted on May 19, 2019 by gersteinfaq

Q1:
First of all I must show great respect to your brilliant work on developing the PseudoPipe software.
Now I am working on my graduate paper, and need to use this software. But I met some problems, so any guide or assistance from you would be appreciated.
I just download the software package from your website and unpack it in my home directory(that is ~/), but when I test it according to your manual, it reported errors as below:
I have tried several ways to fix it ,even trying to modify the source code, but failed. I’ve been driven somehow crazy haha.
Can you please provide some suggestions? thanks in advance!

A1:
It looks like your installation is not referencing python properly. Please edit the env.sh file with the appropriate source/path for python in your system.

Q2:
According to your suggestion, now I have finished all the environment variable setting in env.sh, but I still got error while running the software(as the below Fig.1)
So I try to fix the code of pseudopipe.sh , and I finally made it run just by modifying the "source setenvPipelineVars" into "source ./setenvPipelineVars" at line 141. And I got the final result file(as Fig. 2) by running your sample data. Is the result correct?
Don’t know if anybody reported similar error before. If not, I hope it would contribute to improving your powerful software. And it would be great if you can also display on your manual or README what the standard output and final result file look like when testing the sample data.

A2:
The results look right. Thank you for your suggestions, we will take them into account in a future update of the pipeline.

Good luck with your analysis.

Question about publication data “Comparative analysis of pseudogenes across,three phyla”

Posted on May 11, 2019 by gersteinfaq

Q:
I’m looking at some of the data connected with your recent publication and was wondering if I could get clarification on the BioType attribute in the following file:

http://www.pseudogene.org/psicube/data/Worm-Annotation.bed

In here there appear to be 3 biotypes

processed_pseudogene
pseudogene
unprocessed_pseudogene

Looking through the paper and the supplementary material I can find reference to processed_pseudogene and unprocessed_pseudogene, but not the generic pseudogene? Reading the S1 material I would not expect to see this 3rd biotype

–snip–
(a) Classification
Pseudogenes were classified as “processed” if they have lost their parental gene structures.
Conversely, we classified pseudogenes as “unprocessed”/ “duplicated” if they retained the
same exon-intron structure as their parent loci. In ambiguous cases we used other features to
resolve the provenance of the pseudogene. Where the pseudogene represented a fragment of
the parent, and the homology ended precisely at a splice junction the pseudogene was called
“unprocessed” (“duplicated”). Conversely, where the fragment contained the fusion of two or
more exons the pseudogene was called “processed”. If the parent had a single exon CDS, the
presence of parent gene structure in the 5′ UTR region (identified by alignment of mRNA and
EST evidence) allowed the pseudogene to be called “unprocessed”/“duplicated”. Meanwhile,
the presence of a pseudopoly(A) signal (the position of the parent poly(A) signal at the
pseudogene locus) followed by a tract of A-rich sequence in the genome (indicating the
insertion site of the polyadenylated parental mRNA) indicated a “processed” pseudogene. If
there was no other evidence available to resolve the route by which the pseudogene was
created, we used the position of the pseudogene relative to its parent. As such “processed”
pseudogenes are reinserted into the genome with an approximately random distribution while
“unprocessed”/“duplicated” pseudogenes tend to be more closely associated with the parent
locus. Parsimony therefore suggests that pseudogenes that lie near to the parent locus are
more likely to have arisen via a gene-duplication event than retrotransposition, and this was
used as a tie-breaker in defining the pseudogene biotype.
–snip–

I hope I haven’t missed anything obvious, but any clarification would help greatly.

A:
When we classify the pseudogenes according to their biotype we have processed pseudogenes and duplicated pseudogenes. This biotype is dependent on the pseudogene formation process (retrotransposition vs duplication) and this is the description that you see in the supplementary material. The third biotype that you find in some of the files on psicube website is actually not a biotype per se, these pseudogenes are most of the time highly degraded or short fragments and we could not assign with high confidence a definite biotype to them. In other words the pseudogenes with “pseudogene” as biotype have actually an undetermined biotype. But instead of saying “NA” (not available or unknown) we opted to simply call them “pseudogene".

Pseudogenes catalogue

Posted on May 11, 2019 by gersteinfaq

Q:
I came across your paper ‘The GENCODE pseudogene resource’.

It is a great paper.

Could you please tell me where I would be able to find the list of psuedogenes mentioned in the paper?

I didn’t find any downloadable database in the paper.

A:
see psicube.pseudogene.org & http://pseudogene.org/psidr

Pseudogene ontology

Posted on May 11, 2019 by gersteinfaq

Q:
I’m currently working on reasoning with ontologies, and found in big
interest in reading your paper about reasoning of the pseudogene set.
I’m particularly interested in the time you give for the reasoning to
complete, since performance is my biggest issue. So I wanted to ask
you what was the size of the data you used, since the pseudogene set
does not seem to be available online anymore.

A:
The pseudogene data set used in this paper can be found at:
http://www.pseudogene.org/sdpgenes/all_pgenes/, which is based on
Ensembl genome release 48. There are 2,294 duplicated pseudogenes and
10,187 processed pseudogenes, in chromosome 1-22, X and Y. Hope this
helps.

PGOHUM00000250823 and SETP13/SETP3 pseudogenes

Posted on May 3, 2019 by gersteinfaq

Could you please investigate the support you have for PGOHUM00000250823? At NCBI, we have PGOHUM00000250823 associated with HGNCid 42932, official symbol SETP13, and RefSeq accession NG_032538.1. However, we have a nearly identical RefSeq accession NG_032022.1 associated with HGNCid 31115 and official symbol SETP3. On the current human reference assembly, GRCh38, both NG_032538.1 and NG_032022.1 align perfectly to the same locus and have no other hits of comparable quality to the reference assembly (or to alternate assemblies HuRef or CHM1_1.1). In NCBI’s latest annotation (Annotation Release 106) SETP3 was annotated on the assembly but SETP13 was not because it overlapped with SETP3. Do you have any evidence these are distinct pseudogenes? If not, NCBI’s preference would be to preserve the older nomenclature associated with SETP3 and NG_032022.1. Also, if we agree that SETP13 is redundant with SETP3 then I will proceed to notify HGNC and will CC you on that email.

A:
I looked at the PGOHUM00000250823/SETP13 locus. Our pipeline
predicted it as a pseudogene to SET with around 90% sequence identity.
When compared to SETP3 locus, SETP13 lacks sequences at both 3’ and 5’
ends, which can be aligned to the UTR regions of SET. Our pipeline
missed these sequences at both ends because it searches for homologous
sequence to CDS regions only. We are actually thinking of including
some checks of UTR alignment in our revised pipeline. Thanks for
pointing this case to us.

We have no problem to merge the SETP13 locus with SETP3.

Questions regarding pseudogene.org

Posted on May 3, 2019 by gersteinfaq

Q1:

I am trying to get fly pseudogene information available from pseudogene.org.
I want to know the parent gene of any pseudogene. Pseudogene.org provides “parent proteins”, such as FBpp0112526. However, I cannot find this id in flyabase. Is it the Flybase ID? If not, what database ID is that from?

A1:
The fly pseudogene information currently available on pseudogene.org website is old. As you can see it is from Ensembl build 50, when the current Ensembl release is 75. The FBpp00… id is an Ensembl protein ID based on flybase. However a lot of these ID have been deprecated between the two releases. We are currently preparing a new annotation file for fly pseudogenes based on the final stable gene annotation and it is going to be available online shortly. However if you still want to use the pseudogene.org fly pseudogene annotation you can parse all the parents protein ids in the file using Ensembl biomart and you can see which ids are still current and which are retired. Also the Ensembl biomart gives you the option to get the corresponding transcript and gene id for each protein id.

Q2:
By downloading the fly pseudogenes from pseudogene.org, I can get >1000 pseudogenes, but if I use BioMart, after selecting pseudogene, I can only get 175 pseudogenes. Why?

Since all the pseudogenes at pseudogene.org were identified by your lab, you must have their parent information (gene name or transcript name). Could you provide that information? I do not need parent protein name.

By the way, what pipeline did the lab use to identify the pseudogene? The pseudogene has UTR? Which paper did the lab publish regarding how the pseudogene was identified?

A2:
As I said before the fly pseudogenes that are available from pseudogenes.org are based on a very old gene annotation (Ensembl 50). The quality of the pseudogene annotation is dependent on the quality of the gene annotation. As such, since the fly gene annotation for Ensembl build 50 was just a draft, many of the pseudogene entries that we obtained from build 50 are actually false positives. Currently we are working on the latest fly pseudogene annotation and we’ll make it available soon (next couple of weeks). In our latest annotation we have about 150 pseudogenes. This last set was obtained using a combined manual and automatic annotation. The automatic annotation was obtained using PseudoPipe – a pseudogene annotation pipeline.

Also if you select pseudogenes in BioMart, you will find only the Ensembl annotated pseudogenes. Those pseudogenes were identified using the Ensembl annotation pipeline.

Gerstein lab has published numerous papers regarding pseudogene annotation. For the full list please see: http://papers.gersteinlab.org/subject/pseudogenes/index.html

The pseudogenes do have UTR, however at the moment we do not provide an UTR annotation for fly pseudogenes.

Q3:
How many of your 150 psudogenes are in the 175 psudogenes in Ensemble obtained via BioMart?

Your pseudogene pipeline starts with protein sequence, and that’s why your report has no UTR?

A3:
I attach here (see below) the latest fly pseudogene annotation.
Regarding your questions:

1. There is a reasonable overlap between Ensembl pseudogenes and our set. However I have to mention that Ensembl pseudogene are based on the automatic annotations while our pseudogenes are also manually annotated.

2. yes, our pseudogene annotation pipeline uses the protein data information.

Q4:
Thanks for that. But the attached file only contains the common ones between Gerstein lab annotation and Ensembl annotation? since each row has a Ensembl ID.

A4:
The file contains the latest Gersteinlab annotation. Our annotation was done using a combination of automatic and manual annotation so it is of higher quality than the Ensembl one. The pseudogenes do have Ensembl IDs for easier processing.

Q5:
I am confused. Could you, for example, show me one pseudogene that is annotated by Gerstein lab, but not by Ensembl?

A5:
Maybe I was not clear, our pseudogenes are available through Ensembl, but there are Ensembl-only pseudogenes that have no correspondent in our data set. Also we define their biotype while in Ensembl you won’t find the biotype information.

Q6:
oh? I heard Gerstein lab just submitted the latest annotation not long ago , and the latest annotation will not be available to public right now. your previous attached file is exactly the latest one that hasn’t been published?

The file you sent me was obtained by BioMart of Ensembl? if so , how to set the "Filters" there in order to get the same file as you.

Could I also have the pseudogenes that are not included in Ensembl pseudogene list?

By the way, what is "processed_pseudogene" vs "unprocessed_pseudogene" ?

Sorry for keeping bothering you and thanks for your patience.

A6:
The file that I sent you is our latest and yet unpunished annotation and yes it is not publicly available at the moment. But this will be the official list of pseudogene to use for the fly genome since it is a high quality set, each pseudogene annotation being validated through manual inspection.

The “processed” and “unprocessed” nomenclature refers to the pseudogene biotype, a classification of pseudogenes based on their mode of creation (e.g. processed pseudogenes were formed through retrotransposition while unprocessed pseudogenes are usually the product of duplication). If there is no defined nomenclature , e.g. just “pseudogene” in the biotype field, that means we could not assign a definite biotype to that particular element.

If you want to compare our pseudogene set with the one from Ensembl I would recommend you to use bed tool. Create a bed file for each set and intersect them.

Pseudogenes lacking on short arm of chr13, 14, 15 and 22

Posted on May 3, 2019 by gersteinfaq

Q:
Found your paper (http://genomebiology.com/2012/13/9/r51) about
pseudogene’s very informative.

Have been looking at the distribution of pseudogenes(
http://nagarjunv.blogspot.se/2013/12/pseudogene-distribution-across-human.html)
using the Ensemble annotation as well as that found on pseudogene.org.

However, pseudogenes seem to be lacking on short arm of chr13, 14, 15 and 22.

Could you please let me know if this is a known biological pattern or
some missing annotation?

A:
In human, chromosomes 13, 14, 15, 21 and 22 are acrocentric. They are made of a very long arm and a very short arm that is homologous across the five chromosomes. Only the long arm has been sequenced and annotated. This is way there are no pseudogenes (or genes for that matter of fact) annotated on those chromosome arms.

For more info please see:
http://www.nature.com/nature/journal/v428/n6982/full/nature02379.html
and
http://www.sanger.ac.uk/about/history/hgp/chr13.html

Question about pseudogene.org data

Posted on May 3, 2019 by gersteinfaq

Q:
I have a question about psueodgene.org database.
I’m analyzing human pseudogene database and noticed that many "processed" pseudogene (>70%) don’t have polyA.
It seems like opposite of what textbook says. Is that true?
What’s the criteria of "processed pseudogene" in pseudogene.org?

I came to find another question.
I tried to blat search using several pseudogene sequence from each class of "polyA: "0" or "1" or "2" or "3" ".
But most of PolyA class 1,2,3 don’t have convincing polyA tail compare to following criteria.

Polya: "0" or "1" or "2" or "3".
"0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
"1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
"2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
"3": has polyA tail but no polyadenylation detected.

Does number coordinate of pseudogene.org data depends on human genome assembly GRCh37/hg19?

A:
Pseudogenes are identified primarily by homology matching of protein sequence against the human genome. However, the pipeline that we use incorporates poly A analysis. Our group published a paper a few years ago where we showed that ~ 50% of ribosomal protein pseudogenes do not have a detectable poly A signal. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC187539/ . We believe that this is due to decay in genome sequence and nucleotide substitutions.

For detecting poly A signals and classification, the following criteria is used according to the paper linked above.

We searched a 1000-bp region that was 3′ to the pseudogene homology segment, with a sliding window of 50 nucleotides for a region of elevated polyadenine content (>30 bp), and picked the most adenine-rich 50-bp segment as the most likely candidate. An interval of 1000 nucleotides was used because of the possible existence of 3′-untranslated regions (3′-UTRs); 90% of 3′-UTRs are of length less than 942 bp (Makalowski et al. 1996). In addition, we searched in the same 1000-bp region for candidate AATAAA or other polyadenylation signals and checked whether they were upstream of the candidate polyadenine tail site.

This criteria might not be very stringent.

And yes, the pseudogene coordinates are dependent on the human genome from which it is derived, hence the human genome version number is important.

Search help for PseudoPipe program

Posted on May 2, 2019 by gersteinfaq

Recently, I have read your
published paper named" PseudoPipe: an automated pseudogene identification
pipeline"( Vol. 22 no. 12 2006, pages
1437–1439/doi:10.1093/bioinformatics/btl116), which impressed me so much. I
really admire your and co-workers’ excellent work.
After reading the literature, I downloaded the PseudoPipe program(Pipeline
Source Code) at http://pseudogene.org/ and tried to use it to identify
pseudogene sequences in mammalian genome.But there are some questions
during pre-experiment.I input the exsiting data
(caenorhabditis_elegans_62_220a) and installed Python 2.26，howeverthe
PseudoPipe program failed to run the parseFastaAlignment.py.I analysed the
fastaAlign did not do well.It really puzzle me a lot, and I will appreciate
it if you can solve them for me.

A:
Please note that PseudoPipe was written to discover pseudogenes in mammalian genomes, it does not work well in C. elegans.

The pseudogene information of zebrafish in Pseudogene.org should be updated

Posted on May 2, 2019 by gersteinfaq

Q:
In Pseudogene.org, the pseudogene datasets of zebrafish (Danio rerio)
was based on old annotations (Ensembl 55?). There were about ~1800
processed pseudogenes. However, based on a recent research
(http://www.nature.com/nature/journal/v496/n7446/full/nature12111.html),
there were rare pseudogenes in zebrafish. (Only 21 processed
pseudogenes, according to Supplementary Table 14 in the published
manuscript).

Is this great conflict due to the old annotations?

A:
You are right about the zebrafish pseudogenes in the pseudogene.org. The results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication: http://www.nature.com/nature/journal/v496/n7446/full/nature12111.html.

Thanks for pointing this out to us.

Gerstein Lab FAQs

Frequently Asked Questions

Category Archives: pseudogenes

Questions about using PseudoPipe

Question about publication data “Comparative analysis of pseudogenes across,three phyla”

Pseudogenes catalogue

Pseudogene ontology

PGOHUM00000250823 and SETP13/SETP3 pseudogenes

Questions regarding pseudogene.org

Pseudogenes lacking on short arm of chr13, 14, 15 and 22

Question about pseudogene.org data

Search help for PseudoPipe program

The pseudogene information of zebrafish in Pseudogene.org should be updated