Question about a potential error with Pseudogene.org

Q1:
I want to say great job with the Pseudogene.org site! I recently noticed a potential error and wanted to send a email to inform you if you haven’t already picked it up yourselves….

In the file located at the following address:

http://www.pseudogene.org/psicube/data/gencode.v10.pgene.parents.txt

The start and end chromosomal locations for the pseudogenes are the same. See below:

ENST00000344844.3

unprocessed_pseudogene

chr19 +

9314984

9314984

ENSG00000237521.1 ENST00000456448.1 OR7E24

"Transcribed: 0" "Active Chromatin: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 0"

ENST00000359901.3

unprocessed_pseudogene

chr2 –

98123508

98123508 . .

. "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Pol2: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Constraint: 0"

ENST00000459808.1

processed_pseudogene chr3 –

136527393

136527393 ENSG00000198075.5 ENST00000272452.2

SULT1C4 "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=1;Hepg2=1;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 1"

A1:
Thanks for pointing us the problem. However, I’m a little confused of what file you are referring to. The parents file with url in your message (http://www.pseudogene.org/psicube/data/gencode.v10.pgene.parents.txt) does not match the contents you provided. The contents look more like from the file: http://pseudogene.org/psidr/psiDR.v0.txt. But neither file has the chromosome coordinates issue you mentioned. Maybe you meant some other file?

Q2:
It appears you are correct, i provided the link for the GENCODEv10 pseudogene resource instead of the v7 resource by mistake. I was, however, able to go back and find the file where I had found the mistake.

I had downloaded the Pseudogene Resource psiDR from the GENCODE website ( ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz ) and assumed that this file is the same as the link you provide ( http://pseudogene.org/psidr/psiDR.v0.txt ). Although it appears they are not… The link on the GENCODE website ( ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz ) displays the problem that I previously described, whereas the link you provide does not.

The file with the problem I described is actually linked at this page: http://www.gencodegenes.org/psidr/
Under the link entitled:
New! Pseudogene Resource psiDR
which redirects to: ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz

I am not sure if you part of the administration for the GENCODE site or not, but potentially if you aren’t, you would like to contact them regarding the problem since it appears to be data from your lab that is represented.

I am sorry for providing the wrong link earlier. Please let me know if you have anymore trouble reproducing the problem.

A2:
I can see the problem too. I’ll contact GENCODE to have the file updated. Thanks for pointing this issue to us!

Having some problems while executing PseudoPipe

Q:
I would like to use Pseudopipe. But I have been having some problems while executing PseudoPipe. To test the program I used the example input files (from Caenorhabditis Elegans) which were given. I modified the env.sh. I indicated the paths to python (python2.7), blast (blastall 2.2.25) and tfasty (tfasty36). But no pseudogenes were found when I executed the command which is shown below. I added a screenshot of the error as attachment. Could you give me some guidance to solve this problem?

The command that I am using is:
./pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.I.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chrI_exLocs 0

I downloaded PseudoPipe from:
http://www.pseudogene.org/pseudopipe/

A:
The reason you are not getting any output is because you need to use fasta-35.1.5 (tfasty35). The newer versions of the fasta (e.g. tfasty36) have a different output format that cannot be processed by the downstream programs in our pipeline. We are currently working on updating and improving the pipeline, but for the time being please do use tfasty35.

Question about fly pseudogenes Sisu et al., 2014 publication

Q:
We here at FlyBase are reviewing the Sisu et al., 2014 publication (PMID: 25157146) to check on the state of our pseudogene annotations. The paper talks about 145 pseudogenes, but the bed file at the PsiCube site (http://pseudogene.org/psicube/) lists only 108. We’ve really poked around the paper and PsiCube to find the remaining 37, but to no avail.

Would you please point us in the right direction, or send us a full list of the 145 pseudogene calls (with coordinates and parental gene/protein calls)?

A:
Thank for pointing this out. The error will be rectified. The full file is pasted below:

# DUP = duplicated pseudogenes
# PSSD = processed pseudogenes
# FRAG = pseudogenes with ambiguous biotype
#
#
#Chr Start End Strand PgeneID Biotype
chr2L 3162515 3163289 + FBtr0077575 DUP
chr2L 21404115 21404579 + FBtr0085889 DUP
chr2L 21404963 21405361 + FBtr0085890 DUP
chr2L 21405657 21405970 + FBtr0085891 DUP
chr2L 21542989 21543706 + FBtr0085895 DUP
chr2L 20923955 20924534 + FBtr0089857 DUP
chr2L 20928824 20929525 + FBtr0089858 DUP
chr2L 21418852 21419260 + FBtr0091806 DUP
chr2L 21428850 21429382 + FBtr0091809 DUP
chr2L 21438910 21439318 + FBtr0091815 DUP
chr2L 14781715 14783110 – FBtr0100604 DUP
chr2L 5621602 5623760 – FBtr0100853 DUP
chr2L 21589784 21590593 – FBtr0301172 DUP
chr2L 21577205 21577676 – FBtr0301174 DUP
chr2L 20639015 20639578 + FBtr0305612 DUP
chr2L 22131026 22131127 – FBtr0306298 DUP
chr2L 3694105 3694371 – FBtr0307120 DUP
chr2L 16728035 16729599 – FBtr0310391 DUP
chr2L 16700940 16703197 – FBtr0310392 DUP
chr2L 16699921 16700825 – FBtr0310393 DUP
chr2L 21282811 21284273 – FBtr0330681 DUP
chr2L 22066517 22067170 + FBtr0301969 FRAG
chr2L 15836725 15837444 – FBtr0304145 FRAG
chr2L 22340622 22341291 + FBtr0307110 FRAG
chr2L 19074992 19075412 – FBtr0081172 PSSD
chr2L 20862547 20863775 – FBtr0081448 PSSD
chr2L 22226151 22226771 + FBtr0085952 PSSD
chr2L 20901134 20901524 + FBtr0089856 PSSD
chr2L 6972243 6972796 + FBtr0305347 PSSD
chr2LHet 176023 179707 – FBtr0302459 DUP
chr2R 15649932 15650075 – FBtr0086355 DUP
chr2R 11092159 11093839 + FBtr0087364 DUP
chr2R 4456874 4457652 + FBtr0088689 DUP
chr2R 4320391 4320889 – FBtr0088762 DUP
chr2R 7754448 7755084 + FBtr0300860 DUP
chr2R 20405347 20406675 + FBtr0302916 DUP
chr2R 2887656 2889059 + FBtr0303442 DUP
chr2R 10247938 10249285 – FBtr0305617 DUP
chr2R 11261395 11262226 – FBtr0306709 DUP
chr2R 667816 671151 – FBtr0306722 DUP
chr2R 2926635 2927318 + FBtr0306743 DUP
chr2R 969389 969936 – FBtr0307111 DUP
chr2R 617944 618746 – FBtr0307119 DUP
chr2R 14289351 14289802 + FBtr0310489 DUP
chr2R 14289979 14290254 + FBtr0310490 DUP
chr2R 7586597 7587822 – FBtr0088081 FRAG
chr2R 619019 619416 + FBtr0111304 FRAG
chr2R 4044762 4045419 – FBtr0303310 FRAG
chr2R 271084 271321 + FBtr0304148 FRAG
chr2R 11262351 11263204 – FBtr0306724 FRAG
chr2RHet 2909278 2910923 – FBtr0301970 DUP
chr2RHet 334447 335080 + FBtr0302396 DUP
chr2RHet 338744 339377 + FBtr0302397 DUP
chr2RHet 1142083 1142548 + FBtr0302913 DUP
chr2RHet 2387787 2388177 – FBtr0302353 FRAG
chr2RHet 2128429 2129740 + FBtr0302915 FRAG
chr2RHet 2316691 2318206 – FBtr0302232 PSSD
chr3L 2171777 2172773 – FBtr0072914 DUP
chr3L 6141071 6141715 + FBtr0076993 DUP
chr3L 9506745 9507059 – FBtr0091689 DUP
chr3L 21952611 21953348 + FBtr0112457 DUP
chr3L 17878354 17878628 – FBtr0301175 DUP
chr3L 16593548 16595795 – FBtr0301925 DUP
chr3L 20971865 20972750 + FBtr0302444 DUP
chr3L 24539238 24540086 + FBtr0303009 DUP
chr3L 24542736 24543545 + FBtr0303010 DUP
chr3L 19417802 19418026 + FBtr0303863 DUP
chr3L 19471384 19471770 + FBtr0303926 DUP
chr3L 17861840 17863673 + FBtr0304978 DUP
chr3L 17867815 17869561 + FBtr0304979 DUP
chr3L 24527189 24528567 + FBtr0307117 DUP
chr3LHet 899 1989 + FBtr0114264 FRAG
chr3LHet 2277400 2277931 + FBtr0305903 FRAG
chr3LHet 687420 688819 + FBtr0302346 PSSD
chr3R 1094803 1095232 + FBtr0078783 DUP
chr3R 8221729 8224651 – FBtr0082602 DUP
chr3R 21094383 21095687 + FBtr0084817 DUP
chr3R 23670811 23671136 + FBtr0085225 DUP
chr3R 25684763 25685587 – FBtr0085524 DUP
chr3R 26037249 26037475 + FBtr0085613 DUP
chr3R 26038625 26038864 + FBtr0089614 DUP
chr3R 3352091 3353790 + FBtr0090038 DUP
chr3R 15211639 15212893 + FBtr0112481 DUP
chr3R 4086428 4087620 – FBtr0300631 DUP
chr3R 8719694 8720251 – FBtr0303313 DUP
chr3R 11731969 11732472 + FBtr0304144 DUP
chr3R 1674675 1675175 + FBtr0306740 DUP
chr3R 21093731 21094186 + FBtr0306845 DUP
chr3R 1853489 1854343 + FBtr0307114 DUP
chr3R 23786492 23787198 + FBtr0307118 DUP
chr3R 5887408 5888036 + FBtr0091606 FRAG
chr3R 69328 71233 + FBtr0113190 FRAG
chr3R 5079267 5080405 + FBtr0303309 PSSD
chr3R 17555224 17555617 – FBtr0304882 PSSD
chr3R 17007 21933 + FBtr0308945 PSSD
chr3RHet 412765 412948 – FBtr0302440 DUP
chr3RHet 859445 859735 – FBtr0302347 PSSD
chr4 48156 52259 – FBtr0089180 DUP
chr4 33566 45680 – FBtr0089181 DUP
chr4 26789 32391 – FBtr0089182 DUP
chrU 3448785 3449605 + FBtr0114269 DUP
chrU 8029209 8030316 – FBtr0308947 DUP
chrU 1397433 1397911 – FBtr0114236 FRAG
chrU 5607529 5607780 – FBtr0114258 FRAG
chrU 1206488 1206901 + FBtr0302912 FRAG
chrU 2072072 2074236 – FBtr0114183 PSSD
chrX 373897 375842 – FBtr0070095 DUP
chrX 371883 373342 – FBtr0070097 DUP
chrX 6255528 6256993 – FBtr0070923 DUP
chrX 6176311 6177608 – FBtr0070931 DUP
chrX 6174266 6174785 – FBtr0070932 DUP
chrX 9154730 9155365 + FBtr0071318 DUP
chrX 17792676 17793531 + FBtr0074499 DUP
chrX 20254611 20255089 – FBtr0112509 DUP
chrX 7791378 7792005 – FBtr0299927 DUP
chrX 7792469 7793210 – FBtr0300634 DUP
chrX 9153318 9154436 – FBtr0304143 DUP
chrX 15470388 15470997 + FBtr0304150 DUP
chrX 11481316 11483206 + FBtr0306144 DUP
chrX 11483409 11485299 + FBtr0306145 DUP
chrX 11485502 11487392 + FBtr0306146 DUP
chrX 11487595 11489485 + FBtr0306147 DUP
chrX 11489688 11491578 + FBtr0306148 DUP
chrX 11491781 11493671 + FBtr0306149 DUP
chrX 21026458 21027648 – FBtr0307579 DUP
chrX 19846847 19847947 – FBtr0307580 DUP
chrX 19842112 19843212 – FBtr0307581 DUP
chrX 19837392 19838482 – FBtr0307582 DUP
chrX 19832512 19833429 – FBtr0307583 DUP
chrX 19827821 19828630 – FBtr0307584 DUP
chrX 19822508 19823613 – FBtr0307585 DUP
chrX 19813674 19814773 – FBtr0307586 DUP
chrX 19808805 19809718 – FBtr0307587 DUP
chrX 19803641 19804554 – FBtr0307588 DUP
chrX 6844543 6845299 + FBtr0307391 FRAG
chrX 6846369 6846842 + FBtr0071001 PSSD
chrX 20746139 20746330 + FBtr0303307 PSSD
chrX 3691530 3693041 – FBtr0305348 PSSD
chrYHet 312456 313714 – FBtr0114243 DUP
chrYHet 319739 320997 – FBtr0114244 DUP
chrYHet 327052 328489 – FBtr0114245 DUP
chrYHet 307129 307365 – FBtr0114289 DUP
chrYHet 340030 340818 + FBtr0114241 FRAG
chrYHet 205196 205372 – FBtr0302914 FRAG
chrYHet 337134 338414 + FBtr0114242 PSSD

psiDR query

Q:
I am looking for tissue-specific transcript for pseudogenes in psiDR. However, I could only find the details about their translation in a few cell lines. Kindly provide details of resource/file where this information might be available.

A:
The pseudogene transcription is evaluated using various human cell lines from HumanBodyMap data. The latest information for the human transcription is available in psiCUBE database http://pseudogene.org/psicube/ .
Here used RPKM information to asses the pseudogene transcription levels as described in http://www.pnas.org/content/111/37/13361.short .

I attach here the pseudogene transcription information with the calculated RPKM values in each cell line.

pseudogene similarities to parent genes

Q:
I am looking at your paper ("The Gencode pseudogene resource"), which
appears very relevant to something I am doing right now. Specifically
I am interested in the Sequence identity values between pseudogenes
and their parents, which are used in figure 4. Would it be possible
for you to make these available to me (or to tell me where i can
download them if they are already online ?)

A:
You may find the data at http://pseudogene.org/psidr/similarity.dat

Pseudogene identification pipeline for bacterial genome

Q:
I am writing to you reagarding ‘Pseudogenes’ detection within bacterial genome- I was wondering is there a software/ pipleline to use in order to identify pseudogenes within bacterial genome.

A:
The best way is to use our pseudogene annotation pipeline – Pseudopipe. You can download the stand-alone version that can be easily run on your computer and does not require a cluster:
http://pseudogene.org/pseudopipe/

Pseudogene talk at ASHG

Q:
I recently attended the ASHG conference where you gave a talk on pseudogene copy number variation based on the 1000 genomes project. I tried looking for this study online and didn’t find anything that was obviously part of your presentation. I was wondering if this data has already been published, and if so if you would let me know what the name of the study was.

A:
I think the studies you are looking for are:

http://www.pnas.org/content/111/37/13361.abstract
and
http://genome.cshlp.org/content/23/12/2042.full.pdf+html

The first is the latest paper from our lab on pseudogene analysis and the second is a paper on CNVs and retroduplications based on 1000G project.

Mouse Transcribed Pseudogene Data

Q:
I’m currently working on how pseudogenes can act as competitive endogenous RNAs in humans, and would like to expand my study to include mice. I recently read a paper from your lab, Comparative analysis of pseudogenes across three phyla, and in the supplementary information you mention that you identified 878 transcribed pseudogenes in the mouse genome. Is there a list of these pseudogenes as well as their associated parent genes available on either the pseudogene.org website or on a different website?

A:
I think this draft list should be on the psicube site .

Questions about using PseudoPipe

Q1:
First of all I must show great respect to your brilliant work on developing the PseudoPipe software.
Now I am working on my graduate paper, and need to use this software. But I met some problems, so any guide or assistance from you would be appreciated.
I just download the software package from your website and unpack it in my home directory(that is ~/), but when I test it according to your manual, it reported errors as below:
I have tried several ways to fix it ,even trying to modify the source code, but failed. I’ve been driven somehow crazy haha.
Can you please provide some suggestions? thanks in advance!

A1:
It looks like your installation is not referencing python properly. Please edit the env.sh file with the appropriate source/path for python in your system.

Q2:
According to your suggestion, now I have finished all the environment variable setting in env.sh, but I still got error while running the software(as the below Fig.1)
So I try to fix the code of pseudopipe.sh , and I finally made it run just by modifying the "source setenvPipelineVars" into "source ./setenvPipelineVars" at line 141. And I got the final result file(as Fig. 2) by running your sample data. Is the result correct?
Don’t know if anybody reported similar error before. If not, I hope it would contribute to improving your powerful software. And it would be great if you can also display on your manual or README what the standard output and final result file look like when testing the sample data.

A2:
The results look right. Thank you for your suggestions, we will take them into account in a future update of the pipeline.

Good luck with your analysis.

Question about publication data “Comparative analysis of pseudogenes across,three phyla”

Q:
I’m looking at some of the data connected with your recent publication and was wondering if I could get clarification on the BioType attribute in the following file:

http://www.pseudogene.org/psicube/data/Worm-Annotation.bed

In here there appear to be 3 biotypes

processed_pseudogene
pseudogene
unprocessed_pseudogene

Looking through the paper and the supplementary material I can find reference to processed_pseudogene and unprocessed_pseudogene, but not the generic pseudogene? Reading the S1 material I would not expect to see this 3rd biotype

–snip–
(a) Classification
Pseudogenes were classified as “processed” if they have lost their parental gene structures.
Conversely, we classified pseudogenes as “unprocessed”/ “duplicated” if they retained the
same exon-intron structure as their parent loci. In ambiguous cases we used other features to
resolve the provenance of the pseudogene. Where the pseudogene represented a fragment of
the parent, and the homology ended precisely at a splice junction the pseudogene was called
“unprocessed” (“duplicated”). Conversely, where the fragment contained the fusion of two or
more exons the pseudogene was called “processed”. If the parent had a single exon CDS, the
presence of parent gene structure in the 5′ UTR region (identified by alignment of mRNA and
EST evidence) allowed the pseudogene to be called “unprocessed”/“duplicated”. Meanwhile,
the presence of a pseudopoly(A) signal (the position of the parent poly(A) signal at the
pseudogene locus) followed by a tract of A-rich sequence in the genome (indicating the
insertion site of the polyadenylated parental mRNA) indicated a “processed” pseudogene. If
there was no other evidence available to resolve the route by which the pseudogene was
created, we used the position of the pseudogene relative to its parent. As such “processed”
pseudogenes are reinserted into the genome with an approximately random distribution while
“unprocessed”/“duplicated” pseudogenes tend to be more closely associated with the parent
locus. Parsimony therefore suggests that pseudogenes that lie near to the parent locus are
more likely to have arisen via a gene-duplication event than retrotransposition, and this was
used as a tie-breaker in defining the pseudogene biotype.
–snip–

I hope I haven’t missed anything obvious, but any clarification would help greatly.

A:
When we classify the pseudogenes according to their biotype we have processed pseudogenes and duplicated pseudogenes. This biotype is dependent on the pseudogene formation process (retrotransposition vs duplication) and this is the description that you see in the supplementary material. The third biotype that you find in some of the files on psicube website is actually not a biotype per se, these pseudogenes are most of the time highly degraded or short fragments and we could not assign with high confidence a definite biotype to them. In other words the pseudogenes with “pseudogene” as biotype have actually an undetermined biotype. But instead of saying “NA” (not available or unknown) we opted to simply call them “pseudogene".