Unitary Pseudogene PROMOTER

I am trying to find an example of a unitary pseudogene whose
promoter is known to be mutated as well and therefore the gene is
definitely non-functional. I can find articles stating there are many
examples of unitary pseudogenes in humans (e.g. Vitamin C) but none
seem to mention the promoter. Any thoughts?

Our analyses compiled a number of activity features associated with pseudogenes (e.g. transcription, presence of functional Pol2 and TF binding sites in the upstream region, presence of open chromatin) that are available in online. Please see https://www.ncbi.nlm.nih.gov/pubmed/22951037 (http://pseudogene.org/psidr/ ) and https://www.ncbi.nlm.nih.gov/pubmed/25157146 (http://pseudogene.org/psicube/) for the functional characterisation of pseudogenes. In particular the unitary pseudogenes that do not have transcription, Pol2 and TF binding sites should be the ones to look at and to check the conservation or not of the promoter region.

Zebrafish pseudogenes

I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the
pseudogene.org, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

Pseudogene.org provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

problem in PseudoPipe

I tried to run the software PseudoPipe
(http://pseudogene.org/DOWNLOADS/pipeline_codes/ppipe.tar.gz) using the
example as following:
./pseudopipe.sh ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
No such file or directory
No such file or directory
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root.

pseudogene.org error message

I would just like to bring the below error message to your attention that I recently received when attempting to access data on pseudogenes.org. (see image)

We were not able to reproduce your error. In order to understand what happened and find a solution, it would be of a considerable help if you could let us know the exact commands you made that resulted in this error.

Question re rice pseudogene

I am using your pseudogene dataset of rice to do some analysis. However, I found that you did not mention which Rice genome version you used for data analysis, so I cannot anchor the pseudogenes to the genome I used. Would you please give the information.

As I understand you are using the pseudogenes described in this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2708354/ . The Data source section of the paper highlights the fact that the annotations were done on the Rice genome version 5 from TIGR. You can find all the information regarding the rice genome version 5 at ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa .

Terms in Pseudopipe output, etc

I am looking the details of Pseudopipe’s output terms such as "frac", "ins", "del", "shift", "stop", "polya". Also how pseudopipe makes confirm the pseudogenes in its results.

frac = fraction of parent gene that matches the pseudogene
ins = number of insertions in the pseudogene compared to parent sequence
del = number of deletions in the pseudogene compared to parent sequence
shift = number of frame shifts in the pseudogene compared to parent sequence
stop = number of stop codons in the pseudogene compared to parent sequence
polya = flag indicating the presence or absence of a polyA tail

Also see below the code associated with the script fetchEnsemblFiles.py for downloading the input data for eukaryotes from ensembl website:

#!/usr/bin/env python

# some examples of files and locations
# pub
# lrwxrwxrwx 1 ftpuser ftpusers 30 Dec 7 16:33 current_homo_sapiens -> release-36/homo_sapiens_36_35i
#-rw-rw-r– 1 ftpuser ftpusers 67675771 Nov 15 14:48 Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 40802343 Nov 15 14:55 Homo_sapiens.NCBI35.dec.dna_rm.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 3817861 Nov 15 19:46 Homo_sapiens.NCBI35.dec.pep.known.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 2957452 Dec 2 22:45 exon.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1747738 Dec 2 22:45 exon_stable_id.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1489045 Dec 2 22:45 exon_transcript.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 4626 Dec 2 21:57 homo_sapiens_core_36_35i.mysql40_compatible.sql.gz
#-rw-rw-r– 1 ftpuser ftpusers 4753 Dec 2 21:57 homo_sapiens_core_36_35i.sql.gz

import os, os.path, re, sys
from ftplib import FTP

class collect:
def __init__(self): self.data = []
def more(self, l): self.data.append(l)

def maybeRetrFile(fromPath, toPath):
what = ‘from %s –> to %s’ %(fromPath, toPath)
if os.path.exists(toPath):
print ‘skipping ‘+what
if toPath.endswith(‘.gz’) and os.path.exists(toPath[:-3]):
print ‘skipping (uncompressed) ‘+what

print what
toFile = open(toPath, ‘w’)
ec.retrbinary(‘RETR ‘+fromPath, toFile.write, blocksize=100000)

target = sys.argv[1].strip().lower().replace(‘ ‘, ‘_’)

release = ‘current_’

if len(sys.argv) > 2:
release = ‘release-‘ + sys.argv[2] + ‘/’

# set up initial connection
print ‘Logging into ‘+host
ec = FTP(host)

# look for target in a listing of pub
files = collect()
print ‘Listing ‘+where
ec.dir(where, files.more)
tEntries = [l for l in files.data if target+”_core_” in l and ‘->’ not in l ]
if len(tEntries) != 1:
print target + ‘ is either missing or not unique:’
print tEntries
print ‘\n’.join(files.data)

# “parse” current link name
curPat = re.compile(r”+target+’_core_(.+)_(.+)\Z’)
tPath = tEntries[0].split()[-1]
mo = curPat.match(tPath)
if not mo:
print ‘dont\’t understand release naming scheme: ‘+ tPath
[maj, min] = mo.groups()
outDir = target + ‘_’ + majMin

print ‘Release: ‘+release[0:len(release)-1]+’, ‘+’tPath: ‘+tPath+’, ‘+’target: ‘+target+’, ‘+’maj: ‘+maj+’, ‘+’majMin: ‘+majMin+’, ‘+’outDir: ‘+outDir

## if os.path.exists(outDir):
## print ‘up to date: ‘ + tPath
## ec.close()
## sys.exit(0)

# need to get files. first, set up directories.
[dDir, mDir, pDir] = [outDir+d for d in [‘/dna/’, ‘/mysql/’, ‘/pep/’]]
if not os.path.exists(dDir): os.makedirs(dDir, 0744)
if not os.path.exists(mDir): os.makedirs(mDir, 0744)
if not os.path.exists(pDir): os.makedirs(pDir, 0744)

# retrieve dna
dnaPat = re.compile(r’\.dna(_rm)?\.chromosome\..+\.fa\.gz\Z’)
dFiles = collect()
where = ‘pub/’+release+’fasta/%s/dna’ % target
print ‘Changing dir to ‘+where
ec.dir(where, dFiles.more)
dKeep = [l for l in dFiles.data if dnaPat.search(l)]
for f in dKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, dDir+fn)

# retrieve pep
where = ‘pub/’+release+’fasta/%s/pep’ % target
pFiles = collect()
print ‘Changing dir to ‘+where
ec.dir(where, pFiles.more)
for f in pFiles.data:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, pDir+fn)

# retrieve mysql
# older releases?: mFiles = [‘exon.txt.table’, ‘exon_transcript.txt.table’, ‘gene_stable_id.txt.table’, ‘seq_region.txt.table’, ‘transcript.txt.table’, ‘translation.txt.table’, ‘translation_stable_id.txt.table’, target+’_core_’+majMin+’.sql’, target+’_core_’+majMin+’.mysql40_compatible.sql’]
#older releases which have *_stable_id.txt: mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘gene_stable_id.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, ‘translation_stable_id.txt’, target+’_core_’+majMin+’.sql’]
mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, target+’_core_’+majMin+’.sql’]

where = ‘pub/’+release+’mysql/%s_core_%s’ % (target, majMin)
print ‘Changing dir to ‘+where
for mf in mFiles:
maybeRetrFile(where+’/’+mf+’.gz’, mDir+mf+’.gz’)

# retrieve GTF
where = ‘pub/’+release+’gtf/%s’ % (target)
print ‘Changing dir to ‘+where
gtfPat = re.compile(r’\.gtf\.gz\Z’)
gFiles = collect()
ec.dir(where, gFiles.more)
gKeep = [l for l in gFiles.data if gtfPat.search(l)]
for f in gKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, mDir+fn)


print ‘Processing Fetched Files’
#os.system(‘%s/processEnsemblFiles.sh %s’ % (sys.path[0], outDir))

Regarding PseudoPipe MySQL file

I am using PseudoPipe to find pseudogenes from a query Chromosome. I have a chromosome nucleotide sequence file and a protein sequences file.

I am not getting what is MySQL file and how can get this and one more file of masking.

PseudoPipe is configured to run on nucleotide and protein sequence files as formatted and available for download from the ensembl server.
Regarding your issues:

1. A MySQL file is a file dowloaded from a MySQL database , and thus has it’s specific format. Ensemble uses this database to store exons co-ordinates for all the protein coding genes starting with an exon id, chromosome number, start and end position, strand, etc . As such I suggest you format your exons information accordingly . As example you can use the” chrI_exLocs” file located in the mysql folder from the C.elegans example that you downloaded along with pseudopipe.

2. A masking file is a nucleotide files (in fasta format) that masks all the repeat sequences from the genome. If you want to create it yourself you should use a repeat masker and format it accordingly to the file that you see in the dna folder in the C.elegans example dna_rm.fa .

Question about a potential error with Pseudogene.org

I want to say great job with the Pseudogene.org site! I recently noticed a potential error and wanted to send a email to inform you if you haven’t already picked it up yourselves….

In the file located at the following address:


The start and end chromosomal locations for the pseudogenes are the same. See below:



chr19 +



ENSG00000237521.1 ENST00000456448.1 OR7E24

"Transcribed: 0" "Active Chromatin: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 0"



chr2 –


98123508 . .

. "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Pol2: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Constraint: 0"


processed_pseudogene chr3 –


136527393 ENSG00000198075.5 ENST00000272452.2

SULT1C4 "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=1;Hepg2=1;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 1"

Thanks for pointing us the problem. However, I’m a little confused of what file you are referring to. The parents file with url in your message (http://www.pseudogene.org/psicube/data/gencode.v10.pgene.parents.txt) does not match the contents you provided. The contents look more like from the file: http://pseudogene.org/psidr/psiDR.v0.txt. But neither file has the chromosome coordinates issue you mentioned. Maybe you meant some other file?

It appears you are correct, i provided the link for the GENCODEv10 pseudogene resource instead of the v7 resource by mistake. I was, however, able to go back and find the file where I had found the mistake.

I had downloaded the Pseudogene Resource psiDR from the GENCODE website ( ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz ) and assumed that this file is the same as the link you provide ( http://pseudogene.org/psidr/psiDR.v0.txt ). Although it appears they are not… The link on the GENCODE website ( ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz ) displays the problem that I previously described, whereas the link you provide does not.

The file with the problem I described is actually linked at this page: http://www.gencodegenes.org/psidr/
Under the link entitled:
New! Pseudogene Resource psiDR
which redirects to: ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz

I am not sure if you part of the administration for the GENCODE site or not, but potentially if you aren’t, you would like to contact them regarding the problem since it appears to be data from your lab that is represented.

I am sorry for providing the wrong link earlier. Please let me know if you have anymore trouble reproducing the problem.

I can see the problem too. I’ll contact GENCODE to have the file updated. Thanks for pointing this issue to us!