Pseudogenes – PseudoPipe

I am trying to find SNPs in
pseudogenes but the database for the SNP’s is built for different genome
assemblies than pseudogenes predictions from PseudoPipe. Do you have the
current pipeline pseudogenes predictions on eukaryotic genomes? Or is there
a way to remap the genome assemblies used by Pipeline to a different
If I want to use PsedoPipe, where in Ensembl can I find the input data set?

Regarding your questions there are a number of things that you can do:
* if you are interested in the human/mouse genome, these are available for the latest assembly GRC38 from the webpage , see and respectively.
* the latest annotations for the worm and fly genomes, these are available from here :
* if you are interested in other eukaryotic genomes that have annotation build on older assemblies, one option is to do a lift over of the annotation from an old assembly to a newer one. This can easily be done using the UCSC genome browser resource, however I would very much advise to actually run pseudo pipe on your machine given the fact that improvement in assembly and protein coding annotation will considerably improve the output of the pseudogene annotation. You can download and run pseudo pipe as described here:
* also using the “fetch file” as described here will automatically download all the necessary data for you from the ensembl server.

Unitary Pseudogene PROMOTER

I am trying to find an example of a unitary pseudogene whose
promoter is known to be mutated as well and therefore the gene is
definitely non-functional. I can find articles stating there are many
examples of unitary pseudogenes in humans (e.g. Vitamin C) but none
seem to mention the promoter. Any thoughts?

Our analyses compiled a number of activity features associated with pseudogenes (e.g. transcription, presence of functional Pol2 and TF binding sites in the upstream region, presence of open chromatin) that are available in online. Please see ( ) and ( for the functional characterisation of pseudogenes. In particular the unitary pseudogenes that do not have transcription, Pol2 and TF binding sites should be the ones to look at and to check the conservation or not of the promoter region.

Zebrafish pseudogenes

I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

A: provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

problem in PseudoPipe

I tried to run the software PseudoPipe
( using the
example as following:
./ ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
No such file or directory
No such file or directory
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root. error message

I would just like to bring the below error message to your attention that I recently received when attempting to access data on (see image)

We were not able to reproduce your error. In order to understand what happened and find a solution, it would be of a considerable help if you could let us know the exact commands you made that resulted in this error.

Question re rice pseudogene

I am using your pseudogene dataset of rice to do some analysis. However, I found that you did not mention which Rice genome version you used for data analysis, so I cannot anchor the pseudogenes to the genome I used. Would you please give the information.

As I understand you are using the pseudogenes described in this paper: . The Data source section of the paper highlights the fact that the annotations were done on the Rice genome version 5 from TIGR. You can find all the information regarding the rice genome version 5 at .

Terms in Pseudopipe output, etc

I am looking the details of Pseudopipe’s output terms such as "frac", "ins", "del", "shift", "stop", "polya". Also how pseudopipe makes confirm the pseudogenes in its results.

frac = fraction of parent gene that matches the pseudogene
ins = number of insertions in the pseudogene compared to parent sequence
del = number of deletions in the pseudogene compared to parent sequence
shift = number of frame shifts in the pseudogene compared to parent sequence
stop = number of stop codons in the pseudogene compared to parent sequence
polya = flag indicating the presence or absence of a polyA tail

Also see below the code associated with the script for downloading the input data for eukaryotes from ensembl website:

#!/usr/bin/env python

# some examples of files and locations
# pub
# lrwxrwxrwx 1 ftpuser ftpusers 30 Dec 7 16:33 current_homo_sapiens -> release-36/homo_sapiens_36_35i
#-rw-rw-r– 1 ftpuser ftpusers 67675771 Nov 15 14:48 Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 40802343 Nov 15 14:55 Homo_sapiens.NCBI35.dec.dna_rm.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 3817861 Nov 15 19:46 Homo_sapiens.NCBI35.dec.pep.known.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 2957452 Dec 2 22:45 exon.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1747738 Dec 2 22:45 exon_stable_id.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1489045 Dec 2 22:45 exon_transcript.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 4626 Dec 2 21:57 homo_sapiens_core_36_35i.mysql40_compatible.sql.gz
#-rw-rw-r– 1 ftpuser ftpusers 4753 Dec 2 21:57 homo_sapiens_core_36_35i.sql.gz

import os, os.path, re, sys
from ftplib import FTP

class collect:
def __init__(self): = []
def more(self, l):

def maybeRetrFile(fromPath, toPath):
what = ‘from %s –> to %s’ %(fromPath, toPath)
if os.path.exists(toPath):
print ‘skipping ‘+what
if toPath.endswith(‘.gz’) and os.path.exists(toPath[:-3]):
print ‘skipping (uncompressed) ‘+what

print what
toFile = open(toPath, ‘w’)
ec.retrbinary(‘RETR ‘+fromPath, toFile.write, blocksize=100000)

target = sys.argv[1].strip().lower().replace(‘ ‘, ‘_’)

release = ‘current_’

if len(sys.argv) > 2:
release = ‘release-‘ + sys.argv[2] + ‘/’

# set up initial connection
print ‘Logging into ‘+host
ec = FTP(host)

# look for target in a listing of pub
files = collect()
print ‘Listing ‘+where
ec.dir(where, files.more)
tEntries = [l for l in if target+”_core_” in l and ‘->’ not in l ]
if len(tEntries) != 1:
print target + ‘ is either missing or not unique:’
print tEntries
print ‘\n’.join(

# “parse” current link name
curPat = re.compile(r”+target+’_core_(.+)_(.+)\Z’)
tPath = tEntries[0].split()[-1]
mo = curPat.match(tPath)
if not mo:
print ‘dont\’t understand release naming scheme: ‘+ tPath
[maj, min] = mo.groups()
outDir = target + ‘_’ + majMin

print ‘Release: ‘+release[0:len(release)-1]+’, ‘+’tPath: ‘+tPath+’, ‘+’target: ‘+target+’, ‘+’maj: ‘+maj+’, ‘+’majMin: ‘+majMin+’, ‘+’outDir: ‘+outDir

## if os.path.exists(outDir):
## print ‘up to date: ‘ + tPath
## ec.close()
## sys.exit(0)

# need to get files. first, set up directories.
[dDir, mDir, pDir] = [outDir+d for d in [‘/dna/’, ‘/mysql/’, ‘/pep/’]]
if not os.path.exists(dDir): os.makedirs(dDir, 0744)
if not os.path.exists(mDir): os.makedirs(mDir, 0744)
if not os.path.exists(pDir): os.makedirs(pDir, 0744)

# retrieve dna
dnaPat = re.compile(r’\.dna(_rm)?\.chromosome\..+\.fa\.gz\Z’)
dFiles = collect()
where = ‘pub/’+release+’fasta/%s/dna’ % target
print ‘Changing dir to ‘+where
ec.dir(where, dFiles.more)
dKeep = [l for l in if]
for f in dKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, dDir+fn)

# retrieve pep
where = ‘pub/’+release+’fasta/%s/pep’ % target
pFiles = collect()
print ‘Changing dir to ‘+where
ec.dir(where, pFiles.more)
for f in
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, pDir+fn)

# retrieve mysql
# older releases?: mFiles = [‘exon.txt.table’, ‘exon_transcript.txt.table’, ‘gene_stable_id.txt.table’, ‘seq_region.txt.table’, ‘transcript.txt.table’, ‘translation.txt.table’, ‘translation_stable_id.txt.table’, target+’_core_’+majMin+’.sql’, target+’_core_’+majMin+’.mysql40_compatible.sql’]
#older releases which have *_stable_id.txt: mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘gene_stable_id.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, ‘translation_stable_id.txt’, target+’_core_’+majMin+’.sql’]
mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, target+’_core_’+majMin+’.sql’]

where = ‘pub/’+release+’mysql/%s_core_%s’ % (target, majMin)
print ‘Changing dir to ‘+where
for mf in mFiles:
maybeRetrFile(where+’/’+mf+’.gz’, mDir+mf+’.gz’)

# retrieve GTF
where = ‘pub/’+release+’gtf/%s’ % (target)
print ‘Changing dir to ‘+where
gtfPat = re.compile(r’\.gtf\.gz\Z’)
gFiles = collect()
ec.dir(where, gFiles.more)
gKeep = [l for l in if]
for f in gKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, mDir+fn)


print ‘Processing Fetched Files’
#os.system(‘%s/ %s’ % (sys.path[0], outDir))

Regarding PseudoPipe MySQL file

I am using PseudoPipe to find pseudogenes from a query Chromosome. I have a chromosome nucleotide sequence file and a protein sequences file.

I am not getting what is MySQL file and how can get this and one more file of masking.

PseudoPipe is configured to run on nucleotide and protein sequence files as formatted and available for download from the ensembl server.
Regarding your issues:

1. A MySQL file is a file dowloaded from a MySQL database , and thus has it’s specific format. Ensemble uses this database to store exons co-ordinates for all the protein coding genes starting with an exon id, chromosome number, start and end position, strand, etc . As such I suggest you format your exons information accordingly . As example you can use the” chrI_exLocs” file located in the mysql folder from the C.elegans example that you downloaded along with pseudopipe.

2. A masking file is a nucleotide files (in fasta format) that masks all the repeat sequences from the genome. If you want to create it yourself you should use a repeat masker and format it accordingly to the file that you see in the dna folder in the C.elegans example dna_rm.fa .