Zebrafish pseudogenes

Q:
I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the
pseudogene.org, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

A:
Pseudogene.org provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

problem in PseudoPipe

Q:
I tried to run the software PseudoPipe
(http://pseudogene.org/DOWNLOADS/pipeline_codes/ppipe.tar.gz) using the
example as following:
./pseudopipe.sh ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs
0

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’:
No such file or directory
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’:
No such file or directory
gzip:
/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa:
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

A:
Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root.

pseudogene.org error message

Q:
I would just like to bring the below error message to your attention that I recently received when attempting to access data on pseudogenes.org. (see image)

A:
We were not able to reproduce your error. In order to understand what happened and find a solution, it would be of a considerable help if you could let us know the exact commands you made that resulted in this error.

Question re rice pseudogene

Q:
I am using your pseudogene dataset of rice to do some analysis. However, I found that you did not mention which Rice genome version you used for data analysis, so I cannot anchor the pseudogenes to the genome I used. Would you please give the information.

A:
As I understand you are using the pseudogenes described in this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2708354/ . The Data source section of the paper highlights the fact that the annotations were done on the Rice genome version 5 from TIGR. You can find all the information regarding the rice genome version 5 at ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa .

Terms in Pseudopipe output, etc

Q:
I am looking the details of Pseudopipe’s output terms such as "frac", "ins", "del", "shift", "stop", "polya". Also how pseudopipe makes confirm the pseudogenes in its results.

A:
frac = fraction of parent gene that matches the pseudogene
ins = number of insertions in the pseudogene compared to parent sequence
del = number of deletions in the pseudogene compared to parent sequence
shift = number of frame shifts in the pseudogene compared to parent sequence
stop = number of stop codons in the pseudogene compared to parent sequence
polya = flag indicating the presence or absence of a polyA tail

Also see below the code associated with the script fetchEnsemblFiles.py for downloading the input data for eukaryotes from ensembl website:


#!/usr/bin/env python

# some examples of files and locations
# pub
# lrwxrwxrwx 1 ftpuser ftpusers 30 Dec 7 16:33 current_homo_sapiens -> release-36/homo_sapiens_36_35i
#
#/pub/release-36/homo_sapiens_36_35i/data/fasta/dna
#-rw-rw-r– 1 ftpuser ftpusers 67675771 Nov 15 14:48 Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 40802343 Nov 15 14:55 Homo_sapiens.NCBI35.dec.dna_rm.chromosome.1.fa.gz
#
#/pub/release-36/homo_sapiens_36_35i/data/fasta/pep
#-rw-rw-r– 1 ftpuser ftpusers 3817861 Nov 15 19:46 Homo_sapiens.NCBI35.dec.pep.known.fa.gz
#
#/pub/release-36/homo_sapiens_36_35i/data/mysql/homo_sapiens_core_36_35i
#-rw-rw-r– 1 ftpuser ftpusers 2957452 Dec 2 22:45 exon.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1747738 Dec 2 22:45 exon_stable_id.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1489045 Dec 2 22:45 exon_transcript.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 4626 Dec 2 21:57 homo_sapiens_core_36_35i.mysql40_compatible.sql.gz
#-rw-rw-r– 1 ftpuser ftpusers 4753 Dec 2 21:57 homo_sapiens_core_36_35i.sql.gz

import os, os.path, re, sys
from ftplib import FTP

class collect:
def __init__(self): self.data = []
def more(self, l): self.data.append(l)

def maybeRetrFile(fromPath, toPath):
what = ‘from %s –> to %s’ %(fromPath, toPath)
if os.path.exists(toPath):
print ‘skipping ‘+what
return
else:
if toPath.endswith(‘.gz’) and os.path.exists(toPath[:-3]):
print ‘skipping (uncompressed) ‘+what
return

print what
toFile = open(toPath, ‘w’)
ec.retrbinary(‘RETR ‘+fromPath, toFile.write, blocksize=100000)
toFile.close()

target = sys.argv[1].strip().lower().replace(‘ ‘, ‘_’)

release = ‘current_’

if len(sys.argv) > 2:
release = ‘release-‘ + sys.argv[2] + ‘/’

# set up initial connection
host=’ftp.ensemblgenomes.org’
print ‘Logging into ‘+host
ec = FTP(host)
ec.login()

# look for target in a listing of pub
files = collect()
where=’pub/’+release+’mysql’
print ‘Listing ‘+where
ec.dir(where, files.more)
tEntries = [l for l in files.data if target+”_core_” in l and ‘->’ not in l ]
if len(tEntries) != 1:
print target + ‘ is either missing or not unique:’
print tEntries
print ‘\n’.join(files.data)
ec.close()
sys.exit(-1)

# “parse” current link name
curPat = re.compile(r”+target+’_core_(.+)_(.+)\Z’)
tPath = tEntries[0].split()[-1]
mo = curPat.match(tPath)
if not mo:
print ‘dont\’t understand release naming scheme: ‘+ tPath
ec.close()
sys.exit(-1)
[maj, min] = mo.groups()
majMin=maj+’_’+min
outDir = target + ‘_’ + majMin

print ‘Release: ‘+release[0:len(release)-1]+’, ‘+’tPath: ‘+tPath+’, ‘+’target: ‘+target+’, ‘+’maj: ‘+maj+’, ‘+’majMin: ‘+majMin+’, ‘+’outDir: ‘+outDir

## if os.path.exists(outDir):
## print ‘up to date: ‘ + tPath
## ec.close()
## sys.exit(0)

# need to get files. first, set up directories.
[dDir, mDir, pDir] = [outDir+d for d in [‘/dna/’, ‘/mysql/’, ‘/pep/’]]
if not os.path.exists(dDir): os.makedirs(dDir, 0744)
if not os.path.exists(mDir): os.makedirs(mDir, 0744)
if not os.path.exists(pDir): os.makedirs(pDir, 0744)

# retrieve dna
dnaPat = re.compile(r’\.dna(_rm)?\.chromosome\..+\.fa\.gz\Z’)
dFiles = collect()
where = ‘pub/’+release+’fasta/%s/dna’ % target
print ‘Changing dir to ‘+where
ec.dir(where, dFiles.more)
dKeep = [l for l in dFiles.data if dnaPat.search(l)]
for f in dKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, dDir+fn)

# retrieve pep
where = ‘pub/’+release+’fasta/%s/pep’ % target
pFiles = collect()
print ‘Changing dir to ‘+where
ec.dir(where, pFiles.more)
for f in pFiles.data:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, pDir+fn)

# retrieve mysql
# older releases?: mFiles = [‘exon.txt.table’, ‘exon_transcript.txt.table’, ‘gene_stable_id.txt.table’, ‘seq_region.txt.table’, ‘transcript.txt.table’, ‘translation.txt.table’, ‘translation_stable_id.txt.table’, target+’_core_’+majMin+’.sql’, target+’_core_’+majMin+’.mysql40_compatible.sql’]
#older releases which have *_stable_id.txt: mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘gene_stable_id.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, ‘translation_stable_id.txt’, target+’_core_’+majMin+’.sql’]
mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, target+’_core_’+majMin+’.sql’]

where = ‘pub/’+release+’mysql/%s_core_%s’ % (target, majMin)
print ‘Changing dir to ‘+where
for mf in mFiles:
maybeRetrFile(where+’/’+mf+’.gz’, mDir+mf+’.gz’)

# retrieve GTF
where = ‘pub/’+release+’gtf/%s’ % (target)
print ‘Changing dir to ‘+where
gtfPat = re.compile(r’\.gtf\.gz\Z’)
gFiles = collect()
ec.dir(where, gFiles.more)
gKeep = [l for l in gFiles.data if gtfPat.search(l)]
for f in gKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, mDir+fn)

ec.close()

print ‘Processing Fetched Files’
#os.system(‘%s/processEnsemblFiles.sh %s’ % (sys.path[0], outDir))