pseudogene.org error message

Q:
I would just like to bring the below error message to your attention that I recently received when attempting to access data on pseudogenes.org. (see image)

A:
We were not able to reproduce your error. In order to understand what happened and find a solution, it would be of a considerable help if you could let us know the exact commands you made that resulted in this error.

Question re rice pseudogene

Q:
I am using your pseudogene dataset of rice to do some analysis. However, I found that you did not mention which Rice genome version you used for data analysis, so I cannot anchor the pseudogenes to the genome I used. Would you please give the information.

A:
As I understand you are using the pseudogenes described in this paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2708354/ . The Data source section of the paper highlights the fact that the annotations were done on the Rice genome version 5 from TIGR. You can find all the information regarding the rice genome version 5 at ftp://ftp.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa .

Terms in Pseudopipe output, etc

Q:
I am looking the details of Pseudopipe’s output terms such as "frac", "ins", "del", "shift", "stop", "polya". Also how pseudopipe makes confirm the pseudogenes in its results.

A:
frac = fraction of parent gene that matches the pseudogene
ins = number of insertions in the pseudogene compared to parent sequence
del = number of deletions in the pseudogene compared to parent sequence
shift = number of frame shifts in the pseudogene compared to parent sequence
stop = number of stop codons in the pseudogene compared to parent sequence
polya = flag indicating the presence or absence of a polyA tail

Also see below the code associated with the script fetchEnsemblFiles.py for downloading the input data for eukaryotes from ensembl website:


#!/usr/bin/env python

# some examples of files and locations
# pub
# lrwxrwxrwx 1 ftpuser ftpusers 30 Dec 7 16:33 current_homo_sapiens -> release-36/homo_sapiens_36_35i
#
#/pub/release-36/homo_sapiens_36_35i/data/fasta/dna
#-rw-rw-r– 1 ftpuser ftpusers 67675771 Nov 15 14:48 Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 40802343 Nov 15 14:55 Homo_sapiens.NCBI35.dec.dna_rm.chromosome.1.fa.gz
#
#/pub/release-36/homo_sapiens_36_35i/data/fasta/pep
#-rw-rw-r– 1 ftpuser ftpusers 3817861 Nov 15 19:46 Homo_sapiens.NCBI35.dec.pep.known.fa.gz
#
#/pub/release-36/homo_sapiens_36_35i/data/mysql/homo_sapiens_core_36_35i
#-rw-rw-r– 1 ftpuser ftpusers 2957452 Dec 2 22:45 exon.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1747738 Dec 2 22:45 exon_stable_id.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1489045 Dec 2 22:45 exon_transcript.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 4626 Dec 2 21:57 homo_sapiens_core_36_35i.mysql40_compatible.sql.gz
#-rw-rw-r– 1 ftpuser ftpusers 4753 Dec 2 21:57 homo_sapiens_core_36_35i.sql.gz

import os, os.path, re, sys
from ftplib import FTP

class collect:
def __init__(self): self.data = []
def more(self, l): self.data.append(l)

def maybeRetrFile(fromPath, toPath):
what = ‘from %s –> to %s’ %(fromPath, toPath)
if os.path.exists(toPath):
print ‘skipping ‘+what
return
else:
if toPath.endswith(‘.gz’) and os.path.exists(toPath[:-3]):
print ‘skipping (uncompressed) ‘+what
return

print what
toFile = open(toPath, ‘w’)
ec.retrbinary(‘RETR ‘+fromPath, toFile.write, blocksize=100000)
toFile.close()

target = sys.argv[1].strip().lower().replace(‘ ‘, ‘_’)

release = ‘current_’

if len(sys.argv) > 2:
release = ‘release-‘ + sys.argv[2] + ‘/’

# set up initial connection
host=’ftp.ensemblgenomes.org’
print ‘Logging into ‘+host
ec = FTP(host)
ec.login()

# look for target in a listing of pub
files = collect()
where=’pub/’+release+’mysql’
print ‘Listing ‘+where
ec.dir(where, files.more)
tEntries = [l for l in files.data if target+”_core_” in l and ‘->’ not in l ]
if len(tEntries) != 1:
print target + ‘ is either missing or not unique:’
print tEntries
print ‘\n’.join(files.data)
ec.close()
sys.exit(-1)

# “parse” current link name
curPat = re.compile(r”+target+’_core_(.+)_(.+)\Z’)
tPath = tEntries[0].split()[-1]
mo = curPat.match(tPath)
if not mo:
print ‘dont\’t understand release naming scheme: ‘+ tPath
ec.close()
sys.exit(-1)
[maj, min] = mo.groups()
majMin=maj+’_’+min
outDir = target + ‘_’ + majMin

print ‘Release: ‘+release[0:len(release)-1]+’, ‘+’tPath: ‘+tPath+’, ‘+’target: ‘+target+’, ‘+’maj: ‘+maj+’, ‘+’majMin: ‘+majMin+’, ‘+’outDir: ‘+outDir

## if os.path.exists(outDir):
## print ‘up to date: ‘ + tPath
## ec.close()
## sys.exit(0)

# need to get files. first, set up directories.
[dDir, mDir, pDir] = [outDir+d for d in [‘/dna/’, ‘/mysql/’, ‘/pep/’]]
if not os.path.exists(dDir): os.makedirs(dDir, 0744)
if not os.path.exists(mDir): os.makedirs(mDir, 0744)
if not os.path.exists(pDir): os.makedirs(pDir, 0744)

# retrieve dna
dnaPat = re.compile(r’\.dna(_rm)?\.chromosome\..+\.fa\.gz\Z’)
dFiles = collect()
where = ‘pub/’+release+’fasta/%s/dna’ % target
print ‘Changing dir to ‘+where
ec.dir(where, dFiles.more)
dKeep = [l for l in dFiles.data if dnaPat.search(l)]
for f in dKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, dDir+fn)

# retrieve pep
where = ‘pub/’+release+’fasta/%s/pep’ % target
pFiles = collect()
print ‘Changing dir to ‘+where
ec.dir(where, pFiles.more)
for f in pFiles.data:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, pDir+fn)

# retrieve mysql
# older releases?: mFiles = [‘exon.txt.table’, ‘exon_transcript.txt.table’, ‘gene_stable_id.txt.table’, ‘seq_region.txt.table’, ‘transcript.txt.table’, ‘translation.txt.table’, ‘translation_stable_id.txt.table’, target+’_core_’+majMin+’.sql’, target+’_core_’+majMin+’.mysql40_compatible.sql’]
#older releases which have *_stable_id.txt: mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘gene_stable_id.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, ‘translation_stable_id.txt’, target+’_core_’+majMin+’.sql’]
mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, target+’_core_’+majMin+’.sql’]

where = ‘pub/’+release+’mysql/%s_core_%s’ % (target, majMin)
print ‘Changing dir to ‘+where
for mf in mFiles:
maybeRetrFile(where+’/’+mf+’.gz’, mDir+mf+’.gz’)

# retrieve GTF
where = ‘pub/’+release+’gtf/%s’ % (target)
print ‘Changing dir to ‘+where
gtfPat = re.compile(r’\.gtf\.gz\Z’)
gFiles = collect()
ec.dir(where, gFiles.more)
gKeep = [l for l in gFiles.data if gtfPat.search(l)]
for f in gKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, mDir+fn)

ec.close()

print ‘Processing Fetched Files’
#os.system(‘%s/processEnsemblFiles.sh %s’ % (sys.path[0], outDir))

Regarding PseudoPipe MySQL file

Q:
I am using PseudoPipe to find pseudogenes from a query Chromosome. I have a chromosome nucleotide sequence file and a protein sequences file.

I am not getting what is MySQL file and how can get this and one more file of masking.

A:
PseudoPipe is configured to run on nucleotide and protein sequence files as formatted and available for download from the ensembl server.
Regarding your issues:

1. A MySQL file is a file dowloaded from a MySQL database , and thus has it’s specific format. Ensemble uses this database to store exons co-ordinates for all the protein coding genes starting with an exon id, chromosome number, start and end position, strand, etc . As such I suggest you format your exons information accordingly . As example you can use the” chrI_exLocs” file located in the mysql folder from the C.elegans example that you downloaded along with pseudopipe.

2. A masking file is a nucleotide files (in fasta format) that masks all the repeat sequences from the genome. If you want to create it yourself you should use a repeat masker and format it accordingly to the file that you see in the dna folder in the C.elegans example dna_rm.fa .

Question about a potential error with Pseudogene.org

Q1:
I want to say great job with the Pseudogene.org site! I recently noticed a potential error and wanted to send a email to inform you if you haven’t already picked it up yourselves….

In the file located at the following address:

http://www.pseudogene.org/psicube/data/gencode.v10.pgene.parents.txt

The start and end chromosomal locations for the pseudogenes are the same. See below:

ENST00000344844.3

unprocessed_pseudogene

chr19 +

9314984

9314984

ENSG00000237521.1 ENST00000456448.1 OR7E24

"Transcribed: 0" "Active Chromatin: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 0"

ENST00000359901.3

unprocessed_pseudogene

chr2 –

98123508

98123508 . .

. "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=0;Hepg2=0;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Pol2: GM12878=1;K562=1;Helas3=1;Hepg2=1;H1hesc=0"

"Constraint: 0"

ENST00000459808.1

processed_pseudogene chr3 –

136527393

136527393 ENSG00000198075.5 ENST00000272452.2

SULT1C4 "Transcribed: 0"

"Active Chromatin: GM12878=1;K562=0;Helas3=1;Hepg2=1;H1hesc=1"

"Open Chromatin: GM12878=0;K562=0;Helas3=.;Hepg2=.;H1hesc=."

"TFBS: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Pol2: GM12878=0;K562=0;Helas3=0;Hepg2=0;H1hesc=0"

"Constraint: 1"

A1:
Thanks for pointing us the problem. However, I’m a little confused of what file you are referring to. The parents file with url in your message (http://www.pseudogene.org/psicube/data/gencode.v10.pgene.parents.txt) does not match the contents you provided. The contents look more like from the file: http://pseudogene.org/psidr/psiDR.v0.txt. But neither file has the chromosome coordinates issue you mentioned. Maybe you meant some other file?

Q2:
It appears you are correct, i provided the link for the GENCODEv10 pseudogene resource instead of the v7 resource by mistake. I was, however, able to go back and find the file where I had found the mistake.

I had downloaded the Pseudogene Resource psiDR from the GENCODE website ( ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz ) and assumed that this file is the same as the link you provide ( http://pseudogene.org/psidr/psiDR.v0.txt ). Although it appears they are not… The link on the GENCODE website ( ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz ) displays the problem that I previously described, whereas the link you provide does not.

The file with the problem I described is actually linked at this page: http://www.gencodegenes.org/psidr/
Under the link entitled:
New! Pseudogene Resource psiDR
which redirects to: ftp://ftp.sanger.ac.uk/pub/gencode/psidr/psiDR.v0.txt.gz

I am not sure if you part of the administration for the GENCODE site or not, but potentially if you aren’t, you would like to contact them regarding the problem since it appears to be data from your lab that is represented.

I am sorry for providing the wrong link earlier. Please let me know if you have anymore trouble reproducing the problem.

A2:
I can see the problem too. I’ll contact GENCODE to have the file updated. Thanks for pointing this issue to us!

Having some problems while executing PseudoPipe

Q:
I would like to use Pseudopipe. But I have been having some problems while executing PseudoPipe. To test the program I used the example input files (from Caenorhabditis Elegans) which were given. I modified the env.sh. I indicated the paths to python (python2.7), blast (blastall 2.2.25) and tfasty (tfasty36). But no pseudogenes were found when I executed the command which is shown below. I added a screenshot of the error as attachment. Could you give me some guidance to solve this problem?

The command that I am using is:
./pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.I.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chrI_exLocs 0

I downloaded PseudoPipe from:
http://www.pseudogene.org/pseudopipe/

A:
The reason you are not getting any output is because you need to use fasta-35.1.5 (tfasty35). The newer versions of the fasta (e.g. tfasty36) have a different output format that cannot be processed by the downstream programs in our pipeline. We are currently working on updating and improving the pipeline, but for the time being please do use tfasty35.