Java chromod package request CoassociationAnalyzer.java and GSCCoassociationAnalyzer.java scripts that Kevin Yip wrote (April 14, 2011)

Q:
I’m writing to you to see if you could share with me your java "chromod" package – I’m wanting to use the CoassociationAnalyzer.java and GSCCoassociationAnalyzer.java scripts that Kevin Yip wrote (April 14, 2011), but they rely on the chromod package (package org.gersteinlab.chromod)

If you could share this with me if it’s not a top secret lab package, I would be hugely indebted!

A:
Please download it at http://www.cse.cuhk.edu.hk/~kevinyip/outbox/chromod.jar . Let me know if you encounter any problem when using it.

Inquiry about your article ” The Importance of Bottlenecks in Protein Networks: Correlation with Gene Essentiality and Expression Dynamics”

Q:
I was wondering if you could help me.
I read your interesting article " The Importance of Bottlenecks in
Protein Networks: Correlation with Gene Essentiality and Expression
Dynamics".

I have trouble understanding the definition of hubs and bottlenecks (We
defined hubs as all proteins that are in the top 20% of the degree
distribution (i.e., proteins that have the 20% highest number of neighbors).
Accordingly, we defined
bottlenecks as the proteins that are in the top 20% in terms of
betweenness.. )

For example: if we want to calculate proteins that are in top 10% of degree
distributions, in a PPI network with 1000 nodes, we consider 100 highest
degree nodes?

or

we calculate 10% of the highest degree, which is for example 700 and
proteins with degree above 630 are the hubs?

Which one of these interpretations are correct?

A:
Your first interpretation is correct, i.e., if there are 1000 proteins in the network, we consider the top 100 proteins with the highest degrees.

Question about morph jobs “not yet completed” and errors

Q:
Greetings. Apologies for bothering you, but your morphing site suggests that
I contact you if a job is not finished in a day or so. The following jobs
were submitted last Friday:
– 692711-16199
– 692809-16356
– b692486-16007

For b692486-16007, at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.molmovdb.org_cgi-2Dbin_morph.cgi-3FID-3D&d=AwIFaQ&c=-dg2m7zWuuDZ0MUcV7Sdqw&r=GXLLd-iiiG3R6K6OQPuu_LKCNRF_WFNNajU6UPeecr0&m=Pa2aCcufMFrEjabZ9RvhKK7vIb2V8KTTaDj5TsR6n7E&s=W6R2HgNPIB0eB-w_lOkuCjLbCf_3TMwfNsDf9Nr1lUQ&e=
I get the following message: "Your request could not be processed. The
following error was detected: Morph not found in database.”

A:
An investigation into your morphs has been completed, and we have identified likely causes of these errors. In all cases, it appears as if the errors are a consequence of the PDB file formats. Details on specific morphs are given below:

For kb protein: 2qke_monomer.pdb —> fsTB_avg_min.pdb —> 2qke_monomer.pdb
At least one major issue identified was the fact that there appear to be pathological residue formats. For instance, have a look at the beginning of the ATOM records for the PDB 2qke_monomer.pdb:
ATOM 1 N MET A 1 -8.932 -20.214 2.255 1.00129.79 N
ATOM 2 CA MET A 1 -8.182 -19.102 2.920 1.00129.79 C
ATOM 3 C MET A 1 -8.441 -19.104 4.433 1.00129.79 C
ATOM 4 O MET A 1 -8.910 -18.109 4.991 1.00129.79 O
ATOM 5 CB MET A 1 -8.608 -17.750 2.326 1.00129.79 C
ATOM 6 CG MET A 1 -8.651 -17.719 0.800 1.00129.79 C
ATOM 7 SD MET A 1 -9.005 -16.077 0.094 1.00129.79 S
ATOM 8 CE MET A 1 -10.801 -15.970 0.277 1.00129.79 C

Everything appears to be perfectly fine with this MET residue. But have a look at the corresponding MET (again, the first residue in the file) within the PDB file to which we are trying to morph (ie, MET 1 in fsTB_avg_min.pdb):
ATOM 1 N MET 1 -10.344 9.596 10.785 1.00999.99
ATOM 2 HT1 MET 1 -10.167 8.615 10.490 1.00999.99
ATOM 3 HT2 MET 1 -9.599 10.204 10.388 1.00999.99
ATOM 4 HT3 MET 1 -11.263 9.902 10.406 1.00999.99
ATOM 5 CA MET 1 -10.344 9.693 12.268 1.00999.99
ATOM 6 HA MET 1 -10.330 10.740 12.539 1.00999.99
ATOM 7 CB MET 1 -11.608 9.045 12.838 1.00999.99
ATOM 8 HB1 MET 1 -11.402 8.006 13.048 1.00999.99
ATOM 9 HB2 MET 1 -12.394 9.105 12.100 1.00999.99
ATOM 10 CG MET 1 -12.105 9.700 14.117 1.00999.99
ATOM 11 HG1 MET 1 -11.283 9.762 14.816 1.00999.99
ATOM 12 HG2 MET 1 -12.888 9.087 14.538 1.00999.99
ATOM 13 SD MET 1 -12.756 11.359 13.840 1.00999.99
ATOM 14 CE MET 1 -11.572 12.350 14.748 1.00999.99
ATOM 15 HE1 MET 1 -10.811 12.710 14.072 1.00999.99
ATOM 16 HE2 MET 1 -11.114 11.749 15.519 1.00999.99
ATOM 17 HE3 MET 1 -12.078 13.190 15.200 1.00999.99
ATOM 18 C MET 1 -9.108 9.019 12.853 1.00999.99
ATOM 19 O MET 1 -8.352 9.629 13.609 1.00999.99

Of course, the morph server is trying to morph each residue into the corresponding residue of the other PDB file. However, it is very difficult to do this, most likely because the residues given are actually completely different (your MET residue in fsTB_avg_min.pdb seems to have ~3 times the number of atoms, making it impossible to perform the morph). Notably, the MET 1 residue is not unique in this regard — it appears as if there are many other residues with completely different numbers of atoms and formats.

I would also mention that all morphs are pairwise (rather than being annotated as 3-way morphs in the way that you have this one) — thus, what we tried to generate was really the following: 2qke_monomer.pdb —> fsTB_avg_min.pdb

For ka protein:
truncated5c5e.pdb —> KaiA_transitionState.pdb —> kaiA_fromTernary.pdb —> KaiA_transitionState.pdb —> truncated5c5e.pdb

Again, one immediate issue here is that all morphs are pairwise. Thus, the following individual pairwise morphs are possible
truncated5c5e.pdb —> KaiA_transitionState.pdb
KaiA_transitionState.pdb —> kaiA_fromTernary.pdb
kaiA_fromTernary.pdb —> KaiA_transitionState.pdb
KaiA_transitionState.pdb —> truncated5c5e.pdb

However, a single continues morph between all 5 structures given (really a cycle between 4 morphs) is not possible.

Secondly, if you look closely at the files kaiA_fromTernary.pdb and KaiA_transitionState.pdb, these seem to be completely different sequences (ie, the sequence of residues are very different). Some sequence differences can indeed be tolerated by the morph server, but beyond a certain degree of sequence homology, morphing becomes unreliable and eventually impossible.

For kc protein: c1_from_BCcomplex.pdb —> c1_from_40om.pdb —> c1_from_BCcomplex.pdb

Here, one immediate issue is (as with the first morph), the residue formats seem to be completely different and incompatible. For instance, have a look at VAL 19 in c1_from_BCcomplex.pdb:
ATOM 1 N VAL A 19 41.315 27.606 60.932 1.00 43.58 N
ATOM 2 CA VAL A 19 40.989 28.635 59.949 1.00 41.68 C
ATOM 3 C VAL A 19 42.235 29.400 59.515 1.00 24.69 C
ATOM 4 O VAL A 19 42.771 30.221 60.265 1.00 37.44 O
ATOM 5 CB VAL A 19 39.924 29.626 60.510 1.00 46.84 C
ATOM 6 CG1 VAL A 19 39.678 30.796 59.562 1.00 31.21 C
ATOM 7 CG2 VAL A 19 38.613 28.894 60.761 1.00 49.29 C
ATOM 8 HA VAL A 19 40.613 28.209 59.163 1.00 50.01 H
ATOM 9 HB VAL A 19 40.237 29.983 61.356 1.00 56.21 H
ATOM 10 HG11 VAL A 19 39.011 31.383 59.952 1.00 37.45 H
ATOM 11 HG12 VAL A 19 40.509 31.279 59.434 1.00 37.45 H
ATOM 12 HG13 VAL A 19 39.360 30.452 58.712 1.00 37.45 H
ATOM 13 HG21 VAL A 19 37.962 29.523 61.109 1.00 59.15 H
ATOM 14 HG22 VAL A 19 38.296 28.519 59.924 1.00 59.15 H
ATOM 15 HG23 VAL A 19 38.766 28.185 61.405 1.00 59.15 H

Now compare this to the corresponding residue VAL 19 in the file c1_from_40om.pdb:
ATOM 1 N VAL A 19 -23.156 44.101 -9.426 1.00 64.75 N
ATOM 2 CA VAL A 19 -22.022 43.812 -10.291 1.00 58.63 C
ATOM 3 C VAL A 19 -21.671 42.331 -10.275 1.00 54.58 C
ATOM 4 O VAL A 19 -21.400 41.761 -9.218 1.00 55.90 O
ATOM 5 CB VAL A 19 -20.783 44.627 -9.874 1.00 52.74 C
ATOM 6 CG1 VAL A 19 -19.630 44.361 -10.818 1.00 47.35 C
ATOM 7 CG2 VAL A 19 -21.116 46.109 -9.837 1.00 61.62 C

The VAL 19 within this second file looks good, but there seems to be something wrong with the format of the VAL19 in the first file. It is likely that the errors are a result of a) the incompatible residue formats, and b) the unrecognizable format given in the 1st file. Here, again, I just use VAL19 as an example — many other residues in your file seem to have this issue.

In sum, we advise ‘homogonizing’ the file formats, and adopting conventional residue formats, if possible. You might want to run a python script to extract out the atoms that are consistent with standard formats, for instance. We cannot guarantee with 100% that this will fix everything, but we can guarantee that this is the ideal starting point for resolving these errors.

Thank you again for using the server, and please do not hesitate to contact us if you have further questions or experience further difficulty.

Submitting MSA to coevolution webpage

Q:
I am trying to submit an MSA to your "coevolution" webpage — and it keeps failing on me.
i keep getting an error email saying "could not be completed" Error message:Not enough sequences.

im sure that my MSA is in fasta format… so not really sure what is going wrong…
if you could help me out, it would be much appreciated.

A:
The tool performs a number of filtering steps, to ensure the
reliability of the results. The error messages states that after
filtering, the number of remaining sequences is too small for a
reliable analysis. You may change the filtering criteria using the
advanced options (hidden by default), but please notice that by doing
so you may get results that are unreliable.

Terms in Pseudopipe output, etc

Q:
I am looking the details of Pseudopipe’s output terms such as "frac", "ins", "del", "shift", "stop", "polya". Also how pseudopipe makes confirm the pseudogenes in its results.

A:
frac = fraction of parent gene that matches the pseudogene
ins = number of insertions in the pseudogene compared to parent sequence
del = number of deletions in the pseudogene compared to parent sequence
shift = number of frame shifts in the pseudogene compared to parent sequence
stop = number of stop codons in the pseudogene compared to parent sequence
polya = flag indicating the presence or absence of a polyA tail

Also see below the code associated with the script fetchEnsemblFiles.py for downloading the input data for eukaryotes from ensembl website:


#!/usr/bin/env python

# some examples of files and locations
# pub
# lrwxrwxrwx 1 ftpuser ftpusers 30 Dec 7 16:33 current_homo_sapiens -> release-36/homo_sapiens_36_35i
#
#/pub/release-36/homo_sapiens_36_35i/data/fasta/dna
#-rw-rw-r– 1 ftpuser ftpusers 67675771 Nov 15 14:48 Homo_sapiens.NCBI35.dec.dna.chromosome.1.fa.gz
#-rw-rw-r– 1 ftpuser ftpusers 40802343 Nov 15 14:55 Homo_sapiens.NCBI35.dec.dna_rm.chromosome.1.fa.gz
#
#/pub/release-36/homo_sapiens_36_35i/data/fasta/pep
#-rw-rw-r– 1 ftpuser ftpusers 3817861 Nov 15 19:46 Homo_sapiens.NCBI35.dec.pep.known.fa.gz
#
#/pub/release-36/homo_sapiens_36_35i/data/mysql/homo_sapiens_core_36_35i
#-rw-rw-r– 1 ftpuser ftpusers 2957452 Dec 2 22:45 exon.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1747738 Dec 2 22:45 exon_stable_id.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 1489045 Dec 2 22:45 exon_transcript.txt.table.gz
#-rw-rw-r– 1 ftpuser ftpusers 4626 Dec 2 21:57 homo_sapiens_core_36_35i.mysql40_compatible.sql.gz
#-rw-rw-r– 1 ftpuser ftpusers 4753 Dec 2 21:57 homo_sapiens_core_36_35i.sql.gz

import os, os.path, re, sys
from ftplib import FTP

class collect:
def __init__(self): self.data = []
def more(self, l): self.data.append(l)

def maybeRetrFile(fromPath, toPath):
what = ‘from %s –> to %s’ %(fromPath, toPath)
if os.path.exists(toPath):
print ‘skipping ‘+what
return
else:
if toPath.endswith(‘.gz’) and os.path.exists(toPath[:-3]):
print ‘skipping (uncompressed) ‘+what
return

print what
toFile = open(toPath, ‘w’)
ec.retrbinary(‘RETR ‘+fromPath, toFile.write, blocksize=100000)
toFile.close()

target = sys.argv[1].strip().lower().replace(‘ ‘, ‘_’)

release = ‘current_’

if len(sys.argv) > 2:
release = ‘release-‘ + sys.argv[2] + ‘/’

# set up initial connection
host=’ftp.ensemblgenomes.org’
print ‘Logging into ‘+host
ec = FTP(host)
ec.login()

# look for target in a listing of pub
files = collect()
where=’pub/’+release+’mysql’
print ‘Listing ‘+where
ec.dir(where, files.more)
tEntries = [l for l in files.data if target+”_core_” in l and ‘->’ not in l ]
if len(tEntries) != 1:
print target + ‘ is either missing or not unique:’
print tEntries
print ‘\n’.join(files.data)
ec.close()
sys.exit(-1)

# “parse” current link name
curPat = re.compile(r”+target+’_core_(.+)_(.+)\Z’)
tPath = tEntries[0].split()[-1]
mo = curPat.match(tPath)
if not mo:
print ‘dont\’t understand release naming scheme: ‘+ tPath
ec.close()
sys.exit(-1)
[maj, min] = mo.groups()
majMin=maj+’_’+min
outDir = target + ‘_’ + majMin

print ‘Release: ‘+release[0:len(release)-1]+’, ‘+’tPath: ‘+tPath+’, ‘+’target: ‘+target+’, ‘+’maj: ‘+maj+’, ‘+’majMin: ‘+majMin+’, ‘+’outDir: ‘+outDir

## if os.path.exists(outDir):
## print ‘up to date: ‘ + tPath
## ec.close()
## sys.exit(0)

# need to get files. first, set up directories.
[dDir, mDir, pDir] = [outDir+d for d in [‘/dna/’, ‘/mysql/’, ‘/pep/’]]
if not os.path.exists(dDir): os.makedirs(dDir, 0744)
if not os.path.exists(mDir): os.makedirs(mDir, 0744)
if not os.path.exists(pDir): os.makedirs(pDir, 0744)

# retrieve dna
dnaPat = re.compile(r’\.dna(_rm)?\.chromosome\..+\.fa\.gz\Z’)
dFiles = collect()
where = ‘pub/’+release+’fasta/%s/dna’ % target
print ‘Changing dir to ‘+where
ec.dir(where, dFiles.more)
dKeep = [l for l in dFiles.data if dnaPat.search(l)]
for f in dKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, dDir+fn)

# retrieve pep
where = ‘pub/’+release+’fasta/%s/pep’ % target
pFiles = collect()
print ‘Changing dir to ‘+where
ec.dir(where, pFiles.more)
for f in pFiles.data:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, pDir+fn)

# retrieve mysql
# older releases?: mFiles = [‘exon.txt.table’, ‘exon_transcript.txt.table’, ‘gene_stable_id.txt.table’, ‘seq_region.txt.table’, ‘transcript.txt.table’, ‘translation.txt.table’, ‘translation_stable_id.txt.table’, target+’_core_’+majMin+’.sql’, target+’_core_’+majMin+’.mysql40_compatible.sql’]
#older releases which have *_stable_id.txt: mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘gene_stable_id.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, ‘translation_stable_id.txt’, target+’_core_’+majMin+’.sql’]
mFiles = [‘exon.txt’, ‘exon_transcript.txt’, ‘seq_region.txt’, ‘transcript.txt’, ‘translation.txt’, target+’_core_’+majMin+’.sql’]

where = ‘pub/’+release+’mysql/%s_core_%s’ % (target, majMin)
print ‘Changing dir to ‘+where
for mf in mFiles:
maybeRetrFile(where+’/’+mf+’.gz’, mDir+mf+’.gz’)

# retrieve GTF
where = ‘pub/’+release+’gtf/%s’ % (target)
print ‘Changing dir to ‘+where
gtfPat = re.compile(r’\.gtf\.gz\Z’)
gFiles = collect()
ec.dir(where, gFiles.more)
gKeep = [l for l in gFiles.data if gtfPat.search(l)]
for f in gKeep:
fn = f.split()[-1]
maybeRetrFile(where+’/’+fn, mDir+fn)

ec.close()

print ‘Processing Fetched Files’
#os.system(‘%s/processEnsemblFiles.sh %s’ % (sys.path[0], outDir))

list of ‘LoF-tolerant’ gene category

Q:
I read Integrative Annotation of Variants from 1092 Humans: Application to Cancer Genomics and A systematic survey of loss-of-function variants in human protein-coding genes, and interested about the list of ‘LoF-tolerant’ gene category. I would be appreciated if you could provide with it.

A:
Please see below the list of LoF-tolerant genes from the Science paper.
This list is based on the data from Phase 1 of the 1000 Genomes project.

ABHD14B
AC002511.1
AC007342.1
AC007601.1
AC008676.1
AC009041.2
AC009113.1
AC013480.1
AC018755.11
AC020763.1
AC022148.1
AC022692.1
AC079612.1
AC083883.1
AC091435.1
AC091435.2
AC092171.2
AC092329.1
AC096920.1
AC100788.1
AC100803.1
AC111170.3
AC116447.1
AC118758.1
AC124944.1
AC129492.6
AC130686.1
AC132186.2
AC133919.6
ACSM3
ACTR3C
AF131215.4
AGAP6
AHCTF1
AKR1E2
AL022324.1
AL031587.1
AL035696.1
AL122001.2
AL139385.1
AL355102.1
AL356270.2
AL359236.1
AL359392.1
AL359878.1
AL391137.1
AL449106.1
AL596442.1
ALMS1
AP000354.1
AP001793.1
AP002962.1
APOBEC3B
AQP12B
ARID3A
ARL9
ARMS2
ATP13A5
BPHL
BTN3A2
C10orf113
C10orf68
C11orf21
C12orf60
C13orf26
C14orf180
C14orf182
C17orf107
C17orf77
C17orf97
C18orf56
C19orf71
C1orf227
C20orf185
C21orf88
C2orf57
C3orf14
C3orf49
C4orf17
C5orf27
C5orf49
C6orf123
C8orf44
C9orf43
CALHM2
CAPN11
CAPN9
CASP12
CCDC163P
CCDC7
CD200R1
CD200R1L
CD207
CEACAM4
CELA1
CENPBD1
CFHR1
CLYBL
COL16A1
COL23A1
COL6A5
COX6B2
CPN2
CPNE1
CR392000.1
CRIPAK
CST9
CWH43
CYP2A13
CYP2A7
CYP2C18
CYP2C19
CYP2D6
CYP4B1
DCLRE1A
DDIT4L
DEFB126
DEFB128
DEM1
DNAJC28
DSCR8
DSG1
EBF4
EIF3CL
ENPP7
FAM111B
FAM187B
FAM25A
FAM71D
FAM75A6
FBXL21
FMO2
FMO6P
FTHL17
FUT2
GAB4
GBAP1
GBP3
GBP7
GDPD4
GLT6D1
GPR142
GPRC6A
GRIN3B
GSTT2
GSTT2B
GUF1
H2BFM
HBM
HBP1
HSD17B13
HTN3
IDI2
IDO2
IFNE
IL34
ITIH5
JMJD1C
KRT31
KRT37
KRT77
KRTAP1-1
KRTAP13-2
KRTAP1-5
KRTAP4-8
KRTAP9-1
LAD1
LCN10
LILRA2
LILRA3
LILRB1
LIPJ
LPA
LRRC39
MAGEB16
MAGEE2
MAN2A1
MBL2
METTL7B
MEX3C
MOGAT1
MS4A12
MSR1
MST1R
NACA2
NIPA2
NOXO1
NT5C1B-RDH14
OLFM4
OR10AD1
OR10D3
OR10G7
OR10R2
OR10X1
OR11G2
OR13C2
OR13C4
OR13D1
OR1B1
OR1J2
OR2A5
OR2C1
OR2D2
OR2D3
OR2G6
OR2T11
OR2T27
OR2T4
OR2V2
OR3A1
OR4C11
OR4C16
OR4D10
OR4D6
OR4L1
OR4P4
OR4S2
OR4X1
OR4X2
OR51F1
OR51H1P
OR51I2
OR51Q1
OR51V1
OR52A1
OR52A4
OR52B4
OR52I2
OR52K2
OR52M1
OR52N4
OR5AC2
OR5AR1
OR5B17
OR5H1
OR5H15
OR5K4
OR5M1
OR5M10
OR5M11
OR6C4
OR6C74
OR6Q1
OR7G1
OR7G3
OR8B3
OR8I2
OTOP1
OXGR1
PCDHA3
PCDHGA8
PKD2L1
PKHD1L1
PLA2G4D
PLA2R1
PLEKHG5
PNLIPRP3
POM121L4P
PPEF2
PRAMEF4
PRB4
PSG9
PSORS1C2
PTCHD3
PTGDR
PTX4
PXDNL
PZP
RAI1
RESP18
RFPL1
RHD
RP11-113D6.6
RP11-297N6.4
RP11-455G16.1
RP11-481A20.11
RP11-521B24.1
RP11-542P2.1
RP11-766F14.2
SATL1
SCN8A
SDR42E1
SEC14L4
SEMA4C
SERPINA9
SERPINB3
SLC22A14
SLCO1B1
SLFN12L
SNX31
SPATA18
SPATA4
SPATA8
SPERT
SPTBN5
SPZ1
STARD6
SUMF2
TAAR2
TAS2R46
TAS2R7
TBC1D29
TCHHL1
TCP10L2
TCTEX1D1
TIGD6
TLR10
TLR5
TMEM198
TMEM82
TMPRSS7
TNK1
TRIM22
TRIM38
TRIM73
TRPM1
TSPAN19
TTC24
TXNRD3IT1
UBE2NL
UGT2B10
UGT2B28
ULBP3
UNC93A
USP50
UTS2D
VN1R1
Z82214.1
ZAN
ZFP91
ZNF28
ZNF284
ZNF417
ZNF469
ZNF474
ZNF527
ZNF681
ZNF790
ZNF80
ZNF804A
ZNF812
ZNF860

Questions about “Architecture of the human regulatory network derived from ENCODE data”

Q:
I am reading your paper, and have problem about the TF-target gene network data downloaded from http://encodenets.gersteinlab.org/. I want to know which refGene and gene symbol did you use when you find the TF target gene with ChIP-seq data? I find that some symbols are not concluded in hg19 refGene I download from ucsc.

A:
the server was down for a while, and I wasn’t sure what names were you talking about. Now, I think the names are from gencode, but I cannot recall the exact release we used. I believe the names wouldn’t change in general. you can see all the releases here, the names should be in one of the metafiles.
http://www.gencodegenes.org/releases/

Task failed coevolution

Q:
The seed alignment of PF01036 has been changed recently.
BACR_HALSA is no longer one of the seed sequences. I have just changed
the example to use BACR_HALAR instead, and the program ran fine.
Please let me know if you encounter any problem running the program on
your own data.

A:
I’ve been trying to use your server, but evidently it is not working
correctly; your example data even fails to process.

I am eager to use this server (and likewise cite you), so please let
me know at your earliest convenience if/how I can use this server.

Questions about chromatin data

Q:
I would like to compare some data she has with ChIP-chip/ChIP-seq data in the worm. We have found wig files but these are not very useful. Can you direct us to a site with peak calls? (How were the peaks called?)

A:
the published worm & fly data, incl. peak calls, is at:
https://www.encodeproject.org/comparative

The peak calling is described in Boyle et al. & on the website – eg
https://www.encodeproject.org/comparative/regulation/#Humanset6

Regarding PseudoPipe MySQL file

Q:
I am using PseudoPipe to find pseudogenes from a query Chromosome. I have a chromosome nucleotide sequence file and a protein sequences file.

I am not getting what is MySQL file and how can get this and one more file of masking.

A:
PseudoPipe is configured to run on nucleotide and protein sequence files as formatted and available for download from the ensembl server.
Regarding your issues:

1. A MySQL file is a file dowloaded from a MySQL database , and thus has it’s specific format. Ensemble uses this database to store exons co-ordinates for all the protein coding genes starting with an exon id, chromosome number, start and end position, strand, etc . As such I suggest you format your exons information accordingly . As example you can use the” chrI_exLocs” file located in the mysql folder from the C.elegans example that you downloaded along with pseudopipe.

2. A masking file is a nucleotide files (in fasta format) that masks all the repeat sequences from the genome. If you want to create it yourself you should use a repeat masker and format it accordingly to the file that you see in the dna folder in the C.elegans example dna_rm.fa .