Temp issues w/Packing-Eff

Q1:

I read your journal about Packing-Eff. I find it very informative and resourceful. I would like to try to use it on some protein models that i had built using comparative modelling (MODELLER).

However, when i try to access the website, Packing-Eff Online, I’m afraid it is down. Can you help me with the problem?

Q2:
I am studying the packing of residues in proteins and tried online version
of "Packing-Eff". Unfortunately I could not find any relation between output
amino acid numbering or total number of residues and the input PDB file (I
used PDB: 451C). Is it normal?

It would be nice of you if you help me to solve this problem.

A1 & A2:
Sorry about this. We had briefly experienced a systems failure, but have since recovered. Try again now.

A question about CNVnator resullt

Q:

Recently,I use CNVnator software detecting dog genome CNVs
using dog genome resequensing data from illumina GA platform.I have get the sorted and removed duplicated .bam file using bwa and samtools and then use command as follows to get CNVs result:

./cnvnator -genome Canis_familiaris.CanFam3.1.71.dna_rm.toplevel.fa -root GW2.root -tree GW2_sort.bam
./cnvnator -genome Canis_familiaris.CanFam3.1.71.dna_rm.toplevel.fa -root GW2.root -his 1000 -d genome_split/
./cnvnator -root GW2.root -stat 1000
./cnvnator -root GW2.root -partition 1000
./cnvnator -root GW2.root -call 1000 >GW2_result

I get result file(GW2_result),and then I convert it to VCF format using cnvnator2V! CF.pl,and get GW2_result_vcf file.I found the result is same what werid (I am new in genome CNVs analysis) because I find so many large-sizes duplications and indels in genome.I think the result file need same filter.But I do not know how to filter and do not find any filter information and standard by google,can you help me?Thank you very much!One of my results is in attachment,please check!

A:
thanks for interest to CNVnator.
Not sure what do you mean by many. How many?
Perhaps, some of those are gaps in the reference genome. While duplications are around those gaps.

Alternative to StoneHinge?

Q:
I am a research student working on protein structures using computational methods. I have used the tool StoneHinge to determine the hinge region residue and %protein rigidity for a protein. To confirm and report the significance of the putative hinge residue, I induced single residue mutations and noted the changes in the %packing rigidity. I was unable to find any significant changes. Therefore to confirm the importance of the putative hinge residue and effects of mutation on the hinge movement/rotation, is there any other tool?

A:
did you see our "related resources" page?:

http://www2.molmovdb.org/wiki/info/index.php/Related_Resources

Some of the items in here may be of some help

SIN database, request detailed format

Q:

I am interested in the evolution of protein-protein interaction networks, and
recently became an enthusiastic user of your Structural Interaction Network
(SIN) database.

While downloading the data from the SIN website
(http://networks.gersteinlab.org/structint/), I noticed that more detailed
formats are available upon request for for SIN versions 0.9, 1.0 and 2.0.
In particular, which Pfam domains are involved in each interaction, and
which yeast crystal structure (hopefully PDB identifications) the
interactions are based on.

Would it be possible to obtain this information? I would really appreciate
that. I hope to be able to use it to survey physical properties of the
interactions throughout the network, and connect it to the evolutionary
simulations I’m working on at the lab.

I have a few questions about the DynaSIN. Sorry for this long email, I tried to be as clear as possible. It would be really great if you could help me answer those questions!

Question (1) and (2) are regarding the ‘Interaction Data’ section, file ‘interface_final2.txt’:

(1) What is the significance of the order in which protein A and protein B (second and third columns, respectively) are presented? In other words – if protein A and B are swapped, should the other entries (PDB IDs and surface residues) be calculated in a different way? I thought that swapping protein A and B should give the same result, but I noticed that for interaction 566 and 508, swapping protein A and B result in different PDB IDs and different surface residues for the PDB IDs they have in common:

566 HFE_HUMAN TFR1_HUMAN Permanent 1A6Z_A;1A6Z_B;26,30,49,97,122,202,204,236,243,;54,55,53,31,60,99,11,10, 1A6Z_A;1A6Z_D;; 1A6Z_C;1A6Z_B;; 1A6Z_C;1A6Z_D;26,30,49,97,122,204,236,243,;54,55,53,31,60,11,99,10, 1DE4_A;1DE4_B;30,49,121,122,204,233,236,243,;55,53,1,60,99,11,8,10, 1DE4_A;1DE4_E;; 1DE4_A;1DE4_H;; 1DE4_D;1DE4_B;; 1DE4_D;1DE4_E;30,49,97,120,122,202,204,206,207,233,236,239,243,;55,53,60,3,98,99,11,12,13,8,10, 1DE4_D;1DE4_H;; 1DE4_G;1DE4_B;; 1DE4_G;1DE4_E;; 1DE4_G;1DE4_H;30,49,97,120,121,122,202,204,233,236,;55,53,62,31,1,60,98,99,11,8,10,

508 TFR1_HUMAN HFE_HUMAN Permanent 1DE4_C;1DE4_A;629,640,;85,146, 1DE4_C;1DE4_D;; 1DE4_C;1DE4_G;; 1DE4_F;1DE4_A;; 1DE4_F;1DE4_D;629,658,;146,64, 1DE4_F;1DE4_G;; 1DE4_I;1DE4_A;; 1DE4_I;1DE4_D;; 1DE4_I;1DE4_G;629,640,;85,146,

(2) Do the surface residues numbers (column 5 and subsequent columns) correspond to their position in the full protein sequence as defined in UniProt? Or the residue ID in the PDB file? I assume the latter (but still wanted to make sure) because sometimes the surface residues numbers exceed the protein length. For example in interaction 554, first PDB description:

554 CDC42_HUMAN RHG01_HUMAN Transient 1AM4_D;1AM4_A;532,561,563,564,;189,191,198,126,197,220, …

For the PDB ID 1AM4 (see ), chain D (protein CDC42) is 191 amino acids long (see http://www.uniprot.org/uniprot/P60953) and the surface residues are 532,561,563 and 564.

And (3), a more general question regarding the definition of ‘transient’ and ‘permanent’ interactions. In the Bhardwaj et al (2011) paper it was mentioned that:

"It should be noted here that the term ‘‘permanent’’ does not indicate that the relevant protein interacts with its partner in a strictly permanent fashion (i.e., it does not remain bound to the partner for the duration of its life time). This term (along with ‘‘transient’’ interaction) is based on the convention previously adopted by Kim et al".

I searched the Kim et al (Science 2006) paper for a definition, but I couldn’t find it in the main text or supporting information. Could you please let me know what is the definition, or point out where the definition is? That would be very helpful.

A:
you might want to look at dynasin.molmovdb.org

Unfortunately, the E. coli set does not include the same level of detail which
we provide for the human set on our website. Indeed, the E. coli set, though
part of our study, was not the main focus of the study that motivated the
creation of DynaSIN [ref provided below].

Having said that, however, it should be possible to parse through our E. coli
set and to download the appropriate data from biomart by searching for gene-PDB
mappings. Again, thank you for your interest in this work.

Bhardwaj et al (2011) Integration of protein motions with molecular networks
reveals different mechanisms for permanent and transient interactions. Protein
Science 20:1745-1754.

1) This is indeed a strange observation in the file. It should not be
happening,
unless there’s an implicit convention of which I’m unaware. The analysis and
file compilation has been performed by a previous member of our group. Since I
cannot explain what you’ve observed for interactions 508 and 566, I’ll have to
defer your question to the post-doc who managed these files. I will cc you on
that email I send to him now.

2) You are correct — the surface residues are numbered according to their
numbering in the actual PDB files, and not according to their respective
UniProt reside indices.

3) You’re correct that, in the Kim et al 2006 paper, the terms "transient" and
"permanent" are never given explicit definitions. Rather, certain implied
definitions are appended to these terms in that paper. These definitions and
the reasoning are as follows:
A "transient" interaction is one in which multiple distinct pairs of
protein interact by using a shared interface on either protein. So, for
instance, let’s say that interface "a" on protein "A" interacts with interface
"b" on protein "B". Let’s also say that it’s possible for interface "a" on
protein "A" to interact with a completely different protein (say,
protein "C").
Since both "C" and "B" need to user surface "a" on "A", it is not possible for
both protein C & B to interact with A at the same time. That is to say, such
interactions are mutually exclusive. Assuming that both interactions are, at
some point in time, essential for biological processes, it must be the case
that there’s a transient nature to these interactions, thereby enabling
B and C
to interact with A at different times.
A "permanent" interaction, on the other hand, is one in which there are
not other competing pairs. The analogy here would be if "a" on "A" is inferred
to interact ONLY with "b" on "B". In theory, the interaction between "A" and
"B" may be permanent, since no other proteins need to interact with "a"
on "A".

We’ll wait to hear back from one of the other authors of the DynaSIN
paper, but
if anything I said above is unclear, of if you have any other queries, please
don’t hesitate to let us know.

Thanks for bringing it up; its been a while since I had a look at the codes behind DynaSIN (I have moved from Gerstein Lab). Anyways, ideally, order of proteins should not make a difference; swapping protein A and B should not change the contact residues. How many such cases do you see where order of the proteins made a difference?

The good thing is that these contact residues were not used for deriving the main results of the paper, they were only provided as an additional piece of data. Plus, if you think that the list of contact residues has some issues, its very easy to extract interface residues. That also gives you the freedom to change the distance cutoff.

Understanding transcriptional regulation by integrative analysis of transcription factor binding data

Q1:

In the article, it is mentioned that recent studies often had the problem that they were dependent on techniques like microarrays and that is why these studies were not able to measure expression levels of isoforms of some genes very accurately. It is also said that in this study, those problems would not exist, because ENCODE-data was used. So I looked up the ENCODE project, but I am not quite sure, why this data should be more accurate.

A1:
As we described in the paper, the ENCODE generated CAGE data that measures expression level of each TSS (translational start sites) of a gene. The data enable us to know the effect of TF binding signal nearby a TSS to the expression levels of the TSS.

Q2: Another point I am not sure about is, how this model is used. What kind of data you have to introduce to the program? Do you use transcription factor binding data, or are you just choosing your Transcription factor and the Start site sequence and the program is just telling you, what the probablility for getting an mRNA-transcript is? And if the first option is true, why is it easier to get the binding data of Transcription factor than the expression data – because if you have interactions of the chromatin structure, the latter should be more accurate, shouldn’t it?

A2: The Input to the model is: the TF binding signal nearby each TSS (for all TFs with ChIP-seq data available from ENCODE) AND the expression levels of all TSSes. Since we are using a supervised model, we randomly select 2000 TSSes for training the model, and test the performece of the model in the remaining data. I think your confusion is: since it is easy and more accurate to measure gene expression by RNA-seq or other experiments, why bother using ChIP-seq TF binding data to make prediction? The goal of our model is not to predicting gene expression. The goal is to use the model to quanitfy the relationship between gene expression and TF binding. We want to know: How much gene expression can be explained by TF binding signal? Which TF is more important? TF binding at which position contribute more? And other questions.

Q3: I am also curious, if the developed model is already used for the more predictive transcription factors, or if it was not intended to be used. If it was applied, do you know some groups who did so? I’m quite interested, whether they could create consistent data with this method.

A3: To my knowledge, many other groups also test models to study the relationship between gene expression and TF binding and /or histone modification. You may find the paper by Zhengqing OuYang in PNAS (PMID:19995984), by XIanjun Dong in Genome Biology (PMID:22950368) and many other publications. Again, the goal is to understand regulation conferred by TF binding and histone modifications, rather than predict gene expression.

Correlation ACT error

Q:

I am trying to run the correlation java script and i get the following when I run the example:

java -jar EncodeTfCor2.jar human_genome_file.txt bedlist 1000000 0
Parsing genome chromosomes and tf bindings …
parsing human_genome_file.txt…
parsing lists in bedlist
Building data matrix …
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at encodetfcor2.TfSitesDataMatrixBuilder.<init>(TfSitesDataMatrixBuilder.java:126)
at encodetfcor2.Main.main(Main.java:65)

Any idea whats going on? I have zero familiarity with Java so I am completely lost as to what is going on.

Well I got rid of it and installed a new version and this time I ran the snp data as the example and it worked. I have no idea what happened. One quick question though, I ran the example with the snp from the four individuals and I got the following matrix:

1.000000 0.984057 0.983579 0.941439
0.984057 1.000000 0.985570 0.956917
0.983579 0.985570 1.000000 0.952203
0.941439 0.956917 0.952203 1.000000

The track_names.txt says the following:

chinese.sites.chr1.parsed
korean.sites.parsed.chr1
venter.sites.parsed.chr1
watson.sites.parsed.chr1

so is the actual matrix then:

names chinese korean venter watson
chinese 1.000000 0.984057 0.983579 0.941439
korean 0.984057 1.000000 0.985570 0.956917
venter 0.983579 0.985570 1.000000 0.952203
watson 0.941439 0.956917 0.952203 1.000000

The readme file isnt very clear on that. Thanks.

A:
Yes, the matrix is labelled correctly.

The pseudogene information of zebrafish in Pseudogene.org should be updated

Q:
In Pseudogene.org, the pseudogene datasets of zebrafish (Danio rerio)
was based on old annotations (Ensembl 55?). There were about ~1800
processed pseudogenes. However, based on a recent research
(http://www.nature.com/nature/journal/v496/n7446/full/nature12111.html),
there were rare pseudogenes in zebrafish. (Only 21 processed
pseudogenes, according to Supplementary Table 14 in the published
manuscript).

Is this great conflict due to the old annotations?

A:
You are right about the zebrafish pseudogenes in the pseudogene.org. The results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication: http://www.nature.com/nature/journal/v496/n7446/full/nature12111.html.

Thanks for pointing this out to us.

protein sequences co-evolution software

Q:

I’m writing to you in connection with your research on the computational tools for the study of residue co-evolution in protein sequences, described in Bioinformatics (2008), http://coevolution.gersteinlab.org

We have a summer internship opportunity here at Dupont Industrial Biosciences (IB) in Palo Alto and the proposed project would involve evaluating different methods for identifying co-evolving residues, so that the suitable method or methods could be applied to proteins and protein families of interest to the company. If this approach is successful, it could help guide future protein engineering efforts here at Dupont IB.

If you happen to know a candidate who would be interested in this internship opportunity, I would welcome your recommendations. I’m in the process of interviewing a few people, but would be glad to talk to additional qualified candidates.

This internship is somewhat unusual because it is not part of a bioinformatics group, so the intern would need to make independent judgments regarding the merits and drawbacks of different approaches and regarding the technical implementation of the project.

My second question is whether there are any terms or conditions associated with using the co-evolution computational tools from your lab? Are the terms different if we were to run these programs on a local computer here within the company (rather than submitting our sequences to the remote server)? I didn’t see any indications to that effect on the coevolution.gersteinlab.org page or in the publication, but it is an important aspect to clarify before using external software within the company, so I hope you can let me know what the rules are or suggest the person I should contact.

A:
I’ll look for an intern. There’s no conditions on the use of this software — it’s open source. Just cite us as described on the permissions page.

what kind of indels are incorporated into the diploid genome assembly of the NA12878 individual?

Q:

I would like to ask about what kind of indels are incorporated into the
diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this
genome. What makes me confused is that when I perform a BLAST search with
one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf
(P2_M_061510_21_73), it shows up in both the maternal and paternal
haplotypes. Is there any size cutoff used for the indels that have been
selected for this assembly?

A:
in the latest version no fosmid indesl/SVs were used. Only output of GATK.

list of LoF tolerant genes (140) and list of essential genes (115)

Q:
I read with great interest your exciting paper on "Interpretation of genomic variants using a unified biological network approach".
In the last section of the Results, you describe the validation of your logistic regression model using a list of 140 LoF-tolerant genes (McArthur et al 2012) and a list of 115 essential genes (Liao et 2008). Even though I also read both papers, I couldn’t really find the lists of genes mentioned above (e.g. the supplementary table of Liao’s essential genes lists 120 genes and not 115 genes)
So, I was wondering if you’d be so kind and share the list of 140 LoF-tolerant genes and the list of 115 essential genes.

A:
In our plos comp bio paper in Supplementary Table S8 – the genes with significance_score=0 (second column) are LoF-tolerant genes and genes with significance_score=3 are Essential genes. This file contains 140 LoF-tol and 115 essential genes.

I think Liao et al reports 120 essential genes but with gene id conversions we lost 5 of them.