Inforequest :”Annotation Transfer Between Genomes: Protein–Protein Interologs and Protein–DNA Regulogs”

Q:

Recently I have read one of your article : "Annotation Transfer Between Genomes: Protein–Protein Interologs and Protein–DNA Regulogs".

If it is possible can you send the academic version of this program for linux to me? or is there any location where can I download the implementation?

A:
see web link via http://papers.gersteinlab.org/papers/interolog

Yeast Network Hirearchy

Q:
I am very interested in your work on network rewiring. I have been working on experimental validation of network rewiring approaches investigating how this can be used to reprogram regulatory networks to improve heterologous protein production in Yeast. I am now in the process of analysing transcriptional rewiring phenotypes I have identified in a combinatorial library based screen. I have noticed some very interesting enrichment criteria in the groups of rewired promoters and open reading frames with regards to network structure.

I was hoping to look at how these rewired components are natively arranged with regards to their network hierarchy. I would like to use the hierarchical network model you proposed in your paper (http://www.ncbi.nlm.nih.gov/pubmed/21045205?dopt=Abstract) but I have been having trouble reconstructing it from the pdf supplemental data. I am really keen on using your model to study my experimental data further if you have any suggestions on how I could best go about this I would be most greatful.

A:
you might find the following links useful :

http://www.gersteinlab.org/proj/nethierarchy
http://papers.gersteinlab.org/papers/nethierarchy/
website with an earlier version of the yeast hierarchy.

http://papers.gersteinlab.org/papers/mirnet
http://papers.gersteinlab.org/papers/wormawg
information on worm & fly hierarchies

http://papers.gersteinlab.org/papers/encodenets
Human hierarchy

http://papers.gersteinlab.org/papers/callgraph
Bacterial hierarchy

I would also direct you to the wiki page:

http://info.gersteinlab.org/Hierarchy

Under the heading "Phenotypic Effects of Network Rewiring in Transcriptional Regulatory Hierarchies", this page lists all the data in a very user-friendly format that you would need to reproduce the hierarchies with all the datasets very well described/annotated.

This page has the initial regulatory network of E. coli and Yeast and it also provides you with the original breadth-first search hierarchies. In addition, it lists all the changes in the hierarchy upon deletion of each gene. There is an extensive description of what each column in each file means.

Further, in order for you to better understand the algorithm/program we used, I am also attaching a light-weight perl script that generates the hierarchy from a given network (BFS.pl) (it is well annotated with an explanation of each step). I am also attaching another perl script that I used to list the changes the hierarchy upon deletion of each gene (count_changes_modified_hierarchy.pl). Paths will be broken for input files but it should be enough for you to get a flavor of how we quantified changes in the modified hierarchies.

Query regarding paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:
I recently read your paper "Classification of human genomic regions
based on experimentally determined binding sites of more than 100
transcription-related factors" in Genome Biology, since I am interested
in enhancers. If I understand things correctly, you identified ~13k
putative enhancers in K562 cells, but I cannot locate the list of loci
in the supplemental materials. I was wondering if you would be willing
to share that list with me?

A:
see http://encodenets.gersteinlab.org/metatracks/

SIN database, request detailed format

Q:

I am interested in the evolution of protein-protein interaction networks, and
recently became an enthusiastic user of your Structural Interaction Network
(SIN) database.

While downloading the data from the SIN website
(http://networks.gersteinlab.org/structint/), I noticed that more detailed
formats are available upon request for for SIN versions 0.9, 1.0 and 2.0.
In particular, which Pfam domains are involved in each interaction, and
which yeast crystal structure (hopefully PDB identifications) the
interactions are based on.

Would it be possible to obtain this information? I would really appreciate
that. I hope to be able to use it to survey physical properties of the
interactions throughout the network, and connect it to the evolutionary
simulations I’m working on at the lab.

I have a few questions about the DynaSIN. Sorry for this long email, I tried to be as clear as possible. It would be really great if you could help me answer those questions!

Question (1) and (2) are regarding the ‘Interaction Data’ section, file ‘interface_final2.txt’:

(1) What is the significance of the order in which protein A and protein B (second and third columns, respectively) are presented? In other words – if protein A and B are swapped, should the other entries (PDB IDs and surface residues) be calculated in a different way? I thought that swapping protein A and B should give the same result, but I noticed that for interaction 566 and 508, swapping protein A and B result in different PDB IDs and different surface residues for the PDB IDs they have in common:

566 HFE_HUMAN TFR1_HUMAN Permanent 1A6Z_A;1A6Z_B;26,30,49,97,122,202,204,236,243,;54,55,53,31,60,99,11,10, 1A6Z_A;1A6Z_D;; 1A6Z_C;1A6Z_B;; 1A6Z_C;1A6Z_D;26,30,49,97,122,204,236,243,;54,55,53,31,60,11,99,10, 1DE4_A;1DE4_B;30,49,121,122,204,233,236,243,;55,53,1,60,99,11,8,10, 1DE4_A;1DE4_E;; 1DE4_A;1DE4_H;; 1DE4_D;1DE4_B;; 1DE4_D;1DE4_E;30,49,97,120,122,202,204,206,207,233,236,239,243,;55,53,60,3,98,99,11,12,13,8,10, 1DE4_D;1DE4_H;; 1DE4_G;1DE4_B;; 1DE4_G;1DE4_E;; 1DE4_G;1DE4_H;30,49,97,120,121,122,202,204,233,236,;55,53,62,31,1,60,98,99,11,8,10,

508 TFR1_HUMAN HFE_HUMAN Permanent 1DE4_C;1DE4_A;629,640,;85,146, 1DE4_C;1DE4_D;; 1DE4_C;1DE4_G;; 1DE4_F;1DE4_A;; 1DE4_F;1DE4_D;629,658,;146,64, 1DE4_F;1DE4_G;; 1DE4_I;1DE4_A;; 1DE4_I;1DE4_D;; 1DE4_I;1DE4_G;629,640,;85,146,

(2) Do the surface residues numbers (column 5 and subsequent columns) correspond to their position in the full protein sequence as defined in UniProt? Or the residue ID in the PDB file? I assume the latter (but still wanted to make sure) because sometimes the surface residues numbers exceed the protein length. For example in interaction 554, first PDB description:

554 CDC42_HUMAN RHG01_HUMAN Transient 1AM4_D;1AM4_A;532,561,563,564,;189,191,198,126,197,220, …

For the PDB ID 1AM4 (see ), chain D (protein CDC42) is 191 amino acids long (see http://www.uniprot.org/uniprot/P60953) and the surface residues are 532,561,563 and 564.

And (3), a more general question regarding the definition of ‘transient’ and ‘permanent’ interactions. In the Bhardwaj et al (2011) paper it was mentioned that:

"It should be noted here that the term ‘‘permanent’’ does not indicate that the relevant protein interacts with its partner in a strictly permanent fashion (i.e., it does not remain bound to the partner for the duration of its life time). This term (along with ‘‘transient’’ interaction) is based on the convention previously adopted by Kim et al".

I searched the Kim et al (Science 2006) paper for a definition, but I couldn’t find it in the main text or supporting information. Could you please let me know what is the definition, or point out where the definition is? That would be very helpful.

A:
you might want to look at dynasin.molmovdb.org

Unfortunately, the E. coli set does not include the same level of detail which
we provide for the human set on our website. Indeed, the E. coli set, though
part of our study, was not the main focus of the study that motivated the
creation of DynaSIN [ref provided below].

Having said that, however, it should be possible to parse through our E. coli
set and to download the appropriate data from biomart by searching for gene-PDB
mappings. Again, thank you for your interest in this work.

Bhardwaj et al (2011) Integration of protein motions with molecular networks
reveals different mechanisms for permanent and transient interactions. Protein
Science 20:1745-1754.

1) This is indeed a strange observation in the file. It should not be
happening,
unless there’s an implicit convention of which I’m unaware. The analysis and
file compilation has been performed by a previous member of our group. Since I
cannot explain what you’ve observed for interactions 508 and 566, I’ll have to
defer your question to the post-doc who managed these files. I will cc you on
that email I send to him now.

2) You are correct — the surface residues are numbered according to their
numbering in the actual PDB files, and not according to their respective
UniProt reside indices.

3) You’re correct that, in the Kim et al 2006 paper, the terms "transient" and
"permanent" are never given explicit definitions. Rather, certain implied
definitions are appended to these terms in that paper. These definitions and
the reasoning are as follows:
A "transient" interaction is one in which multiple distinct pairs of
protein interact by using a shared interface on either protein. So, for
instance, let’s say that interface "a" on protein "A" interacts with interface
"b" on protein "B". Let’s also say that it’s possible for interface "a" on
protein "A" to interact with a completely different protein (say,
protein "C").
Since both "C" and "B" need to user surface "a" on "A", it is not possible for
both protein C & B to interact with A at the same time. That is to say, such
interactions are mutually exclusive. Assuming that both interactions are, at
some point in time, essential for biological processes, it must be the case
that there’s a transient nature to these interactions, thereby enabling
B and C
to interact with A at different times.
A "permanent" interaction, on the other hand, is one in which there are
not other competing pairs. The analogy here would be if "a" on "A" is inferred
to interact ONLY with "b" on "B". In theory, the interaction between "A" and
"B" may be permanent, since no other proteins need to interact with "a"
on "A".

We’ll wait to hear back from one of the other authors of the DynaSIN
paper, but
if anything I said above is unclear, of if you have any other queries, please
don’t hesitate to let us know.

Thanks for bringing it up; its been a while since I had a look at the codes behind DynaSIN (I have moved from Gerstein Lab). Anyways, ideally, order of proteins should not make a difference; swapping protein A and B should not change the contact residues. How many such cases do you see where order of the proteins made a difference?

The good thing is that these contact residues were not used for deriving the main results of the paper, they were only provided as an additional piece of data. Plus, if you think that the list of contact residues has some issues, its very easy to extract interface residues. That also gives you the freedom to change the distance cutoff.

protein sequences co-evolution software

Q:

I’m writing to you in connection with your research on the computational tools for the study of residue co-evolution in protein sequences, described in Bioinformatics (2008), http://coevolution.gersteinlab.org

We have a summer internship opportunity here at Dupont Industrial Biosciences (IB) in Palo Alto and the proposed project would involve evaluating different methods for identifying co-evolving residues, so that the suitable method or methods could be applied to proteins and protein families of interest to the company. If this approach is successful, it could help guide future protein engineering efforts here at Dupont IB.

If you happen to know a candidate who would be interested in this internship opportunity, I would welcome your recommendations. I’m in the process of interviewing a few people, but would be glad to talk to additional qualified candidates.

This internship is somewhat unusual because it is not part of a bioinformatics group, so the intern would need to make independent judgments regarding the merits and drawbacks of different approaches and regarding the technical implementation of the project.

My second question is whether there are any terms or conditions associated with using the co-evolution computational tools from your lab? Are the terms different if we were to run these programs on a local computer here within the company (rather than submitting our sequences to the remote server)? I didn’t see any indications to that effect on the coevolution.gersteinlab.org page or in the publication, but it is an important aspect to clarify before using external software within the company, so I hope you can let me know what the rules are or suggest the person I should contact.

A:
I’ll look for an intern. There’s no conditions on the use of this software — it’s open source. Just cite us as described on the permissions page.

Data received – Re: Your model and input data to the “…integrative analysis of transcription factor binding data” paper

Q:
Many thanks for the excellent ENCODE papers! This is an unprecedented source for life scientists, and we appreciate that accordingly!

Would you be so kind as to access your model and input data your random forest model that predicts gene expression based on transcription factor binding?

Could you please also name the source of TSS CAGE? At UCSC, our only suspects were the Riken CAGE*TSS files, or CSHL LongRNA and ShortRNA files.
We would like to run and to adapt your model to the extremely tight co-regulation of ribosome protein genes. We believe that the ENCODE TF’s may account for a major part of their regulation.

Naturally, we would properly cite your works (incl. Cheng & Gerstein, 2011). Should you prefer, we are open to any reasonable forms of collaboration.

A:

See http://archive.gersteinlab.org/proj/chromodel

The human TSS CAGE data are from Roderic’s Lab.

here is the Human CAGE TSS file:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_June2011.gff.gz

here is a readme file:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_June2011.txt

and here are some additional explanations of how the file was made:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_june2011.pdf

Data associated w/paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:

I hare read with great interest you
recent paper "Classification of human genomic regions based on
experimentally determined binding sites of more than 100
transcription-related factors”.

I am interested in one specific genomic region that I am studying. Based on
the available data in the UCSC browser I believe that this region is an
Enhancer, since it contains the characteristic epigenetic marks, HS-DNA
sites, and binding of a few transcription factors.

I have tried to find the predictions done in your paper for that specific
region. However I have not found a link to the predictions done (maybe I
have just missed it). I think my region should be a BAR and a DRM. Since you
have screened Chip-Seq data from a large number of Transcriptional
regulatory Factors I would be highly interested to know which specific
transcription factors bind to the region. Additionally it would be great to
known if this region belongs to the set of DRM in which you identified a
potential target gene.

Are the results from you study available for download? If not would it be
possible to get the results for a specific region?

A:
see http://encodenets.gersteinlab.org/metatracks

Data associated with paper “Redefining Nodes and Edges: Relating 3D Structures to Protein Networks Provides Insight into their Evolution”

Q:

I’m hoping to analyse the data from your 2006 "Redefining Nodes and Edges: Relating 3D Structures to Protein Networks Provides Insight into their Evolution" paper. Do you have the full dataset including the pdb ids/chain ids relating to each interaction in the network?

A:
have you seen assoc. paper website : http://papers.gersteinlab.org/papers/structint

subway network maps

Q:

I’ve really enjoyed seeing your subway network maps. Our institute at Duke has an external review coming up, and I was wondering if any of the code you’ve developed might be applicable to visualize the published interactions within our institute as well as outside with other investigators across the Duke campus.

Thoughts, or is this still too early?

And I bet I’m not the only one who would be interested in this. Are you planning to publish this method? I would think many universities/groups would find this informative to understand where the interactions are happening across campus and where they aren’t happening enough.

A:
I am very enthusiastic that you find our analysis of publication
networks interesting. We just sent out a more extensive description
of this to the analysis group.

In relation to the software, we actually constructed a package for
doing literature networks and distributed some software associated
with it. It is called PubNet and it is available at
pubnet.gersteinlab.org. Unfortunately the web server does not work
that well as of late, mostly due to changes in the NCBI’s PubMed
system, but you can still download the software and do the queries
manually if you want to. I hope this is of use.