Orthoclust Input Question

I plan to use your program OrthoClust but I am a little confused on the input of the network files. The README just says a list of the nodes; if the first column is node A is the second column a node co-associated with node A?

You are right. A net file is simply what some people call an edgelist. Two numbers in a row form an edge. You can see examples in the data folder found on Github.

OrthoClust questions

I am contacting you regarding the OrthoClust program that your group has on github and had a couple questions about how to apply the program to new datasets. First, how was the co-appearance matrix calculated from the OrthoClust output? Second, is it necessary to modify the initial number of spin states (q) or the coupling constant (k) parameters that were used in the 2014 Genome Biology paper? I am not able to find options in the current release and wondering where these values can be changed in the code?

The current implementation in github is based on a heuristic, rather than the simulated annealing method used in the 2014 Genome Biology paper. The initial number of spin state q is no longer a parameter you have to supply. It’s set to be the total number of nodes in the system. As explained in the readme, the coupling constant k is supplied in one of the input files (the coupling information file). It should be the 3rd column of the file. in my example (ortho_info file found in data folder), the third column is all 1, meaning k=1.
For the co-occurrence matrix, notice that the output file is a tab delimited file which consists of three columns. The 1st and 2nd columns are the species id and the gene id given by the input files. The 3rd column is a module id. Suppose there are N1 genes in species 1 and N2 genes in species 2, the co-appearance matrix has dim (N1+N2) by (N1+N2). One should build a map between the genes in individual species to the indices running from 1 to (N1+N2). Suppose there are n genes in module 1, then all the pair-wise combination of these n genes should be marked as 1 in the corresponding matrix elements.
One output file can be used to make a co-appearance matrix (with only 0 and 1). If you have multiple output files from multiple runs of the algorithm, you will arrive at a final co-appearance matrix shown in the Genome biology paper by adding the results together. Of course, in order to make a plot like the heat map shown in the paper, one has to further perform clustering to arrange the rows and columns.
If you use Julia, I may be able to send you a little script.

control parameters in annealing process in OrthoClust R package

Recently, I am trying to use the OrthoClust R package for multiple species network clustering. I did not found the control parameters in annealing process as described in your paper: "Standard simulated annealing was employed. Spin values were randomly assigned initially, and updated via a heat bath algorithm. The initial temperature was chosen in a way such that the flipping rate (the probability that a node changes its spin state) was higher than 1 – 1/q. The temperature was gradually decreased with a cooling factor 0.9, until the flipping rate was less than 1%." I did also not found the simulation annealing algorithm in the matlab file OrthoClustN.m (represented by a greedy algorithm). Please help me solve this problem. Thank you for your time.

The annealing procedure is very slow for practical problems. in the revision stage of our manuscript, we discovered the greedy algorithm (Louvain algorithm) and therefore wrapped up the matlab code, and implemented in R too. it’s a very well regard algorithm, and we strongly encourage you to try the matlab code for your purpose.

Spectral biclustering

I recently read
your 2003 paper titled "Spectral biclustering of microarray data: Coclustering
genes and conditions".

I would like to investigate implementing your approach on a GPU.
Is there any code (Matlab? Python?) you would be willing to share as a result of the paper?

Sorry we’re just using simple SVD routines in matlab. No meaningful code available. -marK

OrthoClust – for more than two species

I just read your recently published paper on OrthoClust approach. It is a well grounded work in both practically and mathematically point of views.

I ran your R scripts for my own data and It worked perfectly fine, however I am wondering how can I use the script for more than two species?

It could be appreciated if you help me to find the solution.

Thanks for your interest in OrthoClust. Orthoclust definitely works on more than 2. The R script is a primitive version for illustrating the concept outlined in the paper. We understand the importance of N-species generalization. We have put a new MATLAB code for N-species. It made use of an efficient code written by Mucha and Porter that implemented the Louvain algorithm for modularity optimization. The 3rd party code as well as our wrapper is now in the gersteinlab github.
Apart from MATLAB, we are planning to provide wrapper for Python or R later.
The N-species code is not exactly the thing we did for the paper. So if you find any bug or question, please let me know. we are trying to make a more user friendly package anyway.

rulefit3 in encodenets

I read your paper about the co-associations among TF binding events, (Architecture of the human regulatory network derived from ENCODE data), and got interested in your original clustering algorithm. Now, in our laboratory, we are developing a new clustering algorithm for a large number of genomic data, and implemented its prototype algorithm. However, the accuracy of our algorithm is not so completed, and we have to evaluate it. Thus, we want to use your algorithm as the fine basis, so how can we use it? If the program is available for us, can you tell us the way to use it?

In that paper we used the Rulefit3 package from Prof. Jerome Friedman; there is an R package available at the link below. Our use of the algorithm is extensively documented in Section C of the Supplementary Materials.


Architecture of the human regulatory network derived from ENCODE data http://dx.doi.org/10.1038/Nature11245