Error reports of larva software

Q1:
I am using larva software to investigate the noncoding hotspot mutation, but one error message was reported as follows:

Error: Mutation counts file example.snv.bed has too few columns on line 1. Expected at least 5, but found 4. Exiting.

The command I used: ./larva -vf example.snv.bed -af example.anno.bed -o larva.out -b

It makes me pretty confused that the “example.snv.bed” file really has 5 columns seperated by tab but the error says only found 4. I have tried a lot but still could not figure it out. Could you please give some help?

#####
The example.snv.bed file likes this:

chrM 5650 5651 BLCA_GD blca01

chrM 8863 8864 BLCA_GD blca01

chr1 1111476 1111477 BLCA_GD blca01

chr1 1632977 1632978 BLCA_GD blca01

chr1 1657153 1657154 BLCA_GD blca01

chr1 2584370 2584371 BLCA_GD blca01
####

The example.anno.bed file likes this:

the fourth column is the annotation info(only subset )

It would be really a great appreciate for your help.

A1:
It looks like the variant file and annotation file excerpts you attached with your email contain the same data (based on columns 1-3). I suspect that wasn’t your intended use of LARVA. Could you please send me the actual set of annotations you’re using? It would be a huge help to uncovering the root cause of the error.

Q2:
As you said, I think maybe the input annotation file is the point that makes an error. Actually, I do not fully apprehend what the annotation file should be.

In your paper published in 2015, the abstract says: "We make LARVA available as a software tool and release our highly mutated annotations as an online resource (larva.gersteinlab.org).”

So, using the highly mutated annotations you provided may be appropriate. However, this website(“larva.gersteinlab.org”) can not be visited any more. I hope you can provide some help.

Sorry to bother you for this little things. I used the RegulomeDB annotation file as the LRVAR’s input annotaion file, and the first error I sent you last time was disappeared, but there was a new error like this:

$ processing chromosomes………………….

Error: Invalid length of 0 in annotation file, line 2

Length must be greater than zero

RegulomeDB annotation file(only the first 4 columns were used): [[see image]]

A2:
I apologize for the accessibility issues with the LARVA website. There was a recent change on the backend that messed up the IP address routing to the website. I’ve contacted our IT people about the issue, but until they fix things on their end, the LARVA website can be accessed with its raw IP address: http://54.164.95.124/

Also, concerning your RegulomeDB issue, the reason you get an "Invalid length of 0" error is because the annotation on the second line uses the same coordinate for start and end. The program considers the annotation length to be (end-start), so the second annotation appears to have zero length, which doesn’t really make sense. In fact, it looks like the entire file is made of single nucleotides. This would make sense for the variant file, but for the annotation file, the intention is that the annotations represent intervals on the genome that perform some function. These are typically regions like exons, promoters, enhancers, etc. The idea is to see if these annotations are being hit with a large number of mutations. Single nucleotides don’t really match that annotation definition.

I hope this helps.

Architecture of the human regulatory network derived from ENCODE data

Q:
I have a question about the following excerpt from page 37 of the supp.
materials:

"In this paper, we mainly present a TF-centric analysis. We have also
analyzed other types of genomic contexts, such as gene-centric contexts, to
reveal the effect of context-specific TF co-associations to gene expression,
as well as chromatin state contexts to reveal relationships of TF
co-associations to various enrichments of chromatin marks. We plan to
present these results in a future publication.”

Were those results ever published? If so could you please point me to them.
I’m looking for an updated regulatory networks based on ENCODE data.

A:
try:

http://papers.gersteinlab.org/papers/metatrack
http://papers.gersteinlab.org/papers/loregic

Zebrafish pseudogenes

Q:
I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the
pseudogene.org, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

A:
Pseudogene.org provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

Orthoclust Input Question

Q:
I plan to use your program OrthoClust but I am a little confused on the input of the network files. The README just says a list of the nodes; if the first column is node A is the second column a node co-associated with node A?

A:
You are right. A net file is simply what some people call an edgelist. Two numbers in a row form an edge. You can see examples in the data folder found on Github.

Query on SCA coevolution

Q:
I have conducted structure based statistical coupling analyses (SCA) on each
of some mitochondrial proteins using 800 multiple sequences (including one
sequence from our organisms, one 3RKO structure sequence, and 788 protein
sequences from different genera), and we could obtain the coevolutionary
scores and spatial distances between any pair of two residues. The aim of
our study is try to analyze the coevolutionary role of some important given
residues (selected by PAML analyses) on key or important residues
responsible for proton translocation in the proton translocating channel of
respiratory Complex I. The problem is we are not sure how to do it in a more
statistical way. Such as, we could have the data of scores and distances of
a given selected residue on these residues in proton channel or other
residues of the same protein. In order to know possible different
coevolutionary role of a given residue i.e. the selective residue on proton
channel residues or other residues, t-test on scores (s), or distances (d)
or sores/distinces (s/d) were compared by us between those types of
residues, we are not sure if this kind of analyse is ok for us. Such as we
don’t know whether the score obtained by SCA analyses in the platform has
alreadly considered the potential role of distance, or it is just the score
obtained no mattter where both residues are? We know the influencing role
between any two given residues might be correlated with both their
characteristics and spatial distance between them.

Do you have any good idea on this, or do you have more reasonable
statistical way to solve our queries and prolem above?

A:
The scores were calculated based on the MSA alone without
considering the spatial distance between residues.

You may want to plot the global distribution of scores, and look
for scores that are significantly larger than the rest but cannot be
explained by the distance on the primary sequence alone. Indirect
coupling between residues though other residues is also something to be
aware of. There have been a lot of new papers about co-evolutionary
analysis lately (e.g., from Rama Ranganathan’s and Debora Marks’s labs).