A question regarding VAT (twoBitToFa: command not found)

Q1:
The psiDR is a valuable resource for my research.

In the corresponding publication (The GENCODE pseudogene resource, Pei et
al., Genome Biology, 2012), you describe in "Material & Methods:
Identification of the parents of pseudogenes and sequence similarity to
the parent" that the exons of parent and pseudogenes were used to align
them via ClustalW2.

Is it possible to provide me the alignments? That would be great.

A1:
Edit @PATH

Q2:
That is, can we run VAT (snpMapper) on a simple mutation list (as shown below)?
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO
14 30525048 . . . . . .
19 28364092 . . . . . .
2 144165461 . . . . . .
2 144224307 . . . . . .
4 98318053 . . . . . .

No rs numbers, no sample IDs and no group:sample files are present in this case, as I just wanted to run VAT for a list of somatic mutations to see if they are annotated to a coding or non-coding region.

There will be no duplicates in the list too, as I have already handled their frequency beforehand.

A2:
I will update documentation on VAT. Thanks for pointing it out to us. Regarding the input that is a non-vcf file,
just make a dummy vcf file. That is have extra columns that are tab-delimited.. Please e-mail me if you have more questions. However, bear with me as my computer just died and I am trying to get it fixed.

data stored in molmovdb

Q:
I am interested in using some of the data stored in molmovdb, but I am unable to access the data. For example, from the page:

http://molmovdb.org/cgi-bin/sets.cgi

Once I click on automated set 2 the page returns “Sorry, no values in this range ( to ). Please try again!”

This happens on many (if not all) of the links. Is this data still available, or has the site been effectively decommissioned?

A:
this is something that is selected in lib/select.pm. Look at the explanation of the %selectors at the top of that file.

For example, for ‘auto’, the non-redundant automatic set, the selector uses the getNR function, which ultimately looks at this file to get the morph ID’s:

/usr/local/server/htdocs/tmp/auto.txt

You can similarly obtain the automatic set, which is not from a text file but rather an sql query.. you’ll see what I mean.

Qs about breakseq tool

Q:
I have just installed Breakseq tool developed by your lab to analyse structural variant in pancreatic cancer genome,

All the required modules has been downloaded, however, I could not find documentation of how to run the tool.

I was wondering is there any manual or an example on how to run the tool?

Or may I could contact someone in the lab who is familiar with Breakseq?

A:
everything we have is at http://sv.gersteinlab.org/breakseq

Inforequest :”Annotation Transfer Between Genomes: Protein–Protein Interologs and Protein–DNA Regulogs”

Q:

Recently I have read one of your article : "Annotation Transfer Between Genomes: Protein–Protein Interologs and Protein–DNA Regulogs".

If it is possible can you send the academic version of this program for linux to me? or is there any location where can I download the implementation?

A:
see web link via http://papers.gersteinlab.org/papers/interolog

Question about pseudogene.org data

Q:
I have a question about psueodgene.org database.
I’m analyzing human pseudogene database and noticed that many "processed" pseudogene (>70%) don’t have polyA.
It seems like opposite of what textbook says. Is that true?
What’s the criteria of "processed pseudogene" in pseudogene.org?

I came to find another question.
I tried to blat search using several pseudogene sequence from each class of "polyA: "0" or "1" or "2" or "3" ".
But most of PolyA class 1,2,3 don’t have convincing polyA tail compare to following criteria.

Polya: "0" or "1" or "2" or "3".
"0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
"1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
"2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
"3": has polyA tail but no polyadenylation detected.

Does number coordinate of pseudogene.org data depends on human genome assembly GRCh37/hg19?

A:
Pseudogenes are identified primarily by homology matching of protein sequence against the human genome. However, the pipeline that we use incorporates poly A analysis. Our group published a paper a few years ago where we showed that ~ 50% of ribosomal protein pseudogenes do not have a detectable poly A signal. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC187539/ . We believe that this is due to decay in genome sequence and nucleotide substitutions.

For detecting poly A signals and classification, the following criteria is used according to the paper linked above.

We searched a 1000-bp region that was 3′ to the pseudogene homology segment, with a sliding window of 50 nucleotides for a region of elevated polyadenine content (>30 bp), and picked the most adenine-rich 50-bp segment as the most likely candidate. An interval of 1000 nucleotides was used because of the possible existence of 3′-untranslated regions (3′-UTRs); 90% of 3′-UTRs are of length less than 942 bp (Makalowski et al. 1996). In addition, we searched in the same 1000-bp region for candidate AATAAA or other polyadenylation signals and checked whether they were upstream of the candidate polyadenine tail site.

This criteria might not be very stringent.

And yes, the pseudogene coordinates are dependent on the human genome from which it is derived, hence the human genome version number is important.

Yeast Network Hirearchy

Q:
I am very interested in your work on network rewiring. I have been working on experimental validation of network rewiring approaches investigating how this can be used to reprogram regulatory networks to improve heterologous protein production in Yeast. I am now in the process of analysing transcriptional rewiring phenotypes I have identified in a combinatorial library based screen. I have noticed some very interesting enrichment criteria in the groups of rewired promoters and open reading frames with regards to network structure.

I was hoping to look at how these rewired components are natively arranged with regards to their network hierarchy. I would like to use the hierarchical network model you proposed in your paper (http://www.ncbi.nlm.nih.gov/pubmed/21045205?dopt=Abstract) but I have been having trouble reconstructing it from the pdf supplemental data. I am really keen on using your model to study my experimental data further if you have any suggestions on how I could best go about this I would be most greatful.

A:
you might find the following links useful :

http://www.gersteinlab.org/proj/nethierarchy
http://papers.gersteinlab.org/papers/nethierarchy/
website with an earlier version of the yeast hierarchy.

http://papers.gersteinlab.org/papers/mirnet
http://papers.gersteinlab.org/papers/wormawg
information on worm & fly hierarchies

http://papers.gersteinlab.org/papers/encodenets
Human hierarchy

http://papers.gersteinlab.org/papers/callgraph
Bacterial hierarchy

I would also direct you to the wiki page:

http://info.gersteinlab.org/Hierarchy

Under the heading "Phenotypic Effects of Network Rewiring in Transcriptional Regulatory Hierarchies", this page lists all the data in a very user-friendly format that you would need to reproduce the hierarchies with all the datasets very well described/annotated.

This page has the initial regulatory network of E. coli and Yeast and it also provides you with the original breadth-first search hierarchies. In addition, it lists all the changes in the hierarchy upon deletion of each gene. There is an extensive description of what each column in each file means.

Further, in order for you to better understand the algorithm/program we used, I am also attaching a light-weight perl script that generates the hierarchy from a given network (BFS.pl) (it is well annotated with an explanation of each step). I am also attaching another perl script that I used to list the changes the hierarchy upon deletion of each gene (count_changes_modified_hierarchy.pl). Paths will be broken for input files but it should be enough for you to get a flavor of how we quantified changes in the modified hierarchies.

Questions re tool for coevolution analysis

Q:
We are interested in
adding a link to your tool on the website. I have been playing around with
the tool but have been having some difficulty. I loaded the example, ran the
analysis and downloaded the results with no problem. Then, I tried to use
the data from the example to re-run the analysis by uploading the PF01036
fasta file (as an MSA) and listing BACR_HALSA as the reference sequence. I
did not load the tree data or the structure data. I submitted the request
and received the following error:

The coevolution analysis task that you submitted at 2013-09-11 11:35:07.0
could not be completed.

Error message:Not enough sequences.

In order to justify the addition of the link to the website, we need to make
sure that the web interface is simple and easy to use. Can you help me
understand what the problem might be (clearly I have enough sequences since
the example runs without any issues)? Also, can you explain to me the link
between the 1C3W reference structure and the sequence data?

Any help you can give would be greatly appreciated.

A:
The error message is due to an internal check of the system. Since
coevolution analysis requires a good number of sequences to give
reliable statistics, there is a minimum threshold of the number of
sequences after the filtering step. When the example is loaded, I think
some of the filtering settings are made so that it can pass the minimum
threshold. When one manually uploads the MSA, the default settings could
be different.

You may try changing the setting for the minimum number of
sequences in the "Advanced options" section (which is by default hidden)
and rerun. Let me know if you still get any error message.