Funseq2 Web Server

Q:
The Funseq2 Web Server goes down these days. Would it be available in the next few days?

A:
The Funseq2 web server is up and running now. It has some suspicious activity on the server recently and we are keeping on monitoring it.
If you are submitting your own query, please try to use the correct format, or it will shows ‘service unavailable’ service.

As an alternative, you can also download the whole genome annotations for both hg19 and hg38 from funseq3.gersteinlab.org, then use tabix to query.

interested in Funseq2

Q:
I found your paper regarding to Funseq2 and quite interested at how do you assign weight or calculated weight for each category. From weighted scoring schema, I could see different categories have different weight, but I am not sure how do you decide them .

A lot bit about me: I am interested pediatric genetic diseases and working on a birth cohort at Beijing Children Hospital as assistant professor.

A:
It’s an entropy-based scheme in the paper. It’s also described in
various FunSeq lectures (on lectures.gersteinlab.org).

The details of Funseq2 can be found in our paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0480-5. Simply, In Funseq2, we firstly to define a weighted score for each feature based on their distribution of features in random selected common variants. Discrete and continuous features use slightly different way (refer the formula 1 and 2 in the paper).
for a discrete feature, like ‘In sensitive regions’: [see image]

if there are 20 out of 2000000 random common variants are overlapping with sensitive regions, the Pd will be 20/2000000 = 0.0001 , then [see image]
will be used to get the weight for ‘In sensitive regions’

For the continuous feature, it uses:
[see image]

how to filter TF binding peaks for a plant ENCODE project

Q:
My lab is doing a few plant ENCODE projects, and we have done ChIP-Seq for ~100 maize TF and is analyzing the data. We followed most of your 2013 paper “architecture of the human regulatory network…”. Something confused me a bit is that we have on average ~10,000 peaks for each TF (from SPP and IDR 0.01). If I associate them to genes based on the distance to TSS, we have a huge TF-gene or TF-TF network. almost everyone is interacting. For example, the 100 TF to 100 TF network has 5k edges, I guess many of them could be false positive due to the weak ChIP-seq peaks. In your paper, you used TIP (in your Cheng et al 2011 NAR) to further filter out some interaction. We are trying that as well. But I don’t understand how did you get the input for TIP (500,542 promoter associated interaction, page 3 of your paper) from 2,948,387 promoter proximal peaks. Is there something I missed?

I also have another question about TF function in general. I am not sure whether we can claim the TF binding is "non-functional”, if the TF gene itself showed low co-expression correlation with the target gene. Or silencing the TF gene did not affect the target gene expression. Because the regulation could be complex with multiple TF targeting one genes. Those show co-expression/correlation might be target genes that the TF play major role. While TF can still contribute to the expression of target genes but it only contribute a small percentage with other TF playing a more dominant role. So can i say that those TF binding has no function?

A:
My understanding is: TIP assume each TF has a specific binding profile around TSS cross the genome in the human genome. TIP then estimate an empirical distribution of signal/peaks around TSS, convert it to weight and calculate a score for a peak. This assumption is based on the human genome. It may not be applied to other genomes directly if there is no clear pattern in around TSS. Before you use the tool, please double check the binding profile of each TF in plants. You can check and adapt the source code of TIP from Github: https://github.com/gersteinlab/TIP

For TF ChIP-seq, if the constructed regulatory network very dense, you may try to use a more stringent cutoff to reduce the false positives regulations.

As to whether gene co-expression reflect TF regulatory function, as you mentioned, you already aware that the mechanism is very complex. The co-expression definitely cannot sufficiently prove this regulatory function. But we still can get some reliable inferences based on the co-expression according to many previous studies. Also if you have multiple data sources, the result can be refined by advanced machine learning techniques. you can refer a new paper from our lab recently, we use elastic-net to refine the TF-gene network(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=30545857&dopt=Abstract).

Ask for the information of data funseq

Q:
I have downloaded the Whole genome scores(hg19) both Version 2.1.6 and 2.1.0 in the project website you provided http://funseq2.gersteinlab.org/downloads. There no score when I queried the codign region, but in the Whole Genome Query interface displays the results, such as chr1:11073808-11073808. I will be so appreciated if you could kindly tell me the reason of this problem. Thank you for your kind consideration of this request.

A:
The genome score you downloaded only includes non-coding variants. For coding regions, the score mostly reply on VAT annotation (another tool by our lab: vat.gersteinlab.org). Also whole genome score including both coding and non-coding will be a very large file, which over 50G after compressed. So we provide a query server: http://funseq3.gersteinlab.org/ . Thanks also for pointing out this issue on the download page, and we will update the webpage with clear and detailed file descriptions.

Questions about FunSeq2

Q:
Recently, I used FunSeq2 to identify non-coding regulatory variations in my
bladder cancer research. In promoter analysis, I discovered the original
file, gencode.v19.promoter.bed, which downloaded from
http://funseq2.gersteinlab.org/static/data_context2.1.2/gencode/, having the
promoter areas of ranking from 1 to 8979 bp, that was inconsistent with the
definition in your article (promoters defined as -2.5 kb from transcription
starting sites).

So, I checked the process script 3.gencode.process.pl, which was downloaded
from http://funseq2.gersteinlab.org/scripts_dev/ and suited for gencode.v16.
This script generated gencode.v16.promoter.bed
(http://funseq2.gersteinlab.org/static/data_context2.1.0/gencode/), and the
BED file’s 3rd column minus 2nd column all equals 2500 (2.5 Kb). After
comparing, I noticed that the latest gencode.v19.promoter.bed has the
additional 5th column, so I realized the 3.gencode.process.pl script had
been re-edited, but I did’t find the latest version on the internet.
Therefore, I wonder whether the latest 3.gencode.process.pl redefined the
meaning of promoter. If it does, can I get one copy of this script?

A:
The promoter file was derived from PCAWG promoter set, which may consider chromHMM segmentation information. Yao have updated this in the v2.1.2, then I keep it in the latest version. User can replace the right file using their own definition of promoters.

The promoter file included in Funseq 2.1.2 is based on PCAWG consortium’s definition, which considers ChromHMM segmentation information. So it will not be exactly 2kb or 2.5kb upstream of TSS.

Using LARVA and FunSeq2 for variant analysis

Q:
I have read your articles describing FunSeq2 and LARVA. I
find these two frameworks to be the most complete and well-adapted and so, I
am very interested in using them for my analysis. I have installed both
tools and started to run them following the instructions in the
documentation, but I am still encountering a few problems.

First, I have run the web-based version of FunSeq2 on several of my VCF
files and it seems to return the wanted result, with around 10,000+ entries
for each sample. However, when running the tool on the same files in command
line (with the -nc option), I obtain a different result, with no significant
entries returned.

The output returned is:

… Input format check : vcf …
… Format ok …
… Start filtering SNVs with minor allele frequency = 0 …
Warning: sample Sample1 – no SNVs left after filtering against natrual
variations …

I receive a similar result when attempting to run the program on multiple
files at once (both in command line and on the web).

I am also trying to use LARVA on these files; I have managed to install the
tool and I am currently testing it using the example-variants-1.txt file
from the regression suite as the variant file, but the program returns
“Segmentation Fault: 11” with no other error message.

Therefore, I would like to know if you have encountered these errors before
and if so, please let me know about any steps that I can try to correct
them.

A:
I’m glad to hear that you’ve decided to use LARVA for your analyses. I did some investigating with the LARVA codebase to try to figure out what might be causing the segmentation fault. One thing I found was that one of the helper scripts (bigWigAverageOverBed) is provided in its Linux (64-bit) version, so if you run LARVA on a different type of system (e.g. a Mac), the script won’t work. There are versions for other operating systems here (at the end of the page), but for simplicity we only provided the 64-bit Linux version. If that doesn’t fix the issue, could you please tell me everything you can about the environment in which you’re running LARVA (CPU, RAM, operating system, etc.) and the command line parameters you used.

Also, for help on Funseq2, I refer you to my colleague, Shake Lou (cc’ed).

One more thing I just thought of: how are all your input files formatted?

As to the issue about Funseq2, here is some suggestions:

1. The Funseq webserver version is obsolete, and we recommend you to use github version.
2. The latest 2.1.6 version has fixed a bug that might lead to some variant missed from the output.
3. Please use bed format as the output format. I will update vcf format output later.
4. You can also try funseq3.gersteinlab.org, which we have pre-calculated each position’s score for the hg19 genome. If you have a large number of variants to query, we have another good news. We are also testing a rich format whole genome Funseq output file and can let you retrieve the Funseq annotation simply from the command line. If you are interested in this file, we can give you the pre-release testing once it passed our internal QC very soon.