The cost of sequencing cost of library preparation & the cost of running the sequencer

Q:
I’m quite interested in the cost of sequencing for a genome. I read the
paper "The real cost of sequencing: higher than you think! " which published
on <Genome Biology>. It’s a good paper but the author didn’t separate the
cost of library preparation from the cost of running the sequencer. So my
question is from your experience, could you tell me the ration of the cost
of library preparation to the cost of running the sequencer. It’s quite
important for me to design a experiment.

A:
as indicated in the paper, the two numbers you mentioned regard the cost of library prepration ($500) and the cost running the sequencer ($6000), respectively. Note that current figures may be different.

Data re “Architecture of the human regulatory network derived from ENCODE data”

Q:
I am very familiar with the ENCODE TF datasets, as I’ve been applying it to various problems in my PhD. I was interested in the expression analysis across human tissues for the ((miR –> TF) –> targets) FFL. There is a reference in the Supplementary file (section H) to the protein-coding expression atlas Su et al. 2004, for the TF and protein-coding targets in this loop, but doesn’t seem to be a ref for the corresponding expression data for miRNAs? I assume it would be Landgraf et al. 2007 ‘A mammalian microRNA expression atlas based on small RNA library sequencing’, since this allows matched tissues and samples with Su et al. However, it might be some other dataset. It would be helpful to be able to replicate/extend the FFL analysis using the correct data. Would you be able to forward this email to the relevent person(s) to confirm whether microRNA expression was taken from Landgraf atlas? Many thanks for your help

Slight correction: The FFL studied for expression pattern of
components is the other way round: ((TF –> miR) –> targets).

A:
the miRNA expression is actually from
Lu et al, Nature 2005
http://www.nature.com/nature/journal/v435/n7043/full/nature03702.html

if you go to
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
under the heading "MicroRNA Expression Profiles Classify Human Cancers"
see files

Common_miRNA.gct
and
Common_Affy.zip

PDB data for: Relating Three-Dimensional Structures to Protein Networks Provides Evolutionary Insights

Q:

Regarding your seminal paper "Relating Three-Dimensional Structures to
Protein Networks Provides Evolutionary Insights".
Amongst the supplementary data I could not find the PDB entries that were
used for each interaction in the SIN.
I would much appreciate if you could send me this data.

A:
info. should be on the site
http://networks.gersteinlab.org/structint

Yip et al 2012 Genome Biology

Q:
I really enjoyed your paper and am looking forward to using
some of the genomic regions you published at http://metatracks.encodenets.gersteinlab.org/
in my research.

I had a couple of questions about them.

BARs–are those the regions predicted by the random forest, or are they
the training set (bins overlapped by a TF ChIP-seq peak)?

PRMs–I may have missed it, but what is the definition of a "promoter"?
I’m guessing it was -1000 to +200bp around a TSS.
(This is to clarify the sentence "bins at the TSSs of expressed genes"
at the bottom of page 17.)

Since the PRMs don’t all span the same genomic distance, I presume
that only bins predicted by the random forest classifier are included
in the files?

Finally, do you have plans to make (or have already made) available
the software for creating region files of BARs,DRMs and DRM-targets
in other tissues?

A:
The BARs are the output regions of Random Forest. They do greatly overlap with the input training sets though.

The positive examples for learning PRMs are the 100bp bins at exactly the TSSs of expressed genes. Random Forest then learned the feature patterns of these bins, and searched for similar bins in the whole genome.

After the predictions, adjacent bins all predicted as PRMs were merged to form regions. The files available on the supplementary web site contain these regions.

Since the computer programs were written based on the available data from ENCODE, they were not written in a way that can be easily adopted to other situations. We do not currently have a plan to make them available.

Comparing chromatin state analysis at pseudogene regions

Q:
I am very interested to compare our chromatin state analysis at the pseudogene regions. I found this file at your website: http://www.pseudogene.org/psidr/psiDR.v0.txt

Could you please let me know if this is the right place to compare? I saw you do have h1-esc there. If I understand correctly, you classified each pseudogene as being in either active (1) or silent state (0).

A:
The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file. Please let me know if you have any questions about that file.

read cleaning for FusionSeq

Q:
The read cleaning (such as adapter remove, low quality base cut and polyA/T trimming) is required for FusionSeq? When we cleaned FASTQ file "SRR018259" which is publicly available, the FusionSeq cannot find correct fusions. On the other hand, when we run the FusionSeq without cleaning, it can find same fusions report in the paper. I would like to know which FASTQ (raw or cleaned) is recommended to load the FusionSeq.

And if cleaning is recommended, which is better to remove adapters.
(1) trimming adapters from reads (Read length is differ)
(2) removing the read itself (Read length is same)

A:
thank you for interest in FusionSeq and for your note. We typically run
the programs on clean reads (i.e. without adapters), but we keep all
reads, regardless of the score. The filtering of potential artifactual
fusion transcripts is performed in subsequent steps within FusionSeq.
Please also note that the filtering step may require some tuning
depending on the specific library preparation protocol. Hence, I would
recommend to remove the reads that have adapters (no trimming).

Question regarding paper “Classification of human genomic regions basedon experimentally determined binding sites of more than 100 transcription-related factors”

Q:
I am reading your paper "Classification of human genomic regions basedon experimentally determined binding sites of more than 100 transcription-related factors" and I have some questions.
In figure 1 what do the colors mean?
I also couldn’t understand plots in figure 4. what are the black dots, the error bars and the black line ?
I would be grateful if you answer my questions.

A:
In figure one different colors are used for different types of regions. For each type of regions, one color is used as the background color as one color is used to show the signal level.

Figure four shows standard Box-and Whisker plots (http://en.wikipedia.org/wiki/Box_plot). The dots are the means of the distributions. The upper and lower lines are the non-outlier maximum and minimum values, respectively. The black lines in the middle are the medians.

Question regarding paper “Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library”

Q:

I read your excellent breakSeq paper "Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library", and now I have some whole genome sequencing data to be analyzed. The breakpoint library you apply (http://sv.gersteinlab.org/breakseq/) is based on human genome NCBI build 36, but I use NCBI build 37 now. So should I lift-over the coordinate to the NCBI build 37 or realign the junction sequences to the NCBI build 37 first by myself? Or is there any pre-compiled breakpoint junction library used for NCBI build 37 ? By the way, any suggestions about adding the SVs identified in 1000 genome project to the breakpoint junction library ?

A:
There are two sets of SV breakpoints that should be relevant to you:

The published 1000 Genomes pilot data in Mills et al Nature 2010: http://www.nature.com/nature/journal/v470/n7332/extref/nature09708-s9.xls
The 1000 Genomes phase I data that is going to be published soon: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/

The published pilot data is on NCBI build 36. Using liftover to convert the genomic coordinates to NCBI build 37 should suffice. You might want to double check whether the SV size and the junction sequences are consistent before and after the liftover.

The phase I data is on NCBI build 37. You may simply take the junction sequences at the breakpoints to add to the library.

Data associated w/paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:

I hare read with great interest you
recent paper "Classification of human genomic regions based on
experimentally determined binding sites of more than 100
transcription-related factors”.

I am interested in one specific genomic region that I am studying. Based on
the available data in the UCSC browser I believe that this region is an
Enhancer, since it contains the characteristic epigenetic marks, HS-DNA
sites, and binding of a few transcription factors.

I have tried to find the predictions done in your paper for that specific
region. However I have not found a link to the predictions done (maybe I
have just missed it). I think my region should be a BAR and a DRM. Since you
have screened Chip-Seq data from a large number of Transcriptional
regulatory Factors I would be highly interested to know which specific
transcription factors bind to the region. Additionally it would be great to
known if this region belongs to the set of DRM in which you identified a
potential target gene.

Are the results from you study available for download? If not would it be
possible to get the results for a specific region?

A:
see http://encodenets.gersteinlab.org/metatracks

Pseudogene database: the link “current human pseudogenes” on the main webpage leads to build 61

Q:
I have a question regarding your Pseudogene database: the link "current
human pseudogenes" on the main webpage leads to build 61. Looking however at
"Database" -"Eukaryote Pseudogenes" I found build 68 for human
pseudogenes. The latter seems to contain less pseudogenes than build 61
(lower count). So I’m not sure which one I should best consider. Probably
build 68 is the latest version and the link on the main page is not up to
date, right?

A:
Now the link is pointing to build 68. I would
suggest you to use this file, which is the latest results based on the
release-68 of Ensembl genomes. The number of pseudogene changes due to
the different annotation of protein coding genes between the different
genome releases.