PDB data for: Relating Three-Dimensional Structures to Protein Networks Provides Evolutionary Insights

Q:

Regarding your seminal paper "Relating Three-Dimensional Structures to
Protein Networks Provides Evolutionary Insights".
Amongst the supplementary data I could not find the PDB entries that were
used for each interaction in the SIN.
I would much appreciate if you could send me this data.

A:
info. should be on the site
http://networks.gersteinlab.org/structint

Yip et al 2012 Genome Biology

Q:
I really enjoyed your paper and am looking forward to using
some of the genomic regions you published at http://metatracks.encodenets.gersteinlab.org/
in my research.

I had a couple of questions about them.

BARs–are those the regions predicted by the random forest, or are they
the training set (bins overlapped by a TF ChIP-seq peak)?

PRMs–I may have missed it, but what is the definition of a "promoter"?
I’m guessing it was -1000 to +200bp around a TSS.
(This is to clarify the sentence "bins at the TSSs of expressed genes"
at the bottom of page 17.)

Since the PRMs don’t all span the same genomic distance, I presume
that only bins predicted by the random forest classifier are included
in the files?

Finally, do you have plans to make (or have already made) available
the software for creating region files of BARs,DRMs and DRM-targets
in other tissues?

A:
The BARs are the output regions of Random Forest. They do greatly overlap with the input training sets though.

The positive examples for learning PRMs are the 100bp bins at exactly the TSSs of expressed genes. Random Forest then learned the feature patterns of these bins, and searched for similar bins in the whole genome.

After the predictions, adjacent bins all predicted as PRMs were merged to form regions. The files available on the supplementary web site contain these regions.

Since the computer programs were written based on the available data from ENCODE, they were not written in a way that can be easily adopted to other situations. We do not currently have a plan to make them available.

Comparing chromatin state analysis at pseudogene regions

Q:
I am very interested to compare our chromatin state analysis at the pseudogene regions. I found this file at your website: http://www.pseudogene.org/psidr/psiDR.v0.txt

Could you please let me know if this is the right place to compare? I saw you do have h1-esc there. If I understand correctly, you classified each pseudogene as being in either active (1) or silent state (0).

A:
The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file. Please let me know if you have any questions about that file.

read cleaning for FusionSeq

Q:
The read cleaning (such as adapter remove, low quality base cut and polyA/T trimming) is required for FusionSeq? When we cleaned FASTQ file "SRR018259" which is publicly available, the FusionSeq cannot find correct fusions. On the other hand, when we run the FusionSeq without cleaning, it can find same fusions report in the paper. I would like to know which FASTQ (raw or cleaned) is recommended to load the FusionSeq.

And if cleaning is recommended, which is better to remove adapters.
(1) trimming adapters from reads (Read length is differ)
(2) removing the read itself (Read length is same)

A:
thank you for interest in FusionSeq and for your note. We typically run
the programs on clean reads (i.e. without adapters), but we keep all
reads, regardless of the score. The filtering of potential artifactual
fusion transcripts is performed in subsequent steps within FusionSeq.
Please also note that the filtering step may require some tuning
depending on the specific library preparation protocol. Hence, I would
recommend to remove the reads that have adapters (no trimming).

Question regarding paper “Classification of human genomic regions basedon experimentally determined binding sites of more than 100 transcription-related factors”

Q:
I am reading your paper "Classification of human genomic regions basedon experimentally determined binding sites of more than 100 transcription-related factors" and I have some questions.
In figure 1 what do the colors mean?
I also couldn’t understand plots in figure 4. what are the black dots, the error bars and the black line ?
I would be grateful if you answer my questions.

A:
In figure one different colors are used for different types of regions. For each type of regions, one color is used as the background color as one color is used to show the signal level.

Figure four shows standard Box-and Whisker plots (http://en.wikipedia.org/wiki/Box_plot). The dots are the means of the distributions. The upper and lower lines are the non-outlier maximum and minimum values, respectively. The black lines in the middle are the medians.

Question regarding paper “Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library”

Q:

I read your excellent breakSeq paper "Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library", and now I have some whole genome sequencing data to be analyzed. The breakpoint library you apply (http://sv.gersteinlab.org/breakseq/) is based on human genome NCBI build 36, but I use NCBI build 37 now. So should I lift-over the coordinate to the NCBI build 37 or realign the junction sequences to the NCBI build 37 first by myself? Or is there any pre-compiled breakpoint junction library used for NCBI build 37 ? By the way, any suggestions about adding the SVs identified in 1000 genome project to the breakpoint junction library ?

A:
There are two sets of SV breakpoints that should be relevant to you:

The published 1000 Genomes pilot data in Mills et al Nature 2010: http://www.nature.com/nature/journal/v470/n7332/extref/nature09708-s9.xls
The 1000 Genomes phase I data that is going to be published soon: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/

The published pilot data is on NCBI build 36. Using liftover to convert the genomic coordinates to NCBI build 37 should suffice. You might want to double check whether the SV size and the junction sequences are consistent before and after the liftover.

The phase I data is on NCBI build 37. You may simply take the junction sequences at the breakpoints to add to the library.

Data associated w/paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:

I hare read with great interest you
recent paper "Classification of human genomic regions based on
experimentally determined binding sites of more than 100
transcription-related factors”.

I am interested in one specific genomic region that I am studying. Based on
the available data in the UCSC browser I believe that this region is an
Enhancer, since it contains the characteristic epigenetic marks, HS-DNA
sites, and binding of a few transcription factors.

I have tried to find the predictions done in your paper for that specific
region. However I have not found a link to the predictions done (maybe I
have just missed it). I think my region should be a BAR and a DRM. Since you
have screened Chip-Seq data from a large number of Transcriptional
regulatory Factors I would be highly interested to know which specific
transcription factors bind to the region. Additionally it would be great to
known if this region belongs to the set of DRM in which you identified a
potential target gene.

Are the results from you study available for download? If not would it be
possible to get the results for a specific region?

A:
see http://encodenets.gersteinlab.org/metatracks

Pseudogene database: the link “current human pseudogenes” on the main webpage leads to build 61

Q:
I have a question regarding your Pseudogene database: the link "current
human pseudogenes" on the main webpage leads to build 61. Looking however at
"Database" -"Eukaryote Pseudogenes" I found build 68 for human
pseudogenes. The latter seems to contain less pseudogenes than build 61
(lower count). So I’m not sure which one I should best consider. Probably
build 68 is the latest version and the link on the main page is not up to
date, right?

A:
Now the link is pointing to build 68. I would
suggest you to use this file, which is the latest results based on the
release-68 of Ensembl genomes. The number of pseudogene changes due to
the different annotation of protein coding genes between the different
genome releases.

Data associated with paper “The GENCODE pseudogene resource”

Q:
Your work on the ENCODE project has helped to
produce an incredible set of data!

I had a question about your pseudogene article, "The GENCODE
pseudogene resource." You note that at least 9% of them are
transcribed. Do you have a list somewhere? I couldn’t find a
supplementary file that might contain such a list. I realize it would
be quite long, but I assume it must exist somewhere. If not, do you
happen to know if the GULO pseudogene is one of the transcribed
pseudogenes?

A:
It transcribes pseudogenes should be available from the resource associated with the paper.

The data associated with the paper is located at http://pseudogene.org/psidr/

Data associated with paper “Redefining Nodes and Edges: Relating 3D Structures to Protein Networks Provides Insight into their Evolution”

Q:

I’m hoping to analyse the data from your 2006 "Redefining Nodes and Edges: Relating 3D Structures to Protein Networks Provides Insight into their Evolution" paper. Do you have the full dataset including the pdb ids/chain ids relating to each interaction in the network?

A:
have you seen assoc. paper website : http://papers.gersteinlab.org/papers/structint