Inquiry about STRESS

Q:
I am writing this e-mail to inquire about STRESS software.

We have learned from your paper (Structure 2016,24:826-837)
that STRESS software can be used for identifying allosteric pockets.
We are interested in using the software for our drug discovery research.
We will perform evaluation of the software for a start.

Will you allow us to use STRESS software for the purpose of our
commercial drug discovery project free of charge?

As this is an urgent project, we would highly appreciate if you could
reply soon.

A:
see license at https://sites.gersteinlab.org/permissions/

Pseudogene Prediction Pipeline Question

Q:
I am unsure if you are the correct person to contact you for my question, so
if that is the case, could you direct me to someone who might? I am
currently a master student at Ghent University doing my master’s thesis and
I trying to use different pipelines to predict pseudogenes.

So far, I have been able to succesfully use "Shiu’s Pipeline" and I an
interested in using the pipeline I found on "http://pseudogene.org/main.php"
too. but whilst trying to using it, I stumbled on some problems, which I was
hoping you (or someone from your lab) could help me with solving. I’ll
briefly try to explain what I’m trying to do and what the problem is.

In my research I’m trying to find/study pseudogenes from a certain
Whole-Genome Duplication in Populus trichocarpa. Step 1 and 2 of the
pipeline (as in the README file) have proven to be successful (Note: the
README file apparently searches for a ‘splitXXXXOut’ pattern, which my file
names don’t contain, as I didn’t split the proteome file into chunks).

I believe I have a problem with the pipeline in step 3, the masking step.
The pipeline has apparently 3 options (no masking, intron masking and gene
masking) for masking, but when looking into the script file
(extractKPExonLocations.py), it seems that only option 3 is available (there
seems to be no code for other options), while I would like to use option 1,
no masking.
As I couldn’t find a way to perform option 1, no masking, I tried masking
the genome anyway, but 2 problems arose:
1) the data I have been using so far comes from Phytozome.org and not
Ensembl plants, which doesn’t have the necessary files for step 2
2) I can’t try to recreate the files necessary for step 3, masking, as
Ensembl (no longer?) provides "translation_stable_id.txt" files. Were they
perhaps replaced by the "translation_attrib.txt" files?

Because of these issues I encountered, and didn’t want to mask the genome in
the first place (I can filter the results afterwards anyway), I tried
skipping step 3 and continue with the pipeline, to see if it would work
anyway. However, as the following steps require "Location of maskt files
(see Step 3) above)" and "The columns in the mask file that provide start
and stop data (0-based)", my results have been fruitless.

Now that you (hopefully) understand what the problems are that I
encountered, I was hoping if you, or someone else, could help me. In
particular I would like to know if it is possible to run the pipeline
without masking. As I said previously, I didn’t see the possibility to do
so, but I am still relatively new to bioinformatics, so I might be mistaken.
In addition to this, null exon data sets are required (which are empty
files?). In case this isn’t possible, would it be possible to tell me what
kind of information is stored in "translation_attrib.txt, so I coudl try to
recreate these files with Phytozome.org data?

I hope that you or someone else can help me with this problem, or point me
in the right direction. I know this was a long read and hopefully I have
explained myself well enough.

A:
Could you please send me all your commands line by line and the errors you encountered so we can help you.
In short replying to you’re queries:
— If you have you want to use your own custom data, not from Ensembl, you will need to format the files and create all the input files required for the pipeline to run. For this download our example file and look at the input files presented in ppipe_input folder.
— You do not need to use a masked genome, you can just pinpoint the pipeline to use the unmasked version, it works exactly the same.

AllelSeq Info

Q:
I am trying to find Allel specific expression from a hybrid mouse RNA-Seq. I
have the complete genome of the two parental strains and the corresponding
SNPs and Indels as vcf format. I don’t have a complete genome and genotype
for the hybrid mouse.

I am looking to your AllelSeq tools you have published and attempted to
create a custom genome using vcf2diploid_v0.2.6a. Would you please guild me
OR forward to someone in your group who will guide me if the tools will be
useful in such context? I see the tool was developed with Trios.

A:
AlleleSeq/vcf2diploid can be used both with trios and with single samples.
Either way, you would need to have a .vcf for the sample (hybrid mouse) in order to use the tools though.

HiC-Spector data

Q:
We have read with much interest your article about the HiC-Spector method.
We are currently working on a method that we hope will help identify
conserved features across different HiC-maps. As the problem we are studying
and the one tackled in your article are closely related, we think it would
be useful for us to test our method using your data set as the ground truth.
We kindly ask whether you would be able to provide us with the HiC maps used
in the article for this purpose.

A:
https://github.com/gersteinlab/HiC-spector/blob/master/data/readme_data

Illumina/PlatinumGenomes — Can’t access NA12878 files on ftp site (#4)

Q:
The map files I’m looking at are in NA12878_diploid_2017_jan7 and the VCF I’m looking at is IPG_2016_1.0_callset/NA12878.vcf.gz.

Just to make sure I’m not misunderstanding, does the map file contain variant coordinates, or is it another form of chain file that gives the offsets of match blocks? I’ve been assuming they’re coordinates of variants listed in the VCF file. Also, I’m assuming the lines with one or more zeroes are indels?

Here are some examples of my attempts to match the coordinates between the map files and the VCF.

First, these three lines are taken from chr1_NA12878.map:

249172383 249098765 249117098
249193024 249119406 249137737
249193998 249120380 249138710

When I take the first number from each of these lines and grep it out of the VCF file, I get no hits, even if I add 1 or subtract 1 to account for differences in the use of 0-based versus 1-based coordinates:

cat NA12878.vcf.gz | gunzip | grep 249172383
cat NA12878.vcf.gz | gunzip | grep 249172382
cat NA12878.vcf.gz | gunzip | grep 249172384

cat NA12878.vcf.gz | gunzip | grep 249193024
cat NA12878.vcf.gz | gunzip | grep 249193023
cat NA12878.vcf.gz | gunzip | grep 249193025

cat NA12878.vcf.gz | gunzip | grep 249193998
cat NA12878.vcf.gz | gunzip | grep 249193997
cat NA12878.vcf.gz | gunzip | grep 249193999

I then tried the opposite: I took coordinates from the VCF and grepped those from the map files, but also with no hits. Here are three lines from the VCF file:

chr19 59097308 . G A . PASS MTD=cgi,bwa_freebayes,bwa_platypus,isaac_strelka,bw
a_gatk;KM=8.86;KFP=0;KFF=0 GT 0|1
chr19 59097977 . T C . PASS MTD=isaac_strelka;KM=7.00;KFP=0;KFF=0 GT 0|1
chr19 59098781 . G T . PASS MTD=bwa_gatk;KM=5.14;KFP=0;KFF=0 GT 0|1

When I grep these coordinates (plus or minus 1) from chr19_NA12878.map I get no hits:

cat chr19_NA12878.map | grep 59097308
cat chr19_NA12878.map | grep 59097307
cat chr19_NA12878.map | grep 59097309

cat chr19_NA12878.map | grep 59097977
cat chr19_NA12878.map | grep 59097976
cat chr19_NA12878.map | grep 59097978

cat chr19_NA12878.map | grep 59098781
cat chr19_NA12878.map | grep 59098780
cat chr19_NA12878.map | grep 59098782

Could it be that this newest diploid genome was generated from hg38 rather than hg19?

The problem isn’t with the link. We were able to download the files. The problem is that the coordinates in the downloaded VCF and map files don’t seem to match up. Please scroll down for my original description of the problem.

A:
The .map files are actually another form of the .chain files.
The first coordinate you looked at (249172383) seems to be the start coordinate of the block that accounts for the coordinate shift due to an indel preceding it:

$grep -w ^249172383 chr1_NA12878.map -B 3
249112564 249038946 0
249112565 249038947 249057281
249172382 249098764 0
249172383 249098765 249117098

$grep -w 249172381 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249172381 . CT C . PASS MTD=bwa_gatk;KM=3.12;KFP=1;KFF=1 GT 0|1

There is a ‘T’ in the reference at 249172382 that is present in PAT, but not in MAT (thus, ‘0’ in the third column). This corresponds to a 1 bp block in REF (249172382) and PAT and a 0 bp block in MAT.
The next block starts after the deletion in MAT and the corresponding REF coordinate is 249172383.

Similarly, for your second example: it is a 2bp shift this time and so the new block starts at +3 bp in REF with respect to the corresponding indel coordinate:

$grep -w ^249193024 chr1_NA12878.map -B 3
249172382 249098764 0
249172383 249098765 249117098
249193022 249119404 0
249193024 249119406 249137737

$grep -w 249193021 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193021 . TAG T . PASS MTD=isaac_strelka,bwa_freebayes,bwa_platypus,bwa_gatk;KM=12.40;KFP=0;KFF=0 GT 0|1

The third example is similar to the first one:

$grep -w ^249193998 chr1_NA12878.map -B 3
249193022 249119404 0
249193024 249119406 249137737
249193997 249120379 0
249193998 249120380 249138710

$grep -w 249193996 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193996 . GA G . PASS MTD=bwa_freebayes,bwa_platypus,bwa_gatk;KM=5.81;KFP=0;KFF=0 GT 0|1

The opposite examples are all SNVs and do not cause coordinate shifts and thus aren’t present in the .map files.

The reference sequence used to create the personal genome is hg19 and the .fasta file can be found in the same archive: reference_callsets_files/reference_annotation/female.hg19.fasta.gz

I am also attaching a README file containing a brief description of the file format (from vcf2diploid, http://alleleseq.gersteinlab.org/tools.html).

README file for vcf2diploid tool distribution v_0.2.6a (19th Sept 2014)

This version of vcf2diploid integrates vcf2diploid_v_0.2.6 with generation of read depth for AlleleSeq filtering of potential SNPs residing in CNV locations.
There is also an additional option on vcf2diploid for an output folder.

The read-depth-file generator calculates a normalized read depth for a SNP in a 2000-bp window (+- 1000bp) around the SNP against an average depth computed
from the entire genome. The output generated can be directly used for the AlleleSeq pipeline.

CITATION:
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.
AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

makePersonalGenome.mk

After installation/compilation of vcf2diploid, modify the file makePersonalGenome.mk to run the following
1) vcf2diploid
2) vcf2snp
3) read-depth-file generator

USAGE: make -f makePersonalGenome DATA_DIR=/path/to/your/VCFdir OUTPUT_SAMPLE_NAME=NA12878 FILE_NAME_BAM=filename.in.DATA_DIR.bam FILE_NAME_VCF=filename.in.DATA_DIR.vcf
–currently DATA_DIR is set as the directory where the BAM and VCF files are kept and where the output directory is going to be.
–options can be modified in the file

Acknowledgement: The modifications in this version are created by R. Kitchen of the Gerstein Lab@Yale.

he following section describes vcf2diploid_v_0.2.6

1. Compilation
$ make

2. Running
java -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

You can try the program by running ‘test_run.sh’ script in the ‘example’
directory. See also "Important notes" below.

3. Constructing personal annotation and splice-junction library

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt

* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).

Important notes

All characters between ‘>’ and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name ‘chr1’.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of ‘chr’ at the beginning is allows, i.e. ‘chr1’ and ‘1’ are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA — see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN — http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with ‘0’ for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref MAP format
X X X
X X X \
X X X > P1 M1 R1
X X – > P4 M4 0
X X – , > 0 M6 R4
– X X ‘ ,-> P6 M7 R5
X X X ‘
X X X
X X X

For question and comments contact: Alexej Abyzov (alexej.abyzov) and
Mark Gerstein (mark.gerstein)

Encode TF binding site data

Q:
I am currently trying to figure out which TFs may be regulating my
modules in the synovium of rheumatoid arthritis patients (generated using
WGCNA of microarray samples). Am I right in understanding I can use your
ENCODE data to do so on this website?

http://encodenets.gersteinlab.org/

Second question is which one of these files should I use to compare my gene
expression module genes with your TF binding gene lists? I noticed the raw
one has huge numbers of genes for each TF, so should I use the other i.e.
filtered?

enets1.Proximal_raw.txt

enets2.Proximal_filtered.txt

Is it a problem the ENCODE TF binding sites are generated from cell lines
and not in the diseased tissue I am interested in? Apologies for my naivety!

A:
quick answers below

…Am I right in understanding I can use your
ENCODE data to do so on this website?

ANSWER: yes

I noticed the raw
one has huge numbers of genes for each TF, so should I use the other i.e.
filtered?

ANSWER: I’d go w/ the filtered one (enets2) first.

Is it a problem the ENCODE TF binding sites are generated from cell lines
and not in the diseased tissue I am interested in?

ANSWER: don’t think so – they are meant to be a ref.