Illumina/PlatinumGenomes — Can’t access NA12878 files on ftp site (#4)

Q:
The map files I’m looking at are in NA12878_diploid_2017_jan7 and the VCF I’m looking at is IPG_2016_1.0_callset/NA12878.vcf.gz.

Just to make sure I’m not misunderstanding, does the map file contain variant coordinates, or is it another form of chain file that gives the offsets of match blocks? I’ve been assuming they’re coordinates of variants listed in the VCF file. Also, I’m assuming the lines with one or more zeroes are indels?

Here are some examples of my attempts to match the coordinates between the map files and the VCF.

First, these three lines are taken from chr1_NA12878.map:

249172383 249098765 249117098
249193024 249119406 249137737
249193998 249120380 249138710

When I take the first number from each of these lines and grep it out of the VCF file, I get no hits, even if I add 1 or subtract 1 to account for differences in the use of 0-based versus 1-based coordinates:

cat NA12878.vcf.gz | gunzip | grep 249172383
cat NA12878.vcf.gz | gunzip | grep 249172382
cat NA12878.vcf.gz | gunzip | grep 249172384

cat NA12878.vcf.gz | gunzip | grep 249193024
cat NA12878.vcf.gz | gunzip | grep 249193023
cat NA12878.vcf.gz | gunzip | grep 249193025

cat NA12878.vcf.gz | gunzip | grep 249193998
cat NA12878.vcf.gz | gunzip | grep 249193997
cat NA12878.vcf.gz | gunzip | grep 249193999

I then tried the opposite: I took coordinates from the VCF and grepped those from the map files, but also with no hits. Here are three lines from the VCF file:

chr19 59097308 . G A . PASS MTD=cgi,bwa_freebayes,bwa_platypus,isaac_strelka,bw
a_gatk;KM=8.86;KFP=0;KFF=0 GT 0|1
chr19 59097977 . T C . PASS MTD=isaac_strelka;KM=7.00;KFP=0;KFF=0 GT 0|1
chr19 59098781 . G T . PASS MTD=bwa_gatk;KM=5.14;KFP=0;KFF=0 GT 0|1

When I grep these coordinates (plus or minus 1) from chr19_NA12878.map I get no hits:

cat chr19_NA12878.map | grep 59097308
cat chr19_NA12878.map | grep 59097307
cat chr19_NA12878.map | grep 59097309

cat chr19_NA12878.map | grep 59097977
cat chr19_NA12878.map | grep 59097976
cat chr19_NA12878.map | grep 59097978

cat chr19_NA12878.map | grep 59098781
cat chr19_NA12878.map | grep 59098780
cat chr19_NA12878.map | grep 59098782

Could it be that this newest diploid genome was generated from hg38 rather than hg19?

The problem isn’t with the link. We were able to download the files. The problem is that the coordinates in the downloaded VCF and map files don’t seem to match up. Please scroll down for my original description of the problem.

A:
The .map files are actually another form of the .chain files.
The first coordinate you looked at (249172383) seems to be the start coordinate of the block that accounts for the coordinate shift due to an indel preceding it:

$grep -w ^249172383 chr1_NA12878.map -B 3
249112564 249038946 0
249112565 249038947 249057281
249172382 249098764 0
249172383 249098765 249117098

$grep -w 249172381 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249172381 . CT C . PASS MTD=bwa_gatk;KM=3.12;KFP=1;KFF=1 GT 0|1

There is a ‘T’ in the reference at 249172382 that is present in PAT, but not in MAT (thus, ‘0’ in the third column). This corresponds to a 1 bp block in REF (249172382) and PAT and a 0 bp block in MAT.
The next block starts after the deletion in MAT and the corresponding REF coordinate is 249172383.

Similarly, for your second example: it is a 2bp shift this time and so the new block starts at +3 bp in REF with respect to the corresponding indel coordinate:

$grep -w ^249193024 chr1_NA12878.map -B 3
249172382 249098764 0
249172383 249098765 249117098
249193022 249119404 0
249193024 249119406 249137737

$grep -w 249193021 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193021 . TAG T . PASS MTD=isaac_strelka,bwa_freebayes,bwa_platypus,bwa_gatk;KM=12.40;KFP=0;KFF=0 GT 0|1

The third example is similar to the first one:

$grep -w ^249193998 chr1_NA12878.map -B 3
249193022 249119404 0
249193024 249119406 249137737
249193997 249120379 0
249193998 249120380 249138710

$grep -w 249193996 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193996 . GA G . PASS MTD=bwa_freebayes,bwa_platypus,bwa_gatk;KM=5.81;KFP=0;KFF=0 GT 0|1

The opposite examples are all SNVs and do not cause coordinate shifts and thus aren’t present in the .map files.

The reference sequence used to create the personal genome is hg19 and the .fasta file can be found in the same archive: reference_callsets_files/reference_annotation/female.hg19.fasta.gz

I am also attaching a README file containing a brief description of the file format (from vcf2diploid, http://alleleseq.gersteinlab.org/tools.html).

README file for vcf2diploid tool distribution v_0.2.6a (19th Sept 2014)

This version of vcf2diploid integrates vcf2diploid_v_0.2.6 with generation of read depth for AlleleSeq filtering of potential SNPs residing in CNV locations.
There is also an additional option on vcf2diploid for an output folder.

The read-depth-file generator calculates a normalized read depth for a SNP in a 2000-bp window (+- 1000bp) around the SNP against an average depth computed
from the entire genome. The output generated can be directly used for the AlleleSeq pipeline.

CITATION:
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.
AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

makePersonalGenome.mk

After installation/compilation of vcf2diploid, modify the file makePersonalGenome.mk to run the following
1) vcf2diploid
2) vcf2snp
3) read-depth-file generator

USAGE: make -f makePersonalGenome DATA_DIR=/path/to/your/VCFdir OUTPUT_SAMPLE_NAME=NA12878 FILE_NAME_BAM=filename.in.DATA_DIR.bam FILE_NAME_VCF=filename.in.DATA_DIR.vcf
–currently DATA_DIR is set as the directory where the BAM and VCF files are kept and where the output directory is going to be.
–options can be modified in the file

Acknowledgement: The modifications in this version are created by R. Kitchen of the Gerstein Lab@Yale.

he following section describes vcf2diploid_v_0.2.6

1. Compilation
$ make

2. Running
java -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

You can try the program by running ‘test_run.sh’ script in the ‘example’
directory. See also "Important notes" below.

3. Constructing personal annotation and splice-junction library

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt

* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).

Important notes

All characters between ‘>’ and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name ‘chr1’.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of ‘chr’ at the beginning is allows, i.e. ‘chr1’ and ‘1’ are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA — see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN — http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with ‘0’ for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref MAP format
X X X
X X X \
X X X > P1 M1 R1
X X – > P4 M4 0
X X – , > 0 M6 R4
– X X ‘ ,-> P6 M7 R5
X X X ‘
X X X
X X X

For question and comments contact: Alexej Abyzov (alexej.abyzov) and
Mark Gerstein (mark.gerstein)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s