Question about liftover from pat/mat to ref

Q1:
I’ve used your AlleleSeq package to get NA12878 diploid genome alignment. I have two questions:

1. How can I liftover from paternal/maternal alignments to hg19 coordinates? Based on the documents, the map files generated by vcf2diploid are used to convert from hg19 to pat/mat, not the other direction. Last year I found an R script "map.to.ref.R" from your website and used that to convert from pat/mat to hg19. However I cannot find it anymore. Do you have an updated script for this liftover? If "map.to.ref.R" is still valid to use, what reference should we cite in our manuscripts?

2. I noticed that both the "MergeBowtie.py" script in AlleleSeq and the R script "map.to.ref.R" assume bowtie format in input alignment files. Do you have upgraded versions that are compatible with SAM format?

A1:
1. We use the chainSwap tool (http://hgdownload.soe.ucsc.edu/admin/exe/) to flip the maternal.chain and paternal.chain files (generated by vcf2diploid) to the other direction. Then a .bed file in the parental coordinates can be converted into a .bed file in the reference coordinates using the UCSC liftOver tool and the swapped chain files.

This approach is straightforward for any arbitrary interval, but if you are converting alignments, it may require additional scripting or using other tools. We’ve received a few issues/questions with the script you are mentioning and since we didn’t develop it and aren’t currently planning to maintain, we removed the link from the web-site for now.

2. We do not have updated released versions yet, but this is one of the things that will be introduced in future versions and thus I might be able to help. What exactly are you interested in: is it just compatibility with SAM-formatted bowtie1 (i.e. ungapped alignments) output or are you trying to use another aligner?

Q2:
Thanks for your detailed answers. They are very helpful.
We used bowtie2 or tophat2 as the aligners for DNA or RNA.
Currently we hope to liftover the pat/may alignments to hg19 for downstream analyses (e.g. HiC or DEG). We can’t compare results if they are not liftovered to the same reference.
Could I take the mapped position in sam, convert it to bed, liftover to hg19, and then chip this new position back to the sam? Do you have a better solution?

A2:
The start coordinates of the reads can be transferred this way, but I don’t think inserting them back to the sam file will produce a correct sam/bam file overall: some of the other fields (CIGAR string, mismatches, etc) will need to be adjusted as well. Depending on the further analysis this may be an issue. CrossMap http://crossmap.sourceforge.net/ seems to work with sam/bam and chain files. I have never used it though.

Maybe it is easier to do the analysis on the parental alignments? Say, for DEG in order to generate read counts table, one might consider transferring the annotation to mat and pat coordinates. Then for every exon/gene extract the reads mapped to it in both alignments and use the number of unique read-ids as the gene read count.

Where does the merge script come in your approach? Do you want to merge the alignments before transferring them to hg19 and then do further analyses (which I am guessing, do not involve looking into allele-specific counts)?

AllelSeq Info

Q:
I am trying to find Allel specific expression from a hybrid mouse RNA-Seq. I
have the complete genome of the two parental strains and the corresponding
SNPs and Indels as vcf format. I don’t have a complete genome and genotype
for the hybrid mouse.

I am looking to your AllelSeq tools you have published and attempted to
create a custom genome using vcf2diploid_v0.2.6a. Would you please guild me
OR forward to someone in your group who will guide me if the tools will be
useful in such context? I see the tool was developed with Trios.

A:
AlleleSeq/vcf2diploid can be used both with trios and with single samples.
Either way, you would need to have a .vcf for the sample (hybrid mouse) in order to use the tools though.

Illumina/PlatinumGenomes — Can’t access NA12878 files on ftp site (#4)

Q:
The map files I’m looking at are in NA12878_diploid_2017_jan7 and the VCF I’m looking at is IPG_2016_1.0_callset/NA12878.vcf.gz.

Just to make sure I’m not misunderstanding, does the map file contain variant coordinates, or is it another form of chain file that gives the offsets of match blocks? I’ve been assuming they’re coordinates of variants listed in the VCF file. Also, I’m assuming the lines with one or more zeroes are indels?

Here are some examples of my attempts to match the coordinates between the map files and the VCF.

First, these three lines are taken from chr1_NA12878.map:

249172383 249098765 249117098
249193024 249119406 249137737
249193998 249120380 249138710

When I take the first number from each of these lines and grep it out of the VCF file, I get no hits, even if I add 1 or subtract 1 to account for differences in the use of 0-based versus 1-based coordinates:

cat NA12878.vcf.gz | gunzip | grep 249172383
cat NA12878.vcf.gz | gunzip | grep 249172382
cat NA12878.vcf.gz | gunzip | grep 249172384

cat NA12878.vcf.gz | gunzip | grep 249193024
cat NA12878.vcf.gz | gunzip | grep 249193023
cat NA12878.vcf.gz | gunzip | grep 249193025

cat NA12878.vcf.gz | gunzip | grep 249193998
cat NA12878.vcf.gz | gunzip | grep 249193997
cat NA12878.vcf.gz | gunzip | grep 249193999

I then tried the opposite: I took coordinates from the VCF and grepped those from the map files, but also with no hits. Here are three lines from the VCF file:

chr19 59097308 . G A . PASS MTD=cgi,bwa_freebayes,bwa_platypus,isaac_strelka,bw
a_gatk;KM=8.86;KFP=0;KFF=0 GT 0|1
chr19 59097977 . T C . PASS MTD=isaac_strelka;KM=7.00;KFP=0;KFF=0 GT 0|1
chr19 59098781 . G T . PASS MTD=bwa_gatk;KM=5.14;KFP=0;KFF=0 GT 0|1

When I grep these coordinates (plus or minus 1) from chr19_NA12878.map I get no hits:

cat chr19_NA12878.map | grep 59097308
cat chr19_NA12878.map | grep 59097307
cat chr19_NA12878.map | grep 59097309

cat chr19_NA12878.map | grep 59097977
cat chr19_NA12878.map | grep 59097976
cat chr19_NA12878.map | grep 59097978

cat chr19_NA12878.map | grep 59098781
cat chr19_NA12878.map | grep 59098780
cat chr19_NA12878.map | grep 59098782

Could it be that this newest diploid genome was generated from hg38 rather than hg19?

The problem isn’t with the link. We were able to download the files. The problem is that the coordinates in the downloaded VCF and map files don’t seem to match up. Please scroll down for my original description of the problem.

A:
The .map files are actually another form of the .chain files.
The first coordinate you looked at (249172383) seems to be the start coordinate of the block that accounts for the coordinate shift due to an indel preceding it:

$grep -w ^249172383 chr1_NA12878.map -B 3
249112564 249038946 0
249112565 249038947 249057281
249172382 249098764 0
249172383 249098765 249117098

$grep -w 249172381 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249172381 . CT C . PASS MTD=bwa_gatk;KM=3.12;KFP=1;KFF=1 GT 0|1

There is a ‘T’ in the reference at 249172382 that is present in PAT, but not in MAT (thus, ‘0’ in the third column). This corresponds to a 1 bp block in REF (249172382) and PAT and a 0 bp block in MAT.
The next block starts after the deletion in MAT and the corresponding REF coordinate is 249172383.

Similarly, for your second example: it is a 2bp shift this time and so the new block starts at +3 bp in REF with respect to the corresponding indel coordinate:

$grep -w ^249193024 chr1_NA12878.map -B 3
249172382 249098764 0
249172383 249098765 249117098
249193022 249119404 0
249193024 249119406 249137737

$grep -w 249193021 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193021 . TAG T . PASS MTD=isaac_strelka,bwa_freebayes,bwa_platypus,bwa_gatk;KM=12.40;KFP=0;KFF=0 GT 0|1

The third example is similar to the first one:

$grep -w ^249193998 chr1_NA12878.map -B 3
249193022 249119404 0
249193024 249119406 249137737
249193997 249120379 0
249193998 249120380 249138710

$grep -w 249193996 inp_files/IPG_2016_1.0_callset/NA12878.vcf
chr1 249193996 . GA G . PASS MTD=bwa_freebayes,bwa_platypus,bwa_gatk;KM=5.81;KFP=0;KFF=0 GT 0|1

The opposite examples are all SNVs and do not cause coordinate shifts and thus aren’t present in the .map files.

The reference sequence used to create the personal genome is hg19 and the .fasta file can be found in the same archive: reference_callsets_files/reference_annotation/female.hg19.fasta.gz

I am also attaching a README file containing a brief description of the file format (from vcf2diploid, http://alleleseq.gersteinlab.org/tools.html).

README file for vcf2diploid tool distribution v_0.2.6a (19th Sept 2014)

This version of vcf2diploid integrates vcf2diploid_v_0.2.6 with generation of read depth for AlleleSeq filtering of potential SNPs residing in CNV locations.
There is also an additional option on vcf2diploid for an output folder.

The read-depth-file generator calculates a normalized read depth for a SNP in a 2000-bp window (+- 1000bp) around the SNP against an average depth computed
from the entire genome. The output generated can be directly used for the AlleleSeq pipeline.

CITATION:
Rozowsky J, Abyzov A, Wang J, Alves P, Raha D, Harmanci A, Leng J, Bjornson R, Kong Y, Kitabayashi N, Bhardwaj N, Rubin M, Snyder M, Gerstein M.
AlleleSeq: analysis of allele-specific expression and binding in a network framework.
Mol Syst Biol. 2011 Aug 2;7:522. doi: 10.1038/msb.2011.54.

makePersonalGenome.mk

After installation/compilation of vcf2diploid, modify the file makePersonalGenome.mk to run the following
1) vcf2diploid
2) vcf2snp
3) read-depth-file generator

USAGE: make -f makePersonalGenome DATA_DIR=/path/to/your/VCFdir OUTPUT_SAMPLE_NAME=NA12878 FILE_NAME_BAM=filename.in.DATA_DIR.bam FILE_NAME_VCF=filename.in.DATA_DIR.vcf
–currently DATA_DIR is set as the directory where the BAM and VCF files are kept and where the output directory is going to be.
–options can be modified in the file

Acknowledgement: The modifications in this version are created by R. Kitchen of the Gerstein Lab@Yale.

he following section describes vcf2diploid_v_0.2.6

1. Compilation
$ make

2. Running
java -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

where sample_id is the ID of individual whose genome is being constructed
(e.g., NA12878), file.fa is FASTA file(s) with reference sequence(s), and
file.vcf is VCF4.0 file(s) with variants. One can specify multiple FASTA and
VCF files at a time. Splitting the whole genome in multiple files (e.g., with
one FASTA file per chromosome) reduces memory usage.
Amount of memory used by Java can be increased as follows

java -Xmx4000m -jar vcf2diploid.jar -id sample_id -chr file.fa … [-vcf file.vcf …]

You can try the program by running ‘test_run.sh’ script in the ‘example’
directory. See also "Important notes" below.

3. Constructing personal annotation and splice-junction library

* Using chain file one can lift over annotation of reference genome to personal
haplotpes. This can be done with the liftOver tools
(see http://hgdownload.cse.ucsc.edu/admin/exe).

For example, to lift over Gencode annotation once can do

$ liftOver -gff ref_annotation.gtf mat.chain mat_annotation.gtf not_lifted.txt

* To construct personal splice-junction library(s) for RNAseq analysis one can
use RSEQtools (http://archive.gersteinlab.org/proj/rnaseq/rseqtools).

Important notes

All characters between ‘>’ and first white space in FASTA header are used
internally as chromosome/sequence names. For instance, for the header

>chr1 human

vcf2diploid will upload the corresponding sequence into the memory under the
name ‘chr1’.
Chromosome/sequence names should be consistent between FASTA and VCF files but
omission of ‘chr’ at the beginning is allows, i.e. ‘chr1’ and ‘1’ are treated as
the same name.

The output contains (file formats are described below):
1) FASTA files with sequences for each haplotype.
2) CHAIN files relating paternal/maternal haplotype to the reference genome.
3) MAP files with base correspondence between paternal-maternal-reference
sequences.

File formats:
* FASTA — see http://www.ncbi.nlm.nih.gov/blast/fasta.shtml
* CHAIN — http://genome.ucsc.edu/goldenPath/help/chain.html
* MAP file represents block with equivalent bases in all three haplotypes
(paternal, maternal and reference) by one record with indices of the first
bases in each haplotype. Non-equivalent bases are represented as separate
records with ‘0’ for haplotypes having non-equivalent base (see
clarification below).

Pat Mat Ref MAP format
X X X
X X X \
X X X > P1 M1 R1
X X – > P4 M4 0
X X – , > 0 M6 R4
– X X ‘ ,-> P6 M7 R5
X X X ‘
X X X
X X X

For question and comments contact: Alexej Abyzov (alexej.abyzov) and
Mark Gerstein (mark.gerstein)

Question about using AlleleSeq tool

Q:

I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.

I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.

I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.

A:
Thank you for your interest in the AlleleSeq pipeline.

The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.

Chain files in AlleleSeq

Q1:
We are very interested in your AlleleSeq package. I have downloaded the maternal.chain and paternal.chain from http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip<http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip.

From the documentation @http://info.gersteinlab.org/AlleleSeq:
Chain files
Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

[liuh@helix NA12878_diploid_genome_dec16_2013]$ more maternal.chain
chain 249198044 1 249250621 + 0 249250621 1_maternal 249242013 + 0 249242013 1
10329 1 0
109 1 0
30199 3 0
43187 4 0
40 1 0

My bed file:
[liuh@helix bcf3]$ awk ‘{print $1,$2,$3}’ cyto.vcf.bed |head
chr1 14541 14542
chr1 14652 14653
chr1 14676 14677
chr1 14906 14907
chr1 14929 14930
chr1 15014 15015
chr1 16287 16288
chr1 16297 16298
chr1 16377 16378
chr1 16494 16495

My script:
module load ucsc; liftOver cyto.vcf.bed maternal.chain cyto.liftover unMapped.cyto

I tried to liftOver my bed from hg19 to hg18 without any problem. It means
that the bed file format should not have any issue.

A1:
thanks for interest to our software.
I may be wrong but it looks to me that the liftOver failed because of using different chromosome naming convention in .bed and .chain files.
In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

Q2:
Thanks so much for your time and kind help! It works well after I removed prefix ‘chr’ in bed file.

Do you have chain file(s) which can convert the annotation coordinates from maternal or paternal genome to reference genome?

I have mpileup and VarScan outputs with the annotation coordinates from maternal and paternal genome, and want to annotate it using annovar (but the annotation coordinates from the reference is required).

A2:
Incidentally, a previous user of AlleleSeq looking for conversion of mat/pat
to ref genome has given us an R code for public consumption. The only gripe
is we have not had time to test it extensively, hence we did not provide it
on the website previously.

I have the link here: http://alleleseq.gersteinlab.org/tools.html. The R
script converts mat/pat back to ref genome using the chain files provided by
AlleleSeq. Please see if it suits your purpose.

what kind of indels are incorporated into the diploid genome assembly of the NA12878 individual?

Q:

I would like to ask about what kind of indels are incorporated into the
diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this
genome. What makes me confused is that when I perform a BLAST search with
one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf
(P2_M_061510_21_73), it shows up in both the maternal and paternal
haplotypes. Is there any size cutoff used for the indels that have been
selected for this assembly?

A:
in the latest version no fosmid indesl/SVs were used. Only output of GATK.

prefix ‘chr’ in liftOver

Q:

From the documentation @http://info.gersteinlab.org/AlleleSeq:

Chain files

Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

249242013 1

10329 1 0

109 1 0

30199 3 0

My bed file:

chr1 14541 14542

chr1 14652 14653

chr1 14676 14677

chr1 14906 14907

A:

It looked like the liftOver failed because of using different chromosome naming convention in .bed and .chain files. In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

fosmid indel

Q:

I would like to ask about what kind of indels are incorporated into the

diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this

genome. What makes me confused is that when I perform a BLAST search with

one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf

(P2_M_061510_21_73), it shows up in both the maternal and paternal

haplotypes. Is there any size cutoff used for the indels that have been

selected for this assembly?

A:

Unfortunately, in the latest version, no fosmid indels/SVs were used; only the variant output of GATK Best Practices v3 was used, even though fosmid data was indeed used to construct the earlier versions of the diploid genome. We might include them in the future. Thank you.

Do you need parents’ genotype data?

Q:

I am looking for a tool to detect allele specific expression from resequencing and RNA-seq data. I find AlleleSeq could be quite powerful. I noticed the input for the software needs parents genotype data; it requires a VCF file which contains trio genotype to create maternal and paternal genome. But in my case, if I only have genotype information from a single individual, how could I use AlleleSeq?

A:

You dont have to genotype parents. You only need to have variants phased in any way you can/wish (vcf2diploid tool only looks at one column with info for the individual of interest and does not consider other columns). Having trio sequenced is an easy and, probably, the best way to do it.

If you have the mothers genotype only, then you can phase a good fraction of heterozygous variants. Each unphased variants will be randomly assigned to a particular haplotype, so half of them will also be correct. And, of course, all homozygous variants will be phased.

Mismatches between the paternal and maternal chromosomes

Q:
I believe I have discovered numerous errors in the NA12878 dataset. We are working with the most recent version,
NA12878_diploid_genome_may3_2011. They are all single base pair mismatches between the paternal and maternal chromosomes in regions that the accompanying .map file marks as contigs.

A:
.map file shows continuous equivalent (without gaps) blocks between haplotypes. BUT THEY DO INCLUDE SNPs. So, heterozygous SNPs will result in base mismatch within a block.