Question about using AlleleSeq tool

Q:

I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.

I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.

I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.

A:
Thank you for your interest in the AlleleSeq pipeline.

The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.

Chain files in AlleleSeq

Q1:
We are very interested in your AlleleSeq package. I have downloaded the maternal.chain and paternal.chain from http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip<http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip.

From the documentation @http://info.gersteinlab.org/AlleleSeq:
Chain files
Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

[liuh@helix NA12878_diploid_genome_dec16_2013]$ more maternal.chain
chain 249198044 1 249250621 + 0 249250621 1_maternal 249242013 + 0 249242013 1
10329 1 0
109 1 0
30199 3 0
43187 4 0
40 1 0

My bed file:
[liuh@helix bcf3]$ awk ‘{print $1,$2,$3}’ cyto.vcf.bed |head
chr1 14541 14542
chr1 14652 14653
chr1 14676 14677
chr1 14906 14907
chr1 14929 14930
chr1 15014 15015
chr1 16287 16288
chr1 16297 16298
chr1 16377 16378
chr1 16494 16495

My script:
module load ucsc; liftOver cyto.vcf.bed maternal.chain cyto.liftover unMapped.cyto

I tried to liftOver my bed from hg19 to hg18 without any problem. It means
that the bed file format should not have any issue.

A1:
thanks for interest to our software.
I may be wrong but it looks to me that the liftOver failed because of using different chromosome naming convention in .bed and .chain files.
In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

Q2:
Thanks so much for your time and kind help! It works well after I removed prefix ‘chr’ in bed file.

Do you have chain file(s) which can convert the annotation coordinates from maternal or paternal genome to reference genome?

I have mpileup and VarScan outputs with the annotation coordinates from maternal and paternal genome, and want to annotate it using annovar (but the annotation coordinates from the reference is required).

A2:
Incidentally, a previous user of AlleleSeq looking for conversion of mat/pat
to ref genome has given us an R code for public consumption. The only gripe
is we have not had time to test it extensively, hence we did not provide it
on the website previously.

I have the link here: http://alleleseq.gersteinlab.org/tools.html. The R
script converts mat/pat back to ref genome using the chain files provided by
AlleleSeq. Please see if it suits your purpose.

what kind of indels are incorporated into the diploid genome assembly of the NA12878 individual?

Q:

I would like to ask about what kind of indels are incorporated into the
diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this
genome. What makes me confused is that when I perform a BLAST search with
one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf
(P2_M_061510_21_73), it shows up in both the maternal and paternal
haplotypes. Is there any size cutoff used for the indels that have been
selected for this assembly?

A:
in the latest version no fosmid indesl/SVs were used. Only output of GATK.

prefix ‘chr’ in liftOver

Q:

From the documentation @http://info.gersteinlab.org/AlleleSeq:

Chain files

Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

249242013 1

10329 1 0

109 1 0

30199 3 0

My bed file:

chr1 14541 14542

chr1 14652 14653

chr1 14676 14677

chr1 14906 14907

A:

It looked like the liftOver failed because of using different chromosome naming convention in .bed and .chain files. In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

fosmid indel

Q:

I would like to ask about what kind of indels are incorporated into the

diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this

genome. What makes me confused is that when I perform a BLAST search with

one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf

(P2_M_061510_21_73), it shows up in both the maternal and paternal

haplotypes. Is there any size cutoff used for the indels that have been

selected for this assembly?

A:

Unfortunately, in the latest version, no fosmid indels/SVs were used; only the variant output of GATK Best Practices v3 was used, even though fosmid data was indeed used to construct the earlier versions of the diploid genome. We might include them in the future. Thank you.

Do you need parents’ genotype data?

Q:

I am looking for a tool to detect allele specific expression from resequencing and RNA-seq data. I find AlleleSeq could be quite powerful. I noticed the input for the software needs parents genotype data; it requires a VCF file which contains trio genotype to create maternal and paternal genome. But in my case, if I only have genotype information from a single individual, how could I use AlleleSeq?

A:

You dont have to genotype parents. You only need to have variants phased in any way you can/wish (vcf2diploid tool only looks at one column with info for the individual of interest and does not consider other columns). Having trio sequenced is an easy and, probably, the best way to do it.

If you have the mothers genotype only, then you can phase a good fraction of heterozygous variants. Each unphased variants will be randomly assigned to a particular haplotype, so half of them will also be correct. And, of course, all homozygous variants will be phased.

Mismatches between the paternal and maternal chromosomes

Q:
I believe I have discovered numerous errors in the NA12878 dataset. We are working with the most recent version,
NA12878_diploid_genome_may3_2011. They are all single base pair mismatches between the paternal and maternal chromosomes in regions that the accompanying .map file marks as contigs.

A:
.map file shows continuous equivalent (without gaps) blocks between haplotypes. BUT THEY DO INCLUDE SNPs. So, heterozygous SNPs will result in base mismatch within a block.