Chain files in AlleleSeq

Q1:
We are very interested in your AlleleSeq package. I have downloaded the maternal.chain and paternal.chain from http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip<http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip.

From the documentation @http://info.gersteinlab.org/AlleleSeq:
Chain files
Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

[liuh@helix NA12878_diploid_genome_dec16_2013]$ more maternal.chain
chain 249198044 1 249250621 + 0 249250621 1_maternal 249242013 + 0 249242013 1
10329 1 0
109 1 0
30199 3 0
43187 4 0
40 1 0

My bed file:
[liuh@helix bcf3]$ awk ‘{print $1,$2,$3}’ cyto.vcf.bed |head
chr1 14541 14542
chr1 14652 14653
chr1 14676 14677
chr1 14906 14907
chr1 14929 14930
chr1 15014 15015
chr1 16287 16288
chr1 16297 16298
chr1 16377 16378
chr1 16494 16495

My script:
module load ucsc; liftOver cyto.vcf.bed maternal.chain cyto.liftover unMapped.cyto

I tried to liftOver my bed from hg19 to hg18 without any problem. It means
that the bed file format should not have any issue.

A1:
thanks for interest to our software.
I may be wrong but it looks to me that the liftOver failed because of using different chromosome naming convention in .bed and .chain files.
In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

Q2:
Thanks so much for your time and kind help! It works well after I removed prefix ‘chr’ in bed file.

Do you have chain file(s) which can convert the annotation coordinates from maternal or paternal genome to reference genome?

I have mpileup and VarScan outputs with the annotation coordinates from maternal and paternal genome, and want to annotate it using annovar (but the annotation coordinates from the reference is required).

A2:
Incidentally, a previous user of AlleleSeq looking for conversion of mat/pat
to ref genome has given us an R code for public consumption. The only gripe
is we have not had time to test it extensively, hence we did not provide it
on the website previously.

I have the link here: http://alleleseq.gersteinlab.org/tools.html. The R
script converts mat/pat back to ref genome using the chain files provided by
AlleleSeq. Please see if it suits your purpose.

A question about CNVnator resullt

Q:

Recently,I use CNVnator software detecting dog genome CNVs
using dog genome resequensing data from illumina GA platform.I have get the sorted and removed duplicated .bam file using bwa and samtools and then use command as follows to get CNVs result:

./cnvnator -genome Canis_familiaris.CanFam3.1.71.dna_rm.toplevel.fa -root GW2.root -tree GW2_sort.bam
./cnvnator -genome Canis_familiaris.CanFam3.1.71.dna_rm.toplevel.fa -root GW2.root -his 1000 -d genome_split/
./cnvnator -root GW2.root -stat 1000
./cnvnator -root GW2.root -partition 1000
./cnvnator -root GW2.root -call 1000 >GW2_result

I get result file(GW2_result),and then I convert it to VCF format using cnvnator2V! CF.pl,and get GW2_result_vcf file.I found the result is same what werid (I am new in genome CNVs analysis) because I find so many large-sizes duplications and indels in genome.I think the result file need same filter.But I do not know how to filter and do not find any filter information and standard by google,can you help me?Thank you very much!One of my results is in attachment,please check!

A:
thanks for interest to CNVnator.
Not sure what do you mean by many. How many?
Perhaps, some of those are gaps in the reference genome. While duplications are around those gaps.

what kind of indels are incorporated into the diploid genome assembly of the NA12878 individual?

Q:

I would like to ask about what kind of indels are incorporated into the
diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this
genome. What makes me confused is that when I perform a BLAST search with
one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf
(P2_M_061510_21_73), it shows up in both the maternal and paternal
haplotypes. Is there any size cutoff used for the indels that have been
selected for this assembly?

A:
in the latest version no fosmid indesl/SVs were used. Only output of GATK.

Single Cell Allele Specific Expression and AlleleSeq

Q:
We have developed a method to perform single cell allele specific expression by labeling individual mRNA molecules with single base specificity in situ. In other words, we can differentially label and detect transcripts with a single SNP difference with our fluorescent probes.

As part of our validation, we were happy to find your published diploid genome of GM12878, and used it to design probes for various candidate genes to see what patterns of allelic imbalance we could see at a single cell level. We were hoping that you might be able to provide us with some form of quantification for the allelic imbalance (such as # of reads that aligned to the maternal versus paternal allele) for some of the genes on your list of "Genes with allele specific expression" that you published on your website, such as SKA and SUZ12. We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.

I look forward to hearing from you as being able to relate our single cell measurements with the genomic measurements is an exciting prospect, and I thank you again for having made available such a useful resource.

A:
thanks for interest to AlleleSeq.
Allelic imbalances can be inferred for each SNP from NA12878_AS_SNPs.vcf file. The file has counts for ref and alt alleles. Is it what you are asking?

We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.
If you point us to few such cases we would be happy to look and resolve the inconsistencies. However, my feeling is that, it is probably reflects differences in gene annotation in different databases.

Your paper sounds very cool. I would be happy to read it when it comes out or even before that (if you consider that possible).

prefix ‘chr’ in liftOver

Q:

From the documentation @http://info.gersteinlab.org/AlleleSeq:

Chain files

Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.

However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.

249242013 1

10329 1 0

109 1 0

30199 3 0

My bed file:

chr1 14541 14542

chr1 14652 14653

chr1 14676 14677

chr1 14906 14907

A:

It looked like the liftOver failed because of using different chromosome naming convention in .bed and .chain files. In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.

fosmid indel

Q:

I would like to ask about what kind of indels are incorporated into the

diploid genome assembly of the NA12878 individual, available from your lab:

http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip

In the readme it says that 829,454 indels were used to construct this

genome. What makes me confused is that when I perform a BLAST search with

one 1.7 kb deletion from NA12878.2010_06.and.fosmid.deletions.phased.vcf

(P2_M_061510_21_73), it shows up in both the maternal and paternal

haplotypes. Is there any size cutoff used for the indels that have been

selected for this assembly?

A:

Unfortunately, in the latest version, no fosmid indels/SVs were used; only the variant output of GATK Best Practices v3 was used, even though fosmid data was indeed used to construct the earlier versions of the diploid genome. We might include them in the future. Thank you.

Do you need parents’ genotype data?

Q:

I am looking for a tool to detect allele specific expression from resequencing and RNA-seq data. I find AlleleSeq could be quite powerful. I noticed the input for the software needs parents genotype data; it requires a VCF file which contains trio genotype to create maternal and paternal genome. But in my case, if I only have genotype information from a single individual, how could I use AlleleSeq?

A:

You dont have to genotype parents. You only need to have variants phased in any way you can/wish (vcf2diploid tool only looks at one column with info for the individual of interest and does not consider other columns). Having trio sequenced is an easy and, probably, the best way to do it.

If you have the mothers genotype only, then you can phase a good fraction of heterozygous variants. Each unphased variants will be randomly assigned to a particular haplotype, so half of them will also be correct. And, of course, all homozygous variants will be phased.