I have enjoyed your papers on allele specificity and I have a question about
using AlleleSeq. I understand time is short and valuable and would very much
appreciate it. I am making the CNV files with format:
chrm snppos rd
1 52066 0.902113
1 695745 0.909802
1 742429 0.976435
for input to the AlleleSeq pipeline.
I am using the alleleSeq_cnvScript tool to convert the output from CNVnator
v0.3, into the required CNV file, and it appears to work. However, my
problem is that it is running prohibitively slowly.
My alignment BAM is from 1000 genomes Phase III low coverage WGS, and is
24GB in size.
If I process only Chromosome 1, I have a ROOT file of 156 MB. I have six
million SNPs in a SNV file of 171 MB.
The addRD program runs using only 4% of the 16GB of RAM I have available,
but will take many weeks to complete at the current rate.
The rate at which addRD runs slows down dramatically with time. Though I am
not proficient in c++, I examined the code to see if I could identify why it
is slowing with time. I guess it is due to the search through the ROOT file
for each window around each SNP. The search restarts from the beginning for
each SNP, and so as the SNP locations become further along the chromosome
this search takes longer. I imagine that a great deal of time could be saved
by initialising each search based on the previous search?
If this approach is not possible for me, please could you advise on whether
the following algorithm would be appropriate for input to AlleleSeq:
1) divide the BAM file into windows of with W and count the number of reads
in each window.
2) Calculate the mean Read Depth (perhaps as a function of GC content): mu
1) for each SNP in my SNV file:
Use bamtools to select the reads in the window of size W centred on
the SNP location and calculate the Read Depth
(perhaps correct RD for GC content)
Calculate the normalised read depth = RD/(2 mu L/W)
output the SNP location and normalised read depth to file.
Have you tried using the latest version of Personal Genome Constructor?
When generating the .cnv file suitable for AlleleSeq, the pipeline uses bedtools to get read depth around each hetSNP instead of CNVnator. From my experience, this doesn’t take more than a few hours on a ~100GB WGS .bam file (single thread, all chromosomes).