AlleleDB: ChrX ASB data for CTCF?

Q:
I’m looking for data to tell me
whether there is ASB for CTCF at a handful of common SNPs on ChrX, but don’t
see ChrX in your database. Any chance you have this data anyway and could
look up a few common SNPs? Or know of another place I could find this info?

A:
Thank you for your interest in our AlleleDB resource! I am afraid the current AlleleDB version focuses on the autosomes only, and I haven’t worked on chr X data when I left. I have CCed Timur who has taken over the resource and might be able to provide more information/updates.

a question about 3V

Q:
Regarding your paper entitled "3V: cavity, channel and cleft
volume calculator and extractor", which I read carefully.

I’ve a question for you. In the abstract, it is written the following:
"It rapidly finds internal volumes by taking the difference between
two rolling-probe solvent-excluded surfaces,…", but I think you
mean "two imaginary rolling-probe solvent-excluded surfaces" because
after looking at your code, I haven’t seen any analytic SES
formulation therein. I guess you are just using two probe spheres of
distinct radii to account for cavities, not the analytic SES
themselves. Am I right?

A:
I am not certain about your use of the term "imaginary", but I would say my method is a "discrete approximation" to the SES. And because it is discrete (i.e. a 3D grid) one can simply subtract one grid from another. See attached figures.

With small grid sizes (0.2 A), I see very little discrepancy to the analytical solution.

problem in PseudoPipe

Q:
I tried to run the software PseudoPipe
(http://pseudogene.org/DOWNLOADS/pipeline_codes/ppipe.tar.gz) using the
example as following:
./pseudopipe.sh ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs
0

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’:
No such file or directory
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’:
No such file or directory
gzip:
/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa:
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

A:
Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root.

1000G enquiry – Breakpoints File Interpretation

Q1:
I’m trying to interpret your breakpoints file at
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/breakpoints/1KG_phase3_all_bkpts.v5.txt.gz.

Is this file the same as Supplementary Table 3 in the SV map paper?

A1:
Yes, they are the same.

Q2:
What VCF should be used to interpret this file? I’m having difficulty
finding a VCF that has all the IDs accounted for.

Does the breakpoints file contain information that is meant to
override that in the VCF? So if the VCF and the breakpoints file
disagree on the position of a variant, the breakpoints file should be
considered correct?

A2:
The VCF file SV events are all SVs identified after taking their unions among other steps. The breakpoint file only contains SVs identified with breakpoint-level resolution by each variant caller. They do not override each other but should be treated as separate datasets. The breakpoint file can be considered to contain more detailed information of the SV region in the union call file.

Q3:
It looks like the breakpoints file contains an INSSEQ column, giving
(anchored) sequences that are inserted at the same time as deletion
events. That makes the deletion into a substitution of the shorter
sequence for the longer sequence, right?

A3:
Yes, these deletions contain mostly micro-insertions (1-20bp) at the deletion site.

Q4:
It would be ideal for my application if I could get a VCF containing
the information from this file. Is that already available? Have the
more precise breakpoint calls been rolled into e.g.
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz
already? If not, do you have advice on how to cram this information
into a VCF while preserving its semantics?

A4:
I am not aware of a breakpoint file in VCF format. You may start with considering including just the chromosome, start, end and type information.

AlleleSeq CNV file from CNVnator

Q:
I have enjoyed your papers on allele specificity and I have a question about
using AlleleSeq. I understand time is short and valuable and would very much
appreciate it. I am making the CNV files with format:

chrm snppos rd
1 52066 0.902113
1 695745 0.909802
1 742429 0.976435

for input to the AlleleSeq pipeline.

I am using the alleleSeq_cnvScript tool to convert the output from CNVnator
v0.3, into the required CNV file, and it appears to work. However, my
problem is that it is running prohibitively slowly.

My alignment BAM is from 1000 genomes Phase III low coverage WGS, and is
24GB in size.

If I process only Chromosome 1, I have a ROOT file of 156 MB. I have six
million SNPs in a SNV file of 171 MB.

The addRD program runs using only 4% of the 16GB of RAM I have available,
but will take many weeks to complete at the current rate.

The rate at which addRD runs slows down dramatically with time. Though I am
not proficient in c++, I examined the code to see if I could identify why it
is slowing with time. I guess it is due to the search through the ROOT file
for each window around each SNP. The search restarts from the beginning for
each SNP, and so as the SNP locations become further along the chromosome
this search takes longer. I imagine that a great deal of time could be saved
by initialising each search based on the previous search?

If this approach is not possible for me, please could you advise on whether
the following algorithm would be appropriate for input to AlleleSeq:

1) divide the BAM file into windows of with W and count the number of reads
in each window.

2) Calculate the mean Read Depth (perhaps as a function of GC content): mu

1) for each SNP in my SNV file:

Use bamtools to select the reads in the window of size W centred on
the SNP location and calculate the Read Depth

(perhaps correct RD for GC content)

Calculate the normalised read depth = RD/(2 mu L/W)

output the SNP location and normalised read depth to file.

A:
Have you tried using the latest version of Personal Genome Constructor?

http://alleleseq.gersteinlab.org/vcf2diploid_v0.2.6a.zip

When generating the .cnv file suitable for AlleleSeq, the pipeline uses bedtools to get read depth around each hetSNP instead of CNVnator. From my experience, this doesn’t take more than a few hours on a ~100GB WGS .bam file (single thread, all chromosomes).