PseudoPipe program

Posted on July 8, 2019 by gersteinfaq

Q:
I need to find out a pseudogene for my research project. I found the PseudoPipe program interesting. Could you send me the program?

A:
see pseudogene.org

Architecture of the human regulatory network derived from ENCODE data

Posted on July 8, 2019 by gersteinfaq

Q:
I have a question about the following excerpt from page 37 of the supp.
materials:

"In this paper, we mainly present a TF-centric analysis. We have also
analyzed other types of genomic contexts, such as gene-centric contexts, to
reveal the effect of context-specific TF co-associations to gene expression,
as well as chromatin state contexts to reveal relationships of TF
co-associations to various enrichments of chromatin marks. We plan to
present these results in a future publication.”

Were those results ever published? If so could you please point me to them.
I’m looking for an updated regulatory networks based on ENCODE data.

A:
try:

http://papers.gersteinlab.org/papers/metatrack
http://papers.gersteinlab.org/papers/loregic

Zebrafish pseudogenes

Posted on July 8, 2019 by gersteinfaq

Q:
I have a question regarding Zebrafish pseudogenes. I searched few
zebrafish genes to check if they have any pseudogenes existing in the
pseudogene.org, I found that there are 15779 zebrafish pseudogenes.
But when I read the nature reference that you mentioned in your blog
has total 154 zebrafish pseudogenes! Could you please let me know how
can one see those 154 pseudogenes, if I want to know whether my genes
of interest having pseudogenes or not?

A:
Pseudogene.org provides a set of pseudogenes resulted from automatic annotation. Zebrafish is a peculiar genome. It was subjected to numerous large scale genome duplication and thus is full of repeats. As such the automatic annotation overstates the number of pseudogenes present. We followed up the automatic annotation with manual curation that resulted in a subsequent much smaller number of pseudogenes. The continuous improvements in the genome annotation result in further improvements in pseudogenes annotation. I attach here the latest set of zebrafish pseudogenes.

Orthoclust Input Question

Posted on July 8, 2019 by gersteinfaq

Q:
I plan to use your program OrthoClust but I am a little confused on the input of the network files. The README just says a list of the nodes; if the first column is node A is the second column a node co-associated with node A?

A:
You are right. A net file is simply what some people call an edgelist. Two numbers in a row form an edge. You can see examples in the data folder found on Github.

Query on SCA coevolution

Posted on July 8, 2019 by gersteinfaq

Q:
I have conducted structure based statistical coupling analyses (SCA) on each
of some mitochondrial proteins using 800 multiple sequences (including one
sequence from our organisms, one 3RKO structure sequence, and 788 protein
sequences from different genera), and we could obtain the coevolutionary
scores and spatial distances between any pair of two residues. The aim of
our study is try to analyze the coevolutionary role of some important given
residues (selected by PAML analyses) on key or important residues
responsible for proton translocation in the proton translocating channel of
respiratory Complex I. The problem is we are not sure how to do it in a more
statistical way. Such as, we could have the data of scores and distances of
a given selected residue on these residues in proton channel or other
residues of the same protein. In order to know possible different
coevolutionary role of a given residue i.e. the selective residue on proton
channel residues or other residues, t-test on scores (s), or distances (d)
or sores/distinces (s/d) were compared by us between those types of
residues, we are not sure if this kind of analyse is ok for us. Such as we
don’t know whether the score obtained by SCA analyses in the platform has
alreadly considered the potential role of distance, or it is just the score
obtained no mattter where both residues are? We know the influencing role
between any two given residues might be correlated with both their
characteristics and spatial distance between them.

Do you have any good idea on this, or do you have more reasonable
statistical way to solve our queries and prolem above?

A:
The scores were calculated based on the MSA alone without
considering the spatial distance between residues.

You may want to plot the global distribution of scores, and look
for scores that are significantly larger than the rest but cannot be
explained by the distance on the primary sequence alone. Indirect
coupling between residues though other residues is also something to be
aware of. There have been a lot of new papers about co-evolutionary
analysis lately (e.g., from Rama Ranganathan’s and Debora Marks’s labs).

AlleleDB: ChrX ASB data for CTCF?

Posted on July 6, 2019 by gersteinfaq

Q:
I’m looking for data to tell me
whether there is ASB for CTCF at a handful of common SNPs on ChrX, but don’t
see ChrX in your database. Any chance you have this data anyway and could
look up a few common SNPs? Or know of another place I could find this info?

A:
Thank you for your interest in our AlleleDB resource! I am afraid the current AlleleDB version focuses on the autosomes only, and I haven’t worked on chr X data when I left. I have CCed Timur who has taken over the resource and might be able to provide more information/updates.

problem in PseudoPipe

Posted on July 6, 2019 by gersteinfaq

Q:
I tried to run the software PseudoPipe
(http://pseudogene.org/DOWNLOADS/pipeline_codes/ppipe.tar.gz) using the
example as following:
./pseudopipe.sh ~/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa
~/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa
/home/liuhui/bin/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs
0

And I got the output in the attachment and attfollowing lines in the screen:
Making directories
Copying sequences
Fomatting the DNAs
Preparing the blast jobs
Finished blast
Processing blast output
Finished processing blast output
Running Pseudopipe on both strands
Working on M strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand M
Working on P strand
sh: 1: source: Permission denied
Finished Pseudopipe on strand P
Generating final results
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’:
No such file or directory
find:
`/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’:
No such file or directory
gzip:
/home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa:
No such file or directory
Finished generating pgene full alignment
Finished running Pseudopipe

Could you please help me in solving the problem?

A:
Looks like you have permission problems. The script tries to source the file setenvPipelineVars that you will find in /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus and /home/liuhui/bin/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus . If you open that file you’ll see a couple of export functions and from the look of it I would guess that you do not have rights to export to the Path. So I suggest you get admin rights and run as root.

1000G enquiry – Breakpoints File Interpretation

Posted on July 6, 2019 by gersteinfaq

Q1:
I’m trying to interpret your breakpoints file at
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/supporting/breakpoints/1KG_phase3_all_bkpts.v5.txt.gz.

Is this file the same as Supplementary Table 3 in the SV map paper?

A1:
Yes, they are the same.

Q2:
What VCF should be used to interpret this file? I’m having difficulty
finding a VCF that has all the IDs accounted for.

Does the breakpoints file contain information that is meant to
override that in the VCF? So if the VCF and the breakpoints file
disagree on the position of a variant, the breakpoints file should be
considered correct?

A2:
The VCF file SV events are all SVs identified after taking their unions among other steps. The breakpoint file only contains SVs identified with breakpoint-level resolution by each variant caller. They do not override each other but should be treated as separate datasets. The breakpoint file can be considered to contain more detailed information of the SV region in the union call file.

Q3:
It looks like the breakpoints file contains an INSSEQ column, giving
(anchored) sequences that are inserted at the same time as deletion
events. That makes the deletion into a substitution of the shorter
sequence for the longer sequence, right?

A3:
Yes, these deletions contain mostly micro-insertions (1-20bp) at the deletion site.

Q4:
It would be ideal for my application if I could get a VCF containing
the information from this file. Is that already available? Have the
more precise breakpoint calls been rolled into e.g.
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/integrated_sv_map/ALL.wgs.integrated_sv_map_v2.20130502.svs.genotypes.vcf.gz
already? If not, do you have advice on how to cram this information
into a VCF while preserving its semantics?

A4:
I am not aware of a breakpoint file in VCF format. You may start with considering including just the chromosome, start, end and type information.

AlleleSeq CNV file from CNVnator

Posted on July 6, 2019 by gersteinfaq

Q:
I have enjoyed your papers on allele specificity and I have a question about
using AlleleSeq. I understand time is short and valuable and would very much
appreciate it. I am making the CNV files with format:

chrm snppos rd
1 52066 0.902113
1 695745 0.909802
1 742429 0.976435

for input to the AlleleSeq pipeline.

I am using the alleleSeq_cnvScript tool to convert the output from CNVnator
v0.3, into the required CNV file, and it appears to work. However, my
problem is that it is running prohibitively slowly.

My alignment BAM is from 1000 genomes Phase III low coverage WGS, and is
24GB in size.

If I process only Chromosome 1, I have a ROOT file of 156 MB. I have six
million SNPs in a SNV file of 171 MB.

The addRD program runs using only 4% of the 16GB of RAM I have available,
but will take many weeks to complete at the current rate.

The rate at which addRD runs slows down dramatically with time. Though I am
not proficient in c++, I examined the code to see if I could identify why it
is slowing with time. I guess it is due to the search through the ROOT file
for each window around each SNP. The search restarts from the beginning for
each SNP, and so as the SNP locations become further along the chromosome
this search takes longer. I imagine that a great deal of time could be saved
by initialising each search based on the previous search?

If this approach is not possible for me, please could you advise on whether
the following algorithm would be appropriate for input to AlleleSeq:

1) divide the BAM file into windows of with W and count the number of reads
in each window.

2) Calculate the mean Read Depth (perhaps as a function of GC content): mu

1) for each SNP in my SNV file:

Use bamtools to select the reads in the window of size W centred on
the SNP location and calculate the Read Depth

(perhaps correct RD for GC content)

Calculate the normalised read depth = RD/(2 mu L/W)

output the SNP location and normalised read depth to file.

A:
Have you tried using the latest version of Personal Genome Constructor?

http://alleleseq.gersteinlab.org/vcf2diploid_v0.2.6a.zip

When generating the .cnv file suitable for AlleleSeq, the pipeline uses bedtools to get read depth around each hetSNP instead of CNVnator. From my experience, this doesn’t take more than a few hours on a ~100GB WGS .bam file (single thread, all chromosomes).

Quantification of private information leakage from phenotype-genotype data: linking attacks – do you have slides?

Posted on June 3, 2019 by gersteinfaq

Q:
I am contacting you as I found your paper extremely interesting and very close to the activities I am doing. I would really like to present your work at our weekly lab meeting to my colleagues. Hence, I was wondering if perhaps you have some slides that I could use for this purpose.

A:
see
http://lectures.gersteinlab.org/summary/Genomic-Privacy-n-Individualized-RNAseq-Incompatible-or-Feasible–20161111-i0idash16/
+
other privacy tagged stuff at
http://lectures.gersteinlab.org/summary/

Gerstein Lab FAQs

Frequently Asked Questions

PseudoPipe program

Architecture of the human regulatory network derived from ENCODE data

Zebrafish pseudogenes

Orthoclust Input Question

Query on SCA coevolution

AlleleDB: ChrX ASB data for CTCF?

problem in PseudoPipe

1000G enquiry – Breakpoints File Interpretation

AlleleSeq CNV file from CNVnator

Quantification of private information leakage from phenotype-genotype data: linking attacks – do you have slides?