Question about publication data “Comparative analysis of pseudogenes across,three phyla”

Q:
I’m looking at some of the data connected with your recent publication and was wondering if I could get clarification on the BioType attribute in the following file:

http://www.pseudogene.org/psicube/data/Worm-Annotation.bed

In here there appear to be 3 biotypes

processed_pseudogene
pseudogene
unprocessed_pseudogene

Looking through the paper and the supplementary material I can find reference to processed_pseudogene and unprocessed_pseudogene, but not the generic pseudogene? Reading the S1 material I would not expect to see this 3rd biotype

–snip–
(a) Classification
Pseudogenes were classified as “processed” if they have lost their parental gene structures.
Conversely, we classified pseudogenes as “unprocessed”/ “duplicated” if they retained the
same exon-intron structure as their parent loci. In ambiguous cases we used other features to
resolve the provenance of the pseudogene. Where the pseudogene represented a fragment of
the parent, and the homology ended precisely at a splice junction the pseudogene was called
“unprocessed” (“duplicated”). Conversely, where the fragment contained the fusion of two or
more exons the pseudogene was called “processed”. If the parent had a single exon CDS, the
presence of parent gene structure in the 5′ UTR region (identified by alignment of mRNA and
EST evidence) allowed the pseudogene to be called “unprocessed”/“duplicated”. Meanwhile,
the presence of a pseudopoly(A) signal (the position of the parent poly(A) signal at the
pseudogene locus) followed by a tract of A-rich sequence in the genome (indicating the
insertion site of the polyadenylated parental mRNA) indicated a “processed” pseudogene. If
there was no other evidence available to resolve the route by which the pseudogene was
created, we used the position of the pseudogene relative to its parent. As such “processed”
pseudogenes are reinserted into the genome with an approximately random distribution while
“unprocessed”/“duplicated” pseudogenes tend to be more closely associated with the parent
locus. Parsimony therefore suggests that pseudogenes that lie near to the parent locus are
more likely to have arisen via a gene-duplication event than retrotransposition, and this was
used as a tie-breaker in defining the pseudogene biotype.
–snip–

I hope I haven’t missed anything obvious, but any clarification would help greatly.

A:
When we classify the pseudogenes according to their biotype we have processed pseudogenes and duplicated pseudogenes. This biotype is dependent on the pseudogene formation process (retrotransposition vs duplication) and this is the description that you see in the supplementary material. The third biotype that you find in some of the files on psicube website is actually not a biotype per se, these pseudogenes are most of the time highly degraded or short fragments and we could not assign with high confidence a definite biotype to them. In other words the pseudogenes with “pseudogene” as biotype have actually an undetermined biotype. But instead of saying “NA” (not available or unknown) we opted to simply call them “pseudogene".

Spectral biclustering

Q:
I recently read
your 2003 paper titled "Spectral biclustering of microarray data: Coclustering
genes and conditions".

I would like to investigate implementing your approach on a GPU.
Is there any code (Matlab? Python?) you would be willing to share as a result of the paper?

A:
Sorry we’re just using simple SVD routines in matlab. No meaningful code available. -marK

OrthoClust – for more than two species

Q:
I just read your recently published paper on OrthoClust approach. It is a well grounded work in both practically and mathematically point of views.

I ran your R scripts for my own data and It worked perfectly fine, however I am wondering how can I use the script for more than two species?

It could be appreciated if you help me to find the solution.

A:
Thanks for your interest in OrthoClust. Orthoclust definitely works on more than 2. The R script is a primitive version for illustrating the concept outlined in the paper. We understand the importance of N-species generalization. We have put a new MATLAB code for N-species. It made use of an efficient code written by Mucha and Porter that implemented the Louvain algorithm for modularity optimization. The 3rd party code as well as our wrapper is now in the gersteinlab github.
Apart from MATLAB, we are planning to provide wrapper for Python or R later.
The N-species code is not exactly the thing we did for the paper. So if you find any bug or question, please let me know. we are trying to make a more user friendly package anyway.

questions about AlleleDB and genotyping

Q:
I am using your lab database AlleleDB for allelic imbalance SNPs of ChIP-seq data and I really find it a formidable resource for further research.

I noticed that many of the significant SNPs for allelic imbalance are SNPs that have only the reference allele in the ChIP-seq data. See the first 10 lines as an example from your file ASB.auto.v2.1.aug16.txt.tgz that can be downloaded from AlleleDB database :

chr start end TF_indiv_ASB ref alt cA cC cG cT p.binomial p.betabinomial
chr1 91604 91605 SA1_NA19099_ASB C T 0 45 0 0 5.68E-14 0.000376021311925
chr1 714018 714019 POL2_NA18505_ASB A G 2 0 19 0 0.000221252441406 0.003114747121148
chr1 714018 714019 POL2_NA19099_ASB A G 2 0 19 0 0.000221252441406 0.002559182427128
chr1 762484 762485 RPB2_NA11894_ASB C A 0 11 0 0 0.0009765625 0.004566586027278
chr1 762600 762601 RPB2_NA11894_ASB T C 0 12 0 0 0.00048828125 0.002968280917731
chr1 840035 840036 SA1_NA18951_ASB T G 0 0 1 168 2.44E-37 0.001745366786663
chr1 937397 937398 SA1_NA19099_ASB G A 0 0 81 0 8.27E-25 5.03E-05
chr1 1167795 1167796 RPB2_NA11894_ASB C T 0 26 0 0 2.98E-08 2.29E-05
chr1 1407371 1407372 RPB2_NA11831_ASB T C 0 0 0 41 9.09E-13 0.000260638330353
chr1 1551110 1551111 POL2_NA19099_ASB C G 0 30 0 0 1.86E-09 1.67E-06

If I understand well allelic imbalance, only SNPs with genotypes that are heterozygous are tested for allelic imbalance.
In this case, the first SNP in the above table (for instance) has a genotype C/T, but shows 45 reads with C and 0 reads with T. This is indeed a strong allelic imbalance. In your data, there are many examples like that.

However, I heard many biologists saying that cell lines are not stable and mutations accumulate over time.
Is it possible that the first SNP is actually homozygous C/C in the cells used for ChIP-seq experiment because of mutations that occurred during cell culture, given that the genotyping by 1000 Genome project and the ChIP-seq experiment might have been done at different times or by different labs?

What do you think?

A:
Thank you for your interest in AlleleDB!
Right, only heterozygous SNVs are tested for the imbalance. We used the 1000G Project variants for these analyses, did not call them ourselves. Indeed, allelic calculations are very sensitive to genotype calls. Thus, a false-positive heterozygous SNV call, as well as a mutation in the cell-line as you described (not sure if there can be many of those), could lead to a false allelic imbalance call. If this is crucial for your analyses, a simple strategy would probably be filtering the call-set to keep only the SNVs where the ‘weaker’ allele is supported by at least one read, though this would remove some true strong imbalance calls as well.

question re. data in paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:
As describe in your paper entitled "Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors", it is mentioned "we identified 13,539 potential enhancers (full list available in the Additional files), among which 50 were randomly chosen". But in the additional files, only 50 enhancer co-ordinates are mentioned. Can you please provide me either the source/list of the all 13,539 enhancers.

Many thanks in anticipation of your quick reply,

A:
see http://encodenets.gersteinlab.org/metatracks

related scripts or equations for implementing analyses in paper “Genomic analysis of the hierarchical structure of regulatory networks”

Q:
I recently have been working on constructing human regulatory networks. After reading your paper <Genomic analysis of the hierarchical structure of regulatory networks> published on PNAS, I found it very amazing and useful, which may be applied for my study. I want to construct hierarchical structure of transcription factors (TFs) in humans, and my data is the expression level of these TFs and their targets obtained by RNA sequencing. Can we use your BFS method to construct the network? As we know little about the computational algorithm of BFS, would you please provide related scripts or equations for implementing it easily?

Thank you very much for occupying your precious time reading my letter and I’m looking forward to your guidance.

A:
Hi, see http://info.gersteinlab.org/Hierarchy

Request for a SI document

Q1:
The file in the url:
http://www.nature.com/nature/journal/v489/n7414/extref/nature11245-s1.pdf
(SI of the paper "Architecture of the human regulatory network derived from
ENCODE data") is damaged and cannot be read.

Can you please send me a copy?

A1:
The file in the url:
http://www.nature.com/nature/journal/v489/n7414/extref/nature11245-s1.pdf
(SI of the paper "Architecture of the human regulatory network derived from
ENCODE data") is damaged and cannot be read.

Can you please send me a copy?

Q2:
Thank you for your prompt reply.

The nature11245__ALL.pdf file you generated works fine (using Adobe XI on Windows 7) but as I mentioned, the file download from nature’s website is damaged (on windows and Linux machines).

FYI, here is Linux stderr (using ocular):

Error: PDF file is damaged – attempting to reconstruct xref table…

Error: Couldn’t find trailer dictionary

Error: Couldn’t read xref table

Connecting to deprecated signal QDBusConnectionInterface::serviceOwnerChanged(QString,QString,QString)

A2:
Thank you for letting us know. This is actually a more serious problem than what I had been expecting. We may need to contact them about this very soon, as other users will experience the same problem. Thanks again.

It might be a problem of fonts that mac os has but windows/linux don’t. You might want to try produce the pdf on windows and try to open it on mac os and linux and if it works just substitute the file on nature site (which is not a trivial task I guess).

Ask for help about your paper (PLoS ONE, 2010)

Q:
Currently, I’m working on reconstructing gene regulatory network. It’s
really an interesting topic and I would like to estimate which tools is
suitable for our experimental data. I have read your published paper
"Improved Reconstruction of In Silico Gene Regulatory Networks by
Integrating Knockout and Perturbation
Data". In this paper, I can’t understand the section of learning noise from
deletion data.
Step 1: Calculate the probability of regulation Pb->a for each pair of genes
(b,a). I want to know how to calculate this probability, and this value of
probability can decide potential regulation or not?
I want to ask you that how to work in this section, and I’m appreciated if
you can help me to figure out.

A: Basically we used the expression levels currently believed to be
unaffected by a deletion to form a Gaussian background. Then if a gene
has an expression level far away from the mean of this Gaussian
distribution (by calculating the probability that the expression is as
extreme or more extreme than the observed one based on the Gaussian), we
consider the gene to be affected by the deletion.

Sequence and quality scores for Brainspan MRF file

Q1:
I am contacted you regarding some questions of "mrfQuantifier" in RSEQtools.

We have cloned many novel transcripts from the whole human brain and wanted
to relate their expression to specific periods/regions using RNA-seq data
from the Brainspan. We were able to download the MRF file that only contains
alignment blocks from Brainspan. We have noticed that you have described in
the Bioinformatics paper that mrfQuantifier does not perform the expression
quantification on the transcript level.

So, Could you let me know why mrfQuantifier does not perform on the
transcript level? Do you mean it does not address the problem of multireads?
Also, "mrf2sam" does not work with MRF files we downloaded from Brainspain
and it shows error "PROBLEM: Unknown presentColumn:
H376_V_51_A1C_L_RNASeq.mrf". Could that be it needs the sequences
information? Also, is any recent tool in RESQtools that would perform on the
transcript level for expression quantification?

A1:
iqseq can perform transcript level quant. Here is my understanding to your questions.

1) When RSEQtools were first published in 2011, it was designed to just take care of the gene or exon form level quantification. Isoform quantification is challenging, partially because of the multi-reads.

Then in 2012, our lab published another paper IQseq (https://code.google.com/p/iqseq/downloads/list). IQseq actually take all the reads mapped to one gene, for those mapped to multiple isoforms, EM algorithm is used to allocate the reads iteratively to each transcript.

2) For the second question, I am not sure what is happening. If possible could please send me the sample mrf files for me to check what’s going on there?

Hopefully this could help a little bit. Please let me know if you have any questions.

Q2:
We have our novel transcript annotations (eg, exon boundaries), and we downloaded all MRF format files of RNA-seq reads of developmental brain from Brainspan to try to quantify the expression levels.
Under "Supplemental Data" from http://www.brainspan.org/static/download.html , there are all the MRF files we downloaded. I have tried to run them with mrf2sam and failed with error.
My question is, would IQseq tool works with MRF files containing only alignment blocks (genomic coordinates only) without sequence information since Brainspan MRF files contain no such information? or it still needs sequences and quality score information to be able to run the tool. The complete set of MRF files can be obtained from Under "Supplemental Data" from http://www.brainspan.org/static/download.html for testing. If would be great if you could also check whether or not IQSeq would work on these set of MRF files.

A2:
After consulting some of my colleagues, I have the answer like this.

For one of the files, if you type

mrf2sam <H376_IIA_50_LGE_L_RNASeq.mrf H376_IIA_50_LGE_L_RNASeq.sam

it works, but if you type mrf2sam H376_IIA_50_LGE_L_RNASeq.mrf
it wont work.

I am not the author of RSEQtools, but I guess that the reason for the command line to be like this is that everything should be piped to avoid I/O for the middle files.

If you still can not figure it out, please let me know what it your command and the error message. Hope this would help!