Single Cell Allele Specific Expression and AlleleSeq

Posted on May 1, 2019 by gersteinfaq

Q:
We have developed a method to perform single cell allele specific expression by labeling individual mRNA molecules with single base specificity in situ. In other words, we can differentially label and detect transcripts with a single SNP difference with our fluorescent probes.

As part of our validation, we were happy to find your published diploid genome of GM12878, and used it to design probes for various candidate genes to see what patterns of allelic imbalance we could see at a single cell level. We were hoping that you might be able to provide us with some form of quantification for the allelic imbalance (such as # of reads that aligned to the maternal versus paternal allele) for some of the genes on your list of "Genes with allele specific expression" that you published on your website, such as SKA and SUZ12. We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.

I look forward to hearing from you as being able to relate our single cell measurements with the genomic measurements is an exciting prospect, and I thank you again for having made available such a useful resource.

A:
thanks for interest to AlleleSeq.
Allelic imbalances can be inferred for each SNP from NA12878_AS_SNPs.vcf file. The file has counts for ref and alt alleles. Is it what you are asking?

We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.
If you point us to few such cases we would be happy to look and resolve the inconsistencies. However, my feeling is that, it is probably reflects differences in gene annotation in different databases.

Your paper sounds very cool. I would be happy to read it when it comes out or even before that (if you consider that possible).

Question re ENCODE data on website

Posted on May 1, 2019 by gersteinfaq

I’ve been incorporating the encode data from your webpage in my analyzes
(http://encodenets.gersteinlab.org/). The data is fantastic, but I have
some
questions regarding the enets*.GM_proximal_*filtered_network.txt data
sets.

The filtered dataset actually contains more regulators than the
unfiltered data
set, making me speculate that the unfiltered data file is not complete:
[bb447@compute-8-2 TF]$ cut -f1
enets6.GM_proximal_unfiltered_network.txt | sort
-u | wc -l
50
[bb447@compute-8-2 TF]$ cut -f1 enets8.GM_proximal_filtered_network.txt
| sort
-u | wc -l
67

Could it be possible that the file is incomplete?

A:
the updated files are uploaded to the site. thanks again for pointing this out.

ACT software question

Posted on May 1, 2019 by gersteinfaq

Q:
I tried to use your ACT software for aggregation plot and slightly confused.
If it’s possible can you please look at my input (at the end of this email) and
tell me where I have misunderstanding?

1) Why position -2 (-15bp) has signal 6.7 instead of (10+5+7)/3=7.3?
2) Why positions <-2 (25bp away) have any values and how these values were obtained (as my signal is only up to 17bp)?
3) How values for -1 and 0 were obtained?

Thank you in advance,
Hennady.

bed.txt

chr1 20 25 +
chr1 280 288 +

signal.txt

chr1 1 10
chr1 2 5
chr1 3 7
chr1 300 6
chr1 301 8
chr1 302 9

execution
-bash-3.2$ python ACT.py –nbins=10 –mbins=0 –radius=100 bed.txt signal.txt
# ACT.py –nbins=10 –mbins=0 –radius=100 bed.txt signal.txt
# annotationCount: 2
Bin Center mean stdev
-10 -95 3.5 0.0
-9 -85 3.5 0.0
-8 -75 3.5 0.0
-7 -65 3.5 0.0
-6 -55 3.5 0.0
-5 -45 3.5 0.0
-4 -35 3.5 0.0
-3 -25 3.5 0.0
-2 -15 6.7 3.12889756943
-1 -5 7.0 0.0
0 4 7.0 0.0
1 14 7.0 0.0
2 24 7.8 3.69684550214
3 34 8.0 1.41421356237
4 44 8.0 1.41421356237
5 54 8.0 1.41421356237
6 64 8.0 1.41421356237
7 74 8.0 1.41421356237
8 84 8.0 1.41421356237
9 94 8.0 1.41421356237

A:
The default for ACT is to assume that the signal file is a step-wise signal input, so for example in your signal.txt file, all positions on chr1 between nucleotides 3 and 300 are assigned the value 7 (hence it is not acting as your calculation below might suggest).

In addition, your bed file has two annotations (20 to 25 and 280 to 288, both +). For the positions <-2 bins away, only the values upstream of the 280 to 288 annotation are used.

reference for web site partslist

Posted on May 1, 2019 by gersteinfaq

Q:
Could you give me a reference for your web site:

http://bioinfo.mbb.yale.edu/align/rankings/?updown=-&subcategory=Folds&category=occurrences&rankby=20genomes&half=30+Highest

A:
thanks for your interest.

The reference is http://papers.gersteinlab.org/papers/partslist-nar

information about pseudogene.org

Posted on May 1, 2019 by gersteinfaq

Q1:

I am interested in using the information in your database to design PCR probes that would recognize usable and ensuing pseudogenes for several genes.

Do I need to obtain any type of written permission to use this information?

A1:
Nope!

Q2:
I checked one gene, with 9 pseudogenes listed and tried to align the
sequences to make PCR primers to detect the 10 copies, however, I realized
that being a bit naïve about pseudogenes led me down the wrong path, as I
thought the sequences would be more similar, and adept to being used to
estimate copy number for inserting foreign genes. While I did get regions
that hit 3-6 of the 10 genes, it wasn’t consistent enough.

I was wondering if you have the data about % conservation or any types of
algorithms that would predict the % conservation of pseudogene to gene and
pull out those names/gene Ids and number of pseudogenes?

A2:
It would be helpful if you can tell us a bit more about what you are trying to do.
I assume you are looking at human pseudogenes. We do have percent identity between the parent protein and the pseudogene.

Q3:
I’m trying to figure out a sensible way to use the numbers of the pseudogene/gene as a natural standard curve for real time PCR. See attached excel file. I chose at random genes with 9 to1 listed pseudogene which theoretically would allow me to target endogenous genes of different copy number and get some type of standard curve. This is assuming equal efficiency etc.

I didn’t pay attention to the column "Identity" but now I’m thinking I can sort out genes based on high identity and try again?

A3:
I think that identity should be taken into account when you are creating the standard curse. Also, note that in the excel file, there is a column of fraction (after gene ID), which indicates the fraction of a parent gene aligned to its pseudogene. The start and end coordinates of an alignment are also in the excel file (columns between protein ID and gene ID). Maybe you want to take these into consideration too.

Pseudogene minilist for PCR.xlsx

Architecture of the human regulatory network derived from

Posted on May 1, 2019 by gersteinfaq

Re: Architecture of the human regulatory network derived from ENCODE data
10.1038/nature11245

Hi Dr. Gerstein: This is a very nice paper and is very important in my
current study. Do you have tools/software for TF Co-association (figure 1
and supplemental section B and C) mentioned in this paper. Can I get it?

A:
Anshul did the co-association analysis for this Networks paper. I
think he knows that part the best.

As for the co-association analysis in the ENCODE main paper, it can
be repeated using the GSC package available at the ENCODE statistics web
site (http://www.encodestatistics.org/). The first thing you need to do
is to determine (manually or by other means) a segmentation of the
genome, where TF binding is assumed segment-wise stationary. If you have
no specific preference on how the segmentation should be done, you can
use the GSC Python segmentation tool to do that, which will try to
perform an automatic segmentation (the results of which would be better
if you have more data). Then you can run the GSC Python program to
perform segmented block sampling to compute pairwise p-vlaues of your
binding data.

ENCODE data
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

DREAM 3 challenge & paper “Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data”

Posted on May 1, 2019 by gersteinfaq

I am interested in exploring further the work did by you and your team
members in DREAM 3 challenge, as reported in the paper stated below. Do you
provide the codes/program for public to view? Thanks.

"Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data"

I am ok with the current software which you said quite tailored to the competition. Please send it to me. Really appreciate it. Thanks.

A:
The current form of the software is quite tailored for the
competition, and we do not have a general, publicly distributable
version. I can send it to you if you think it would be useful.

Please find the version that we submitted to DREAM attached, together with some data and some script files for running it. If you have Apache Ant installed, simply issue the command "ant runall3" to run the program on the DREAM3 files. The size-10 networks are included, and the size-50 and size-100 networks can be downloaded from the DREAM web site.

Data received – Re: Your model and input data to the “…integrative analysis of transcription factor binding data” paper

Posted on May 1, 2019 by gersteinfaq

Q:
Many thanks for the excellent ENCODE papers! This is an unprecedented source for life scientists, and we appreciate that accordingly!

Would you be so kind as to access your model and input data your random forest model that predicts gene expression based on transcription factor binding?

Could you please also name the source of TSS CAGE? At UCSC, our only suspects were the Riken CAGE*TSS files, or CSHL LongRNA and ShortRNA files.
We would like to run and to adapt your model to the extremely tight co-regulation of ribosome protein genes. We believe that the ENCODE TF’s may account for a major part of their regulation.

Naturally, we would properly cite your works (incl. Cheng & Gerstein, 2011). Should you prefer, we are open to any reasonable forms of collaboration.

See http://archive.gersteinlab.org/proj/chromodel

The human TSS CAGE data are from Roderic’s Lab.

here is the Human CAGE TSS file:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_June2011.gff.gz

here is a readme file:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_June2011.txt

and here are some additional explanations of how the file was made:
ftp://genome.crg.es/pub/Encode/data_analysis/TSS/Gencodev7_CAGE_TSS_clusters_june2011.pdf

The cost of sequencing cost of library preparation & the cost of running the sequencer

Posted on May 1, 2019 by gersteinfaq

Q:
I’m quite interested in the cost of sequencing for a genome. I read the
paper "The real cost of sequencing: higher than you think! " which published
on <Genome Biology>. It’s a good paper but the author didn’t separate the
cost of library preparation from the cost of running the sequencer. So my
question is from your experience, could you tell me the ration of the cost
of library preparation to the cost of running the sequencer. It’s quite
important for me to design a experiment.

A:
as indicated in the paper, the two numbers you mentioned regard the cost of library prepration ($500) and the cost running the sequencer ($6000), respectively. Note that current figures may be different.

Data re “Architecture of the human regulatory network derived from ENCODE data”

Posted on May 1, 2019 by gersteinfaq

Q:
I am very familiar with the ENCODE TF datasets, as I’ve been applying it to various problems in my PhD. I was interested in the expression analysis across human tissues for the ((miR –> TF) –> targets) FFL. There is a reference in the Supplementary file (section H) to the protein-coding expression atlas Su et al. 2004, for the TF and protein-coding targets in this loop, but doesn’t seem to be a ref for the corresponding expression data for miRNAs? I assume it would be Landgraf et al. 2007 ‘A mammalian microRNA expression atlas based on small RNA library sequencing’, since this allows matched tissues and samples with Su et al. However, it might be some other dataset. It would be helpful to be able to replicate/extend the FFL analysis using the correct data. Would you be able to forward this email to the relevent person(s) to confirm whether microRNA expression was taken from Landgraf atlas? Many thanks for your help

Slight correction: The FFL studied for expression pattern of
components is the other way round: ((TF –> miR) –> targets).

A:
the miRNA expression is actually from
Lu et al, Nature 2005
http://www.nature.com/nature/journal/v435/n7043/full/nature03702.html

if you go to
http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
under the heading "MicroRNA Expression Profiles Classify Human Cancers"
see files

Common_miRNA.gct
and
Common_Affy.zip

Gerstein Lab FAQs

Frequently Asked Questions

Daily Archives: May 1, 2019

Single Cell Allele Specific Expression and AlleleSeq

Question re ENCODE data on website

ACT software question

reference for web site partslist

information about pseudogene.org

Architecture of the human regulatory network derived from

DREAM 3 challenge & paper “Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data”

Data received – Re: Your model and input data to the “…integrative analysis of transcription factor binding data” paper

The cost of sequencing cost of library preparation & the cost of running the sequencer

Data re “Architecture of the human regulatory network derived from ENCODE data”