Single Cell Allele Specific Expression and AlleleSeq

We have developed a method to perform single cell allele specific expression by labeling individual mRNA molecules with single base specificity in situ. In other words, we can differentially label and detect transcripts with a single SNP difference with our fluorescent probes.

As part of our validation, we were happy to find your published diploid genome of GM12878, and used it to design probes for various candidate genes to see what patterns of allelic imbalance we could see at a single cell level. We were hoping that you might be able to provide us with some form of quantification for the allelic imbalance (such as # of reads that aligned to the maternal versus paternal allele) for some of the genes on your list of "Genes with allele specific expression" that you published on your website, such as SKA and SUZ12. We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.

I look forward to hearing from you as being able to relate our single cell measurements with the genomic measurements is an exciting prospect, and I thank you again for having made available such a useful resource.

thanks for interest to AlleleSeq.
Allelic imbalances can be inferred for each SNP from NA12878_AS_SNPs.vcf file. The file has counts for ref and alt alleles. Is it what you are asking?

We also noted that some of the genes on the "Genes with allele specific expression" have heterozygous SNPs in them that are not on the "SNP’s resulting in allele specific behavior table" and vice versa, so we weren’t exactly sure how to interpret that difference.
If you point us to few such cases we would be happy to look and resolve the inconsistencies. However, my feeling is that, it is probably reflects differences in gene annotation in different databases.

Your paper sounds very cool. I would be happy to read it when it comes out or even before that (if you consider that possible).

Question re ENCODE data on website


I’ve been incorporating the encode data from your webpage in my analyzes
( The data is fantastic, but I have
questions regarding the enets*.GM_proximal_*filtered_network.txt data

The filtered dataset actually contains more regulators than the
unfiltered data
set, making me speculate that the unfiltered data file is not complete:
[bb447@compute-8-2 TF]$ cut -f1
enets6.GM_proximal_unfiltered_network.txt | sort
-u | wc -l
[bb447@compute-8-2 TF]$ cut -f1 enets8.GM_proximal_filtered_network.txt
| sort
-u | wc -l

Could it be possible that the file is incomplete?

the updated files are uploaded to the site. thanks again for pointing this out.

ACT software question

I tried to use your ACT software for aggregation plot and slightly confused.
If it’s possible can you please look at my input (at the end of this email) and
tell me where I have misunderstanding?

1) Why position -2 (-15bp) has signal 6.7 instead of (10+5+7)/3=7.3?
2) Why positions <-2 (25bp away) have any values and how these values were obtained (as my signal is only up to 17bp)?
3) How values for -1 and 0 were obtained?

Thank you in advance,


chr1 20 25 +
chr1 280 288 +


chr1 1 10
chr1 2 5
chr1 3 7
chr1 300 6
chr1 301 8
chr1 302 9

-bash-3.2$ python –nbins=10 –mbins=0 –radius=100 bed.txt signal.txt
# –nbins=10 –mbins=0 –radius=100 bed.txt signal.txt
# annotationCount: 2
Bin Center mean stdev
-10 -95 3.5 0.0
-9 -85 3.5 0.0
-8 -75 3.5 0.0
-7 -65 3.5 0.0
-6 -55 3.5 0.0
-5 -45 3.5 0.0
-4 -35 3.5 0.0
-3 -25 3.5 0.0
-2 -15 6.7 3.12889756943
-1 -5 7.0 0.0
0 4 7.0 0.0
1 14 7.0 0.0
2 24 7.8 3.69684550214
3 34 8.0 1.41421356237
4 44 8.0 1.41421356237
5 54 8.0 1.41421356237
6 64 8.0 1.41421356237
7 74 8.0 1.41421356237
8 84 8.0 1.41421356237
9 94 8.0 1.41421356237

The default for ACT is to assume that the signal file is a step-wise signal input, so for example in your signal.txt file, all positions on chr1 between nucleotides 3 and 300 are assigned the value 7 (hence it is not acting as your calculation below might suggest).

In addition, your bed file has two annotations (20 to 25 and 280 to 288, both +). For the positions <-2 bins away, only the values upstream of the 280 to 288 annotation are used.

information about


I am interested in using the information in your database to design PCR probes that would recognize usable and ensuing pseudogenes for several genes.

Do I need to obtain any type of written permission to use this information?


I checked one gene, with 9 pseudogenes listed and tried to align the
sequences to make PCR primers to detect the 10 copies, however, I realized
that being a bit naïve about pseudogenes led me down the wrong path, as I
thought the sequences would be more similar, and adept to being used to
estimate copy number for inserting foreign genes. While I did get regions
that hit 3-6 of the 10 genes, it wasn’t consistent enough.

I was wondering if you have the data about % conservation or any types of
algorithms that would predict the % conservation of pseudogene to gene and
pull out those names/gene Ids and number of pseudogenes?

It would be helpful if you can tell us a bit more about what you are trying to do.
I assume you are looking at human pseudogenes. We do have percent identity between the parent protein and the pseudogene.

I’m trying to figure out a sensible way to use the numbers of the pseudogene/gene as a natural standard curve for real time PCR. See attached excel file. I chose at random genes with 9 to1 listed pseudogene which theoretically would allow me to target endogenous genes of different copy number and get some type of standard curve. This is assuming equal efficiency etc.

I didn’t pay attention to the column "Identity" but now I’m thinking I can sort out genes based on high identity and try again?

I think that identity should be taken into account when you are creating the standard curse. Also, note that in the excel file, there is a column of fraction (after gene ID), which indicates the fraction of a parent gene aligned to its pseudogene. The start and end coordinates of an alignment are also in the excel file (columns between protein ID and gene ID). Maybe you want to take these into consideration too.

Pseudogene minilist for PCR.xlsx

Architecture of the human regulatory network derived from


Re: Architecture of the human regulatory network derived from ENCODE data

Hi Dr. Gerstein: This is a very nice paper and is very important in my
current study. Do you have tools/software for TF Co-association (figure 1
and supplemental section B and C) mentioned in this paper. Can I get it?

Anshul did the co-association analysis for this Networks paper. I
think he knows that part the best.

As for the co-association analysis in the ENCODE main paper, it can
be repeated using the GSC package available at the ENCODE statistics web
site ( The first thing you need to do
is to determine (manually or by other means) a segmentation of the
genome, where TF binding is assumed segment-wise stationary. If you have
no specific preference on how the segmentation should be done, you can
use the GSC Python segmentation tool to do that, which will try to
perform an automatic segmentation (the results of which would be better
if you have more data). Then you can run the GSC Python program to
perform segmented block sampling to compute pairwise p-vlaues of your
binding data.

MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit

DREAM 3 challenge & paper “Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data”


I am interested in exploring further the work did by you and your team
members in DREAM 3 challenge, as reported in the paper stated below. Do you
provide the codes/program for public to view? Thanks.

"Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data"

I am ok with the current software which you said quite tailored to the competition. Please send it to me. Really appreciate it. Thanks.

The current form of the software is quite tailored for the
competition, and we do not have a general, publicly distributable
version. I can send it to you if you think it would be useful.

Please find the version that we submitted to DREAM attached, together with some data and some script files for running it. If you have Apache Ant installed, simply issue the command "ant runall3" to run the program on the DREAM3 files. The size-10 networks are included, and the size-50 and size-100 networks can be downloaded from the DREAM web site.