Query regarding paper “Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors”

Q:
I recently read your paper "Classification of human genomic regions
based on experimentally determined binding sites of more than 100
transcription-related factors" in Genome Biology, since I am interested
in enhancers. If I understand things correctly, you identified ~13k
putative enhancers in K562 cells, but I cannot locate the list of loci
in the supplemental materials. I was wondering if you would be willing
to share that list with me?

A:
see http://encodenets.gersteinlab.org/metatracks/

query regarding Breakseq usage

Q:

I am using Breakseq to find mechanism of structure variations (SV) mapped using different package. I got stuck while using svMech module, probably due to lack of its user manual.
I only want to find mechanism of SV, so I have commented Ancestral state and feature analysis in annotate script under bin directory of breakseq.
It is working fine if I give only deletions in gff file. But when I give Insertions in gff file, it exits with following error

********** Creating standard breakpoint library **********
Traceback (most recent call last):
File "/home/pankaj/breakseq/breakseq-1.3/bin/svUtil/svStd.py", line 20, in <module>
out_fna.write(">%s\n%s\n"%(sv.id,sv.get_sequence()))
File "/home/pankaj/breakseq/breakseq-1.3/lib/biopy/io/SV.py", line 103, in get_sequence
return self.base.get_sequence(self.name, self.start, self.end)
AttributeError: ‘NoneType’ object has no attribute ‘get_sequence’
Command exited with non-zero status 1
0.13user 0.04system 0:00.21elapsed 83%CPU (0avgtext+0avgdata 60800maxresident)k
0inputs+8outputs (0major+4306minor)pagefaults 0swaps

Could you please resolve my following queries regarding breakseq

(1) For Insertion, Do I need to provide inserted sequence explicitly or does this package find internally.

(2) Does this package also find mechanism of translocations. If yes, which keyword should I use in 3rd column of gff file.

A:
1) you have to provide the inserted sequence. (see http://sv.gersteinlab.org/breakseq/ as an example)

2) it does not currently support translocations. (not mentioned on our paper)

Search help for PseudoPipe program

Q:

Recently, I have read your
published paper named" PseudoPipe: an automated pseudogene identification
pipeline"( Vol. 22 no. 12 2006, pages
1437–1439/doi:10.1093/bioinformatics/btl116), which impressed me so much. I
really admire your and co-workers’ excellent work.
After reading the literature, I downloaded the PseudoPipe program(Pipeline
Source Code) at http://pseudogene.org/ and tried to use it to identify
pseudogene sequences in mammalian genome.But there are some questions
during pre-experiment.I input the exsiting data
(caenorhabditis_elegans_62_220a) and installed Python 2.26,howeverthe
PseudoPipe program failed to run the parseFastaAlignment.py.I analysed the
fastaAlign did not do well.It really puzzle me a lot, and I will appreciate
it if you can solve them for me.

A:
Please note that PseudoPipe was written to discover pseudogenes in mammalian genomes, it does not work well in C. elegans.

ncRNA position doesn’t match

Q:
I have downloaded the psiDR for comparing the results with previously posted
lincRNAs at the UCSC web site.

I found that the following entry doesn’t match with the current hg19
positions assigned at the UCSC genome browser:

gene_id "ENSG00000224184.1"; transcript_id "ENSG00000224184.1"; gene_type
"lincRNA"; gene_status "NOVEL"; gene_name "AC096559.1"; transcript_type
"lincRNA"; transcript_status "NOVEL"; transcript_name "AC096559.1"; level 2;
tag "ncRNA_host"; havana_gene "OTTHUMG00000151709.2";

Coordinates at psiDR are: chr2:11,988,748-12,718,474

Coordinates at UCSC are: chr2:12,716,164-12,783,038

Don’t know whether or not that happens with the coordinates of other
elements.

I can’t find a way to explain this difference other than a mistake in the
annotation process, but maybe I’m wrong and there is a better explanation.

A:
We use the GENCODE gene annotation model. If you check Ensembl for "ENSG00000224184.1", you will see that it matches the coordinates at psidDR.
I think the UCSC track includes the actual clone boundaries. You can e-mail to the UCSC help desk. They are generally very responsive. Please bear in mind that coordinates also change a bit with updated genome assembly as well refined gene annotation models.

Question about definition of startOverlap and endOverlap in VAT

Q:
I have been using your variant annotation tool VAT and I have a question about what the definition is of startOverlap and endOverlap. I went through the example workflow and I have annotated another file, but I do not have any variants annotated with these types in my files. I went through your website but I could not find a listing of definitions for the terms. Thank you.

A:
These features are annotated only for indels. Essentially, when a indels affects the START of a gene or the end of a gene, it is annotated as startOverlap and end Overlap respectively. You can find extensive documentation for VAT at vat.gersteinlab.org. Please click on "Documentation" tab. Please let me know if you have any more questions.

Completion of job on coevolution server

Q:

we are trying to find the co-evolving positions in a protein family of interest. I had submitted a job on the co-evolution server several days back, but I have not received a response yet. Could you please let me know the estimated time of completion of my job?

A:
There was a long queue of pending tasks, and one of had been stuck in the queue for some time. I have removed it to let the others run. Please see if you can get your results within a day. If not, please let me know and I will check the system again.

Question about using AlleleSeq tool

Q:

I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.

I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.

I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.

A:
Thank you for your interest in the AlleleSeq pipeline.

The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.