Q:
I recently read your paper "Classification of human genomic regions
based on experimentally determined binding sites of more than 100
transcription-related factors" in Genome Biology, since I am interested
in enhancers. If I understand things correctly, you identified ~13k
putative enhancers in K562 cells, but I cannot locate the list of loci
in the supplemental materials. I was wondering if you would be willing
to share that list with me?
Daily Archives: May 2, 2019
query regarding Breakseq usage
Q:
I am using Breakseq to find mechanism of structure variations (SV) mapped using different package. I got stuck while using svMech module, probably due to lack of its user manual.
I only want to find mechanism of SV, so I have commented Ancestral state and feature analysis in annotate script under bin directory of breakseq.
It is working fine if I give only deletions in gff file. But when I give Insertions in gff file, it exits with following error
********** Creating standard breakpoint library **********
Traceback (most recent call last):
File "/home/pankaj/breakseq/breakseq-1.3/bin/svUtil/svStd.py", line 20, in <module>
out_fna.write(">%s\n%s\n"%(sv.id,sv.get_sequence()))
File "/home/pankaj/breakseq/breakseq-1.3/lib/biopy/io/SV.py", line 103, in get_sequence
return self.base.get_sequence(self.name, self.start, self.end)
AttributeError: ‘NoneType’ object has no attribute ‘get_sequence’
Command exited with non-zero status 1
0.13user 0.04system 0:00.21elapsed 83%CPU (0avgtext+0avgdata 60800maxresident)k
0inputs+8outputs (0major+4306minor)pagefaults 0swaps
Could you please resolve my following queries regarding breakseq
(1) For Insertion, Do I need to provide inserted sequence explicitly or does this package find internally.
(2) Does this package also find mechanism of translocations. If yes, which keyword should I use in 3rd column of gff file.
A:
1) you have to provide the inserted sequence. (see http://sv.gersteinlab.org/breakseq/ as an example)
2) it does not currently support translocations. (not mentioned on our paper)
Search help for PseudoPipe program
Q:
Recently, I have read your
published paper named" PseudoPipe: an automated pseudogene identification
pipeline"( Vol. 22 no. 12 2006, pages
1437–1439/doi:10.1093/bioinformatics/btl116), which impressed me so much. I
really admire your and co-workers’ excellent work.
After reading the literature, I downloaded the PseudoPipe program(Pipeline
Source Code) at http://pseudogene.org/ and tried to use it to identify
pseudogene sequences in mammalian genome.But there are some questions
during pre-experiment.I input the exsiting data
(caenorhabditis_elegans_62_220a) and installed Python 2.26,howeverthe
PseudoPipe program failed to run the parseFastaAlignment.py.I analysed the
fastaAlign did not do well.It really puzzle me a lot, and I will appreciate
it if you can solve them for me.
A:
Please note that PseudoPipe was written to discover pseudogenes in mammalian genomes, it does not work well in C. elegans.
ncRNA position doesn’t match
Q:
I have downloaded the psiDR for comparing the results with previously posted
lincRNAs at the UCSC web site.
I found that the following entry doesn’t match with the current hg19
positions assigned at the UCSC genome browser:
gene_id "ENSG00000224184.1"; transcript_id "ENSG00000224184.1"; gene_type
"lincRNA"; gene_status "NOVEL"; gene_name "AC096559.1"; transcript_type
"lincRNA"; transcript_status "NOVEL"; transcript_name "AC096559.1"; level 2;
tag "ncRNA_host"; havana_gene "OTTHUMG00000151709.2";
Coordinates at psiDR are: chr2:11,988,748-12,718,474
Coordinates at UCSC are: chr2:12,716,164-12,783,038
Don’t know whether or not that happens with the coordinates of other
elements.
I can’t find a way to explain this difference other than a mistake in the
annotation process, but maybe I’m wrong and there is a better explanation.
A:
We use the GENCODE gene annotation model. If you check Ensembl for "ENSG00000224184.1", you will see that it matches the coordinates at psidDR.
I think the UCSC track includes the actual clone boundaries. You can e-mail to the UCSC help desk. They are generally very responsive. Please bear in mind that coordinates also change a bit with updated genome assembly as well refined gene annotation models.
Question about definition of startOverlap and endOverlap in VAT
Q:
I have been using your variant annotation tool VAT and I have a question about what the definition is of startOverlap and endOverlap. I went through the example workflow and I have annotated another file, but I do not have any variants annotated with these types in my files. I went through your website but I could not find a listing of definitions for the terms. Thank you.
A:
These features are annotated only for indels. Essentially, when a indels affects the START of a gene or the end of a gene, it is annotated as startOverlap and end Overlap respectively. You can find extensive documentation for VAT at vat.gersteinlab.org. Please click on "Documentation" tab. Please let me know if you have any more questions.
Completion of job on coevolution server
Q:
we are trying to find the co-evolving positions in a protein family of interest. I had submitted a job on the co-evolution server several days back, but I have not received a response yet. Could you please let me know the estimated time of completion of my job?
A:
There was a long queue of pending tasks, and one of had been stuck in the queue for some time. I have removed it to let the others run. Please see if you can get your results within a day. If not, please let me know and I will check the system again.
Question about using AlleleSeq tool
Q:
I am a student doing a research project in allele-specific expression,
and am planning to use your lab’s AlleleSeq tool.
I am trying to use the YRI Population HapMap data. I went and tried to
find phased YRI trio data (from 1000 Genomes) to input into the
vcf2diploid tool. Unfortunately, I found data that includes only the
parent ID’s, but not the child’s. Since I don’t have the child data, I
am unable to use the AlleleSeq pipeline.
I was wondering if you could give me some suggestions on how to do ASE
given only the parental data.
A:
Thank you for your interest in the AlleleSeq pipeline.
The AlleleSeq pipeline assumes that the ‘child’ is the subject in which you
are trying to find ASE. Hence the genotypes of the subject are required.
Chain files in AlleleSeq
Q1:
We are very interested in your AlleleSeq package. I have downloaded the maternal.chain and paternal.chain from http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip<http://sv.gersteinlab.org/NA12878_diploid/NA12878_diploid_dec16.2012.zip.
From the documentation @http://info.gersteinlab.org/AlleleSeq:
Chain files
Using the chain file, one can use the LifeOver tool to convert the annotation coordinates from reference genome to personal haplotypes.
However, when I tried to liftOver my bed file using maternal.chain, all returned unMapped.
[liuh@helix NA12878_diploid_genome_dec16_2013]$ more maternal.chain
chain 249198044 1 249250621 + 0 249250621 1_maternal 249242013 + 0 249242013 1
10329 1 0
109 1 0
30199 3 0
43187 4 0
40 1 0
…
My bed file:
[liuh@helix bcf3]$ awk ‘{print $1,$2,$3}’ cyto.vcf.bed |head
chr1 14541 14542
chr1 14652 14653
chr1 14676 14677
chr1 14906 14907
chr1 14929 14930
chr1 15014 15015
chr1 16287 16288
chr1 16297 16298
chr1 16377 16378
chr1 16494 16495
…
My script:
module load ucsc; liftOver cyto.vcf.bed maternal.chain cyto.liftover unMapped.cyto
I tried to liftOver my bed from hg19 to hg18 without any problem. It means
that the bed file format should not have any issue.
A1:
thanks for interest to our software.
I may be wrong but it looks to me that the liftOver failed because of using different chromosome naming convention in .bed and .chain files.
In .bed file chromosomes are named with prefix ‘chr’, while in chain files they don’t have such prefix.
Q2:
Thanks so much for your time and kind help! It works well after I removed prefix ‘chr’ in bed file.
Do you have chain file(s) which can convert the annotation coordinates from maternal or paternal genome to reference genome?
I have mpileup and VarScan outputs with the annotation coordinates from maternal and paternal genome, and want to annotate it using annovar (but the annotation coordinates from the reference is required).
A2:
Incidentally, a previous user of AlleleSeq looking for conversion of mat/pat
to ref genome has given us an R code for public consumption. The only gripe
is we have not had time to test it extensively, hence we did not provide it
on the website previously.
I have the link here: http://alleleseq.gersteinlab.org/tools.html. The R
script converts mat/pat back to ref genome using the chain files provided by
AlleleSeq. Please see if it suits your purpose.
Data associated w/paper “Construction and Analysis of an Integrated Regulatory Network Derived from High-Throughput Sequencing Data”
Q:
I recently read your article “Construction
and Analysis of an Integrated Regulatory Network Derived from
High-Throughput Sequencing Data”. In the last year, I measured mRNA and
miRNA expression in the different types of mouse skeletal muscle fibers to
discover the different regulatory circuits activated in fast and slow
myofibers. I designed a preliminary network using the databases of miRNA –
target mRNA and protein – protein interactions, and I have started to
include my expression data in order to understand the biological meaning. I
was wondering if it is possible to use your more accurate mouse regulatory
network for my data. Is this network free to use? In the article and in the
website of your laboratory I did not find any file or link with the complete
networks that you describe. I am not a computational biologist, but the
paper is very interesting and I think that the network that you design with
your method could be very useful for the scientific community.
A:
Hereby I attach three files for our three mouse networks. 1) how miRNAs targeting genes (This is not our calculation, but downloaded from TargetScan).
2) how TFs targeting genes, 3) how TFs targeting miRNAs based on ChIP-Seq data of 12 TFs.
The files are in plain text format. The first column is the list of regulators and the second column is the list of targets. The bracket next to a gene name gives the class of the gene, TF for transcription factors, MIR for miRNAs, and X for non-TF protein-coding genes.
Thank you for your interest of our paper. I hope this information will be useful for your work.
accessing Database of Macromolecular Movements
Q:
I am developing a software for assessing similarity among flexible proteins. I
would like to test the software on the Database of Macromolecular Movements
to test my software, however I found no means to download multiple files. I
was wondering whether it is possible to get a data set with files containing
protein motions without separately accessing each and every entry in your
database.
What I would like to have, if it is possible of course, is the curated
files of the conformational changes and the corresponding PDB IDs. Let
me know whether it is possible or not.
A:
This may be doable, but it depends on what exactly you need. Do you want the frame-by-frame morph files, the video files, or the PDB IDs, or some other form of the data? Also, we actually have two databases: one is a manually curated set of about 200 conformational changes, and the other database is user-submitted. If you tell me more about the kinds of things you may need, I can likely send you the compressed files.
At the first URL below, you will find the curated set of motions. The second URL is an outbox that’s prepared for you, which contains these morphs (frame-by-frame) from the curated motions:
http://www.molmovdb.org/cgi-bin/browse.cgi
http://homes.gersteinlab.org/people/dc547/.curated_morphs__G_Mate/
Please let us know if you have trouble getting access for any reason.