Zebrafish pseudogenes

Posted on May 14, 2013 by gersteinfaq

Q:
In Pseudogene.org, the pseudogene datasets of zebrafish (Danio rerio) was based on old annotations (Ensembl 55?). There were about ~1800 processed pseudogenes. However, based on a recent research (http://www.nature.com/nature/journal/v496/n7446/full/nature12111.html), there were rare pseudogenes in zebrafish. (Only 21 processed pseudogenes, according to Supplementary Table 14 in the published manuscript). Is this great conflict due to the old annotations?

A:
This is right, the results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication.

vat installation

Posted on May 8, 2013 by gersteinfaq

VAT seems to require a number of libraries. Is there a full writeup of VAT installation?

You can see the dependencies from: http://vat.gersteinlab.org/download.php and this blog post provide a pretty good step-by-step outline of VAT installation: http://ngsda.blogspot.com/2011/06/vat.html

Van der Waals Radii

Posted on January 28, 2013 by gersteinfaq

Q:
I am writing about your article, The Packing Density in Proteins: Standard Radii and Volumes, published by JMB on 1999. In the article, in particular in table 2, you list a series of radii associated to each atom according to the number of hydrogens it has attached and a number you call the “valence”. However, valences of carbon are 2 and 4, and the list shows a valence 3 carbon; also valences for nitorgen are 3 and 5, and the tible shows a valence 4 one. Could you please explain what you mean by the term “valence” exactly? In particular, I am interested in knowing the type of heavy atoms you can find in glutamine and alanine residues, and their radii.

A:
Here, the term “valence” is perhaps best described in Table 1 (instead of Table 2). What is meant by the “n-term” (here, used synonymously with valence) is usually a geometric descriptor designating the orientation of other atomic species around that atom (for example, n=4 usually means that the atom builds a tetrahedron, whereas n=3 usually means that the atom is trigonal planar). Strictly speaking, and perhaps more accurately, n just designates the total number of atoms bound to a central atom. So, in your example of carbon’s n=3 in Table 2, these are carbon atoms which are connected to 3 other atoms (an example of C3H0 may be the carbonyl carbon in a protein backbone, and C3H1 may be a carbon atom in a phenyl group of PHE). In your example of nitrogen’s n=4, the N4H3 may represent the epsilon-amino group in LYS, since it is bound to 4 other atoms (one carbon and 3 hydrogen atoms).

PsiDR

Posted on October 23, 2012 by gersteinfaq

Q:
Where is the psiDR file?
A:
The file can be downloaded at: http://www.pseudogene.org/psidr/psiDR.v0.txt

Q:
Is H1-hesc included in the psiDR file?
A:
The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file.

Q:
Could you let me know briefly how the chromatin states in the psiDR file are determined?
A:
The chromatin states were assessed using the Segway segmentation. Segway annotates the genome using 25 different labels representing active and repressive marks. we use two selection criteria to pinpoint pseudogenes with active chromatin states:
(1) the frequency of the TSS is three times higher than the frequency of any repressive markers;
(2) the gene body start (GS), gene body middle (GM) and gene body end (GE) frequencies are two times larger than the frequency of the repressive markers.
The selection criteria were chosen to match the segmentation behavior of the active genes.

Consult for help about PseudoPipe

Posted on October 23, 2012 by gersteinfaq

Q:
Why the genomic sequences need to be repeatmasked before their inputs to the pipeline?

A:
This is to block the low complexity regions in genome from pseudogene searching.

Q:
Which database we should use to do the repeatmasking?

A:
Our current pipeline downloads genome data from Ensembl, where the repeats are detected with the RepeatMasker tool. More information about the pseudopipe can be found at: https://faq.gersteinlab.org/category/pseudogenes/.

>
> Dear Prof. Gerstein,
>
> My name is Yiling Lai, a PhD student from Prof. Xingzhong Liu’s > group in Institute of Microbiology, Chinese Academy of Sciences. Our > research focus on comparative genomics of nematode endoparasitic > fungi Hirsutella spp.. Now we start to analyse the genomic sequences > and use the PseudoPipe from your published method to identify > pseudogenes in these genomes. However, some questions confuse us > when we use the pipeline. The first one is why the genomic sequences > need to be repeatmasked before their inputs to the pipeline. The > second question is which database we should use to do the > repeatmasking, the repbase database or database established from de > nove consensus sequences by RepeatScout? We would be very > appreciated if you could give us some good suggestions. Thank you > very much! We’re looking forward for your reply. >
>
> Best wishes
>
>
> Yiling Lai
>
>
> State Key Laboratory of Mycology
>
> Institute of Microbiology
>
> Chinese Academy of Sciences
>
> No.3 1st Beichen West Road, Chaoyang District
>
> Beijing 100101, PR China

Data matrices and R scripts for the paper “Quantifying environmental adaptation of metabolic pathways in metagenomics”

Posted on July 10, 2012 by gersteinfaq

Q:
With respect to your your metagenomics paper which appeared in PNAS (http://www.pnas.org/content/106/5/1374.long), I was wondering if it is still possible to access the entire (1) environmental features vs. geographic sites and (2) metabolic features vs. geographic sites matricesas used in the paper? Also, is it still possible to access the R scripts that were used in this work?

A:
see http://metagenomics.gersteinlab.org/

Conflicting SEQRES records

Posted on June 24, 2012 by gersteinfaq

Q:
I’m trying to visualize a morph between two structures. However, I’m prompted with the error message “Conflicting SEQRES records”. How may I resolve this?

A:
This is by far the most comment cause of failure. The server uses SEQRES data to determine the actual protein sequence. However, this frequently conflicts with the sequence intuited from ATOM records (including official PDB files from http://www.rcsb.org). If the server cannot automatically resolve these conflicts, it will fail. The easiest workaround to this is to submit files with SEQRES records deleted; if you supplied a PDB ID rather than a file you will need to download the proper file from the PDB, then modify and upload it. However, this may sometimes lead to other distortions, depending on the sequence numbering.

Preventing submissions on the morph server from becoming public

Posted on June 22, 2012 by gersteinfaq

Q:
I understand that, by default, the structures I submit to your morph server become public on your database. However, I am submitting coordinates that have not yet been published, or which are commercial. Thus, I’d prefer that my submission not be made public. Can you help me?

A:
We strongly discourage private submissions because they go against the spirit of the database, which is not only intended to provide free morphing to individual users, but also to serve as a browseable and searchable repository of morphs which are useful to others. We understand, however, that some morphs may reveal confidential information and so beyond moral suasion nothing prevents you from using the “Private” check box on our single- and multi-chain morph submission forms. This sets a flag in our database that tells our movie gallery page not to display the morph. Also, the search tool on our front page will not return it. The only way a member of the public could possibly find your morph is if they were able to intercept the email you got from our server with the link to it. We consider the probability that this will happen to be very low. Generally we cannot handle specific requests for further security unless as part of an official collaboration.

Using CNVnator

Posted on June 4, 2012 by gersteinfaq

Q:
CNVnator is a very popular software as observed though there is no official guide on CNVnator or any directions available on how to get started with CNVnator.Could you be kind enough to provide me with the same, please? Does your license allow to provide commercial services based on your program?

A:
Please download the software and read README file.

Alex Abyzov

probe radius

Posted on April 24, 2012 by gersteinfaq

I would like to calculate the solvent accessible surface of certain
proteins. By using your method (online available on
http://helixweb.nih.gov/structbio/basic.html) one can set the probe radius
to maximal size of 1.6 Å. Because I would like to mimic methylene groups as
solvent, the probe radius should be ~1.9 Å (Johnson R.M. et al, Biochemistry
2006, 45, 8507-8515) .
Is there a possibility to increase the probe size radius to 1.9 Å and
calculate the accessible surface using your method?

Question goes here

Your answer here

I think you can do this by downloading the software from
http://www2.molmovdb.org/wiki/info/index.php/Macromolecular_Geometry

Gerstein Lab FAQs

Frequently Asked Questions

Author Archives: gersteinfaq