Installing VAT

Q:
Thank you for offering VAT free to scientific community. A few users of our cluster requested this software to be installed on our 6000 node cluster.
But I ran into problem while trying to compile it. Hopefully you can provide some help.Thank you in advance.

I have followed your instruction on
http://vat.gersteinlab.org/documentation.php
But when I try to make vat, I got following errors. I have attached the
config.log file.:

/data/rl/vat-2.0.0> env | eg FLAGS

LDFLAGS=-L/usr/local/gsl-1.14/lib -L/usr/local/gd-2.0.35/lib
-L/usr/local/libs3-2.0/lib -L/usr/local/bios/lib
CPPFLAGS=-I/usr/local/gsl-1.14/include -I/usr/local/gd-2.0.35/include
-I/usr/local/libs3-2.0/include -I/usr/local/bios/include

/data/rl/vat-2.0.0> make
make all-recursive
make[1]: Entering directory `/gs1/users/rl/vat-2.0.0′
Making all in lib
make[2]: Entering directory `/gs1/users/rl/vat-2.0.0/lib’
GEN alloca.h
GEN configmake.h
GEN langinfo.h
GEN stdlib.h
GEN unistd.h
GEN wchar.h
GEN wctype.h
make all-recursive
make[3]: Entering directory `/gs1/users/rl/vat-2.0.0/lib’
make[4]: Entering directory `/gs1/users/rl/vat-2.0.0/lib’
CC localcharset.o
AR libgnu.a
make[4]: Leaving directory `/gs1/users/rl/vat-2.0.0/lib’
make[3]: Leaving directory `/gs1/users/rl/vat-2.0.0/lib’
make[2]: Leaving directory `/gs1/users/rl/vat-2.0.0/lib’
Making all in src
make[2]: Entering directory `/gs1/users/rl/vat-2.0.0/src’
CC vcf.o
CC util.o
CC shutil.o
CC cfio.o
CC md5.o
CC growbuffer.o
CC s3.o
AR libvat.a
CC gencode2interval.o
CCLD gencode2interval
CC interval2sequences.o
CCLD interval2sequences
CC snpMapper.o
CCLD snpMapper
CC indelMapper.o
CCLD indelMapper
CC svMapper.o
CCLD svMapper
CC genericMapper.o
CCLD genericMapper
CC vcf2images.o
CCLD vcf2images
vcf2images.o: In function `generateLegend’:
/data/rl/vat-2.0.0/src/vcf2images.c:516: undefined reference to `gdImagePng’
vcf2images.o: In function `main’:
/data/rl/vat-2.0.0/src/vcf2images.c:564: undefined reference to `gdImagePng’
collect2: ld returned 1 exit status
make[2]: *** [vcf2images] Error 1
make[2]: Leaving directory `/gs1/users/rl/vat-2.0.0/src’
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/gs1/users/rl/vat-2.0.0′
make: *** [all] Error 2
/data/rl/vat-2.0.0> echo $?

A:
Thank you for your interest in VAT. It appears that the compiler is unable to find the reference for gdImagePng, which is part of the GD library.
It is important to install and configure the GD library before installing VAT.

problems in compilation of libproteingeometry-2-3-1

Q.
I am experiencing problems in trying to compile your package (libproteingeometry-2-3-1). I tried to compile it under Ubuntu-11.10 on a 32-bit VAIO centrino-based PC (gcc vers 4.6.1) and I got the following error message:

utypes.h:16:15: error: conflicting types for ‘float_t’
/usr/include/i386-linux-gnu/bits/mathdef.h:36:21: note: previous declaration of
‘float_t’ was here

Is it possible that this error is related CPPFLAGS settings?

I have also had difficulty compiling on an Intel i7 64-b machine under Ubuntu-11.10 with the same gcc version. This time the message was:

../src-lib/.libs/libproteingeometry: undefined reference to `sincos’ ../src-lib/.libs/libproteingeometry: undefined reference to `ceil’ ../src-lib/.libs/libproteingeometry: undefined reference to `atan2′ ../src-lib/.libs/libproteingeometry: undefined reference to `acos’ ../src-lib/.libs/libproteingeometry: undefined reference to `sin’ ../src-lib/.libs/libproteingeometry: undefined reference to `rint’ ../src-lib/.libs/libproteingeometry: undefined reference to `sqrtf’ ../src-lib/.libs/libproteingeometry: undefined reference to `pow’ ../src-lib/.libs/libproteingeometry: undefined reference to `sqrt’ ../src-lib/.libs/libproteingeometry: undefined reference to `floor’ collect2: ld returned 1 exit status

A.
This problem was resolved in a strange way. The package was compiled in the /usr/local dir of an Intel i5 machine, and all the compiled material was brought to the corresponding dir of an i7 computer. Almost the same had been done for a vaio 32-bit computer using a CD with the Ubuntu 9.10 version on another 32-bit system.

There may have been some collisions with environment flags set by the preceding compilation of other packages. In particular, Amber10 and Gromacs-4-5-5 had already been installed on the i7 and vaio.

Information in .root file

Q:
By using CNVnator, I managed to create the .root file but from there I can’t go any further because when I try to create the histograms, it seem to be working, but it never creates any files after it’s done.
A:
New information is added to the .root file you provided in the command line.
During next calculation step CNVnator will extract this information from the file.

To browse the content of the .root file you can start ROOT and open browser (type “new TBrowser”).

Please see http://root.cern.ch for details.

PGOHUM00000250821 probably not a pseudogene

Q:
This is supported as a protein coding gene based on transcript and genomic data in human, and homology data. The differences with the human reference assembly (insertions at nt 475-476 and nt 496-497 in the CDS) are supported by transcript data and alignment to the alternate (Celera) assembly. The mouse protein NP_758465.2 (Ppp1r9b, Entrez GeneID 217124) is the same length as the human protein (NP_115984.3) and 96% identical. The region where the mouse gene is located on chromosome 11 has the same genes in the same order as the location on human chromosome 17 where this gene is annotated.
A:
Thanks to Dr. Janet Weber from the Refseq project group for pointing this to us. PGOHUM00000250821 is most likely a protein-coding gene PPP1R9B. The erroneous annotations probably results due to either an error or difference in the canonical human reference genome. Please note that this locus is tagged for follow-up by the Genome Reference Consortium as a possible locus where the reference genome is incorrect (GRC Jira system as HG-191, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/issues/chr17/ ).

CNVnator license

Q:
Does your license allow to provide commercial services based on your program?

A:
Commercial services can use CNVnator for free provided that original software/developers/paper is credited/cited.

Alex Abyzov

***********************************************************
Department of Molecular Biophysics and Biochemistry,
Yale University, 260 Whitney ave., P.O. Box 208114,
New Haven, CT, 06520, USA
Phone: 1-(203)-432-5405
e-mail: abyzov@gersteinlab.org
URL: http://homes.gersteinlab.org/people/aabyzov
***********************************************************

Cow pseudogenes?

[tag sb]
Q:
We’re wondering if you happen to have a database for cow pseudogenes
A:
We haven’t done a Pesudopipe run on cow genome.
I see that the genome is available from Ensembl. You can download the code and run it. In theory,
Pseudopipe can be executed when the genome and the annotation files are a part of Ensembl. The code to run Pseudopipe can be downlaoded from
http://www.pseudogene.org/DOWNLOADS/pipeline_codes/

PseudoPipe

Dear Anand,

First some general guidelines of running the pseudopipe pipeline in your local machine.

Since the pseudopipe pipeline was originally designed and automated to work with ensembl data, so some manual settings are required to run it with other input data.

Attached is an archive that consists of the pipeline and a simple try-out data.

===
There are three folders within a parent directory “pgenes” after extraction:
===
– pseudopipe: pipeline code;
– ppipe_input: input data;
– ppipe_output: output data.

===
Input data:
===
You may create a separate folder within the ppipe_input (and ppipe_output) for each species. There need to be three folders for each species genomic input data,
– dna: contains a file named dna_rm.fa, which is entire repeat masked dna from that species, and a list file for all unmasked dna divided into different chromosomes in FASTA format;
– pep: contains a FASTA file for all the proteins in the species;
– mysql: contains a list of files named as “chr1_exLocs”, “chr2_exLocs”, etc. to specify exons coordinates, one for each chromosome. Only thing matters for these files are their third and fourth columns, which should be start and end coordinates of exons.

===
Environment setting:
===
You’ll need python, blast and tfasty to run the pipeline. Their paths should be indicated at the end of /pseudopipe/bin/env.sh

===
Run the pipeline
===
First go to the folder pseudopipe/bin, and run with command line in the form of: ./pseudopipe.sh [output dir] [masked dna dir] [input dna dir] [input pep dir] [exon dir] 0.

An example using the try-out data is as follow:
./pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
(This command line assumes you extract the archive in your home directory, i.e., “~/”. Please note that the paths in the command line need to be absolute, and chromosome and exon files are specified with wild card “%s”.)

The blast step is already included in the pipeline.

===
Output:
===
The output can be found at ppipe_output/caenorhabditis_elegans_62_220a/pgenes/ppipe_output_pgenes.txt

===
Run time
===
On a single laptop (2.6GHz, 4GB RAM): The most time consuming step is tblastn. It may take around one day to finish an entire genome in a comparable size of C. elegans. The following steps will finish in a few hours.

We’ve implemented the pipeline to run parallel in cluster machines. However, the pipeline I sent can only run on a single machine. The parallel implementation is currently hard-coded to our local settings.

Some specific answers to your questions:
Q:
I am ready to run tBLASTn of proteome versus genome. I can repeat mask the genome during the tBLASTn run itself, would that be OK?

A:
You don’t have to run the tBLASTn by yourself since it is already integrated into the pipeline. In ENSEMBL, the genomes are repeat masked by RepeatMasker, that’s the input data currently used the pipeline. I would assume any reasonable repeat mask algorithm is fine.

Q:
For the tBLASTn, instead of using the entire genome, can I use the genome that is ‘masked’ for entire genes (not just exons). Based on gff info, I have converted the genic regions (not just exons) into stretches of Ns. Would this ‘masked’ genome be a good input for my tBLASTn?

A:
You don’t need to do that since the pipeline will remove blast hits significantly overlap known gene exons ( > 30 bp overlap). Also, manually masking the entire gene sequences may be problematic, since we do find in some species the pseudogenes with some overlap with genes annotation.

Q:
You mention in your paper that you use bite-sized portions of your proteome as query for your BLAST search. Does that mean I should chop up my proteome into peptides x amino acids in length? Is that x >= 10?

A:
No need to do that. You can keep the whole protein sequences in the input FASTA file.

Q:
Is there a latest README or even a User Guide for PseudoPipe that you can share with me?

A:
Unfortunately, we don’t have a user friendly README file for the entire pipeline, especially for it to run in different environment from ours. I hope this email can help you set it up and run the pipeline in your machine. And also you can find some comments on each individual pipeline script file.
Please feel free to let me know if you need further assistance.

Best,
Baikang

Interaction Data set

Q:

help of your paper “Redefining Nodes and Edges: Relating 3D Structures to Protein Networks Provides Insight
into their Evolution “. Now I need to get those protein in pfam which are involved in interaction and also the crystal structure of them.
I would be very grateful to you if you send me the link to access the more detail format of SIN v0.9 data.

A:

My understanding from your email is that you would like to know the Pfam IDs
and the corresponding crystal structures (ie, the PDB IDs) for the
interactions involved in the SIN. To do this, you will have to process two
separate datasets together, but this will not be difficult. Here are the
steps:

i) access the raw SIN data (http://networks.gersteinlab.org/structint/) At
this page, click on “composite dataset” under the download column for SIN
v0.9 data. This is a list of open reading frame IDs corresponding to each
interaction (the first and third columns), as well as whether the
interaction is taken from Pfam.
ii) open the text file I’ve attached with this email. Each row contains
several pieces of information, but what you would like to do is find the PDB
IDs (contained in the 2nd column) corresponding to each Ensembl Gene ID (the
first column). This Ensembl Gene ID is taken from (i) above.
I should mention that there are two problems with the procedure outlined
above.
The first is that I noticed it will not provide crystal structures for all
interactions. I’m not sure why this is the case. Secondly, for some
interactions, multiple crystal structures are available, and it is not clear
which structure was used in Pfam. Nitin (CC’ed to this email) may know how
to negotiate with these issues. If you are still having difficulty, please
contact Nitin or I again after further efforts to get the data you need.

integrated regulatory network

Q:

I read your recent paper “Construction and Analysis of an Integrated
Regulatory Network Derived from High-Throughput Sequencing Data” in PLOS
Computational Biology with a great interest. I would like to know if the
data of your integrated regulatory networks is available, or if you mind to
share it. Indeed, I’m part of a group of statisticians in Evry (France)
working on probabilistic models for biological networks. Our aim is to
retrieve the groups of nodes having similar topological behaviours. The
fact that your data has three types of nodes, a hierarchical structure among
TFs and miRNAs and that you made a biological analysis of this structure
makes it very interesting for us to validate or not the methods we
developed. Would it be possible for you to send me the C. elegans network
and the corresponding hierarchical structure? Any use of it would of course
be referenced.

A:

I have upload the worm network data onto http://archive.gersteinlab.org/proj/mirnet
It comprise 3 files:

cel_TF_Target_GID.net : TF->gene interactions
cel_TF_MIR_GID.net: TF->miR interactions
cel_miR_conservedTarget_Kris3way_GID.net: miR->gene interaction

Node type is labeled as “MIR”, “TF” or “X” in the bracket.

Request for Pseudogene

Q:

We are basically looking for the pseudogenes of protein P53 (tumor protein 53, or tumor suppressor) and protein WSTF (also call it as BAZ1B) in human species. There have no information in Pseudogene.org. Could you please help us to find a way to get the result?
Later on I found one webservice, which is called PseudoGeneQuest, and I submitted my target protein sequences and I got the results as shown in the following forwarded emails.

The results showed that there are known-pseudogenes in your database, however, I couldn’t extract the data out. Could you please help me to do so?
We are basically looking for the pseudogenes of protein P53 (tumor protein 53, or tumor suppressor) and protein WSTF (also call it as BAZ1B) in human species.

A:

I have looked at our pseudogene database and there are no pseudogenes for P53 and WSTF. I have further rechecked this by redoing homology analysis to the genome based on both P53 and WSTF sequence and there are no other regions in the genome which are good hits to P53 and WSTF. I have also looked at the results from the other program and either the matches are to other coding exons of other genes or all they are not significant matches, i.e. the match-lengths are very small and the e-values are not significant.

For example, these are the other regions in the genome homologous to the coding sequence in BLAST. Please see attached image. The only significant matches to P53 proteins are
1. NT_010718.16

This corresponds to P53 itself

2. NT_004350.19 This corresponds to P73, another gene and not a pseudogene

3. NT_005612.16 This corresponds to P63, another gene and not a pseudogene

The other two matches are not significant matches and have length homology only to 20% of P53.

This is the result that you obtained from the other program.

0 - QUERY:111222153038348410812
2 - KNOWN_PSEUDOGENE:ref|NT_004350.19|:NT_010755.15:3118600:3119076
2 - KNOWN_PSEUDOGENE:ref|NT_004350.19|:NT_033903.7:3114083:3118495
2 - KNOWN_PSEUDOGENE:ref|NT_010718.16|:NT_008470.18:7177265:7178188
2 - KNOWN_PSEUDOGENE:ref|NT_010718.16|:NT_023935.17:7181340:7182403
2 - KNOWN_PSEUDOGENE:ref|NT_010718.16|:NT_079573.3:7181224:7182633
3 - REAL GENE OR EXON:ref|NT_004350.19|:3122278:3122442
3 - REAL GENE OR EXON:ref|NT_005612.16|:96077137:96077361
3 - REAL GENE OR EXON:ref|NT_005612.16|:96079592:96079771
3 - REAL GENE OR EXON:ref|NT_005612.16|:96080735:96080899
3 - REAL GENE OR EXON:ref|NT_005612.16|:96081483:96081638
3 - REAL GENE OR EXON:ref|NT_010718.16|:7176274:7176414
3 - REAL GENE OR EXON:ref|NT_010718.16|:7180194:7180331
3 - REAL GENE OR EXON:ref|NT_010718.16|:7180364:7180564
3 - REAL GENE OR EXON:ref|NT_010718.16|:7180845:7181012
3 - REAL GENE OR EXON:ref|NT_010718.16|:7183182:7183316

So all the good hits are to coding exons of P53 or P63 or P73 presumably because P53 is homologous to P63, P73 etc.

Similarly for WSTF, the other matches are either to known genes or the matches are not significant. You can easily check this by querying your protein sequence using BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&PROG_DEF=blastn&BLAST_PROG_DEF=megaBlast&SHOW_DEFAULTS=on&SHOW_DEFAULTS=on&BLAST_SPEC=OGP__9606__9558)