A question regarding VAT (twoBitToFa: command not found)

The psiDR is a valuable resource for my research.

In the corresponding publication (The GENCODE pseudogene resource, Pei et
al., Genome Biology, 2012), you describe in "Material & Methods:
Identification of the parents of pseudogenes and sequence similarity to
the parent" that the exons of parent and pseudogenes were used to align
them via ClustalW2.

Is it possible to provide me the alignments? That would be great.

Edit @PATH

That is, can we run VAT (snpMapper) on a simple mutation list (as shown below)?
14 30525048 . . . . . .
19 28364092 . . . . . .
2 144165461 . . . . . .
2 144224307 . . . . . .
4 98318053 . . . . . .

No rs numbers, no sample IDs and no group:sample files are present in this case, as I just wanted to run VAT for a list of somatic mutations to see if they are annotated to a coding or non-coding region.

There will be no duplicates in the list too, as I have already handled their frequency beforehand.

I will update documentation on VAT. Thanks for pointing it out to us. Regarding the input that is a non-vcf file,
just make a dummy vcf file. That is have extra columns that are tab-delimited.. Please e-mail me if you have more questions. However, bear with me as my computer just died and I am trying to get it fixed.

Question about pseudogene.org data

I have a question about psueodgene.org database.
I’m analyzing human pseudogene database and noticed that many "processed" pseudogene (>70%) don’t have polyA.
It seems like opposite of what textbook says. Is that true?
What’s the criteria of "processed pseudogene" in pseudogene.org?

I came to find another question.
I tried to blat search using several pseudogene sequence from each class of "polyA: "0" or "1" or "2" or "3" ".
But most of PolyA class 1,2,3 don’t have convincing polyA tail compare to following criteria.

Polya: "0" or "1" or "2" or "3".
"0":no polyA tail (> 30 A in 50 bp window) detected of the pseudogene
"1" : has polyA tail and also polyadenylation signal with 50 bp of the begining of the tail
"2" : has polyA tail and polyadenylation signal within 50-100 bp of the begining of the tail
"3": has polyA tail but no polyadenylation detected.

Does number coordinate of pseudogene.org data depends on human genome assembly GRCh37/hg19?

Pseudogenes are identified primarily by homology matching of protein sequence against the human genome. However, the pipeline that we use incorporates poly A analysis. Our group published a paper a few years ago where we showed that ~ 50% of ribosomal protein pseudogenes do not have a detectable poly A signal. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC187539/ . We believe that this is due to decay in genome sequence and nucleotide substitutions.

For detecting poly A signals and classification, the following criteria is used according to the paper linked above.

We searched a 1000-bp region that was 3′ to the pseudogene homology segment, with a sliding window of 50 nucleotides for a region of elevated polyadenine content (>30 bp), and picked the most adenine-rich 50-bp segment as the most likely candidate. An interval of 1000 nucleotides was used because of the possible existence of 3′-untranslated regions (3′-UTRs); 90% of 3′-UTRs are of length less than 942 bp (Makalowski et al. 1996). In addition, we searched in the same 1000-bp region for candidate AATAAA or other polyadenylation signals and checked whether they were upstream of the candidate polyadenine tail site.

This criteria might not be very stringent.

And yes, the pseudogene coordinates are dependent on the human genome from which it is derived, hence the human genome version number is important.

ncRNA position doesn’t match

I have downloaded the psiDR for comparing the results with previously posted
lincRNAs at the UCSC web site.

I found that the following entry doesn’t match with the current hg19
positions assigned at the UCSC genome browser:

gene_id "ENSG00000224184.1"; transcript_id "ENSG00000224184.1"; gene_type
"lincRNA"; gene_status "NOVEL"; gene_name "AC096559.1"; transcript_type
"lincRNA"; transcript_status "NOVEL"; transcript_name "AC096559.1"; level 2;
tag "ncRNA_host"; havana_gene "OTTHUMG00000151709.2";

Coordinates at psiDR are: chr2:11,988,748-12,718,474

Coordinates at UCSC are: chr2:12,716,164-12,783,038

Don’t know whether or not that happens with the coordinates of other

I can’t find a way to explain this difference other than a mistake in the
annotation process, but maybe I’m wrong and there is a better explanation.

We use the GENCODE gene annotation model. If you check Ensembl for "ENSG00000224184.1", you will see that it matches the coordinates at psidDR.
I think the UCSC track includes the actual clone boundaries. You can e-mail to the UCSC help desk. They are generally very responsive. Please bear in mind that coordinates also change a bit with updated genome assembly as well refined gene annotation models.

Question about definition of startOverlap and endOverlap in VAT

I have been using your variant annotation tool VAT and I have a question about what the definition is of startOverlap and endOverlap. I went through the example workflow and I have annotated another file, but I do not have any variants annotated with these types in my files. I went through your website but I could not find a listing of definitions for the terms. Thank you.

These features are annotated only for indels. Essentially, when a indels affects the START of a gene or the end of a gene, it is annotated as startOverlap and end Overlap respectively. You can find extensive documentation for VAT at vat.gersteinlab.org. Please click on "Documentation" tab. Please let me know if you have any more questions.

PGOHUM00000250821 probably not a pseudogene

This is supported as a protein coding gene based on transcript and genomic data in human, and homology data. The differences with the human reference assembly (insertions at nt 475-476 and nt 496-497 in the CDS) are supported by transcript data and alignment to the alternate (Celera) assembly. The mouse protein NP_758465.2 (Ppp1r9b, Entrez GeneID 217124) is the same length as the human protein (NP_115984.3) and 96% identical. The region where the mouse gene is located on chromosome 11 has the same genes in the same order as the location on human chromosome 17 where this gene is annotated.
Thanks to Dr. Janet Weber from the Refseq project group for pointing this to us. PGOHUM00000250821 is most likely a protein-coding gene PPP1R9B. The erroneous annotations probably results due to either an error or difference in the canonical human reference genome. Please note that this locus is tagged for follow-up by the Genome Reference Consortium as a possible locus where the reference genome is incorrect (GRC Jira system as HG-191, http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/issues/chr17/ ).

Cow pseudogenes?

[tag sb]
We’re wondering if you happen to have a database for cow pseudogenes
We haven’t done a Pesudopipe run on cow genome.
I see that the genome is available from Ensembl. You can download the code and run it. In theory,
Pseudopipe can be executed when the genome and the annotation files are a part of Ensembl. The code to run Pseudopipe can be downlaoded from


Dear Anand,

First some general guidelines of running the pseudopipe pipeline in your local machine.

Since the pseudopipe pipeline was originally designed and automated to work with ensembl data, so some manual settings are required to run it with other input data.

Attached is an archive that consists of the pipeline and a simple try-out data.

There are three folders within a parent directory “pgenes” after extraction:
– pseudopipe: pipeline code;
– ppipe_input: input data;
– ppipe_output: output data.

Input data:
You may create a separate folder within the ppipe_input (and ppipe_output) for each species. There need to be three folders for each species genomic input data,
– dna: contains a file named dna_rm.fa, which is entire repeat masked dna from that species, and a list file for all unmasked dna divided into different chromosomes in FASTA format;
– pep: contains a FASTA file for all the proteins in the species;
– mysql: contains a list of files named as “chr1_exLocs”, “chr2_exLocs”, etc. to specify exons coordinates, one for each chromosome. Only thing matters for these files are their third and fourth columns, which should be start and end coordinates of exons.

Environment setting:
You’ll need python, blast and tfasty to run the pipeline. Their paths should be indicated at the end of /pseudopipe/bin/env.sh

Run the pipeline
First go to the folder pseudopipe/bin, and run with command line in the form of: ./pseudopipe.sh [output dir] [masked dna dir] [input dna dir] [input pep dir] [exon dir] 0.

An example using the try-out data is as follow:
./pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
(This command line assumes you extract the archive in your home directory, i.e., “~/”. Please note that the paths in the command line need to be absolute, and chromosome and exon files are specified with wild card “%s”.)

The blast step is already included in the pipeline.

The output can be found at ppipe_output/caenorhabditis_elegans_62_220a/pgenes/ppipe_output_pgenes.txt

Run time
On a single laptop (2.6GHz, 4GB RAM): The most time consuming step is tblastn. It may take around one day to finish an entire genome in a comparable size of C. elegans. The following steps will finish in a few hours.

We’ve implemented the pipeline to run parallel in cluster machines. However, the pipeline I sent can only run on a single machine. The parallel implementation is currently hard-coded to our local settings.

Some specific answers to your questions:
I am ready to run tBLASTn of proteome versus genome. I can repeat mask the genome during the tBLASTn run itself, would that be OK?

You don’t have to run the tBLASTn by yourself since it is already integrated into the pipeline. In ENSEMBL, the genomes are repeat masked by RepeatMasker, that’s the input data currently used the pipeline. I would assume any reasonable repeat mask algorithm is fine.

For the tBLASTn, instead of using the entire genome, can I use the genome that is ‘masked’ for entire genes (not just exons). Based on gff info, I have converted the genic regions (not just exons) into stretches of Ns. Would this ‘masked’ genome be a good input for my tBLASTn?

You don’t need to do that since the pipeline will remove blast hits significantly overlap known gene exons ( > 30 bp overlap). Also, manually masking the entire gene sequences may be problematic, since we do find in some species the pseudogenes with some overlap with genes annotation.

You mention in your paper that you use bite-sized portions of your proteome as query for your BLAST search. Does that mean I should chop up my proteome into peptides x amino acids in length? Is that x >= 10?

No need to do that. You can keep the whole protein sequences in the input FASTA file.

Is there a latest README or even a User Guide for PseudoPipe that you can share with me?

Unfortunately, we don’t have a user friendly README file for the entire pipeline, especially for it to run in different environment from ours. I hope this email can help you set it up and run the pipeline in your machine. And also you can find some comments on each individual pipeline script file.
Please feel free to let me know if you need further assistance.