Comparing chromatin state analysis at pseudogene regions

Q:
I am very interested to compare our chromatin state analysis at the pseudogene regions. I found this file at your website: http://www.pseudogene.org/psidr/psiDR.v0.txt

Could you please let me know if this is the right place to compare? I saw you do have h1-esc there. If I understand correctly, you classified each pseudogene as being in either active (1) or silent state (0).

A:
The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file. Please let me know if you have any questions about that file.

Pseudogene database: the link “current human pseudogenes” on the main webpage leads to build 61

Q:
I have a question regarding your Pseudogene database: the link "current
human pseudogenes" on the main webpage leads to build 61. Looking however at
"Database" -"Eukaryote Pseudogenes" I found build 68 for human
pseudogenes. The latter seems to contain less pseudogenes than build 61
(lower count). So I’m not sure which one I should best consider. Probably
build 68 is the latest version and the link on the main page is not up to
date, right?

A:
Now the link is pointing to build 68. I would
suggest you to use this file, which is the latest results based on the
release-68 of Ensembl genomes. The number of pseudogene changes due to
the different annotation of protein coding genes between the different
genome releases.

Data associated with paper “The GENCODE pseudogene resource”

Q:
Your work on the ENCODE project has helped to
produce an incredible set of data!

I had a question about your pseudogene article, "The GENCODE
pseudogene resource." You note that at least 9% of them are
transcribed. Do you have a list somewhere? I couldn’t find a
supplementary file that might contain such a list. I realize it would
be quite long, but I assume it must exist somewhere. If not, do you
happen to know if the GULO pseudogene is one of the transcribed
pseudogenes?

A:
It transcribes pseudogenes should be available from the resource associated with the paper.

The data associated with the paper is located at http://pseudogene.org/psidr/

Zebrafish pseudogenes

Q:
In Pseudogene.org, the pseudogene datasets of zebrafish (Danio rerio) was based on old annotations (Ensembl 55?). There were about ~1800 processed pseudogenes. However, based on a recent research (http://www.nature.com/nature/journal/v496/n7446/full/nature12111.html), there were rare pseudogenes in zebrafish. (Only 21 processed pseudogenes, according to Supplementary Table 14 in the published manuscript). Is this great conflict due to the old annotations?

A:
This is right, the results were based on an old genome assembly, ENSEMBL release 55, which was done in year 2009. We do notice that the pseudogene number is way too high, which we believe partially due to the quality of the genome assembly, and partially due to the reason that the pipeline parameters were optimized with primates. Given that, for up-to-date pseudogene information in Zebrafish, people should refer to the Nature publication.

PsiDR

Q:
Where is the psiDR file?
A:
The file can be downloaded at: http://www.pseudogene.org/psidr/psiDR.v0.txt

Q:
Is H1-hesc included in the psiDR file?
A:
The chromatin state, promoter prediction and pol2 binding regarding to pseudogenes in H1-hesc are included in the psiDR file.

Q:
Could you let me know briefly how the chromatin states in the psiDR file are determined?
A:
The chromatin states were assessed using the Segway segmentation. Segway annotates the genome using 25 different labels representing active and repressive marks. we use two selection criteria to pinpoint pseudogenes with active chromatin states:
(1) the frequency of the TSS is three times higher than the frequency of any repressive markers;
(2) the gene body start (GS), gene body middle (GM) and gene body end (GE) frequencies are two times larger than the frequency of the repressive markers.
The selection criteria were chosen to match the segmentation behavior of the active genes.

Consult for help about PseudoPipe

Q:
Why the genomic sequences need to be repeatmasked before their inputs to the pipeline?

A:
This is to block the low complexity regions in genome from pseudogene searching.

Q:
Which database we should use to do the repeatmasking?

A:
Our current pipeline downloads genome data from Ensembl, where the repeats are detected with the RepeatMasker tool. More information about the pseudopipe can be found at: https://faq.gersteinlab.org/category/pseudogenes/.

>
> Dear Prof. Gerstein,
>
> My name is Yiling Lai, a PhD student from Prof. Xingzhong Liu’s > group in Institute of Microbiology, Chinese Academy of Sciences. Our > research focus on comparative genomics of nematode endoparasitic > fungi Hirsutella spp.. Now we start to analyse the genomic sequences > and use the PseudoPipe from your published method to identify > pseudogenes in these genomes. However, some questions confuse us > when we use the pipeline. The first one is why the genomic sequences > need to be repeatmasked before their inputs to the pipeline. The > second question is which database we should use to do the > repeatmasking, the repbase database or database established from de > nove consensus sequences by RepeatScout? We would be very > appreciated if you could give us some good suggestions. Thank you > very much! We’re looking forward for your reply. >
>
> Best wishes
>
>
> Yiling Lai
>
>
> State Key Laboratory of Mycology
>
> Institute of Microbiology
>
> Chinese Academy of Sciences
>
> No.3 1st Beichen West Road, Chaoyang District
>
> Beijing 100101, PR China