PGOHUM00000250823 and SETP13/SETP3 pseudogenes

Q:

Could you please investigate the support you have for PGOHUM00000250823? At NCBI, we have PGOHUM00000250823 associated with HGNCid 42932, official symbol SETP13, and RefSeq accession NG_032538.1. However, we have a nearly identical RefSeq accession NG_032022.1 associated with HGNCid 31115 and official symbol SETP3. On the current human reference assembly, GRCh38, both NG_032538.1 and NG_032022.1 align perfectly to the same locus and have no other hits of comparable quality to the reference assembly (or to alternate assemblies HuRef or CHM1_1.1). In NCBI’s latest annotation (Annotation Release 106) SETP3 was annotated on the assembly but SETP13 was not because it overlapped with SETP3. Do you have any evidence these are distinct pseudogenes? If not, NCBI’s preference would be to preserve the older nomenclature associated with SETP3 and NG_032022.1. Also, if we agree that SETP13 is redundant with SETP3 then I will proceed to notify HGNC and will CC you on that email.

A:
I looked at the PGOHUM00000250823/SETP13 locus. Our pipeline
predicted it as a pseudogene to SET with around 90% sequence identity.
When compared to SETP3 locus, SETP13 lacks sequences at both 3’ and 5’
ends, which can be aligned to the UTR regions of SET. Our pipeline
missed these sequences at both ends because it searches for homologous
sequence to CDS regions only. We are actually thinking of including
some checks of UTR alignment in our revised pipeline. Thanks for
pointing this case to us.

We have no problem to merge the SETP13 locus with SETP3.

“Results not accessible” @ funseq page

Q:

It is requested to kindly check your servers as results page of
"FunSeq" is not giving any kind of output. It just returns "Page not
found" after running the tool on uploaded vcf file.

A:
We should definitely improve the server. The server deletes the results once per week. Unfortunately I cannot see you results now. I just checked the server. It does return results.

When it shows ‘Page not found’, there are several reasons. 1. the file format is incorrect. 2. no variants left after the filtering against 1000 genomes. If it is the second case, please set different values to MAF (for example, 1). 3. Funseq doesn’t analyze Indels. If there are indels, they will be filtered out. Thus if no results left, there will be an error.

If you still experience the same problem, feel free to contact us. Please give us the job id number, then I can check it for you.

Questions regarding pseudogene.org

Q1:

I am trying to get fly pseudogene information available from pseudogene.org.
I want to know the parent gene of any pseudogene. Pseudogene.org provides “parent proteins”, such as FBpp0112526. However, I cannot find this id in flyabase. Is it the Flybase ID? If not, what database ID is that from?

A1:
The fly pseudogene information currently available on pseudogene.org website is old. As you can see it is from Ensembl build 50, when the current Ensembl release is 75. The FBpp00… id is an Ensembl protein ID based on flybase. However a lot of these ID have been deprecated between the two releases. We are currently preparing a new annotation file for fly pseudogenes based on the final stable gene annotation and it is going to be available online shortly. However if you still want to use the pseudogene.org fly pseudogene annotation you can parse all the parents protein ids in the file using Ensembl biomart and you can see which ids are still current and which are retired. Also the Ensembl biomart gives you the option to get the corresponding transcript and gene id for each protein id.

Q2:
By downloading the fly pseudogenes from pseudogene.org, I can get >1000 pseudogenes, but if I use BioMart, after selecting pseudogene, I can only get 175 pseudogenes. Why?

Since all the pseudogenes at pseudogene.org were identified by your lab, you must have their parent information (gene name or transcript name). Could you provide that information? I do not need parent protein name.

By the way, what pipeline did the lab use to identify the pseudogene? The pseudogene has UTR? Which paper did the lab publish regarding how the pseudogene was identified?

A2:
As I said before the fly pseudogenes that are available from pseudogenes.org are based on a very old gene annotation (Ensembl 50). The quality of the pseudogene annotation is dependent on the quality of the gene annotation. As such, since the fly gene annotation for Ensembl build 50 was just a draft, many of the pseudogene entries that we obtained from build 50 are actually false positives. Currently we are working on the latest fly pseudogene annotation and we’ll make it available soon (next couple of weeks). In our latest annotation we have about 150 pseudogenes. This last set was obtained using a combined manual and automatic annotation. The automatic annotation was obtained using PseudoPipe – a pseudogene annotation pipeline.

Also if you select pseudogenes in BioMart, you will find only the Ensembl annotated pseudogenes. Those pseudogenes were identified using the Ensembl annotation pipeline.

Gerstein lab has published numerous papers regarding pseudogene annotation. For the full list please see: http://papers.gersteinlab.org/subject/pseudogenes/index.html

The pseudogenes do have UTR, however at the moment we do not provide an UTR annotation for fly pseudogenes.

Q3:
How many of your 150 psudogenes are in the 175 psudogenes in Ensemble obtained via BioMart?

Your pseudogene pipeline starts with protein sequence, and that’s why your report has no UTR?

A3:
I attach here (see below) the latest fly pseudogene annotation.
Regarding your questions:

1. There is a reasonable overlap between Ensembl pseudogenes and our set. However I have to mention that Ensembl pseudogene are based on the automatic annotations while our pseudogenes are also manually annotated.

2. yes, our pseudogene annotation pipeline uses the protein data information.

Q4:
Thanks for that. But the attached file only contains the common ones between Gerstein lab annotation and Ensembl annotation? since each row has a Ensembl ID.

A4:
The file contains the latest Gersteinlab annotation. Our annotation was done using a combination of automatic and manual annotation so it is of higher quality than the Ensembl one. The pseudogenes do have Ensembl IDs for easier processing.

Q5:
I am confused. Could you, for example, show me one pseudogene that is annotated by Gerstein lab, but not by Ensembl?

A5:
Maybe I was not clear, our pseudogenes are available through Ensembl, but there are Ensembl-only pseudogenes that have no correspondent in our data set. Also we define their biotype while in Ensembl you won’t find the biotype information.

Q6:
oh? I heard Gerstein lab just submitted the latest annotation not long ago , and the latest annotation will not be available to public right now. your previous attached file is exactly the latest one that hasn’t been published?

The file you sent me was obtained by BioMart of Ensembl? if so , how to set the "Filters" there in order to get the same file as you.

Could I also have the pseudogenes that are not included in Ensembl pseudogene list?

By the way, what is "processed_pseudogene" vs "unprocessed_pseudogene" ?

Sorry for keeping bothering you and thanks for your patience.

A6:
The file that I sent you is our latest and yet unpunished annotation and yes it is not publicly available at the moment. But this will be the official list of pseudogene to use for the fly genome since it is a high quality set, each pseudogene annotation being validated through manual inspection.

The “processed” and “unprocessed” nomenclature refers to the pseudogene biotype, a classification of pseudogenes based on their mode of creation (e.g. processed pseudogenes were formed through retrotransposition while unprocessed pseudogenes are usually the product of duplication). If there is no defined nomenclature , e.g. just “pseudogene” in the biotype field, that means we could not assign a definite biotype to that particular element.

If you want to compare our pseudogene set with the one from Ensembl I would recommend you to use bed tool. Create a bed file for each set and intersect them.

coevolution webserver

Q:

I have read the article “An integrated system for studying residue coevolution in proteins” and start to use the coevolution webserver http://coevolution.gersteinlab.org/coevolution/. The help page says SCA is one of the coevolution score function, which I can‘t find on the webpage. Could you please tell me what is wrong?

A:
SCA is only available in the version for download, but not at the Web site anymore since it sometimes ran for an extremely long time. If you want to use the SCA method, you can use the download version or the latest SCA method by the Ranganathan group, which I think is better than the version we implemented.

Query about QTL calling from Wang et al PsychEncode paper

Q:

I have a quick query about the Wang et al paper from the PsychENCODE study.
Were the QTLs identified from all the samples or the control samples only?
I’ve checked the paper, online resources and the supplementary methods but can’t seem to work this out.

A:
The QTLs were identified from both control and disease samples. You could find the sample information in Table S11. Summary of dataset.

Inquiry regarding PsychENCODE eQTL resource

Q1:

Was the eQTLs calculated on 1,886 unique individuals?

A1:
No, the eQTLs were calculated on 1387 filtered adult samples with matching gene expression and genotypes.

Q2:
In Fig S34, it mentions only 1,432 individuals have genotyped. How was the genotype information determined for the remaining 454 individuals?

A2:
We didn’t have genotype information determined for the remaining 454 individuals. So we didn’t include these 454 individuals in any QTL analysis.

Q3:
The # of samples with genotypes enumerated in Table S1 and Table S11 do not appear to match. For example, Table S1 reports 450 GTEx samples (97 DFC), but Table S11 reports 25 GTEx genotypes from the pre-frontal cortex. There might be some subtlety between these two tables I have missed, could you please clarify how to properly interpret these tables?

A3:
The Genotypes column in Table S11 only includes the filtered high genotyping quality samples (for example, genotype imputation accuracy score R2>0.3) which have matched RNA-seq data.

Information about program code for ENCODE paper

Q:
During the last days I was reading your paper "Architecture of the human
regulatory network derived from ENCODE data".
I am doing something related and I am willing to perform your kind of
analysis in addition or to merge the two ideas somehow.
For this purpose I was looking for some program code that has been
published for the analysis of your work, but so far I just found the
workflow description in the SI.
In case it is possible, I would be delighted if you could share the
relevant code with me, which would make life much easier for me and my
analysis much quicker.
I would be primarily interested in everything that allows me to infer
the hierarchy diagrams for the TF network and the TF-miRNA network.
By the way: Is there any reason why you did not include histone
modification and DNA methylation data?

A:
some code is associated with separate papers – eg see :

http://papers.gersteinlab.org/papers/hier-rewiring
encodenets.gersteinlab.org

training set size used for PPI network construction with bayesian method

Q:
I am trying to construct a "gene co-phenotype" background network using bayesian approach which is mentioned in your paper "A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data" (science ,17th October 2003).

After reading the supplementary method related with this paper, I have a question on how to set the training data set size.
In this paper, 8250 positive /2691903 negative training gene pairs are used. It is recommended that the training data set should be balance with the true situation when we use naive bayesian method. Could you give me some instruvtions on how you set the positive/negative training dataset size. It will be very glad to hear from you.

A:

best I can do here is point you to:
http://papers.gersteinlab.org/papers/funcpred-goldstd

Pseudogenes lacking on short arm of chr13, 14, 15 and 22

Q:
Found your paper (http://genomebiology.com/2012/13/9/r51) about
pseudogene’s very informative.

Have been looking at the distribution of pseudogenes(
http://nagarjunv.blogspot.se/2013/12/pseudogene-distribution-across-human.html)
using the Ensemble annotation as well as that found on pseudogene.org.

However, pseudogenes seem to be lacking on short arm of chr13, 14, 15 and 22.

Could you please let me know if this is a known biological pattern or
some missing annotation?

A:
In human, chromosomes 13, 14, 15, 21 and 22 are acrocentric. They are made of a very long arm and a very short arm that is homologous across the five chromosomes. Only the long arm has been sequenced and annotated. This is way there are no pseudogenes (or genes for that matter of fact) annotated on those chromosome arms.

For more info please see:
http://www.nature.com/nature/journal/v428/n6982/full/nature02379.html
and
http://www.sanger.ac.uk/about/history/hgp/chr13.html

Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors

Q:
in your research paper "Classification of human genomic regions based
on experimentally determined binding sites of more than 100
transcription-related factors" you identified regions with extremely
high and low degrees of co-binding, termed respectively "HOT" and "LOT"
regions. Since I’m very interested in this classification, I tried to
reproduce this analysis on transcription-related factor (TRF) data of a
particular cell type.

I first downloaded the Genome Structure Correction scripts from

http://www.encodestatistics.org/svn/genome_structural_correction/python_encode_statistics/trunk/

and ran "block_bootstrap.py" on every pair of TRF data, thus obtaining
a matrix with z-scores. I then computed a raw z-score for each TRF,
defined as the average z-score with all other TRFs in the matrix. I
finally sorted these raw z-scores numerically and normalized them
linearly, so as to assign a weight of 1 to the TRF with the lowest raw
z-score and a weight of 1/n to the TRF with the highest raw z-score.
I’m afraid it is not clear to me what I should do next for the
identification of HOT and LOT regions: I would be very grateful if you
could help me with this analysis.

A:
The z-scores that you have computed can be considered as the global weights of the TRFs, where a TRF that more frequently binds to the same locations of other TRFs receive a lower weight, in order to de-emphasize the global co-binding effects.

For each bin j, the weighted binding score (i.e., degree of region-specific co-occurrence) was computed as d_j = \sum_i w_i s_ij, where i iterates through all TRFs, w_i is the weight of TRF i as defined above, and s_ij is its discretized binding signal at bin j (1-5, 5 for the top 25 percentile and 1 for zeros). The top 1% bins with the highest d_j were defined as the HOT regions, while the 1% bins with the lowest non-zero d_j were defined as the LOT regions.

Please notice that in the original calculations, when the block sampling program was run, the human genome was segmented into three classes, namely DNase hypersensitive peaks, DNase hypersensitive non-peak hotspots, and other regions. The idea was that the prior TRF binding probabilities in these three classes of region could be quite different, and thus they should be separately considered during the sampling process.