Human pseudogene annotation

Posted on January 6, 2022 by gersteinfaq

Q:
In your recent Nature Communication report of mouse pseudogenes (https://doi.org/10.1038/s41467-020-17157-w), you stated that “For human, we used a similar workflow to refine the reference pseudogene annotation to a high-quality set of 14,650 pseudogenes.” I wonder if you could kindly share the chromosome coordinate information of these 14,650 pseudogenes with me? I am investigating the distribution patterns of RNA-editing sites in human genome, and I cannot find a good source of pseudogene definition. A database named Pseudogene.org is too old and not based on GRCh38.

A:
In the paper we have worked with the GENCODE consortia to refine the pseudogene annotation. Since the paper publication we have continued to improve the human pseudogene annotation using a combination of manual and automatic pipelines as described in the paper. Attached is the pseudogene coordinates for the complete set of pseudogenes.

For a definition fo pseudogene i suggest you use our paper https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-9-r51 that defines pseudogenes as defunct genomic loci with sequence similarity to functional genes but lacking coding potential due to the presence of disruptive mutations such as frame shifts and premature stop codons.

Question about uORF annotation in NAR paper

Posted on December 14, 2020 by gersteinfaq

Q:
I am examining your uORF annotations with great interest but am unsure how to interpret a few of the entries in the file below on the github site.

Complete list of predictions (complete_uORF_predictions_hg19.zip · 35.29 MB)

If you look at these two uORF_IDs:

ENST00000307677.4.uORF_ATC.5

ENST00000422920.1.uORF_ATA.4

They are annotated with the same start and end coordinates, but different start codons (ATC / ATA).

Also, looking at the region I cannot find either start codon in the hg19 reference.

Any idea what is going on here?

A:
Basically, the start codon here appears to overlie a splice site. Alternative splicing means you could either end up with an ATC or an ATA at that location depending on which processed transcript you are looking at (see image below). That’s why these uORFs have the same start and end coordinate, but different start codons.

We had wrestled a bit with the question of whether or not to call these two separate uORFs. However, they do have different mRNA/protein sequences, so that’s why they received separate entries in our catalog.

Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

Posted on December 14, 2020 by gersteinfaq

Q:
I am reading your paper “Supervised enhancer prediction with epigenetic pattern recognition and targeted validation”, and I would greatly appreciate if you could provide some results apparently missing in Figure 2.

I am interested in the AUPR comparison of the matched-filter results with the peak-calling results, but I could not find the "gray" numbers.

Fig. 2 a, ….the gray numbers in the parentheses refer to the performance of the peak-based models.

A:
Thank you for bringing this to our attention and apologies for any confusion. We lost the numbers during one of the revisions. I am attaching a SI figure from an older version of the manuscript that answers your question.

In the table, I have compared the AUROC and AUPR for accuracy of different matched filter models (outside parentheses) with the corresponding peak based accuracy measures (within parentheses) for same histone marks. In this particular case, the comparison is made based on overlap with a single STARR-seq experiment but the trends remain the same even after combining information from multiple STARR-seq experiments within the same cell-line.

Full set of tQTLs and isoQTLs from Wang et al. 2018

Posted on December 14, 2020 by gersteinfaq

Q:
we have made great use of the publicly available PEC resources on https://nam05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fresource.psychencode.org%2F&data=02%7C01%7Cshuang.liu%40yale.edu%7Caa9d9436ceb6478ec71208d8142dea62%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637281534720419443&sdata=eu4DyEY%2BNUJuueEbj4YWFeWfOYoao6j%2B%2F1rqyq1DSUc%3D&reserved=0, in particular the QTL data. However, I have not been able to locate the full set of isoQTLs and tQTLs without any p-value/FDR filtering, as is available for eQTLs. Is there somewhere I can access this easily? Or does access to the full set of tQTLs and isoQTLs require an application to Synapse?

A:
Currently we don’t provide access to the full set. The full set is very large and we need to discuss where we should share these data. I will let you know once we have any updates.

Request for example input and output files of Hotspot Community pipeline

Posted on December 14, 2020 by gersteinfaq

Q:
We are interested in using the HotCommics pipeline to identify hotspot
communities from our own cancer mutation data. However, we have
difficulty in running the pipeline because we could not find
description of the input files in the snpMapping and the
hotSpotCalculation step. Could you kindly help to provide us some
example input files so that we can appropriately format our input?

A:
Thank you for your interest in our work. The input file for SNP
mapping step is the input file for VAT tool, which can be the vcf file
that you. are working with. Alternatively, you can also use a
tab-separated file with header information described below.

#CHROM hg19_pos ID Ref Alt Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode Info

http://vat.gersteinlab.org

For the hotspot community identification, you will have to run the
community identification module for each PDBs on which your mutations
have mapped to

https://github.com/gersteinlab/HotCommics/tree/master/communityIdentification

Once you have generated these communities and have a list of PDBs on
which mutations have mapped to then you will need to provide the list
of PDBs for hotspot calculation.

Encode for cancer genomics to predict gene expression

Posted on December 14, 2020 by gersteinfaq

Q:
I am just beginning start my first ever project by using the extended gene definition provided in the dataset of Encode for cancer genomics to predict gene expressions. I would be incredibly grateful if there could be an explanation about the layout of the text files. I have been unsuccessfully trying to understand how the extended gene was used to interpret the mutations and expression changes in the published article.

A:
Thanks for your interest in the research and the extended gene annotation. We are preparing BED-formatted extended gene annotation and they will be available soon on our project website (http://encodec.encodeproject.org/). We will keep you informed.

ALoFT in hg38?

Posted on December 14, 2020 by gersteinfaq

Q:
Secondly, we are working with .vcf files in GRCh38 build. Is there a way to run ALoFT using this build, or will we need to do a liftover back down to hg19?

A:
Currently, ALOFT cannot be used with build38. We don’t have a plan to upgrade it to HG38. For SNPs, we already provide scores exome wide based on liftover to HG38. However, if you want other annotated features/scores for indels, it cannot be done without doing a liftover back down. While it is not ideal, that will work.

Data access in Psychencode repository

Posted on December 14, 2020 by gersteinfaq

Q: Would you be able to point me to the repository where all these data are stored? I am looking into the Psychencode repository in synapse but it’s not clear if all the data presented in the publications are included in there and if so, are grouped into one folder? We are particularly interested in the bulk and scRNASeq for now.
https://www.synapse.org/#!Synapse:syn5553626

A:
We have recently created a portal for easier access to the data generated through the PEC. Please see the [SingleCellRNAseq study](https://psychencode.synapse.org/Explore/Studies/DetailsPage?study=syn7067037) which this data came from. Note the link under the study description for the single cell data used in Wang et al.,

Data access approvals are handled by the NIMH through the NIMH Repository and Genomics Resources. Instructions are on the study page. If you do not have access, and have questions about the process let me know.

Does pBAM work for methylation data?

Posted on October 8, 2020 by gersteinfaq

Q: Does pBAM work for methylation data?

A: The current version of pTools does not work for methylation data, but we are currently working on this issue. The next version will have support for methylation BAM files.

Server errors on molmovdb

Posted on September 3, 2020 by gersteinfaq

Q:
I was trying a couple of tools: Morph Server and RigidFinder and in both cases I get a server error indicating that files could not be written. Specifically, RigidFinder complains: "Cann’t write to file ‘/tmp/rid74285/upfile1.pdb’." and Morph Server says "Can’t create morph directory!". If this site is still being maintained, please consider addressing these issues.

A:
Thank you for your interest in our servers and for letting us know of
problems you’ve encountered.

Rigidfinder’s disk filled up. I cleared some space and it should be
working again.

Molmovdb, however, is a more complicated issue. It needs an upgrade
since it is more than 15 years old. Occasionally, we simply roll back
to a previous version but then any submissions would be lost. We also
cannot guarantee when the next roll back would be.

We apologize for any inconvenience this may have caused. We provide
related software in our FAQs for those who are interested.
http://www2.molmovdb.org/wiki/info/index.php/Related_Resources

Gerstein Lab FAQs

Frequently Asked Questions

Human pseudogene annotation

Question about uORF annotation in NAR paper

Supervised enhancer prediction with epigenetic pattern recognition and targeted validation

Full set of tQTLs and isoQTLs from Wang et al. 2018

Request for example input and output files of Hotspot Community pipeline

Encode for cancer genomics to predict gene expression

ALoFT in hg38?

Data access in Psychencode repository

Does pBAM work for methylation data?

Server errors on molmovdb