Request for example input and output files of Hotspot Community pipeline

Q:
We are interested in using the HotCommics pipeline to identify hotspot
communities from our own cancer mutation data. However, we have
difficulty in running the pipeline because we could not find
description of the input files in the snpMapping and the
hotSpotCalculation step. Could you kindly help to provide us some
example input files so that we can appropriately format our input?

A:
Thank you for your interest in our work. The input file for SNP
mapping step is the input file for VAT tool, which can be the vcf file
that you. are working with. Alternatively, you can also use a
tab-separated file with header information described below.

#CHROM hg19_pos ID Ref Alt Tumor_Sample_Barcode
Matched_Norm_Sample_Barcode Info

http://vat.gersteinlab.org

For the hotspot community identification, you will have to run the
community identification module for each PDBs on which your mutations
have mapped to

https://github.com/gersteinlab/HotCommics/tree/master/communityIdentification

Once you have generated these communities and have a list of PDBs on
which mutations have mapped to then you will need to provide the list
of PDBs for hotspot calculation.

Small question of the paper “Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences”

Q:
Recently, I read a paper which was published in Cell, titled "Passenger Mutations in More Than 2,500 Cancer Genomes: Overall Molecular Functional Impact and Consequences". Cause of my research topic was similar with this paper, just one of question about Figure 2B. In this heatmap, I saw totally 80 motifs on the bottom, but only 70 rows up to them, I was a little bit confused how did you know the ETS motif matched to the marked row?

A:
The rows in the figure correspond to different cancer cohorts or meta-cohorts. We also provide this information on the cancer cohort with significant differential burdening in Supplement 1 in the paper.

PCAWG passenger mutation analysis

Q1:
I was trying to download a subset of data from your recent paper (https://www.cell.com/cell/fulltext/S0092-8674(20)30113-6). However, the website is returning ‘not found’ error (http://pcawg.gersteinlab.org/). Especially, I am interested in ‘Gene list categories’. Therefore, I kindly request you to share relevant files listed under ‘Gene List Categories’ on the website, so I could use in my analysis.

A1:
The website works fine for me. Sure it doesn’t work ? … Please let me know which specific file are you trying to download.

Q2:
Thanks a lot for the reply.

I need the gene list categories listed under PCAWG-specific annotations (http://pcawg.gersteinlab.org/#Annotations)

Eseential Genes
Immune Response Genes
DNA repair Genes
Metabolic Genes
Cancer Pathway Genes
non-Essential Genes
cell Cycle Genes
For some reason, when I click on the link, it’s directly downloading the html file with error. It would be great if you could share these files.

A2:
You can download relevant files from the link listed below.

http://pcawg.gersteinlab.org/Datasets/Annotations/categories/

Running SVFX

Q:
I would like to run your new SVFX method on some structural variants. For full disclosure, I’m working on a method to assess the pathogenicity of germline SVs, and would like to compare with yours. Based on reading your preprint, I believe our methods are quite distinct in terms of training data. I think it’s great you’ve already put code on github, but I’m not sure what data files are needed to run the code. Could you put me in touch with one of your students to help me run SVFX locally?

A:
Thanks for your interest in SVFX. We have reported our feature list in supplement table1.

Overall, our feature list is extracted from a bunch of genomic annotations and various functional genomics/epigenomics signal files.

You can download signal files from iHEC or epigenome roadmap data portal. As you might have noticed, we created multiple tissue-specific models for our analysis.

For the germline model, we also built a feature matrix based on the h1HESC cell line, which performed quite well. On the SVFX GitHub page, we have uploaded the bed file for different annotations (under the data folder) used in our study.

pnas paper supplement duplication

Q1:
I am reading with interest your recent paper (Kumar, Clarke, and Gerstein, PNAS), but I suspect that supplement 1 and 2 are the same, and neither has a list of 434 genes. Could you please supply the list?

A1:
Thank you very much for your interest in the paper. Supplement 1 includes hotspot communities based on pan-cancer analysis (i.e., when will compute statistics over multiple cancer cohorts in TCGA). In contrast, supplement 2 lists out putative driver genes with hotspot communities for specific cancer types. If you note in supplement2, column F list out the name of particular cancer cohorts.

Regarding the number of genes, 434 genes are based on the pan-cancer analysis.
For each gene, there are multiple PDB entries. For analysis in our paper, we selected a representative structure with the highest residue coverage. However, to be exhaustive and allow researchers to analyze protein of their interest, in our supplement, we include all PDB entries for a given gene. We have tried to explain this in our method section.

Q2:
Thanks for your quick reply; but, no, this does not remove my confusion. Please take a moment to check the link from your paper at PNAS. When I download pnas.1901156116.sd01.xlsx, the file has 217 lines (not 434) and includes the column F that breaksdown by cancer type.

A2:
I am attaching our original tables with the email. It appears that the table has been somehow duplicated on the PNAS website. We will work with the PNAS team to get it fixed.

Supplemental_tables.xlsx