I am using Pseudopipe and I am wondering the different types of its output.
I looked into the script and found there are several types: GENE-SINGLE, PSSD, FRAG, GENE-MULT, and DUP. Would you like to explain the meaning of each type?
From what i can see you are looking at an intermediary result file not at the final output. The final output should contain only 3 biotypes: PSSD, DUP and FRAG.
The PSSD is indicative of processed pseudogenes, DUP is indicative of duplicated pseudogenes, FRAG is indicative of pseudogene loci where we can not assign with certitude a biotype (processed or duplicated).
GENE-SINGLE and GENE-MULTI are intermediary biotype definitions. The SINGLE refers to the fact that the pseudogene locus contains only one exon (similar to processed pseudogenes) and MULTI refers to the fact that the potential pseudogenic locus contains multiple exons (similar to duplicated pseudogenes).
If a proposed locus has over 95% sequence identity to the parent gene and covers over 95% of the parent gene sequence and there are no identifiable disablements associated with it we initially refer to these potential loci as GENE-SINGLE and respectively GENE-MULTI. If we find a polyA tail you might see PSSD|GENE-SIGNLE and in that case we will relabel that locus as a processed pseudogene. For very high similarity we tend to be conservative and not label that locus as a pseudogene. If we find in subsequent searches additional data (E.g. polyA tail, truncations etc) we will relabel the locus as pseudogene.
In your recent Nature Communication report of mouse pseudogenes (https://doi.org/10.1038/s41467-020-17157-w), you stated that “For human, we used a similar workflow to refine the reference pseudogene annotation to a high-quality set of 14,650 pseudogenes.” I wonder if you could kindly share the chromosome coordinate information of these 14,650 pseudogenes with me? I am investigating the distribution patterns of RNA-editing sites in human genome, and I cannot find a good source of pseudogene definition. A database named Pseudogene.org is too old and not based on GRCh38.
In the paper we have worked with the GENCODE consortia to refine the pseudogene annotation. Since the paper publication we have continued to improve the human pseudogene annotation using a combination of manual and automatic pipelines as described in the paper. Attached is the pseudogene coordinates for the complete set of pseudogenes.
For a definition fo pseudogene i suggest you use our paper https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-9-r51 that defines pseudogenes as defunct genomic loci with sequence similarity to functional genes but lacking coding potential due to the presence of disruptive mutations such as frame shifts and premature stop codons.
I recently came across your paper, "Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes."
I’m interested in the substitution rates in human pseudogenes. Figure 2A from your paper (pasted below) plots these rates. Would you be able to send me these rates as a table?
Additionally, has your group calculated the substitution rates for more families of pseudogenes? (The NAR 2003 paper only analyzed ribosomal protein pseudogenes sequences.) I tried poking around psiDR, but wasn’t not able to find this type of information readily available.
These substitution rate matrices would be very helpful for my research.
We develop MH Guide, a genome-guided cancer treatment decision support software (https://www.molecularhealth.com/us/).
I was trying to get the current annotated pseudogene information via http://pseudogene.org/Human/. The link to GENCODE seems not to work and returns “file not found” (https://www.gencodegenes.org/releases/current.html).
Could you please kindly redirect me to the file with annotated pseudogenes.
If you are looking for the current GENCODE annotation, for the current release please follow this link: https://www.gencodegenes.org/human/ . If you want to use pseudo pipe to create a custom human annotation of a genome sequence of preferences, please follow the instructions here: http://pseudogene.org/pseudopipe/ . If you are interested in the functional annotation of pseudogenes with information regarding pseudogene activity please see http://pseudogene.org/psicube/ .
Since I am working on pseudogene identification for my new project, I was using your pipeline. But I am having few errors which I am going to mention below. Can you please help me to resolve these errors. I shall be very grateful to you.
> 1. On terminal:
> sudo bash pseudopipe.sh ~/pgenes/ppipe_output/caenorhabditis_elegans_62_220a ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/dna_rm.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/dna/Caenorhabditis_elegans.WS220.62.dna.chromosome.%s.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/pep/Caenorhabditis_elegans.WS220.62.pep.fa ~/pgenes/ppipe_input/caenorhabditis_elegans_62_220a/mysql/chr%s_exLocs 0
> Making directories
> Copying sequences
> Fomatting the DNAs
> Preparing the blast jobs
> Skipping blast
> Processing blast output
> Skipping the processing of blast output
> Running Pseudopipe on both strands
> Working on M strand
> Finished Pseudopipe on strand M
> Working on P strand
> Finished Pseudopipe on strand P
> Generating final results
> find: ‘/home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/minus/pgenes’: No such file or directory
> find: ‘/home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/plus/pgenes’: No such file or directory
> gzip: /home/kashmir/pgenes/ppipe_output/caenorhabditis_elegans_62_220a/pgenes/*/pgenes/*.all.fa: No such file or directory
> Finished generating pgene full alignment
> Finished running Pseudopipe
> 2. In log file inside minus and plus folder:
> need to document overlap parameter (30) and dependency on mask array files.
> mask fields [2, 3]
> Traceback (most recent call last):
> File "/home/kashmir/SOFTWARE/pgenes/pseudopipe/core/filterEnsemblGene.py", line 60, in <module>
> maskFile = openOrFail(ExonMaskTemplate % chr, ‘r’)
> TypeError: not all arguments converted during string formatting
> running filterEnsemblGene.py
> failed during filterEnsemblGene.py stage.
From the output it looks like you had a couple of issues starting with the blast job.
Could you please check your output directory in the blast/output folder and see if you see any split000*.Out files (where * is a number). If you don’t see any output files it means that your blast job did not run. In order run the pipeline you need to have a couple of additional software packages installed and preferentially added to the path. Specifically you will need: blast-2.2.13 and fasta-35.1.5. If you do not want to add them to the path, you can add the path to their location in the env.sh file that you can find in the bin folder of the PseudoPipe.
This should allow you to run the pipeline without any issues.
I am … developing an application that matches cancer patients to treatment based on the person’s genetic profile. We are looking for an updated list of human pseudogenes to use in evaluating submitted DNA variants. Can you tell me if the Pseudo Fam data files at the pseudogen.org website are still being updated? If not, perhaps you could recommend an alternate source?
Best to get an updated list of pseudogenes from pseudogene.org, which is continually updated, ie http://pseudogene.org/Human/. Yucheng
The pseudogene databases, including Pseudofam and PseudoPipe, have been extremely helpful for a project I am working on, and I was wondering if you knew how it would be possible to compare the DNA sequence of a human gene with all the pseudogenes on the PseudoPipe resources. I am looking to identify pseudogenes that may be related to the genes I am working with. I was hoping there was a way to devise this information by BLAST comparing the DNA sequence a specific gene with the sequences from all the pseudogenes in the genome, similar to NCBI BLAST or UniProt BLAST feature.
Any help or insight would be appreciated.
If you have many genes to query, may be you can use BLAST+ (https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download) to build your own tool. You can then download the sequences of all pseudogenes and make a BLAST database (https://www.ncbi.nlm.nih.gov/books/NBK279688/ ) from which you can query.
Recently, I was reading one of your papers about finding processed pseudogenes published in 2003: "Millions of Years of Evolution Preserved: A Comprehensive Catalog of the Processed Pseudogenes in the Human Genome". Because I want to find processed pseudogenes among several recently released mammalian genomes. Your paper is very interesting and helpful for my work. And to ensure the method i grasped is correct, I want to use your original data to redo your analysis process.
But I come across a problem when I download nonredundant human proteome set from the EBI Web site. Because the data was published in June 2002, and I can’t successfully download them from EBI website. Here I write to you with the hope of getting nonredundant human proteome set you used released in June 2002. Although I know many years have passed since the paper was published and you may also lost the original data, I still want to have a try!
The data associated with the paper is here: http://pseudogene.org/human-all/index.html. You can also find the latest human pseudogene annotation here: http://pseudogene.org/Human/
I would like to cite your work about Psedogenes in my Master thesis, sadly I cannot find a valid citation regarding your work, would it be possible for you to provide me with a valid citation of your work ?
many at http://papers.gersteinlab.org/subject/pseudogenes/
Can you please help me to get pseudogene information for human, mouse, rat, drosophilla and C. elegans? I need exclusive fasta files or .bed files corresponding to pseudogene annotations for these five species separately.
see pseudogene.org. For any infromation regarding the pseudogene annotation in human, mouse, drosophila and C.elegans please see: