Understanding the molecular signatures underlying the evolution of phenotypic traits represents a key challenge in both contemporary evolutionary biology and genomics. While events of gene duplication and amino acid divergence have been frequently assigned to the evolution of novel traits, gene loss, on the other hand, has been less regarded as an evolutionary force per se. In fact, events of loss have been thoroughly associated with the non-functionalisation of redundant genes arising from the accumulation of deleterious mutations, a process termed pseudogenisation, following gene duplication or events of transposition of processed transcripts. Yet, true gene loss mechanisms, including complete gene elimination or pseudogenisation, have been increasingly linked to phenotypic modifications.
A gene is considered lost in a given lineage if it complies with two conditions: first, it must derive from an ancestral sequence yielding an intact protein-coding gene; and secondly, it should display evidence of erosion such as complete absence from the corresponding orthologous genomic locus or accumulation of gene disrupting mutations that likely results in non-functionalisation.
Despite the current wealth of genome availability and rapid increase of high-quality genome assemblies, exemplified by ongoing projects such as Genome-10K and i5k, the assessment of gene loss events still suffers from technical inertia. Additionally, some studies suggest that real pseudogenes can be mistakenly annotated as functional protein-coding genes since ORF-disrupting mutations, including frameshifts or in-frame premature stop codons, are often weighed as sequencing or assembly artefacts, being automatically corrected by whole-genome annotators.
Although automatic and semi-automatic pipelines are currently available for the identification of redundant loss, the few systematic approaches capable of inferring episodes of non-redundant gene loss present severe limitations such as the requirement of whole genomes or the necessity of exhaustive manual curation, unpractical when dealing with the hundreds of genomes available today.
Here we are introducing PseudoChecker , the first integrated online platform for gene inactivation inference. This software aims to facilitate and promote the study of gene loss as a driver of evolutionary change, providing an easy to use, systematic, highly accurate, and computationally automatic approach. Our comparative genomics-based method consists in an online computational pipeline able to infer the coding status of a given eukaryotic nuclear protein-coding gene in single or multiple species of interest by taking advantage of existing genomic data.
While making use of minimalist user input and a set of established parameters, PseudoChecker is capable of:
(i) identifying gene loss events, automatically, remotely and in a relatively short amount of time, highlighting the mutational evidence for a set of unlimited target species with available genomic data;
(ii) unveiling ancestral gene loss events by accurately displaying conserved gene inactivating mutations across closely related taxa within a given analysis;
(iii) measuring the erosion level of a candidate gene in any target species by assigning an index of pseudogenisation, the PseudoIndex;
(iv) including external functional gene datasets into the analyses;
(v) exporting the produced data throughout the analysis, useful for performing downstream complementary tasks including phylogenetic reconstructions and selection analyses.
PseudoChecker 's rationale is based on the inference the coding status of a given candidate gene in a target species using, as a reference, an orthologous coding sequence. The platform takes into account the coding sequence conservation across related species, requiring previous knowledge of phylogenetic context. Gene annotation is followed by the screening of gene eroding features. This general-purpose bioinformatics tool was designed to be easily applied in two different situations: (i) de novo candidate gene annotation, for instance for unannotated genomes; (ii) re-annotation of candidate genes from previous automatic genome annotations, to verify previous annotations and identify pseudogenes erroneously annotated as functional protein-coding genes.
To accomplish this, by running a PseudoChecker analysis, the user will simply require the following inputs:
(i) a single reference nucleotide coding sequence (CDS) and the respective exon nucleotide sequence(s), both annotated and retrieved from a given reference species (FASTA format) (if distinct gene isoforms exist, i.e. splice variants, the user must select a single reference sequence);
(ii) for each target species, the corresponding genomic sequence, against which the reference coding exons will be mapped to predict the gene CDS in the target species (FASTA format). The user is responsible for ensuring that each inserted sequence is orthologous to the in-study gene. Optionally, the user may include complete functional nucleotide coding sequences into a given analysis, also in FASTA format, – referred to as the predetermined coding sequences, to be incorporated into the second component of PseudoChecker 's pipeline (see below) and, consequently, into the final output.
Once the input data is correctly assigned, and the available parameters, underlying the different components of the three-step based integrated pipeline, are selected, the latter is executed as follows:
(i) Coding sequence prediction: for each target species, PseudoChecker annotates the orthologous exons and, consequently, predicts the sequence of the in-study gene. This is done by performing a progressive deterministic pairwise alignment of each reference coding exon, from the 5' to the 3' end of the gene, against the corresponding inputted genomic sequence of each target species. The software makes use of the semi-global variation of the classical global alignment Needleman-Wunsch algorithm (1970) for computing each alignment.
(ii) Alignment by MACSE: once the first step is concluded, PseudoChecker runs MACSE, a standalone multiple sequence aligner software. Here, a deterministic (pairwise or multiple) alignment between predicted sequence(s) (coding or pseudogenised), predetermined coding sequences (optional), and the reference coding sequence (functional) is produced. The alignment is produced considering the underlying amino acid translation of each sequence and the eventual presence of frameshift and premature stop codons, while preserving the underlying codon structure. Algorithmically speaking, MACSE solution is an improved version of the Needleman-Wunsch algorithm that, for computing each optimal pairwise alignment, adds alignment costs associated with frameshifts mutations and stop codons.
(iii) Report of the mutational evidence and PseudoIndex computation: for each target species, and in agreement with the previous alignment produced by MACSE, existing cross-species conserved and non-conserved gene deleterious mutations, including frameshift mutations and in-frame premature stop codons, relative to the reference CDS, are identified. Following the first component of the pipeline, splice site disrupting mutations (any deviation to the consensus GT/GC-AG splice site pairs), full gene loss or exon loss are also revealed. Finally, considering the presence/absence of full or partial target gene sequences and degree of mutational evidence, a pseudogenisation score, the PseudoIndex, is assigned to each target species corresponding sequence.
Detailed information regarding input data submission, parameter selection and interpretation of results is available at the PseudoChecker 'Instructions' page.