(i) Submitting a Job

To run a Pseudo_Checker analysis, biological input data should be inserted and the different available parameters regarding the two first steps of the pipeline should be selected on the main page of the software. Once the input data and parameters are assigned, Pseudo_Checker internally runs the submitted job. When this is completed, the software automatically redirects the user to the corresponding results page.

BIOLOGICAL DATA INPUT

1. Coding exons and coding sequence (CDS) of the reference species' candidate gene: the candidate gene nucleotide coding sequence (CDS) and the respective exon nucleotide sequence(s), both annotated and retrieved from a given reference species (FASTA format). Exons should be inserted from the 5' to the 3' end of the candidate gene and the CDS should be placed after the 3' coding exon, identified by the header '>CDS'. The header of each sequence should not exceed a single line.

2. Genomic sequence per target species: orthologous genomic sequence to the in-study gene for each target species (FASTA format). Even though this is not mandatory, for a more convenient interpretation of the results of a given analysis, the header of each sequence should correspond to the name of the target species itself (e.g. >Balaenoptera_musculus, >Canis_lupus_familiaris). The header of each sequence must occupy a single line. As target sequences, Pseudo_Checker supports either partial/full genomic contigs, scaffolds and genomic sequencing reads. Each inserted sequence must correspond to the same DNA strand as the reference gene.

BASIC OPTIONS - CODING SEQUENCE PREDICTION PARAMETERS

1. Similarity scoring scheme: scoring function used for computing each exon alignment at the coding sequence prediction step of Pseudo_Checker 's pipeline.

a) Closely related species_{Optimised Similarity Scoring Scheme} : to be used when the reference species and the target one/s are closely related;

b) Slightly divergent species_{Optimised Similarity Scoring Scheme} : to be used when the reference species and the target one/s are slightly more evolutionary divergent;

c) Best-fit_{Similarity Scoring Scheme} (Default and Recommended): to be used when there is no clear idea about the evolutionary divergence between both reference and test species and/or the candidate gene's conservation state across species. This corresponds to a (slightly more time consuming) dynamic similarity scoring scheme intending to make Pseudo_Checker 's exon alignments more resistant to the evolutionary divergence of the in-study species and/or gene itself. It tests, in the worst-case scenario, 40 different combinations of match/mismatch alignment punctuating schemes, with the ultimate goal of finding alignments yielding predicted intact exons, the ones presenting conserved adjacent splice sites (GT/GC-AG splice site pairs) without containing reading frame-disrupting indels.

Note that, independently of the chosen similarity scoring scheme, the prediction quality of the gene's coding sequence in each target species will increase with the phylogenetic proximity of both reference and the species.

2. UTR trimming: allows the automatic trimming of any untranslated regions (UTR’s) lying within the reference species' 5' and/or 3' coding exon of the candidate gene, as their absence is mandatory for an accurate prediction of the gene’s CDS for each test species. Analyses initiated with coding exons containing at least one UTR fragment will be interrupted.

3. Extension of the reference species' 3' coding exon alignment to find a missing downstream final stop codon in the original alignment: useful if the C-terminus of the protein encoded by the in-study gene is slightly divergent in size across the tested lineages. The maximum extension range allowed by Pseudo_Checker corresponds to 15 nucleotides.

4. Minimum exon alignment identity (%): reflects the minimum percentage of alignment identity for an exon alignment to be considered as a valid alignment. This parameter will define if a given exon alignment corresponds to a real, biologically meaningful alignment, rather than a non-specific alignment. If the alignment identity is inferior to this value, the exon is considered lost in the corresponding target species, thus, excluded from the final annotated/predicted sequence. These predicted coding sequences are considered as partial coding sequences. In contrast, sequences that include all predicted orthologous exons are considered as full coding sequences. The value for this parameter should be adjusted according to the evolutionary relationship between the reference and target species. Particularly, if the reference species is highly divergent from the test species, the user should choose a lower value for this parameter.

ADVANCED OPTIONS

1. Predetermined coding sequences: optional external functional nucleotide coding sequences (FASTA format). Coding sequences inputted into this textbox will be incorporated into the second component of Pseudo_Checker 's pipeline, the Alignment by MACSE. Even though it is not mandatory, for a clearer interpretation of the results, the header of each submitted coding sequence should coincide with the name of the corresponding species (e.g >Pan_troglodytes, >Sus_scrofa). The header of each sequence should not exceed one line.

2. MACSE alignment costs: alignment costs composing the similarity scoring scheme employed by MACSE during the computation of its alignment within the second component of Pseudo_{Checker 's} pipeline:

a) Reliable sequences: MACSE alignment costs associated with reliable sequences. These include the reference species' CDS, the coding sequences predicted as functional during the first step of Pseudo_Checker 's pipeline, that is, sequences that do not exhibit frameshift gaps and/or in-frame premature stop codons within each exon alignment, and predetermined coding sequences.

b) Less reliable sequences: MACSE alignment costs associated with less reliable sequences. These include the predicted coding sequences during the first step of Pseudo_Checker 's pipeline displaying frameshift gaps and/or in-frame premature stop codons within at least one exon alignment;

c) Common costs: MACSE alignment costs associated with both types of sequences (reliable and less reliable sequences).

The default MACSE alignment costs provided by Pseudo_Checker constitute the default similarity scoring scheme provided by the MACSE authors, suitable for most cases. Still, each of these parameters can be user-defined. As amino acid substitution matrix, Pseudo_Checker uses the BLOSUM62 matrix, the default MACSE amino acid substitution matrix, for which the default alignment costs are optimised. More detailed information regarding MACSE alignment costs can be found here.

However, not every MACSE produced alignment is viable for running a Pseudo_Checker analysis. Even though from our experience, such event is unlikely to occur, since the produced alignment depends on the defined MACSE similarity scoring scheme at Pseudo_Checker 's home page, inadequate choices regarding a given in-study sequence dataset might lead to the appearance of frameshift mutations and/or premature stop codons at the reference species’ CDS and/or at the inputted predetermined coding sequences (optional). In these situations, as these sequences are supposed to be functional, Pseudo_Checker automatically interrupts the analysis.

(II) Interpreting the results of a Pseudo_Checker analysis

Pseudo_Checker displays the results of a given analysis in an interactive and intuitive web interface, divided into sections, providing different levels of information.

1. Alignment by MACSE: MACSE produces an alignment containing information at the nucleotide and amino acid levels for each aligned sequence, represented by Pseudo_Checker as the top and bottom sequence, respectively. For a convenient visualisation, the reference species’ CDS is always shown at the top, and the alignment is colour graded according to the resulting codon structure, with each set of adjacent blocks of three nucleotides represented with different background colours. In detail, at the nucleotide level, frame-preserving gaps are represented by a codon ’- - -’ on a white background. At the amino acid level, in contrast, no special representation is applied. Regarding frameshift mutations, at the nucleotide level, these are represented by a partial codon with one or two exclamation marks (!), each highlighted in orange and, at the amino acid level, no representation is used. For in-frame stop codons, at the nucleotide level, these are represented in red font and white background colour and, at the amino acid level, by an asterisk (*), also in red font. Finally, if a given amino acid differs from the reference sequence at the same alignment position, this is represented with a grey background colour.

MACSE exclamation marks arise within partial codons which derive from frameshift mutations in order to preserve the structure of the reading frame. These partial codons may appear in different forms, ’!!N’, ’!NN’, ’N!!’, ’NN!’, 'N!N' or ’!N!’, with ’N’ corresponding to any of the 4 DNA nucleotide bases. Partial codon annotations may pinpoint different interpretations. For instance, a partial codon represented by a ’!!N’, ’!N!’ or ’N!!’ can either represent the deletion of two nucleotides or a single nucleotide insertion. However, since frameshift mutations are inferred with respect to a reference functional sequence, the mutational interpretation underlying each partial codon must be made accordingly, taking into account the reference species’ codon observed at the same alignment site.

First, if a given reference codon is represented by ’NNN’, where N represents any of the 4 DNA nucleotide bases, and for the corresponding sequence, the observed codon at the same alignment site is represented by a partial codon containing at least one ’!’, this should be understood as resulting from frameshift deletions, wherein each exclamation mark represents a single nucleotide deletion. In contrast, if a given reference codon is represented by a set of three gaps ’- - -’, and the corresponding target codon aligning at the same site is represented by a ’!!N’, ’!NN’, ’N!!’, ’NN!’, 'N!N' or ’!N!’, this should be interpreted as an insertion of the ’N’ DNA nucleotide base(s).

ALIGNMENT BY MACSE

Four levels of information related with the alignment are also presented: alignment length, number of aligned sequences, average pairwise amino acid alignment identity relative to the reference species and the number of amino acid identical sites across aligned sequences. When supplied, the number of predetermined coding sequences included in the alignment is also shown. Also, partial coding sequences are enumerated, and absent sequences (sequences that are not included in the alignment since their respective species do not present any exon orthologous to the in-study gene) are equally mentioned.

2. Detected Mutations and PseudoIndex: under the MACSE alignment, a summary of the detected frameshift mutations and stop codons per target species, corresponding exon, as well as their respective position within the alignment, is presented.

Detected Mutations per Full Coding Sequence

Frameshift mutations

In-frame Premature stop codons

ᴪ PseudoIndex

Importantly, note that partial coding sequences are excluded from this feature, only declaring detected mutations for full coding sequences.

As aforementioned, MACSE automatically imposes exclamation marks (!) in the most appropriate alignment location in order to maintain the original structure of the reading frame. However, when single or multiple exons are missing, which is the case for partial coding sequences, if their absence results into the disruption of the reading frame within their aligned sequence, it is, from our experience, very likely that, aiming to compensate the reading frame’s disruption, exclamation marks will arise adjacently to the exons neighbouring the missing exons. This constitutes an issue since it might be difficult for the user to distinguish between real biological mutations from alignment adjustments produced by MACSE, aiming to preserve the structure of the reading frame.

Nonetheless, an additional tool, the PseudoIndex, that measures the erosion level of a candidate gene in each target species, is supplied for species exhibiting either partial or full coding sequences. Detailed information regarding this metric is available at the 'PseudoIndex' page.

3. Additional MACSE alignment metrics: section, which is optionally viewable by clicking in the 'Display' button at the top of the page, that provides additional MACSE alignment metrics for each aligned sequence relative to the reference sequence.

Additional MACSE alignment metrics relative to the reference CDS

4. Coding sequence prediction statistics (only available for multi-exon candidate genes): by clicking in the same 'Display' button at the top of the page, this section provides several levels of information regarding the prediction of the coding sequence of the in-study gene for each target species. Additionally, it is also possible to, per target species, export the individual predicted exons, as well as to visualise the alignments between each reference species' coding exon and the corresponding inputted genomic sequence.

Coding sequence prediction statistics

5. User input parameters: by using the same 'Display' button at the top of the page, this section displays the input parameters used for the computation of the in-analysis Pseudo_Checker job.

6. Exporting the produced data: by clicking in the button 'Export' at the top of the page, it is possible to export the produced alignment by MACSE, as well as the predicted coding sequences at both nucleotide and amino acid levels. Exporting MACSE aligned sequences allow for easy downstream phylogenetic and selection analyses with methods based on codon models of sequence evolution. Particularly, exporting and visualising the predicted coding sequences at the amino acid level allows for understanding the direct effect of potential existent frameshifts (recognised within the MACSE alignment at the nucleotide level) in the protein sequence, including aberrant amino acids or out-of-frame premature stop codons.

(III) Checking the results of a previously submitted job

By accessing the page 'Submitted Jobs' and inserting the Analysis ID of a previously submitted job, Pseudo_Checker will redirect the user to the corresponding results page. This allows the user to avoid waiting for an analysis to be completed, but also to consult the results of a previously finished analysis afterwards. Results can only be displayed after the job of interest has successfully been completed.

(IV). Loading the example test case

By clicking in 'Load example data' and then 'Run analysis' in Pseudo_Checker's main page, the example test case regarding the inference of the coding status of the CCL27 gene in 5 different species is executed.

The cattle ortholog of the candidate gene is used as a reference, as target species, 4 cetaceans are tested, and a more distantly related species, Canis lupus familiaris (dog) is also included in the analysis. For each of these species, the corresponding genomic sequences were previously downloaded from NCBI, each underlying a previously automatic annotation for the in-study gene. Pseudo_Checker, and as suggested by this study, confirms the inactivated status of CCL27 within the tested cetacean species, with these displaying conserved and non-conserved deleterious mutations interrupting the normal functional status of the gene in these lineages. Conversely, Pseudo_Checker suggests the functionality of the CCL27 gene in dog.

ᴪPSEUDOCHECKER

Integrated online platform for gene inactivation inference

(i) Submitting a Job

(II) Interpreting the results of a Pseudo_Checker analysis

(III) Checking the results of a previously submitted job

(IV). Loading the example test case

(i) Submitting a Job

(II) Interpreting the results of a PseudoChecker analysis

(III) Checking the results of a previously submitted job

(IV). Loading the example test case

(II) Interpreting the results of a Pseudo_Checker analysis