INTRODUCTION TO PSEUDOINDEX

Accurately measuring the level of pseudogenisation of a given gene poses several challenges. For instance, evolutionary changes in the exon-intron structures of conserved genes, including splice site shifts over evolution, lineage-specific exons and precise intron deletions, all mimic inactivating mutations in genes that, in fact, might be functional. Additionally, even real mutations might not indicate gene loss: for example, when a given frameshift indel arises but is compensated downstream by an additional frameshift restoring the original reading frame, or when such frameshifts and/or premature stop codons arise close to the sequence region encoding the C-terminus of the resulting protein, which is under less evolutionary constraints.

Considering all these factors, the manual screening of a given predicted DNA sequence might be a laborious and puzzling task. To overcome these challenges, we have built into Pseudo _Checker the PseudoIndex, a user assistant metric that intends to, at a glance, measure the erosion state of a given gene at a given species by inspecting the presence and magnitude of its mutational evidence.

Explicitly, for each target species, the PseudoIndex takes into account three different components: (i) the absent-exons component, that takes into account the percentage of gene content present in the reference sequence but that does not align with the target genomic sequence; (ii) the shifted codons component, that takes into account the percentage of codons that are read out of the reference reading frame; (iii) and the truncated sequence component, that measures the percentage of the target sequences that are not translated into protein, due to the presence of a premature stop codon.

The PseudoIndex attributed value for each in-study target species varies on a discrete scale from 0 to 5, with a PseudoIndex of 0 suggesting the full functionality of the candidate gene and a PseudoIndex of 5 indicating its full inactivation. In detail, a PseudoIndex of 0 indicates that the corresponding species presents an intact, or almost intact version of the in-study gene, and a PseudoIndex of 1 and 2 indicates that, although the predicted gene has shown some mutational evidence, this likely does not affect the functionality of the resulting protein. A PseudoIndex of 3, on the other hand, indicates a doubtful case, for which the coding status of the corresponding gene should be manually inspected, and finally, a PseudoIndex equal to 4 or 5 suggests the loss of the in-study gene.

The final PseudoIndex value attributed to each target species will correspond to the highest value amongst the computed sub-PseudoIndex values and, importantly, since the user will have the ultimate interpretation of the level of mutational evidence, therefore, the classification of the in-study gene in a given species as coding or pseudogenised, this should not be seen as the ultimate verdict underlying its coding status, but rather as a user-friendly metric that takes into account multivariate factors in order to assist the user in the interpretation of the coding status of a given in-study gene in a given species.

(I) ABSENT-EXONS COMPONENT

Within the absent-exons component of PseudoIndex, Pseudo_Checker measures the harmful impact of the absence of single or multiple coding exons on the in-study gene in each target species. For this, Pseudo_Checker starts by computing the weight that each coding exon displays at the reference gene by dividing its nucleotide length over the entire reference coding sequence length. Then, the percentage of absent gene content computed for a target species is the result of the sum of this computed ratio for each absent exon, multiplied by 100. Different obtained values will yield different sub-PseudoIndex values.

Absent gene content (%)	Sub-PseudoIndex
≤ 10	0
> 10 and ≤ 15	1
> 15 and ≤ 20	2
> 20 and ≤ 25	3
> 25 and ≤ 30	4
> 30	5

(II) SHIFTED CODONS COMPONENT

For the shifted codons component of PseudoIndex, Pseudo_Checker measures the impact that frameshift mutations have on the in-study gene predicted sequence for a given target species. Here, our approach considers isolated frameshifts (a single frameshift that occurs within a given sequence), multiple frameshifts that do not compensate each other, compensatory frameshifts, and reading frame disruptive effects caused by the absence of single or multiple exons. To this aim, the shifted codons component considers two factors. First, it calculates the percentage of shifted codons that a given sequence displays. In detail, Pseudo_Checker starts by counting the total number of codons retrieved in a shifted reading frame (gapped codons, read as ’---’ are not considered) from the 5’ end to the 3’ end of the sequence, then it divides the obtained number by the number of total codons within the sequence, and further multiplies it by a factor of 100 (gapped codons are not, once again, considered). This value is only calculated within the predicted coding sequence: initiating with the first observed in-frame start codon and ending in the last available codon. Different values obtained for this ratio will result in different computed preliminary sub-PseudoIndex values.

Shifted codons (%)	Preliminary sub-PseudoIndex
≤ 10	0
> 10 and ≤ 15	1
> 15 and ≤ 20	2
> 20 and ≤ 25	3
> 25 and ≤ 30	4
> 30	5

This rationale considers that frameshift mutations, arising before the first observable start codon, do not correspond to real mutational events. Most commonly, the first codon of a predicted coding sequence will correspond to a start codon; however, if such does not occur (for instance, start codon shifts during evolution or alignment related problems), alternative start codons, downstream from the first codon should not be dismissed. In such scenarios, frameshifts that occur upstream from the first observable start codon should be less penalised than frameshifts arising downstream of it.

Thus, if at least one frameshift mutation arises upstream of the first observable start codon in a given sequence, a minimum value of 3 will be attributed to the sub-PseudoIndex for this component. Consequently, the final sub-PseudoIndex value obtained for this component results in the highest value between 3 and the preliminary sub-PseudoIndex that resulted from the previously computed percentage of shifted codons.

In contrast, in the absence of frameshift mutations arising upstream of the first in-frame start codon, the computed sub-PseudoIndex value will solely be dependent on the value computed for the preliminary sub-PseudoIndex. If no start codons are detected within a sequence of a given species, hindering the assessment of frameshifts a value of 3 is attributed to the sub-PseudoIndex for this component.

(III) TRUNCATED SEQUENCE COMPONENT

Lastly, the truncated sequence component of PseudoIndex considers the percentage of truncated sequence that each target gene sequence displays. This is defined as the number of non-gapped codons that are not translated into protein, following either an in-frame premature stop codon, or an out-of-frame premature stop codon, translated as a real stop codon, as a consequence of an upstream disruption of the reading frame, further divided by the number of codons within the sequence, multiplied by 100.

Similarly to the previous component of PseudoIndex, this ratio is only calculated between the first observable in-frame start codon and the last available codon. Different values obtained for this metric will also yield different preliminary sub-PseudoIndex values.

Truncated sequence component (%)	Preliminary sub-PseudoIndex
≤ 10	0
> 10 and ≤ 15	1
> 15 and ≤ 20	2
> 20 and ≤ 25	3
> 25 and ≤ 30	4
> 30	5

Nevertheless, if an in-frame or out-of-frame premature start codon, translated as an effective stop codon due to the upstream disruption of the original reading frame, arises prior to the first observed start codon of the in-analysis sequence, the minimum corresponding species assigned sub- PseudoIndex value for this component of PseudoIndex will be equal to 3. Consequently, the final attributed sub-PseudoIndex will be defined by the maximum value between 3 and the preliminary sub- PseudoIndex value, which relates to the percentage of truncated sequence as explained above.

In contrast, if a sequence does not display any premature stop codons arising upstream of the first observed start codon, the attributed sub-PseudoIndex value will only depend on the percentage of truncated sequence. Finally, if no start codons are found, the assigned sub-PseudoIndex concerning this PseudoIndex’s component will be equal to 3.

ᴪPSEUDOCHECKER

Integrated online platform for gene inactivation inference

INTRODUCTION TO PSEUDOINDEX

(I) ABSENT-EXONS COMPONENT

(II) SHIFTED CODONS COMPONENT

(III) TRUNCATED SEQUENCE COMPONENT