Introduction: 5' splice site GT>GC or +2T>C variants have been frequently reported to
cause human genetic disease and are routinely scored as pathogenic splicing mutations. However, we
have recently demonstrated that such variants in human disease genes may not invariably be pathogenic.
Moreover, we found that no splicing prediction tools appear to be capable of reliably distinguishing
those +2T>C variants that generate wild-type transcripts from those that do not.
Methodology: Herein, we evaluated the performance of a novel deep learning-based tool, SpliceAI, in
the context of three datasets of +2T>C variants, all of which had been characterized functionally in
terms of their impact on pre-mRNA splicing. The first two datasets refer to our recently described “in
vivo” dataset of 45 known disease-causing +2T>C variants and the “in vitro” dataset of 103 +2T>C
substitutions subjected to full-length gene splicing assay. The third dataset comprised 12 BRCA1
+2T>C variants that were recently analyzed by saturation genome editing.
Results: Comparison of the SpliceAI-predicted and experimentally obtained functional impact assessments
of these variants (and smaller datasets of +2T>A and +2T>G variants) revealed that although
SpliceAI performed rather better than other prediction tools, it was still far from perfect. A key
issue was that the impact of those +2T>C (and +2T>A) variants that generated wild-type transcripts
represents a quantitative change that can vary from barely detectable to an almost full expression of
wild-type transcripts, with wild-type transcripts often co-existing with aberrantly spliced transcripts.
Conclusion: Our findings highlight the challenges that we still face in attempting to accurately identify