XB-ART-38694PLoS Comput Biol January 1, 2008; 4 (8): e1000147.
A general definition and nomenclature for alternative splicing events.
Understanding the molecular mechanisms responsible for the regulation of the transcriptome present in eukaryotic cells is one of the most challenging tasks in the postgenomic era. In this regard, alternative splicing (AS) is a key phenomenon contributing to the production of different mature transcripts from the same primary RNA sequence. As a plethora of different transcript forms is available in databases, a first step to uncover the biology that drives AS is to identify the different types of reflected splicing variation. In this work, we present a general definition of the AS event along with a notation system that involves the relative positions of the splice sites. This nomenclature univocally and dynamically assigns a specific "AS code" to every possible pattern of splicing variation. On the basis of this definition and the corresponding codes, we have developed a computational tool (AStalavista) that automatically characterizes the complete landscape of AS events in a given transcript annotation of a genome, thus providing a platform to investigate the transcriptome diversity across genes, chromosomes, and species. Our analysis reveals that a substantial part--in human more than a quarter-of the observed splicing variations are ignored in common classification pipelines. We have used AStalavista to investigate and to compare the AS landscape of different reference annotation sets in human and in other metazoan species and found that proportions of AS events change substantially depending on the annotation protocol, species-specific attributes, and coding constraints acting on the transcripts. The AStalavista system therefore provides a general framework to conduct specific studies investigating the occurrence, impact, and regulation of AS.
PubMed ID: 18688268
PMC ID: PMC2467475
Article link: PLoS Comput Biol
Genes referenced: cir1 clec10a mapt vegfa
Article Images: [+] show captions
|Figure 1. Comparison of nomenclatures for alternative splicing.Examples of splicing structures in the 5 human genes VEGFA (A), CLEC10A (B), TCL6 (C), AURKC (D), and AIF1 (E). In each case a schema of the exon–intron structure is shown where variable sites are numbered consecutively from 5′ to 3′. Subsequently, the splicing structure is described with the Malko's 5-component strings, Nagasaki's bit matrices and integer vectors, the nomenclature of the ASD/ATD/AEdb databases and with the AS code we propose in this work. The nomenclature of ASD/ATD/AEdb assigns ambiguously the same identifier to the structures in VEGFA (A) and TCL6 (C), respectively in CLEC10A (B) and AURKC (D). In CLEC10A (B), the bit matrix system assumes independence between both sides of the exon and therefore can not identify a single AS event. In AURKC (D), the vector (1,3) is assignable from the bit matrices, but it is not considered as part of the alternative donor event (9,13). Authors of the ASD/ATD/AEdb nomenclature propose the term “CIR” for complex intron retention structures. However, as in AIF1 (E), the selection of the central intron can be problematic as the names “CIR-II-5p3p-5p-IR-3p”, “CIR-CIR-II5p3p-5p-5p”, or “CIR-II5p4p-CIR-IR-3p-3p” could be imaginable.|
|Figure 2. Pairwise AS events in the TCL6 gene.Schematic overview of the RefSeq transcripts of the TCL6 gene (top) and all pairwise AS events (A–N) they describe according to Definition 4. For each event, the corresponding AS code and the structure with the variable splice sites numbered from 5′ to 3′ are presented. Besides traditional events as skipped exon (A and G), retained intron (B), mutually exclusive exons (H), alternative donor (C) and acceptor site (F), novel events are observed that involve more than one of the latter types (D and E) or are connected to differences in the transcription start/polyadenylation site (I through N). Note that in our method L, M and N are considered as three different events that expose the same structure (i.e., –,–).|
|Figure 3. Comparison of the AS landscape in human reference annotations.Distribution of AS events that are not related to alternative transcription starts/polyadenylation sites and contain exclusively introns with canonical splice sites in different reference annotations of the human genome: EnsEmbl, RefSeq, and Gencode. Numbers represent the event count for each different structure and the proportions of the 4 simplest splicing patterns are colored as follows: exon skipping in blue, alternate donors in green, alternate acceptors in red and retained introns in yellow; the fraction of all types of more complex events is shown together in grey with the number of different structures observed there given in brackets. In general, the landscape of AS splicing is similar across the three datasets, with the biggest difference being a comparatively larger fraction of complex events in EnsEmbl.|
|Figure 4. Landscape of AS events in the 5′ UTR vs. CDS.Landscape of AS events in RefSeq with all variable splice sites included in the 5′ UTR (A) in comparison to the ones included in the genomic region of the CDS (B). The structurally different groups are colored as in Figure 3. ES is more frequent in the CDS, whereas IR is observed more often in the 5′ UTR. Whereas in CDS alternative acceptors are more frequent than alternative donors, the landscape of events in the 5′ UTR exhibits a reverse ratio with a bias against alternative acceptors. The more complex AS events are mainly located in the region of the CDS.|
|Figure 5. Bias of potential stop codons in the splice site sequences.Proportion of the coding exons that truncate the ORF when artificially extended into the intronic region at the splice donor (blue diamonds) or splice acceptor sites (red crosses). The horizontal axis shows the number of artificial codons taken from the intronic sequence (i.e., the 1st, 2nd, 3rd, etc. codon downstream of the splice donor respectively upstream of the splice acceptor). The vertical axis to the left gives the percentage of sites that show an in-frame stop with the theoretical inclusion of the respective codon. For the regions A, B, and C, sequence logos are shown where dotted lines indicate the exon boundary and intrinsic potential stop codons are shaded in grey. When regarding exclusively the extension of one (complete) codon into the intron, one third less ORFs would be truncated when extending at the acceptor site compared to the donor site (A vs. B). The observation can partially be explained by in-frame stop codons intrinsic to the different splice site consensus sequences. A secondary peak of stop codons is observed ∼9 extended codons upstream of the acceptor site at a common position for the branch point (consensus sequence C). Sequence logos have been produced with the tool “seqlogo” . Branch point sequences have been kindly provided by the Ast laboratory (http://ast.bioinfo.tau.ac.il/BranchSite.htm).|
|Figure 6. Landscape of AS in noncoding transcripts.The landscape of AS in CDSs of coding transcripts (A) compared to events occurring in noncoding transcripts (B) with the different classes colored as in Figure 3. Complex events and retained introns are more frequent in noncoding transcripts whereas the fraction of ES is clearly higher in coding regions. Alternative donors compared to alternative acceptors are more frequent in the noncoding transcripts.|
|Figure 7. Comparative genomics of the AS landscape in 12 metazoa.For each of the 12 compared species a pie diagram shows the distribution of events across 5 structural different classes (color scheme as in Figure 3). Vertebrates—amongst them especially mammals—exhibit more exon skipping and complex events and less retained introns than invertebrates. Estimations of evolutionary distances are given according to .|
|Figure 8. Algorithm for the extraction of pairwise AS events.The algorithm extracts from a splicing graph G(V,E) all events that are described by transcript pairs (St,Su) in a locus C. By priority queue W, nodes si of the splicing graph are iterated from 5′ to 3′ according to pos(si). The queue contains at the beginning root and subsequently is filled with all nodes sj that are connected by outedges of si —if they are supported by either St or Su.|