XB-ART-44836Genesis. March 1, 2012; 50 (3): 143-54.
From expression cloning to gene modeling: the development of Xenopus gene sequence resources.
The Xenopus community has made concerted efforts over the last 10-12 years systematically to improve the available sequence information for this amphibian model organism ideally suited to the study of early development in vertebrates. Here I review progress in the collection of both sequence data and physical clone reagents for protein coding genes. I conclude that we have cDNA sequences for around 50% and full-length clones for about 35% of the genes in Xenopus tropicalis, and similar numbers but a smaller proportion for Xenopus laevis. In addition, I demonstrate that the gaps in the current genome assembly create problems for the computational elucidation of gene sequences, and suggest some ways to ameliorate the effects of this.
PubMed ID: 22344767
PMC ID: PMC3488295
Article link: Genesis.
Grant support: U117597137 Medical Research Council , MC_U117597137 Medical Research Council , U117597137 Medical Research Council , MRC_MC_U117597137 Medical Research Council , U117597137 Medical Research Council , MC_U117597137 Medical Research Council , U117597137 Medical Research Council
Genes referenced: ids
Article Images: [+] show captions
|Fig 1. Using EST assemblies to accurately define transcript sequences. (a) EST contigs for a pair of Xenopus laevis homeolog genes (H+transporting F1 ATP synthase, epsilon subunit, atp5e), illustrating sensitivity of clustering and accuracy of assembly. The EST assembly for each homeolog is shown, where the consensus sequence is constructed from the aligned EST sequences. Protein alignments are used to determine the frame and position of the ORF. (b) BLASTn alignment of the open reading frames (plus 12 bases upstream) of these two genes illustrates how important knowing both these sequence is to the process of morpholino design. The Gurdon IDs for the assemblies for the A and B homeologs are Xl2.1-LANE.XL433b16.5 and Xl2.1-Ls19H.BX846122.5, respectively.|
|Fig 2. Accumulation of Xenopus cDNA and EST sequences over time. Computational analysis of sequence submission dates reveals the rates of arrival of different types of gene sequence for the two frogs. Upper panel: cDNA numbers analyzed by individual submission date pooled by month. Lower panel: EST numbers grouped by UniGene library entry, using earliest submission date noted for each library, and pooled by month.|
|Fig 3. Mis-modeling of genes leads to incorrect sequences in public databases. There are gaps in the Xenopus tropicalis genome assembly in the ifngr2 gene locus, and this has created problems for the gene modeling process. The figure compares alignments for two gene models from different sources and two cDNA based sequences: X. tropicalis gene model A, Ensembl ENSXETT00000000072; gene model B, NCBI XM_002942799.1; EST assembly consensus, Gurdon Xt7.1-TNeu110f24.3; and X. laevis cDNA, NCBI NM_001099874.1. Panel (a) first 400 bp of the assembled EST contig. Visible striations in the EST alignments section indicate the quality of the assembly. The section of the contig sequence missing from the genome sequence is outlined in red. Panel (b) three-way alignment between the 5′-most parts of the EST consensus sequence and the two gene model sequences. Colored boxes and dashed lines link sections of sequence to their positions in the detailed view from the UCSC genome browser page. The red arrow indicates the most likely assembly gap to contain the ‘missing’ sequence. Panel (c) UCSC genome browser view for the whole ifngr2 locus on scaffold_1108, Xt-v4.1, showing BLAT alignments for the EST consensus and gene model sequences. Public EST alignments mapped by UCSC are also shown. Panel (d) a multiple sequence alignment of the resulting translated protein sequences (first ∼70 residues), including a full-length cDNA sequence from Xenopus laevis, showing where the gene model derived protein sequences diverge from the cDNA based sequences.|
|Fig 4. Systematic analysis for gene model transcript completeness. Transcript sequences are computationally analyzed (a) for the longest open reading frame (ORF), using in-frame ATG (green) and STOP codons (red). Missing UTRs are identified (b) where the ORF runs to the ends of the transcript. Truncated ORFs are identified (c) where protein matches are detected upstream of the most 5′ ATG and/or there is no 3′ STOP codon.|