Pavy N et al. (2005), Generation, annotation, analysis and database i...

XB-ART-34265

BMC Genomics 2005 Apr 01;6:144. doi: 10.1186/1471-2164-6-144.

Show Gene links Show Anatomy links

Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters.

Pavy N , Paule C , Parsons L , Crow JA , Morency MJ , Cooke J , Johnson JE , Noumen E , Guillet-Claude C , Butterfield Y , Barber S , Yang G , Liu J , Stott J , Kirkpatrick R , Siddiqui A , Holt R , Marra M , Seguin A , Retzel E , Bousquet J , MacKay J .

???displayArticle.abstract???
The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss). We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations. This report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics.

???displayArticle.pubmedLink??? 16236172
???displayArticle.pmcLink??? PMC1277824
???displayArticle.link??? BMC Genomics

???attribute.lit??? ???displayArticles.show???

	Figure 1. Composition of white spruce consensus sequences (contigs and singletons) according to orientation of direction of the reads (3' or 5') and according to their redundancy in the database (number of clones).
	Figure 2. Sequence sizes. Size distribution of the consensus sequences derived from the pine (PGI5.0) and white spruce (ArboreaSet) assemblies.
	Figure 3. Sequence similarities. Number of white spruce transcript sequences similar to Uniref100 proteins, Arabidopsis, pine, Cycas according to the blast e-value cutoff.
	Figure 4. Hierarchical presentation of the number of spruce transcripts with or without similarities with pine, Arabidopsis, rice and poplar. The numbers were derived by the filtering of tblastx searches with an e-value < 1e-10.
	Figure 5. Protein families. Occurrence of the 30 most abundant protein families in the white spruce dataset identified by HMM searches with an e-value < 1e-10 against the PFAM database.
	Figure 6. Number of spruce consensus sequences (identified by HMM searches against PFAM) relative to the size of the gene families in Arabidopsis (a) and rice (b). Each point represents a protein family detected by the HMM searches with p-score < 1e-10. Point coordinates are the number of genes found in the analysed Angiosperm genome (x axis) and the number of contigs found in the spruce database (y axis), after a log transformation. The red, blue and green lines represent the ratios 1:1, 1:2, and 1:4, respectively. Red points represent sequences found 4 times more in white spruce than in Arabidopsis: 1. AWPM-19-like family [PF05512], 2. Chalcone and stilbene synthases, C-terminal domain [PF02797], 3. Phosphoenolpyruvate carboxykinase [PF01293]. Blue points represent sequences found 4 times more in spruce than in rice : 4. Ribosomal protein S28e [PF01200], 5. Cyclin-dependent kinase regulatory subunit [PF01111], 6. TIR domain [PF01582], 7. Splicing factor 3B subunit 10 [PF07189], 8. Ribosomal Proteins L2, C-terminal domain [PF03947]. Green points represent sequences found 4 times more in spruce compared to both Arabidopsis and rice: 9. Translationally controlled tumour protein [PF00838], 10. S-adenosyl-L-homocysteine hydrolase [PF05221], 11. S-adenosylmethionine synthetase, C-terminal domain [PF02773].
	Figure 7. SpruceDB core tables and data sources. Data from flat files on ESTs, Assemblies and blast hits is loaded into the core tables Read, Contig, Contig_Element and Blast_Hsp. Additional information on taxonomy identifiers and Uniref100 peptides is obtained from shared databases.
	Figure 8. Examples of the interface of the SpruceDB database. A) Use of Query 1 to search for contigs matching "cinnamoyl alcohol dehydrogenase" among the blastx results loaded in the database. B) Display of the results indicating alignment parameters (alignment length, similarity and identity level). C) BioDATA page linked to by clicking on MNC5693153 in Query 1 results. The upper figure illustrates the alignment of the members of the contigs in a color coded manner. Read names written in blue and white color refer to 5'and 3'reads, respectively. D) Query 8 allowing to retrieve sequence aliases and library names for specified MN_Ids. E) Query 8 results showing libraries GQ004 and GQ006.

References [+] :

Allona, Analysis of xylem formation in pine by cDNA sequencing. 1998, Pubmed

Allona, Analysis of xylem formation in pine by cDNA sequencing. 1998, Pubmed
Altschul, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. 1997, Pubmed
Bairoch, The Universal Protein Resource (UniProt). 2005, Pubmed
Bateman, The Pfam protein families database. 2004, Pubmed
Bouillé, Trans-species shared polymorphisms at orthologous nuclear gene loci among distant species in the conifer Picea (Pinaceae): implications for the long-term maintenance of genetic diversity in trees. 2005, Pubmed
Brenner, Expressed sequence tag analysis in Cycas, the most primitive living seed plant. 2003, Pubmed
Brown, Nucleotide diversity and linkage disequilibrium in loblolly pine. 2004, Pubmed
Campbell, Variation in Lignin Content and Composition (Mechanisms of Control and Implications for the Genetic Improvement of Plants). 1996, Pubmed
Davuluri, AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. 2003, Pubmed
Dubos, Identification of water-deficit responsive genes in maritime pine (Pinus pinaster Ait.) roots. 2003, Pubmed
Egertsdotter, Gene expression during formation of earlywood and latewood in loblolly pine: expression profiles of 350 genes. 2004, Pubmed
Espartero, Differential accumulation of S-adenosylmethionine synthetase transcripts in response to salt stress. 1994, Pubmed
Ewing, Base-calling of automated sequencer traces using phred. I. Accuracy assessment. 1998, Pubmed
Forment, Development of a citrus genome-wide EST collection and cDNA microarray as resources for genomic studies. 2005, Pubmed
García-Gil, Nucleotide diversity at two phytochrome loci along a latitudinal cline in Pinus sylvestris. 2003, Pubmed
Gene Ontology Consortium, Creating the gene ontology resource: design and implementation. 2001, Pubmed
Girke, The Cell Wall Navigator database. A systems-based approach to organism-unrestricted mining of protein families involved in cell wall metabolism. 2004, Pubmed
Guillet-Claude, The evolutionary implications of knox-I gene duplications in conifers: correlated evidence from phylogeny, gene mapping, and analysis of functional divergence. 2004, Pubmed
Hertzberg, A transcriptional roadmap to wood formation. 2001, Pubmed
Kawai, Functional annotation of a full-length mouse cDNA collection. 2001, Pubmed
Kirst, Apparent homology of expressed genes from wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana. 2003, Pubmed
Lamblin, MtDB: a database for personalized data mining of the model legume Medicago truncatula transcriptome. 2003, Pubmed
Le Provost, Seasonal variation in transcript accumulation in wood-forming tissues of maritime pine (Pinus pinaster Ait.) with emphasis on a cell wall glycine-rich protein. 2003, Pubmed
Li, Characterization of fortilin, a novel antiapoptotic protein. 2001, Pubmed
Lindroth, Two S-adenosylmethionine synthetase-encoding genes differentially expressed during adventitious root development in Pinus contorta. 2001, Pubmed
Neale, Association genetics of complex traits in conifers. 2004, Pubmed
Paux, Identification of genes preferentially expressed during wood formation in Eucalyptus. 2004, Pubmed
Pavy, Large-scale statistical analysis of secondary xylem ESTs in pine. 2005, Pubmed
Pay, An alfalfa cDNA encodes a protein with homology to translationally controlled human tumor protein. 1992, Pubmed
Peleman, Structure and expression analyses of the S-adenosylmethionine synthetase gene family in Arabidopsis thaliana. 1989, Pubmed
Pelgas, A composite linkage map from two crosses for the species complex Picea mariana x Picea rubens and analysis of synteny with other Pinaceae. 2005, Pubmed
Pot, Nucleotide variation in genes involved in wood formation in two pine species. 2005, Pubmed
Quackenbush, The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. 2001, Pubmed
Raes, Genome-wide characterization of the lignification toolbox in Arabidopsis. 2003, Pubmed
Rhee, The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. 2003, Pubmed
Sage-Ono, Dark-induced accumulation of mRNA for a homolog of translationally controlled tumor protein (TCTP) in Pharbitis. 1998, Pubmed
Schröder, Three differentially expressed S-adenosylmethionine synthetases from Catharanthus roseus: molecular and functional characterization. 1997, Pubmed
Shen, High free-methionine and decreased lignin content result from a mutation in the Arabidopsis S-adenosyl-L-methionine synthetase 3 gene. 2002, Pubmed
Stasolla, The effect of reduced glutathione on morphology and gene expression of white spruce (Picea glauca) somatic embryos. 2004, Pubmed
Sánchez-Aguayo, Salt stress enhances xylem development and expression of S-adenosyl-L-methionine synthase in lignifying tissues of tomato plants. 2004, Pubmed
Vettore, Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane. 2003, Pubmed
Whitfield, Annotated expressed sequence tags and cDNA microarrays for studies of brain and behavior in the honey bee. 2002, Pubmed
Yang, Novel gene expression profiles define the metabolic and physiological processes characteristic of wood and its extractive formation in a hardwood tree species, Robinia pseudoacacia. 2003, Pubmed
Yang, High-throughput sequencing: a failure mode analysis. 2005, Pubmed
van Zyl, Heterologous array analysis in Pinaceae: hybridization of Pinus taeda cDNA arrays with cDNA from needles and embryogenic cultures of P. Taeda, P. Sylvestris or Picea abies. 2002, Pubmed