Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data.
Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links. Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user''s gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user''s browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species. This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources.
PubMed ID: 18928517
PMC ID: PMC2587480
Article link: BMC Bioinformatics.
Grant support: Wellcome Trust , Wellcome Trust
Genes referenced: myf5 myod1 t
Article Images: [+] show captions
|Figure 1. Generic application logic used in indirect sequence similarity search for gene data. (1.) the user pastes a gene sequence into the browser window and sends it to the search engine; (2.) the gene sequence is blasted against the database of sequences associated with the gene data; (3.) IDs of matching sequence are returned to the search engine; (4.) the matching sequence IDs are used to query the local managing database for available gene data; (5.) a list of matching gene data and descriptive text is returned to the search engine; (6.) an html formatted page containing the retrieved gene data and descriptive text is returned to the user's browser.|
|Figure 2. Example output of quickImage. The query sequence was X. tropicalis myf5, used to retrieve image data for this and related genes. The upper panel shows alignment and similarity between the query sequence and the matching image source sequences. The first three sets of retrieved images are shown; for each set, the accession number of the image source sequence and the best BLAST matches against human, mouse and Xenopus proteins are provided for identification purposes, as well as the originating image collection and species. Images marked A and B show highly similar expression of myf5 in the two frog species at the same development stage. The image marked C shows an interestingly similar expression pattern for the related gene myod/myf3 at a slightly later stage.|
|Figure 3. Example output of quickLit. The query sequence was X. tropicalis brachyury, used to retrieve literature references for this and related genes. The retrieved references are shown for the first few matching sequences. The retrieved data shows a high degree of apparent relevance as indicated by the title of each paper, and clear organisation of reference by species. Reference summaries and associated sequence data were downloaded from NCBI GenBank and various model organism databases.|
|Figure 4. Example output of quickGene. The query sequence was X. tropicalis brachyury, used to search gene name data from Entrez Gene. Note the variable nature of the retrieved gene names for this set of related genes.|