3''-UTR SIRF: a database for identifying clusters of whort interspersed repeats in 3'' untranslated regions.
Short (~5 nucleotides) interspersed repeats regulate several aspects of post-transcriptional gene expression. Previously we developed an algorithm (REPFIND) that assigns P-values to all repeated motifs in a given nucleic acid sequence and reliably identifies clusters of short CAC-containing motifs required for mRNA localization in Xenopus oocytes. In order to facilitate the identification of genes possessing clusters of repeats that regulate post-transcriptional aspects of gene expression in mammalian genes, we used REPFIND to create a database of all repeated motifs in the 3'' untranslated regions (UTR) of genes from the Mammalian Gene Collection (MGC). The MGC database includes seven vertebrate species: human, cow, rat, mouse and three non-mammalian vertebrate species. A web-based application was developed to search this database of repeated motifs to generate species-specific lists of genes containing specific classes of repeats in their 3''-UTRs. This computational tool is called 3''-UTR SIRF (Short Interspersed Repeat Finder), and it reveals that hundreds of human genes contain an abundance of short CAC-rich and CAG-rich repeats in their 3''-UTRs that are similar to those found in mRNAs localized to the neurites of neurons. We tested four candidate mRNAs for localization in rat hippocampal neurons by in situ hybridization. Our results show that two candidate CAC-rich (Syntaxin 1B and Tubulin beta4) and two candidate CAG-rich (Sec61alpha and Syntaxin 1A) mRNAs are localized to distal neurites, whereas two control mRNAs lacking repeated motifs in their 3''-UTR remain primarily in the cell body. Computational data generated with 3''-UTR SIRF indicate that hundreds of mammalian genes have an abundance of short CA-containing motifs that may direct mRNA localization in neurons. In situ hybridization shows that four candidate mRNAs are localized to distal neurites of cultured hippocampal neurons. These data suggest that short CA-containing motifs may be part of a widely utilized genetic code that regulates mRNA localization in vertebrate cells. The use of 3''-UTR SIRF to search for new classes of motifs that regulate other aspects of gene expression should yield important information in future studies addressing cis-regulatory information located in 3''-UTRs.
PubMed ID: 17663765
PMC ID: PMC1973087
Article link: BMC Bioinformatics.
Genes referenced: gnl3 nanos1 sec61a1 slc25a20 stx1a stx1b stx5 tubb4a
Article Images: [+] show captions
|Figure 3. Localization of the Rat and Human Tubβ4 3'-UTRs in Xenopus oocytes. The 3'-UTR of rat or human Tubβ4 (Acc. # 82522352 and BC013683, respectively), and human Stx1B2 (Acc. # BC062298) were synthesized and labelled in vitro with Alexa-Fluor-546-UTP. These fluorescently labelled RNAs were then microinjected into stage II Xenopus oocytes. All three RNAs localize to the vegetal pole, which is oriented downwards in all panels. A fragment of the Xenopus β-globin gene (XβG) was used as a negative control for localization, whereas the mitochondrial cloud RNA localization element from the Xenopus Xcat-2 mRNA (MCLE) was used as a positive control. Note that the extent of Stx1B2 localization is higher than that of either Tubβ4 RNA. Arrows depict the localized RNA towards the vegetal pole and GV indicates the germinal vesicle (nucleus) in these cells which are ~300 μm diameter.|
|Figure 1. Schematic representation of the information stored in the 3'-UTR SIRF database. Sequences were extracted from the Mammalian Gene Collection (NCBI) and stored in the insdseq table of the database. REPFIND was then used to identify clusters of all perfect repeats in the 3'-UTRs of these sequences. The results of this computational analysis were stored in the 'match' table. A similar table, 'match_random' was generated on the same sequences which had their nucleotides shuffled in a random fashion. All information included in the insdseq table is from the NCBI database, except INSDSeq_Create_release, which defines when the table entry was created and INSDSeq_Update_release, which identifies when the table entry is modified. INSDSeq_ID is used as the identification number into the table. It has the same role as INSDSeq_primaryAccession, but is used because it is an integer that is more efficient for indexing. INSDSeq_ID in the match and match_random tables indicates the gene corresponding to the cluster identified by REPFIND. In addition, the P-value, sequence of the repeat (motif), number of motifs, start (cluster_start), and end (cluster_end) of each cluster are shown. These last two entries are used to calculate the size of each identified cluster.|
|Figure 2. Cumulative cluster frequencies of CAC-containing motifs in human 3'-UTRs. Trends was used to determine the cumulative frequencies of clusters of 5–7 nucleotide long CAC-containing repeated motifs in the 'match' table (blue line) and 'match_random' table (red line). As can be seen, the frequencies of CAC-containing motifs with low P-values are much higher in real 3'-UTRs than they are in the shuffled ones. This type of separation is seen in all seven vertebrate species and with independently shuffled control data sets (data not shown).|
|Figure 4. Mouse, Rat, and Human Tubβ4 3'-UTRs all have an abundance of CAC-containing motifs. Even though the human Tubβ4 3'UTR has little sequence similarity when it is aligned with the mouse or rat orthologs, all three genes are shown to have a highly significant number of CAC motifs when individually assessed by REPFIND. For the rat and mouse sequences, REPFIND was performed without filtering low complexity regions and the human background was used. The accession number for the mouse Tubβ4gene is BC054831. Motifs depicted in grey would have yielded higher (less significant) P-values, and therefore were not used to generate the P-values shown.|
|Figure 5. REPFIND analysis of dendritic mRNAs CamKIIα and Arc. The 3'-UTR of rat Arc (Acc. #NM_019361) and human CamKIIα (Acc. #BC012321) were analyzed for all repeats. As can be seen, CAG or CAG-containing motifs comprise the top scoring cluster for each 3'-UTR. Motifs depicted as vertical small colored bars indicate the cluster with the most significant P-value. The red bars below each 3'-UTR represent RNA sequences that have dendritic RNA localization activity and were mapped in previous studies using reporter assays [28, 29].|
|Figure 7. Endogenous CAC and CAG rich mRNAs are localized to distal processes in mammalian neurons. In situ hybridization was used to reveal the subcellular distribution of each mRNA in rat hippocampal neurons that had been cultured for 8 days after plating. Stx5 was used as a negative control for localization since it has no repeats and resides exclusively in the cell body. CamKIIα was used as a positive control for localization since it is well known to localize well to distal processes. White arrows show labelling in distal processes. All images were collected at identical laser settings using confocal microscopy and all images were processed together as a montage image to enhance contrast. In addition all cells came from the same experiment and each cell has multiple processes in the focal plane, but often a single process is preferentially labelled. The identity of processes as either axons or dendrites is not yet known. Specific mRNAs were detected in distal processes with both CAC-rich mRNAs (Tubβ4 and Syn1B2) and both CAG-rich mRNAs (Syn1A and Sec61α) that were identified with 3'-UTR SIRF. The cell bodies in these images are approximately 15 μm in diameter.|
|Figure 8. Semi-quantitative analysis of the localization of endogenous mRNAs. To estimate the extent of localization of each endogenous mRNA, images were collected from 30–40 cells using identical laser settings from the same experiment shown in Figure 7. All raw images were assembled into a montage and a threshold was applied to help identify mRNA labelled in distal processes. A cell was considered to be positive for localization if mRNA could be detected in a process greater than 40 μm away from the cell body. If no signal could be detected greater than 10 μm away from the cell body the cell was considered to be negative for mRNA localization. About 15–30 percent of all cells showed some signal in processes 10–40 μm away from the cell body. These cells were excluded from the graph since they added little information to this analysis.|