Nagy A et al. (2008), Identification and correction of abnormal, inco...

XB-ART-38279

BMC Bioinformatics 2008 Aug 27;9:353. doi: 10.1186/1471-2105-9-353.

Show Gene links Show Anatomy links

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases.

Nagy A , Hegyi H , Farkas K , Tordai H , Kozma E , Bányai L , Patthy L .

???displayArticle.abstract???
Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.

???displayArticle.pubmedLink??? 18752676
???displayArticle.pmcLink??? PMC2542381
???displayArticle.link??? BMC Bioinformatics

???attribute.lit??? ???displayArticles.show???

	Figure 1. Error detected by MisPred routine for Conflict 1: the case of the Swiss-Prot entry LPLC4_HUMAN. The protein contains extracellular domains LBP_BPI_CETP and LBP_BPI_CETP_C but was found to lack both a signal peptide and transmembrane helices. The human sequence was corrected (LPLC4_HUMAN_corrected) by targeted search of the human genome with its mouse ortholog, CAM20161 [EMBL:CAM20161] that has a signal peptide. The alignment shows the N-terminal parts of LPLC4_HUMAN, CAM20161 and LPLC4_HUMAN_corrected. The predicted signal peptides of CAM20161 and LPLC4_HUMAN_corrected are in yellow and underlined.
	Figure 2. Error detected by MisPred routine for Conflict 1: the case of the Swiss-Prot entry C209C_MOUSE. The protein contains an extracellular C-type lectin domain but was found to lack both a signal peptide and transmembrane helices, whereas all closely related proteins (e.g. C209A_MOUSE, C209D_MOUSE [Swiss-Prot:Q91ZX1, Q91ZW8]) are type II transmembrane proteins. The sequence of this protein was corrected by targeted search of mouse genomic and EST sequences. The alignment shows the N-terminal parts of C209C_MOUSE, C209C_MOUSE_corrected, C209A_MOUSE and C209D_MOUSE. The predicted transmembrane helices of C209C_MOUSE_corrected, C209A_MOUSE and C209D_MOUSE are in red and underlined.
	Figure 3. Error detected by MisPred routine for Conflict 1: the case of the Swiss-Prot entry YL15_CAEEL The hypothetical homeobox protein C02F12.5 [EnsEMBL: C02F12.5] predicted for chromosome X contains an extracellular Kunitz_BPTI domain but was found to lack both a signal peptide and transmembrane helices. This protein, that also contains a nuclear Homeobox domain, arose through in silico fusion of a gene related to the homeobox protein HM07_CAEEL and a gene related to the Kunitz_BPTI containing protein CBG14258, Q619J1_CAEBR. (A) Alignment of YL15_CAEEL and Q619JI_CAEBR shows close homology only in the C-terminal region, highlighted in yellow. (B) Alignment of the YL15_CAEEL_corr1 and HM07_CAEEL. (C) Alignment of YL15_CAEEL_corr2 and Q619J1_CAEBR.
	Figure 4. Error detected by MisPred routine for Conflict 4: the case of the Swiss-Prot entry EPHA5_RAT. This protein contains a C-terminal truncated SAM_1 domain that deviates significantly from the normal size of this domain family. It is noteworthy that orthologs from mouse, human and chicken contain an intact SAM_1 domain. The sequence of this protein was corrected by targeted search of the rat genome using the sequences of the full-length orthologs. The alignment shows the C-terminal parts of EPHA5_RAT, EPHA5_RAT_corrected, EPHA5_MOUSE [Swiss-Prot:Q60629], EPHA5_HUMAN [Swiss-Prot:P54756] and EPHA5_CHICK [Swiss-Prot:P54755]. The region of the predicted SAM_1 domain of EPHA5_RAT_corrected that is absent in EPHA5_RAT is underlined and highlighted in yellow.
	Figure 5. Error detected by MisPred routine for Conflict 5: the case of the protein Q9NXI4_HUMAN. The cDNA of this hypothetical protein FLJ20227, cloned from colon mucosa is derived from a chimera of two genes located on chromosome 11 and chromosome 2. The N-terminal part of the protein (underlined and highlighted in yellow) is derived from the gene encoding the PR domain zinc finger protein 10, PRD10_HUMAN (A), the C-terminal part of the protein (underlined and highlighted in blue) is derived from the gene encoding liver fatty acid-binding protein, FABPL_HUMAN (B).
	Figure 6. Error detected by MisPred routine for Conflict 2. ENSXETP00000040601 of Xenopus tropicalis corresponds to the frog ortholog of Ephrin receptor A7, but lacks a typical transmembrane helix between its extracellular FN3 and cytoplasmic Pkinase domains. The mispredicted sequence was corrected by identifying the missing transmembrane sequence using frog EST sequences such as EL820950 [GenBank:EL820950]. The alignment shows the regions containing the transmembrane helices of Gallus gallus Ephrin receptor A7 [RefSeq:NP_990414], ENSXETP00000040601 and ENSXETP00000040601_corrected. The predicted transmembrane helices of NP_990414 and ENSXETP00000040601_corrected are in red and underlined, the mispredicted region of ENSXETP00000040601 is in italics.

References [+] :

Adams, The genome sequence of Drosophila melanogaster. 2000, Pubmed