Click here to close
Hello! We notice that you are using Internet Explorer, which is not supported by Xenbase and may cause the site to display incorrectly.
We suggest using a current version of Chrome,
FireFox, or Safari.
BMC Bioinformatics
2017 Jun 02;181:288. doi: 10.1186/s12859-017-1686-9.
Show Gene links
Show Anatomy links
Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance.
Oda T
,
Lim K
,
Tomii K
.
???displayArticle.abstract???
BACKGROUND: PSI-BLAST, an extremely popular tool for sequence similarity search, features the utilization of Position-Specific Scoring Matrix (PSSM) constructed from a multiple sequence alignment (MSA). PSSM allows the detection of more distant homologs than a general amino acid substitution matrix does. An accurate estimation of the weights for sequences in an MSA is crucially important for PSSM construction. PSI-BLAST divides a given MSA into multiple blocks, for which sequence weights are calculated. When the block width becomes very narrow, the sequence weight calculation can be odd.
RESULTS: We demonstrate that PSI-BLAST indeed generates a significant fraction of blocks having width less than 5, thereby degrading the PSI-BLAST performance. We revised the code of PSI-BLAST to prevent the blocks from being narrower than a given minimum block width (MBW). We designate the modified application of PSI-BLAST as PSI-BLASTexB. When MBW is 25, PSI-BLASTexB notably outperforms PSI-BLAST consistently for three independent benchmark sets. The performance boost is even more drastic when an MSA, instead of a sequence, is used as a query.
CONCLUSIONS: Our results demonstrate that the generation of narrow-width blocks during the sequence weight calculation is a critically important factor that restricts the PSI-BLAST search performance. By preventing narrow blocks, PSI-BLASTexB upgrades the PSI-BLAST performance remarkably. Binaries and source codes of PSI-BLASTexB (MBW = 25) are available at https://github.com/kyungtaekLIM/PSI-BLASTexB .
Fig. 1. Examples showing the sequence weight calculation of PSI-BLAST and PSI-BLASTexB. a Sequence weights (shown on the right side) of all positions in the MSA were calculated from a single block covering the whole alignment. b PSI-BLAST divided the MSA into three blocks (blue, orange, and green) and calculated sequence weights for each block. Sequence weights calculated from the blocks are shown on the right side with the same color. For the orange block that is one aa long, PSI-BLASTexB extends the block such that the block width becomes MBW (red block). Weights calculated from the red block are also shown. See Methods for detailed procedures. “seq7” has no amino acid at position 23. For that reason, the sequence weights of other sequences are calculated ignoring “seq7” at the position
Fig. 2. Distributions of block widths used for PSSW calculation with varying numbers of iterations. Results of searches against SCOP20_training, SCOP20_validation, and CATH20-SCOP are presented, respectively, in (a) (b) and (c). Searches that converged before the eighth iteration were not used. Numbers of sequences (and blocks) used in (a) (b) and (c) are, respectively, 2502 (468286), 2473 (459375), and 1009 (135176). Numbers of searches that had not converged before each iteration are provided in Additional file 4: Table S1
Fig. 3. ROC curves of PSI-BLAST and PSI-BLASTexB. a ROC curves of PSI-BLAST (MBW = 1) and PSI-BLASTexB (MBW = 5, 13, 25, or 41) at the fifth iteration against SCOP20_training. b ROC curves among searches with different numbers of iterations against SCOP20_validation. Narrow, normal, and thick lines respectively show the second, third, and fifth iterations. c ROC curves of PSI-BLAST and PSI-BLASTexB at the fifth iteration against CATH20-SCOP. Black lines represent FDR of 10%
Fig. 4. Relations between the ROC5 score improvement and the fraction of narrow blocks. The X-axis shows (number of one aa long blocks during PSSM construction)/(length of the query). The Y-axis shows the ROC5 score of PSI-BLASTexB replaced by that of PSI-BLAST. Each dot represents the result of a single query. The results of queries which have only one TP hit (self-hit) were ignored. Results of SCOP20_training (2752 queries), SCOP20_validation (2752 queries), and CATH20-SCOP (858 queries) at the second iteration are presented respectively in A, B, and C
Fig. 5. ROC curves with “-in_msa” option of PSI-BLAST and PSI-BLASTexB against SCOP20_validation. Thick and narrow lines respectively show ROC curves at the fifth and third iterations. The black straight line shows FDR of 10%
Fig. 6. Schematic representation of narrow-width block generation by HOE. When we performed a PSI-BLAST search [22], at the NCBI website, of a protein sequence (UniProtKB [23] accession number: Q5JHS2, as a query) that contains two conserved domains (Pfam [10] IDs: PF13419 and PF00535) against the UniProtKB/Swiss-Prot database [24], we found that a hit (UniProtKB accession number: Q9S586) consisting of a single-domain protein (Pfam ID: PF13419) with HOE (white boxes) has an overlap with another hit (UniProtKB accession number: A2E3C6) matched only to the domain of PF00535, resulting in a 3 aa-long block (gray bar)
Altschul,
PSI-BLAST pseudocounts and the minimum description length principle.
2009, Pubmed
Altschul,
PSI-BLAST pseudocounts and the minimum description length principle.
2009,
Pubmed
Altschul,
Basic local alignment search tool.
1990,
Pubmed
Altschul,
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
1997,
Pubmed
Angermüller,
Discriminative modelling of context-specific amino acid substitution probabilities.
2012,
Pubmed
Aspnäs,
Code optimization of the subroutine to remove near identical matches in the sequence database homology search tool PSI-BLAST.
2010,
Pubmed
Biegert,
Sequence context-specific profiles for homology searching.
2009,
Pubmed
Boratyn,
BLAST: a more efficient report with usability improvements.
2013,
Pubmed
Boutet,
UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.
2016,
Pubmed
Camacho,
BLAST+: architecture and applications.
2009,
Pubmed
Finn,
The Pfam protein families database: towards a more sustainable future.
2016,
Pubmed
Fox,
SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures.
2014,
Pubmed
Gonzalez,
Homologous over-extension: a challenge for iterative similarity searches.
2010,
Pubmed
Gough,
Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.
2001,
Pubmed
Gribskov,
Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.
1996,
Pubmed
Henikoff,
Amino acid substitution matrices from protein blocks.
1992,
Pubmed
Henikoff,
Position-based sequence weights.
1994,
Pubmed
Katoh,
MAFFT multiple sequence alignment software version 7: improvements in performance and usability.
2013,
Pubmed
Li,
PSI-Search: iterative HOE-reduced profile SSEARCH searching.
2012,
Pubmed
Pundir,
UniProt Protein Knowledgebase.
2017,
Pubmed
Remmert,
HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.
2011,
Pubmed
Schäffer,
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements.
2001,
Pubmed
Sillitoe,
CATH: comprehensive structural and functional annotations for genome sequences.
2015,
Pubmed
Suzek,
UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches.
2015,
Pubmed
Yamada,
Revisiting amino acid substitution matrices for identifying distantly related proteins.
2014,
Pubmed