Deep proteomics of the Xenopus laevis egg using an mRNA-derived reference database.
Mass spectrometry-based proteomics enables the global identification and quantification of proteins and their posttranslational modifications in complex biological samples. However, proteomic analysis requires a complete and accurate reference set of proteins and is therefore largely restricted to model organisms with sequenced genomes. Here, we demonstrate the feasibility of deep genome-free proteomics by using a reference proteome derived from heterogeneous mRNA data. We identify more than 11,000 proteins with 99% confidence from the unfertilized Xenopus laevis egg and estimate protein abundance with approximately 2-fold precision. Our reference database outperforms the provisional gene models based on genomic DNA sequencing and references generated by other methods. Surprisingly, we find that many proteins in the egg lack mRNA support and that many of these proteins are found in blood or liver, suggesting that they are taken up from the blood plasma, together with yolk, during oocyte growth and maturation, potentially contributing to early embryogenesis. To facilitate proteomics in nonmodel organisms, we make our platform available as an online resource that converts heterogeneous mRNA data into a protein reference set. Thus, we demonstrate the feasibility and power of genome-free proteomics while shedding new light on embryogenesis in vertebrates.
PubMed ID: 24954049
PMC ID: PMC4090281
Article link: Curr Biol.
Grant support: P40 OD010997 NIH HHS , P40OD010997 NIH HHS , R01 DK077197 NIDDK NIH HHS , R01 GM103785 NIGMS NIH HHS , R01 HD073104 NICHD NIH HHS , R01DK077197 NIDDK NIH HHS , R01GM103785 NIGMS NIH HHS , R01HD073104 NICHD NIH HHS , P40OD010997 NIH HHS , R01 HD073104 NICHD NIH HHS , R01 DK077197 NIDDK NIH HHS , R01 GM103785 NIGMS NIH HHS , R01HD073104 NICHD NIH HHS , P40 OD010997 NIH HHS , R01GM103785 NIGMS NIH HHS , R01DK077197 NIDDK NIH HHS , P40OD010997 NIH HHS , R01 HD073104 NICHD NIH HHS , R01 DK077197 NIDDK NIH HHS , R01 GM103785 NIGMS NIH HHS , R01HD073104 NICHD NIH HHS , P40 OD010997 NIH HHS , R01GM103785 NIGMS NIH HHS , R01DK077197 NIDDK NIH HHS
Genes referenced: actr2 apc fbxo43 kif22 mapre1 mos npm1 pcna piwil1 ran topbp1
Article Images: [+] show captions
|Figure 1. MS Data Can Be Used to Evaluate Relative Reference Database Quality Spectra from a tryptic digest of yeast lysate were searched against the standard yeast protein database (Full DB). Shown are the number of total peptide spectral matches (blue), unique peptides (orange), or proteins (black) that were confidently identified. To simulate poor reference databases, we removed half (Half DB) or three-quarters of proteins (Quarter DB) from the reference database. The number of identified PSMs and unique peptides scale approximately with the number of proteins in the database. To test how the addition of nonsense sequences would affect the number of identified peptides, we added randomized human proteins to the full yeast database (Full DB + Nonsense). The numbers of peptides and proteins are negatively affected. To simulate a reference database in which proteins are fragmented, we divided at a random position every protein in the reference into two proteins. Whereas the number of identified peptides slightly decreases, the number of identified proteins substantially increases.|
|Figure 2. Overview of the Steps for Constructing the High-Quality Protein Reference Set PHROG Transcripts from four different sources were combined, trimmed and cleaned using SeqClean, masked using RepeatMasker, and clustered and assembled using TGICL/CAP3. The assembled transcripts were aligned against a collection of model vertebrate proteins using BLASTX. The results were used for identifying the correct translation frame, for frameshift correction (if appropriate), and for removing sequences without significant similarity to known proteins. Once translated using BioPerl, the longest peptide for each protein is identified, and the ends are trimmed to match tryptic peptides. The collection is processed to remove 100% redundant proteins using CD-HIT, and gene symbols are assigned to the remaining members using the reciprocal or single best BLAST hit against human proteins. The numbers indicate the numbers of transcripts or proteins in each group.|
|Figure 3. Comparison of Protein Reference Databases for the Fractionated X. laevis Egg Sample (A) Number of unique peptides identified with 0.5% FDR on the peptide level. PHROG significantly outperforms the publically available proteins from Xenbase and even the preliminary gene models from the 7.0 genome assembly as reference database. (B) Comparison of the number of proteins identified in the egg, with additional filtering to 1% FDR at the protein level and maximal parsimony.|
|Figure 4. Estimation of Protein Abundance in the Xenopus Egg (A) Previously published protein concentrations for 49 proteins versus measured ion current in MS1 spectrum normalized to protein length. The Pearson correlation is 0.92. On average, the predicted protein concentration is approximately 2-fold different from the reported protein concentration. (B) Histogram of concentration for all identified proteins regressed from normalizedMS1 ion current.Median concentration of measured proteins is approximately 30 nM. (C) Estimated concentration for subunits of stable complexes is similar. For the APC/C, we additionally distinguished between subunits that were reported to be dimeric (square) or monomeric (triangle) within the complex. Although our accuracy is not good enough to separate the two populations, the estimated concentrations for dimeric subunits tend to be higher than those for monomeric subunits. (D) Concentrations for enzymes of a metabolic pathway can vary widely. For each metabolic pathway, the predicted concentrations of its members are plotted (based on the Kyoto Encyclopedia of Genes and Genomes).|
|Figure 5. mRNA and Protein Abundance (A) Histogram of mRNA levels in the egg. mRNA for which the protein was also detected is colored blue. Orange indicates that only mRNA was detected. The median of mRNA concentration is approximately 1,000-fold lower than the median for protein abundance. Although we see only a weak correlation between mRNA and protein abundance (0.32 Pearson correlation), the lower the mRNA concentration, the less likely we are to detect the corresponding protein. (B) mRNA and protein were matched via assigned gene symbols. MS is able to identify approximately 60% of all gene symbols for which we could detect mRNA. The proteins that we cannot detect via MS are overrepresented by transcription factors, proteins involved in differentiation, and transmembrane proteins. On the contrary, for w350 gene symbols, we could identify only proteins, but not mRNA. This group is highly enriched for blood plasma and liver proteins and was likely endocytosed during oocyte growth.|