After the unexpected presidential election, many political pundits were quick to criticize prediction models for being “wrong” because almost all calculated probabilities were well below 50%. In fact, Nate Silver’s estimated ~30% sounds about right to me given the tiny margin of victory in several necessary states. Nevertheless the concept of abstracting the electoral outcome as one flip of a loaded coin begs these questions: (1) What exactly does a probability represent? and (2) How is it computed?
In a nutshell, estimated probabilities of non-trivial predictions are best viewed as mental tools for human observers to quantify ambiguity by incorporating meta-data. This is because direct data may be incomplete or unavailable. (Gamblers would love to get detailed physiological data from race horses before placing a bet, but they make do with historical performance etc. that is meta to that day’s race.) As such, probability models are inherently imprecise and subjective.
In other words, the moment any hard data (delta-mass, peak-counts) are converted into a probability, information is irreversibly lost while modeling artifacts are introduced. For example, choosing between binomial and hypergeometric distributions, which can yield near-identical probabilities, increases informational entropy compared to not using a probability model at all.
For mass spectrometry data, where the inter-relation of charge/mass peaks clearly violate the “IID” (independent and identically distributed) assumption fundamental to probability theory, probability estimation is mainly to ease human interpretation and not to uncover intrinsic data patterns. Therefore, robust and precise analysis requires using probability models as late as practical in a data analysis workflow. For mainstream search-engine-based proteomics workflows, this means probability models should only be used after the peptide identification step. Any earlier and peptide identification becomes less precise.
Like many disruptive technologies, current first-generation “Proteomics 1.0” is built on academic research proofs-of-concept that excel in innovation but lack robustness. The technology is capable of very cool things but requires specialized expertise, and even then mileage may vary.
To make clinical impact, however, “Proteomics 2.0” needs to become boringly dependable (robust), useful (sensitive enough for lower abundance peptides/proteins), and verifiable (important IDs and quantitation can be traced back to raw data if needed).
No conventional workflow offers the trifecta of being robust, sensitive, and verifiable. The closest for peptide ID remains Sequest+target/decoy+TPP, still the standard in many statistics-savvy labs and part of SORCERER(tm) for a decade, which is robust and verifiable but not sensitive. No quantitation workflow can be considered robust in our view.
The SorcererScore(tm) methodology is built from the ground up specifically to provide ID and quantitation with the full trifecta — robust, sensitive, and verifiable — for “Precision Proteomics 2.0”, starting with precision peptide ID and SILAC quantitation. Luckily we were able to build on the existing robust and verifiable foundation.
The secret? For IDs, we make the post-search filter (i.e. PeptideProphet) rigorously precise by using minimally processed raw mass/charge data, instead of subjective derived parameters like search scores and probabilities.
For quantitation, SorcererScore starts by data-mining intact-peptide mass/charge data (“MS1”) to extract peptide quantitation information anonymously. These are then cross-matched to peptide IDs derived from fragment data (“MS2”), thereby catching errors from either quantitation or identification to provide robust results. In contrast, research prototype workflows typically use peptide IDs to drive quantitation, which is simpler to implement but that propagates instead of attenuates errors.
A deeper understanding of probability and statistics has become a requirement as bio-science becomes an information science. As noted earlier (click here), proteomics is an atypical science in that data is used both to auto-generate hypotheses (peptide IDs) and to auto-test these same hypotheses. So the “science” — i.e. creating and testing hypotheses — is necessarily done through bioinformatics with computers. Therefore it behooves researchers to competently judge the quality of semi-automated analyses, as opposed to outsourcing all the science to automation or being fooled by faulty analytics.
What a probability represents
Say we estimate two probabilities at 50%: (1) A fair coin toss comes up Head. (2) My team wins the Superbowl.
Though both probabilities, they represent different concepts. The first characterizes the “object” (coin) while the second characterizes the “subject” (observer), or more precisely, the subject’s interpretation of the object (Superbowl). Notably, the latter is a subjective prediction because each observer can have a different interpretation of the same object.
In other words, a probability can in principle be 100% objective, 100% subjective, or something in between. In practice, except for simple cases like a coin toss, real-world calculated probabilities are mostly subjective, model-dependent, and reflect the biases of the observer/modeler.
This inherent subjectivity means software users should not blindly accept proteomic probabilities without knowing what it represents. In our analysis, the differing design and use of subjective probability models, including flawed estimation of error statistics, is one of the main causes of irreproducibility that plagues conventional proteomics.
Different ways to compute a “probability”
Different “probabilities” are computed differently and require widely differing effort. In proteomics we find conventional true/false probabilities (e.g. PeptideProphet), p-values (calculated theoretical probability of surpassing a threshold randomly), and Bayesian probabilities (a “degree of confidence” score). Only the first are considered true probabilities. Conceptually it may be useful to think of ‘P’ as a true probability if (1-P) is a meaningful probability of the opposite. This would exclude both p-values and Bayesian probabilities.
To estimate the empirical probability of a loaded coin, we can flip it 10 times for a first robust estimate. This may be an acceptable trade-off between precision and effort.
However, if you want 10x or 100x more precision, flip it 100 or 1000 times instead. Simple but tedious, requiring orders of magnitude more effort. Fancy math won’t help if precision is needed, but automation would.
Going the opposite way, one can hack malleable Bayesian statistics to reduce effort. Wikipedia explains Bayesian probabilities of a prediction as a “degree of confidence” of an “orderly opinion” to be incrementally revised by data. As an extreme illustration, we can guestimate some prior probability, say, p1 = 0.5678, then incorporate one coin flip — a Head or 1.0 — as an average to derive the posterior probability of p2 = 0.7839. Just one flip and done! But you get what you pay for.
To be sure, Bayesian analysis was a major advance in statistical thinking. However, its flexibility and complexity make models both tricky get right by the author and to analyze by users. As well, Bayesian “degrees of confidence” scores are too easily confused with true probabilities leading to subtly wrong answers. The result is that many who publish such models may not fully understand them, while users find them difficult to validate.
To summarize, it can take 100x to 1000x more effort to derive precise truths than informed opinions. This is why a high-end SORCERER can crunch for days while a PC program takes minutes on the same data. And why SORCERER results are often boringly robust.
In practice, the “prior probability” can serve as a fudge factor to tune a Bayesian model to over-fit specific datasets for publication and common benchmarks. Such software is expected to excel in specific situations but yield subpar results otherwise. This explains many labs’ observation that the same software can unpredictably produce good or semi-random results.
Simple test for robustness
The above arguments may seem theoretical, but there’s nothing abstract about sending flawed results to your research customers.
To be sure, run this simple test on your software: Search any dataset against a randomized protein sequence database. (If you have a SORCERER Storm account, you can download one from there. Or email Sales@SageNResearch.com.)
Robust software would report no IDs; some (like TPP) may even error out due to a failure to find any IDs. This is the correct behavior.
In contrast, non-robust software would report many IDs at low error rate, which indicates a flaky methodology and faulty error estimate.
Jump into Precision Proteomics with SORCERER Storm
Modern proteomics should be a robust, precision analytical tool for cell biology research. Instead, it is hampered by imprecise or non-robust software. Sage-N Research set out to fix this once and for all.
The root problem is that many scientists know little about their primary data and become overly dependent on third party software. The ultimate solution is not better push-button software (canned software only works for experiments someone has done), but a better data platform that allows scientists to interact with their data. And to rapidly prototype custom algorithms. Basically an “Excel on steroids”.
Believe it or not, SorcererScore solved the precision peptide ID problem. And precision SILAC quantitation is in internal testing.
Both capabilities are scripts rapidly developed and optimized on our GEMYNI software platform. The same platform scales from the entry-level SORCERER Storm cloud account to our high-end SORCERER Pro physical system.
Believe it or not, the Precision Proteomics revolution is here!
To get a jump on the new high-precision paradigm, try our SORCERER Storm (visit http://www.SageNResearch.com/). Or email Terri at Sales@SageNResearch.com for information.