How 99.99% Confidence IDs Enable Clinical Proteomics
by David.Chiang and Pat.Chu (at) SageNResearch.com
Clinical proteomics is surprisingly simple if we think about it cleanly:
Identify individual peptides — particularly low-abundance peptides — with extreme precision, then use protein sequence info to infer protein IDs and relative quantities.
That’s the revolutionary idea behind SorcererScore(tm) [Chiang, 2016] — simple in concept but computationally difficult.
The clinical benefit is clear from Figures 1 and 2 for SorcererScore. Relatively stable number of high-confidence IDs (99% to 99.99%) of both proteins and peptides allow experimental validation. Simple arithmetic operations at every stage allow any protein ID and quantitation to be readily traced to constituent peptide ID and quantitation, and in turn back to the raw data, so every result can be directly traced back to raw data for manual verification of critical results.
Figure 1: #Protein IDs vs. Min PSM Probability with and without SorcererScore
(Analyzed using ScaffoldBatch on SORCERER for one typical anonymous dataset, allowing single-peptide protein ID.)
Figure 2: #PSMs vs. Min PSM Probability with and without SorcererScore
(Analyzed using ScaffoldBatch on SORCERER for one typical anonymous dataset.)
Of course, there can be ambiguous PSMs (< 99% confidence), akin to blurry spots in a clean x-ray, that may require semi-manual interpretation. The key is that any distortions arise from the data itself, not from complex models or meta-data.
In contrast, first-generation “inside-out” data analysis starts by filtering search results with population-wide statistical models (FDR) and lots of meta-data (search scores, peptide attributes) instead of with only hard delta-mass data. This means precision for individual spectra is lost right at the first filtering step. Referring to Figures 1 and 2 without SorcererScore (“Standard”), it is unclear what cutoff should be used as a confidence threshold to drive experiments.
Here we explain the two counterintuitive innovations in SorcererScore that make Precision Proteomics possible.
Filter PSMs with only hard data for robust reproducible IDs
Before search engines were invented, every mass spectrometry researcher understood a candidate peptide ID (formally called peptide-sequence match or PSM) is likely correct if and only if it has:
1. Small intact peptide ion delta-mass (‘dmass’)
2. Small average fragment ion delta-mass (‘dfragm’)
3. Many matched fragment ions (‘peakcount’)
Clearly any PSM from any search engine can be distilled to these three essential physical parameters, which can be mapped to a point in 3D space.
For mathematically well-behaved search engines like the Yates-Eng XCorr (Eng et al, 1993) rooted in linear algebra, these points form well-defined clusters of “correct” and “random” PSMs. This spatial clustering can be exploited to separate these two sub-populations.
In a nutshell, SorcererScore maps each Yates-Eng XCorr PSM into 4D space (adding one parameter), where “correct” and “random” PSM clusters are separated by a hyper-plane.
The S-Score for each PSM is then defined as a distance to the separation hyper-plane.
In contrast to traditional approaches based on statistical models using meta-data, our geometry-based approach using hard physical data is inherently as precise, robust, and reproducible as the raw mass spec data.
This is obvious in hindsight but counterintuitive to long-time bio-science researchers.
Traditional bio-science is “data-scarce” and requires additional assumptions and soft meta-data to reach conclusions.
In contrast, mass spectrometry is more tech-like in being “data-rich”, where the problem is conflicting and/or ambiguous data from noise, not from insufficient data. As such we need math and computers to filter out noise, not complex models to add meta-data, to reach conclusions.
“Purify” PSMs with ultra-wide mass-tolerant search
Chemical purification requires sufficient solvent to remove impurities. In digital purification, mathematical transformations applied by powerful computers are used to clean large data-sets. For example, a satellite image may be sharpened by removing high-frequency noise using the Fast Fourier Transform.
Therefore, CPU cycles may be viewed as “digital solvent” for data purification. The dirtier the data, the more CPU cycles needed.
As noted for clinical proteomics, high-confidence peptide ID is the Achilles heel. To separate random ID “impurities” using only hard data, this means using only delta-masses from intact and fragment mass/charge data.
There is only one model-free way to increase the power of delta-mass: increase the search mass tolerance then filter with a narrow delta-mass. Like dipping a dirty spoon into a bathtub vs. a glass, the extra solvent spreads random IDs over a wider range.
Mathematically, it increases the average delta-mass of random PSMs so they can be eliminated using it.
This simple idea may be counterintuitive to bio-scientists used to data-scarcity who focus on the “signal”. The data-rich paradigm focuses instead on the “noise” in order to eliminate it, not the signal to preserve it.
Therefore, labs restricted to using a PC instead of a powerful SORCERER iDA to pattern-search large dirty datasets are arbitrarily restricting the digital solvent to maybe 1/10th the needed amount. This works fine for cleaning simple noise-free data but not real-world clinical data.
How analysis software quality can vary widely
Experienced researchers know first-generation proteomics software report different IDs even at reported 1% FDR error. This means the actual error may be much higher than 1% and can exceed 10% for overly aggressive software.
How can this happen?
First, false-discovery error rate (FDR) is itself a statistical estimate with an error bar that is potentially large. This means that, if we keep re-sampling FDR under slightly different conditions, we can eventually get a way-off estimate close to zero. If we iterate specifically to optimize FDR, such as in some popular proof-of-concept software, we are in effect automating “p-hacking”, or hacking statistics (e.g. p-value) to make results look artificially better a la Mark Twain.
P-hacking is an accepted part of publishing novel research ideas, but not for real-world clinical research.
Broadly speaking, the 80/20 rule suggests basically all software agree on the easy 80% of PSMs being correct or random IDs. The difference in software quality is in the vital 20% containing low-abundance PSMs. In any case, casual users are unlikely to discern robustness of the algorithm based on this small subset, yet this is where deep valuable insights reside.
Robust algorithms like SorcererScore seek to resolve these with only hard data, which requires asymptotically more computing.
This is consistent with other big-data fields that utilize simple robust math applied by powerful servers to uncover hidden insights, i.e. data-mining. Proteomics is unique among big-data fields to lack deep math expertise and to depend on PCs instead of powerful servers, which would explain its relative lack of clinical success.
In contrast, aggressive non-robust software might use meta-data like “sibling peptides” to boost scores of marginal PSMs assigned to already-identified proteins. The analogy in criminal justice is to consider a suspect more guilty just because of relatives already in jail, which is meta to physical evidence like fingerprints or DNA.
This has the effect of improving already “good” protein IDs while boosting the number of peptide IDs, a popular quality metric among beginners.
On the flip side, this approach has two undesirable effects for clinical research.
First, it suppresses low-abundance proteins which are expected to be disproportionately single-PSM protein IDs. Second, mis-assigning a PSM to a protein distorts its quantitation estimate because an unrelated value is averaged in.
The key point: There’s really is no free lunch. And certainly not in mature fields like math and computing where deep experts are plentiful outside proteomics.
Quality has to be baked in, not tested out. True validation requires validating the insides of an algorithm to ensure it is rigorous by design. Reading its “ingredients” is a first step to telling apart the good stuff vs. the “junk food” of the software world.
Every year for more than a decade, there is new software purported to address the latest challenges, only for subtle new problems to emerge sometimes years later. Researchers should probably not expect the latest crop of software to be much different, in terms of solving some problems while introducing new ones.
Commentary: Revolutionary success through timely preparation
Success isn’t hard if we clearly see facts without clutter. But intelligent people can be their own worse enemy by over-thinking things. They might try to be too clever or try too hard to get “a deal” only to miss a once-in-a-lifetime opportunity.
To succeed, just deliver the goods!
What’s tricky, though, is that the game changed. Nowadays “goods” increasingly means robust, validation-ready analyses for translational medicine, not virtual reality from cheapo proof-of-concept software for peer-review.
Like junk food to pro athletes, it’s best to wean from the latter before it ruins our performance.
Narrow smart people project their expertise into unrelated fields. One famous real estate developer claimed he knows more about war than generals. A couple of chemists bragged of being experts in data science and computers from using a PC in grad school and taking a programming class. Such wishful thinking hinders real progress.
Clinical proteomics, and medical research in general, silently reached at an inflection point. The 80/20 rule predicts resources will concentrate in the 20% who deliver reliable deep protein analyses. Robust data analysis and tech support from the Silicon Valley engineers of Sage-N Research who live and breath math and computing can prepare them for success beyond their dreams.
Time is of the essence because a tech revolution can accelerate unpredictably quickly. Once the explosion starts, it’s already too late to prepare. There will never be a more stark difference between success and its opposite for proteomics-trained scientists.
You can choose between our low-priced SORCERER Storm cloud account or the high-performance SORCERER Pro physical iDA system. (A high-end “private cloud” version of SORCERER Storm is also available for an annual license fee.)
For previous Technical Posts, please visit our website at: https://www.SageNResearch.com/.
Chiang D (2016) How to Identify Low-Abundance Modified Peptides with Proteomics Mass Spectrometry. MOJ Proteomics Bioinform 4(5): 00133. DOI: 10.15406/mojpb.2016.04.00133
Eng JK, McCormack AL, Yates JR (1994) An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J Am Soc Mass Spectrom 5(11): 976-989.