For the first time, proteomics earns its 2002 Nobel to begin revolutionizing medicine!
Most labs have all the pieces except for competent analytics. We’ve blogged our original work developing the theory and application of precision low-abundance proteomics. Using only elementary mathematics, we proved its peptide ID step requires all 3 components: (1) mass-tolerant search, (2) cross-correlation search engine, and (3) post-search filter using only hard mass data.
Even with state-of-the-art technology (SORCERER) this requires significant computing power beyond PCs.
Conversely, any and all fast PC search software, popular with most labs, violate at least one of the 3 requirements. As such, they are provably imprecise at best. Here we explain the theory behind this.
Historically, clever theory skeptics stubbornly attempted the undoable trying to build perpetual motion machines or turn lead into gold. Elementary mathematical analysis suggests clever bio-scientists improvising fast and accurate PC-based analytics without the 3 requirements are similarly doomed.
In a nutshell, proteomic analytics suffered from a disconnect between reality and perception. There is nothing trivial about pattern-matching billions of noisy mass measurements with protein subsequences to infer biochemistry. Yet many labs expect to hire a couple of programmers or bioinformaticians to singlehandedly solve complex riddles with little time or adequate tools and support.
When success becomes unattainable, the publish-or-perish culture produces hacked analytics that looks good in a publication but quietly miscalculates under non-ideal conditions. Labs mistaken experimental academic software (typically a proof-of-concept to promote as-yet-unproven ideas) as robust analytics for real-world experiments and get imprecise or fanciful results.
The current productivity crisis presents a breakout opportunity for savvy labs that can do precision low-abundance proteomics to deliver the goods. This is now uniquely possible with SorcererScore(tm) technology.
The strange and tricky science of proteomics
Proteomics’ unusual scientific paradigm confuses researchers and hides problems. Unlike most sciences where data is used only to test hypotheses, here data is used both to auto-create hypotheses (candidate peptide IDs from search engines) and to test these same hypotheses, which risks circular logic. Also unlike most sciences, the error estimate (false discovery rate or FDR) doesn’t even consider data in its calculation, which makes hackable. Taken together, it becomes dangerously easy for sloppy analytics to build “castles in the air” — self-consistent results with little connection to reality.
For example, when search mass tolerance is set too narrow (i.e. not mass-tolerant, say <10 ppm), the hypotheses-testing step becomes almost meaningless. Most random IDs would pass right through because they become indistinguishable from correct IDs using hard mass data. And a hacked FDR doesn’t blow up when these random IDs inflate the total count.
Counterintuitive to many, the only way to fix this is by widening the search tolerance, almost to the limits of your compute power, to increase the filtering power of delta-mass. Even for hacked FDR workflows, this improves quality by reducing random IDs that pass through, resulting in fewer IDs at lower true FDR. But casual users of weak tools may mis-interpret fewer IDs as a worse result. (One caveat: mass-tolerant search requires a way to pick out low-ranked correct IDs, which is a key part of SorcererScore.)
Such is the current state of proteomics: Most labs standardized on low-priced or free academic PC software that generates good-looking results which later prove to be semi-random, and they seem perplexed as to why. Now it’s a mystery no more.
Astute readers may start to see the theoretical basis for why the required 3 components form, more or less, the only possible framework to precision, low-abundance proteomics.
Although conceptually simple, this methodology requires an optimized computing appliance (SORCERER) or a large cluster (20x to >40x more CPUs for equivalent performance). For instance, one customer is investigating a potential breakthrough discovered only with a 1000 amu search which took a SORCERER-2 system about 10 days. It could be tweaked to be barely practical for the cloud-based SORCERER Storm (~10% compute power) but largely impossible for desktop PCs.
How to cross-validate false-positives and FDR
Validating software means checking both true-positives (peptides in the sample are reported) and false-positives (peptides not in the sample are not reported). Anecdotally, many labs only validate true positives while presuming the software’s self-reported FDR is correct.
The following is a simple framework to cross-check any software’s FDR with real-world data.
1) Do a standard target-decoy search and sanity-check true-positives.
2) Remove all identified and similar proteins from sequence database, then search again.
3) Repeat #2 if there are still many high-confidence identified peptides and proteins.
The principle is to create a sequence database with no matches by iteratively removing any proteins that do match. Searching against this non-matching database reveals false-positive susceptibility.
For robust analytics, confident IDs would quickly drop to become insignificant after one or two iterations. In contrast, non-robust analytics would report brand-new IDs at a low FDR after every iteration. Labs can learn a lot about analytics in general with this test.
Robust precision relies on hard mass data
All analytics infers results from both hard data (mass) and soft meta-data (search scores, protein sequences, etc.). The latter embeds subjective variability, for example in what information to consider and at what relative significance. Therefore robust, reproducible analytics strives to prioritize hard mass data.
In contrast, academic proof-of-concept analytics may be expected to mix-and-match, perhaps even favoring meta-data in a pitch, to boost performance on the published datasets (“over-fitting”).
Look, soft meta-data can be very informative, sometimes more than hard data, but just not predictably so.
Your mother can just look at you to diagnose most ailments, perhaps better and certainly quicker than your MD. The problem is, she can be catastrophically wrong, something your MD interpreting hard diagnostics data is trained to avoid.
Search engine scores are soft meta-data because human inventors decide which spectral features are important and how they combine into a numerical score. This subjectivity is evident because no two search engines ever agree on peptide IDs regardless of the quality or accuracy of the mass spec dataset.
Conceptually, a hard/soft split of 90/10 is more rigorous than a 10/90 split.
As an analogy, a student’s final grade based on a test/attendance split of 90/10 is a more rigorous indicator of competence than a 10/90 split. Test score is a hard indicator of competence, while attendance is only weakly correlated to competence. Relevant for peptide ID, it becomes easier to inflate significance with a higher soft split or with “softer” meta-data.
Popular dogma is fundamentally wrong
The single biggest misunderstanding — unfortunately reinforced by flawed intuition, recommendation from respected scientists, and seminal papers — is the idea that accurate data can be searched with an ultra-narrow mass tolerance (say <10 ppm). In fact, the complete opposite is necessary for precision proteomics.
It’s not easy to dispel ingrained dogma. It’s also probably very disturbing because it casts a shadow over years of proteomics research using narrow searches. Sorry about this, but that’s science. If the brilliant Stephen Hawking has to backtrack on his signature theory involving a black hole’s event horizon, no dogma is safe regardless of statute.
On the positive side, this is a rare leapfrog opportunity before the rest of the field wakes up to it.
Fundamentally, a narrow mass tolerance reduces hard data content, so results rely almost entirely on soft meta-data. The narrower the tolerance, the lower the hard/soft split.
As noted, that means the results can be unpredictably correct or incorrect depending how well the dataset matches subjective assumptions. Just because the soft analytics worked well for the published data has little bearing on whether it will work on different data. We believe this is consistent with labs’ own experiences.
Narrow mass tolerance means reliance on soft meta-data
The only hard data from tandem mass spectrometry are masses (m/z) of precursor (MS1) and fragment ions (MS2). That’s it. Absolutely everything else is soft meta-data.
Consider this thought experiment: Say we narrow the mass tolerance to exactly the data accuracy. Common intuition suggests this is good because all correct IDs pass the filter while the ultra-narrow window blocks almost all random IDs.
But the operative word is “almost”. The problem is, some unknown number of random IDs will always get through, and they will be completely mass-indistinguishable from correct IDs.
The literature shows two approaches to finesse this problem. First is to further tighten the search space (in addition to the narrow tolerance) by searching only smallish databases or limiting modifications. The second is to mis-use one academic software (a non-robust proof-of-concept for a clever paper) that “percolates” on ID hypotheses with more soft meta-data. Neither can replace the missing hard data information.
Previously we explained how SorcererScore delivers precision low-abundance proteomics, with its 3 required components.
Here we explain the flip side: how existing fast search workflows using narrow mass tolerance inherently yields imprecise if not semi-random results. The theory shows the problem is fundamental and cannot be fixed without incorporating the 3 requirements.
For 15 years we’ve been here, this is the first time proteomics becomes a robust, precise, sensitive technology for protein research.
Done right, medical research transcends politics.
Budget-conscious labs can start with the low-priced cloud version (SORCERER Storm) starting at $5K with tech support. Researchers with SORCERER experience can forgo tech support starting at $3K. Labs can sign up to lock in current rates before a planned price increase commensurate with increased functionality.
Contact Terri at Sales@SageNResearch.com for information.