How Bad Analytics Hijacked Proteomics: A Theoretical Analysis
The fixable reason why proteomics has little clinical impact — it’s big data mis-analyzed by small PCs using faulty analytics.
Here we explain how proteomics has been hijacked by a fundamentally flawed, self-reinforcing fantasy — that a sufficiently clever PC program can analyze huge content-rich mass spec datasets in minutes with 99%+ accuracy. Just as dirty samples need sufficient chemistry to clean, dirty deep data (where breakthroughs hide) need a lot of computation that greatly exceeds a PC’s capacity.
In a nutshell, basically any analysis from any fast PC program is either uncompetitively shallow or deceptively semi-random because minimal information content is imparted by insufficient computation. Note that this is a perpetual moving target — as PCs increase in power, so too do servers while data complexity continues to increase, so PC software is always at least an order of magnitude behind what server systems can do.
The silver lining: With the field spinning in circles due to faulty analytics, researchers skilled at competent deep proteomics can dominate with few rivals. This is a once-in-a-lifetime opportunity for foundational protein research at the dawn of the digital biomedical revolution.
Data analysis is always the Achilles heel in any “real” scientific discipline, and more so for modern mass spectrometry data with overwhelming volume and bottomless depth. To make proteomics less intimidating, PC programs designed to be fast, cheap and easy — basically demo software — are marketed to novices for simple experiments. But unsuspecting labs using them for real-world research get burned by irreproducible results.
Rigorous analytics is fundamental to precise physical fields like subatomic physics and proteomics. Since scientists can’t directly “see” what they study but only indirectly through precise measurements, the integrity of the science is 100% determined by the integrity of the mathematics. Vulnerabilities can only be revealed by studying the equations, not by simple sanity-check experiments. Most labs only do the latter and put too much trust on peer-review to verify the math.
We can now mathematically explain how popular PC programs achieve otherworldly performance — an order of magnitude faster while reporting more IDs — versus robust if imprecise server workflows like classical TPP/PeptideProphet. Basically the software fudges stats by systemically and significantly under-estimating FDR error.
Two wrongs don’t make a right, but each can make the other less detectable to casual users. This is the one-two punch: (1) narrow search tolerance, which speeds up search while making wrong IDs harder to discriminate with delta-mass, and (2) an over-aggressive FDR optimizer that, when given close-together right and wrong IDs, tend to report ambiguous IDs as correct. If a true FDR of 10-20% can be mis-reported as 1% (i.e. 80%+ are indeed correct), human nature gives everything the benefit of the doubt. It’s the same mental trick used by psychics to fool gullible victims by telling them 9 provable truths for every unprovable lie about the afterlife. In short, the false-discovery rate (FDR) has evolved from a useful quality metric to a fake number that makes bad analytics look great.
Let’s say a quick narrow search yields exactly 10K correct IDs. (Note that each PSM must be either right or wrong, so the number of correct peptide IDs is an exact number.) A robust optimizer might report 10.1K IDs at 1% FDR, but a flaky one might report 11K or 12K IDs also at 1% FDR. By definition, no robust result can come close to the latter’s artificially boosted output. A costly deeper search may yield 10.3K correct IDs, including 300 lower abundance peptides (perhaps one is a breakthrough), but superficially it can’t compete against the flaky result.
Most labs “validate” an algorithm by pre-assuming its FDR is correct (!), in part because few know to cross-check it. This allows even the flakiest algorithms to get pass peer-review and propagate like a weed to destroy the credibility of proteomics technology.
To play devil’s advocate, some may argue, “What’s wrong with getting 90% right answers with cheap/free software?” But that confuses quantity with relevance: An infection researcher orders a proteomic analysis to see what pathogens are in diseased tissue. The report comes back with a list of 95% mostly correct but irrelevant human proteins and 5% mostly wrong pathogen proteins. Who wants that? By saving money on software and IT, proteomics labs produce lots of correct irrelevant information, but little pertinent to relevant research.
One of the challenges for professional analytics companies is that researchers have become so overly dependent on numerical surrogates from software they seem to forget even the basics. Mass spectrometry isn’t a deep concept — the presumed moiety is wrong if delta-masses are excessive. There is nothing deeper than that.
For the record, a candidate peptide ID is likely correct if and only if there are many matched fragment ions, and predicted masses (precursor and fragments) closely match measured ones, period. There literally isn’t any more quantitative information than delta-masses. Yet too many researchers have been trained to think only in terms of search scores (subjective metrics) and FDR, and ironically don’t trust delta-masses.
Look at it this way. Weight data is conceptually simple. It’s so simple that other scientists might laugh if someone insists on using one model of accurate scale along with analysis software designed for weight data from that scale. Actually, molecular mass is only slightly more complex due to isotopes, so that a monoisotopic mass M might manifest as (M+1) or (M+2). And mass/charge ratios (m/z) from mass spectrometry are only slightly more complex than that because the charge can be +2 or +3 or sometimes +4 or higher. This gives rise to m/z variations of M/2, (M+2)/3, (M+3)/2, etc. In other words, conceptually speaking, mass spectrometry data are mainly tricky because of arithmetic variations. You fundamentally don’t need or want overly fancy mathematics (Bayesian analysis, machine learning) to unwind arithmetic transformations, because they blur out inherent precision. Instead you want to stick with arithmetic transforms that preserve it. That’s the guiding principle behind SorcererScore™ for precision proteomics.
Current mass spectrometers have a dynamic range of perhaps 6 orders of magnitude [click here], which is much less than that of peptide abundance in real-world bio-samples. As well, short-duration peptides are hit-or-miss in terms of being captured in either a MS1 or MS2 scan. This means the dataset is extremely content-rich and captures only a tiny fraction of peptides, with signal-to-noise that varies from great to extremely poor.
To appreciate the data analysis complexity, imagine having thousands of separate jigsaw puzzles (proteins) mixed into one pile. Let’s say there are 10M total theoretical pieces (peptides), of which only 1M are available (MS1 peaks), of which only 100K retain their possibly fading picture (MS2 spectra). Given a book of solutions (‘Fasta’ protein sequences), the analysis software must go backward to reconstruct the original puzzles (protein IDs) to the extent possible, first by pattern matching as many pictured pieces as possible (peptide IDs) to the solution book. The problem is further complicated by signal-to-noise and search variations (PTMs or post-translational modifications).
Such complex, voluminous data with “layers” of different signal-to-noise is somewhat new to bioscience, but not to subatomic physics and other big-data fields. It is unhelpful to ask a canned program to “tell us everything” (you get mostly obvious irrelevant answers). Instead, they are typically probed with an open-ended, semi-interactive approach (“data mining”) with overnight scripts on powerful servers. Scripts allow different questions to be asked.
Why servers? In data mining, you need the most powerful and efficient computers you can afford to probe deeper into noisy data. The total computation applied to one task may be modeled as “TC = Speed*Efficiency*Time”, where Speed represents the relative hardware speed (a PC is 1.0), Efficiency is a multiplier for software optimization like indexing, and Time is the number of hours. An unoptimized PC program running for an hour yields TC=1.0. In comparison, a computer with 4x the cores using indexing for 5x the efficiency running for 10 hours yields TC=200, or 200x more total computing. Unlike PCs, servers have hardware, OS and software designed specifically for robust 24/7 automation. Server CPUs will always offer more cores and reliability than PC CPUs. Therefore, optimized server systems like SORCERER can always apply orders of magnitude more computing than PCs, because both improve in unison from the same Moore’s Law.
So what does it all mean? Like mining for 5-carat diamonds with hand tools, the naive approach to shopping for the “best” PC program (in terms of fast, cheap, and easy) for real-world proteomics research has been and will always be futile. Another lab using efficient data mining servers can always direct 200x more computing to do deeper, more reproducible analysis. There is no natural market for undependable shallow analyses of abundant proteins using $1M instruments, questionably justified by saving maybe $100K in software costs. (Total IT cost is typically 15% of total budget in diverse data-intensive fields.) The cost may sound high to those new to advanced computing, but no one can name a successful big-data field dominated by amateur PC software.
Look, if all data analysis can be pre-encapsulated into a push-button program, future Crick/Watsons and Einsteins can earn their Nobels easier by letting junior programmers pre-solve all their data riddles. The fallacy is the chicken-and-egg problem: we can’t code insights we don’t yet have. Even fancy AI techniques (think statistical tool PCA on steroids) simply repackage the complexity somewhere else.
We are not saying we have all the answers, only that we understand the question better than most. And we uniquely have the most advanced platform and technology (SorcererScore) that supports the theoretical foundation for all Precision Proteomics.
In deep sea marine biology, scientists and engineers work together to explore the murky abyss with submersibles. Biologists focus on studying new creatures, not calculating steel thickness. Engineers focus on building a specialized tool, not encroach on biology. Marine biologists who try to do both will be unlikely to succeed in either science or engineering.
In deep proteomics, scientists and engineers work together to explore deep data. Mass spec informatics turns out to require different mathematics than other big-data fields. Proteomics scientists focus on early cancer and disease biomarkers, not developing and maintaining complex software systems. System engineers focus on building productive data mining tools that accelerate scientific discovery without dabbling in the science itself. As history shows, most labs that try to master both science and computing don’t succeed.
A precision revolution in proteomics has quietly started. With wide availability of accurate mass spectrometers and continuing advances in sample prep, the value-added bottleneck shifts to deep data analysis. This signifies a maturation of technology (“inflection point”) that creates unprecedented opportunity and peril for institutions, companies, and people involved.
These are tumultuous times for medical research. Most people pull back hoping for better times. Strategically, the best time to leapfrog is to invest when others retrench, and to do great science when others settle. This is especially true when everyone will soon learn how robust proteomics should be done.
We are doing our part to advance science by making introductory precision proteomics affordable. Contact us for more information.
For archive of Technical eNews on website, click here.