Proteomics in clinical research is like selling fish in Tokyo. There is unlimited demand for fine sashimi but not for tuna salad. Customers pay top dollar for essential flavors not for heavy mayonnaise. Uninformed restaurants might believe preparing raw fish requires no skill, hire inexperienced youngsters instead of sushi masters, then wonder why customers don’t show. In fact, sushi mastery is hard precisely because of the restricted degrees of freedom without cooking or added ingredients.
Translational clinical research has unlimited demand for Precision Proteomics, but not for imprecise results from ad hoc statistical blending of hard data and soft meta-data.
For two decades, proteomics had to be statistical because data accuracy was poor. It borrowed heavily from genomics, an inherently statistical field, so imprecision was baked into conventional bioinformatics.
After high-accuracy mass spectrometry became prevalent, proteomics should no longer be a statistical science. Instead, the question is whether it is a mathematical or computational problem. In other words, is the main challenge defining the best equations or doing timely calculations (i.e. in days not months)?
From our original work rebuilding its mathematical foundation, we found that for precise peptide identification, the math must be simple arithmetic directly on raw mass/charge (m/z) data with no complex models. But the critical need is powerful computing to do ultra-wide sensitive searches, which is a direct consequence of the restricted degrees of freedom from not using complex models nor soft meta-data.
As such, SorcererScore(tm ) provides the framework for the Precision Proteomics Revolution.
This new paradigm may be viewed as a “biochemical x-ray” for peptides where SorcererScore “develops” an “image” (arithmetic transformation of m/z data). Like x-ray diagnosis, peptide ID hypotheses are classified as “yes”, “no”, or “maybe” depending on the raw quality of the m/z data. Because simple arithmetic not models are used, skilled interpreters can uncover deeper insight by manipulating the arithmetic (data-mining).
Two important implications …
First, any lab attempting sensitive precision proteomics with PCs is wasting their time. (It is impractical to do computation-intensive science with less computing power than a photography studio running Photoshop.)
Second, unlike probabilistic results from a statistical science, precise peptide IDs can be validated experimentally.
This is a major game-changer because Precision Proteomics can immediately impact clinical research in ways that elude conventional proteomics.
(Note this post focuses on peptide ID because it is the foundation and problem-prone area of proteomics. Precision peptide quantitation using arithmetic on m/z and intensity will be covered in a later blog post.)
Why reliance on delta-mass means a wide mass search
To understand this critical but counter-intuitive insight, imagine you have thousands of fingerprints from a Las Vegas crime scene, and you have solid information all responsible live within one mile from downtown. There are also meta-data such as crime statistics on race, gender, age, etc.
The first way to identify suspects: Fingerprint search within a few miles of downtown, but with a score that also weighs in meta-data, and then filter down to those within one mile.
Here the results depend on both objective and subjective factors, which causes subjectivity and imprecision.
The second way: Fingerprint search in all Las Vegas, then filter to within one mile. For higher confidence, search all Nevada or even the US, then filter to one mile.
Here, the results are robust and reproducible because their hypothesis filtering uses only one hard parameter — distance. The results are as precise as the raw data allows.
Akin to rinsing a dirty spoon in a bathtub vs a cup, the wide search range spreads out random matches, so few random matches appear within the filter range. The wider the search, the lower the error. This is the only way to use a standard search-then-filter methodology to leverage a single parameter like delta-mass.
(For completeness, note that skillful incorporation of meta-data — the art of data science — can increase IDs, but it should be done as a separate later step, and never blended with hard data in the same step.)
“Fast, good, or cheap: Pick two”
This old software development saying, called the “Triple Constraint”, also applies to software products.
This implies that “fast” software is either good or inexpensive but not both.
For example, in photo editing, you can choose between a powerful Photoshop server or an inexpensive iPhone app. Even fast-and-good Internet software that is “free” is financed by $10B’s in advertising ultimately paid by you-know-who.
Proteomics is no exception.
Fast-and-inexpensive PC software look “good” apparently because of faulty error estimation. Their popularity explains how proteomics acquired a reputation for irreproducibility.
In contrast, SORCERERs are professionally engineered to be “good”. One can now choose “fast” (SORCERER Pro iDA system) or “inexpensive” (SORCERER Storm cloud account).
Both use the same SORCERER GEMYNI platform for rapidly building novel algorithms for leading-edge clinical research. In many cases, Sage-N Research can supply basic scripts at no charge, which our tech support can help users customize.