For two decades, traditional proteomics has been held back by two subtle deficiencies. First, its results are imprecise even from high accuracy data. Second, it struggles to find important lower abundance proteins. For years no one knew why. Solving it would unlock significant financial support with projected 19% CAGR growth by one estimate.
That riddle is now solved, which has important implications for every proteomics lab. Now that precision proteomics is possible, traditional proteomics will start to fade away because, obviously, there is no market for imprecise characterization of abundant proteins. As we explain, precision proteomics is fundamentally an analytics challenge. To be sure, some labs will procrastinate. Others may question our claim. In any case, savvy researchers will want to get ahead of the curve, either to try to poke holes, plan its adoption, or just keep abreast of the latest technology that shifts paradigms.
The new SORCERER Storm(tm) product offers low-cost cloud-based platform with SorcererScore(tm) technology for precision proteomics. Accurate-mass data from basically any vendor’s tandem mass spectrometers can be uploaded (as mzXML) to be searched and analyzed. Results can be viewed online with TPP or RStudio, or downloaded as a summary CSV file into MS Excel.
In other words, most scientists can hit the ground running with SORCERER Storm.
For TPP (PeptideProphet etc.) users, the base case requires only running the script after the search, with no other changes needed. It typically “just works” to increase IDs and true-vs-false separation (see figure 1), with no extra effort beyond possibly optimizing search parameters.
Figure 1: TPP/PeptideProphet results before (a) and after (b) applying SorcererScore script, showing improved True-vs-False separation. Many obviously incorrect IDs are also omitted.
Uniquely, the latest SX script will also create an Excel-compatible CSV file of all high-confidence peptide IDs with label-free (MS1) quantitation. Each peptide ID hypothesis is mapped to this triad of physical parameters that make up the S-score for post-search filtering: (1) delta-mass, (2) average fragment delta-mass, and (3) matched fragment peak-count.
Critically, this triad is the heart of SorcererScore, the discovery that each peptide ID hypothesis can be mapped to physical parameters (x,y,z) in a 3D data-cube — independent of any search engine — that reveals distinct ‘true’ and ‘false’ populations from purely physical characteristics! This becomes visually clear by creating an interactive 3D plot within SORCERER Storm, specifically using R (RStudio) with the ‘rgl’ library on the CSV file.
The search engine-independent nature of the triad solves both deficiencies of traditional proteomics. To our knowledge, every traditional proteomic analytics tool uses the search engine score to derive the true-vs-false discriminant score, hence introducing model artifacts and suppressing low-abundance peptides.
The CSV file represents the power and simplicity of SorcererScore. It summarizes all the fundamental information of the experiment at the peptide level: sequence ID with modifications along with MS1 quantitation. Although our field is called proteomics, it can be considered “peptidomics” since all the quantitative data are at the peptide level. The CSV file captures almost all that is knowable regarding peptides. Any knowable protein information is statistically inferred from it.
In advanced labs large and small, bioinformaticians can access all key analytics in transparent scripts written in Linux, MUSE (based on the Lua language), and R, and all key text files in structured directories. That most score and quantitation scripts are just a few pages of code underscores the concise power of our platform.
It has been widely known that certain workflows gave better answers for certain datasets, but no one knew why (until now). From a reproducibility perspective this is concerning. Many researchers tweak search conditions to improve results, usually to get the most IDs at 1% FDR error. Pretty soon, aggressive low-priced software lets the user put in the desired FDR, and it automatically loops to maximize the ID count. It’s now clear that this is simply automated p-hacking, which yields attractive but semi-random IDs at an artificially low FDR.
What every scientist needs to know is that FDR is a statistical estimate that can modeled as a random variable with a probability distribution. Therefore, looping to achieve 1% FDR is basically running a random number generator until something <1% shows up. The true FDR can be 10% or much higher. It’s algorithmically tricky to reduce true FDR, but trivial to hack FDR to look good, especially in our complex “big data” field where many labs lack deep expertise in mathematical sciences.
The field had become unnecessarily mathematically complex which allowed faulty analytics to hide in convoluted hackable equations. We believe a lot of p-hacking is inadvertent from incomplete understanding. For example, many researchers seem unaware that decoys used to estimate error (FDR ~= #decoys/#targets) are simply proxies for wrong target IDs, because this FDR reflects the error after all decoys are removed. This is why you don’t get zero error just by omitting decoys because there is no 1:1 proxy relationship.
Like others, we once tried to combine multiple search engine scores to improve accuracy, including adopting a Support Vector Machine approach which showed early promise. The outcome improved but not enough. (The SVM paper triggered many ideas and hence was successful as a publication, but the described approach actually doesn’t work if made rigorous.)
Our first breakthrough came about 2.5 years ago when we understood the real problem — that the search engine score was too “polluted” by modeling artifacts impossible to undo by combining with other scores. (That’s inherent in its primary role as an efficient gross filter.) As well, from first principles, important low-abundance peptides have low similarity scores due to poor signal-to-noise. This means a similarity score-based post-search filter would suppress them — the opposite of what makes proteomics valuable!
Embarrassingly, it took almost two years in deep hiatus to go from understanding the problem to discovering more or less the one single solution, what we call SorcererScore. During this time we built the SORCERER GEMYNI software platform for ourselves to do efficient data-mining on our multi-core servers. We stared at many graphs and explored many dead ends.
It turns out, obvious in hindsight, that tandem mass spectrometry yields only two types of quantitative data — precursor and fragment delta-mass. That’s it. Quantitatively, there is no other information. This means this is essentially the only allowable information to deduce true-vs-false for precision proteomics.
In contrast, traditional analytics incorporated everything but the kitchen sink — multiple scores, w-ions, number of sibling peptides, and lots of other meta-data — that looks good on paper but ruins precision. This is based on flawed intuition that “more information is always better”, whereas in fact incorporating ambiguous information reduces precision. It’s like getting a second medical opinion from your untrained in-laws.
Instead, the only right way is to squeeze every ounce of information out of the delta-masses, i.e. the above triad. Frankly, this doesn’t provide many degrees of freedom for a competing solution. Although more complex formulas are possible, a simple weighted average works pretty well and is easily understood. The only modeling then is to decide on the weighting coefficients which can be manually (better over long term) or automatically set.
Our standard S-score includes a 4th term — the logarithm of the score-rank — that elevates highly ranked peptide ID hypotheses, based on strong empirical evidence that score rank (but not the scores themselves) is well-correlated to correctness. As an analogy, knowing that a student is highly ranked in a huge school is more informative than her raw GPA.
In summary, shotgun proteomics fundamentally asks three questions at the peptide level: (1) What peptide sequences are there? (2) How are they modified? (3) How much is there?
SorcererScore answers the first two questions together as identifying modified peptides (we now believe PTM localization should not be a separate problem), and the last as a straightforward area-under-curve (AUC) calculation. Peptide ID is by far the hardest problem, now fundamentally solved just this year. In comparison, quantitation is conceptually simple with its main complexity in book-keeping, for example in linking a certain “lump” in the MS1 chromatograph to a MS2 scan or in deducing SILAC pairs. Historically, quantitation inaccuracy arose from incorrect peptide IDs, not complex AUC miscalculations.
After two years of secret development, we are introducing our precision-focused SORCERER platform into a fundamentally different medical research market. Talented researchers who understand the new success game can leapfrog everybody.
Solve productivity, not just cut costs
Life has been hard for medical researchers, with public funding getting tighter with no end in sight. The problem isn’t that medical research is less needed. (The greying rich are dying to fund research that helps them, so to speak.) Rather, it’s that the traditional business model of medical research is broken, but institutions seem stuck. The problem is productivity, which requires strategic investment, but most are simply cutting costs. Cutting back on quality software, a tiny % of total budget fully funded, is penny wise and pound foolish.
Everyone knows success in the information world means investing in software to drive productivity. Yet proteomics labs insist on cheap or free PC software, many developed by academics in a hurry, to quickly process huge ultra-complex datasets into simple summaries. There are political and financial factors for this, but the net effect is poor productivity which is “lose-lose” for everyone at a time when the field needs “win-win”.
One problem is that to many decision makers, data analysis means “software”, which is one monolithic black-box easily doable with enough warm-bodied programmers. In fact, proteomic analytics starts with “what” to calculate (math), then “how” to compute it efficiently (computer science), and finally coding (programming). Programming is the easiest part. For precision peptide ID, the math took us a decade to solve and the computing for sensitive search required years of engineering. Luckily, given accurate peptide IDs, quantitation is mainly a straightforward book-keeping exercise.
Back when data was simpler, everyone knows quality science means hypotheses being tested against data interpreted by the experimental scientist himself. He may use MS Excel and write a macro to do some specialized calculations. Or he may adopt somebody else’s macro with some minor tweaks. Without a platform like Excel, he might have written a simple PC program instead, but that takes more effort for the same ‘value’.
In other words, he maximizes productivity by splitting analysis “software” into two pieces — “app” and “platform” — with opposing value-vs-effort tradeoffs (see table). With a productive platform, he delivers high value with low effort, while for him to develop the platform would be prohibitively difficult with little value. (Platform development is not for amateurs.)
Table 1: Separating “Apps” from “Platform” increases productivity
For the complex data from mass spectrometry, the same model applies but at higher scale. A capable platform requires robust Linux servers that present all data in huge text files split into structure directories, allowing for scripts (basically macros on steroids) to manipulate them. Trained bioinformaticians can customize sample scripts or develop novel ones.
In other words, the SORCERER GEMYNI platform (same from cloud to physical iDA) frees labs to data-mine their valuable datasets with reduced effort.
For the first time in proteomics, labs large and small have a integrated, turnkey server-class platform — scalable from the small virtual iDA on the cloud to large physical systems — that does most of what they need out-of-the-box and allows them to efficiently develop scripts for advanced needs. (To give credit where credit is due, we integrate Trans-Proteomic Pipeline/TPP and incorporates parts of ProteoWizard, both of which are of high quality.)
The precision and productivity of SORCERER will revolutionize proteomics. The low-cost of SORCERER Storm on the cloud makes it accessible to every lab for the first time.
A win-win proposition
Smart people tend to over-think everything. We sometimes confuse ourselves with too much meta-data when the right answer is simple.
Here’s the deal: If you care about quality proteomics and don’t expect a free lunch, I want you to sign up for our SORCERER Storm cloud solution, starting at only $3K.
I am basically asking you to crowd-source our collective success in proteomics. It invests in the cause and directly benefits your research, regardless of your current or even future workflow as I will explain.
Why us, why now
In any win-win business transaction there are 3 fundamental questions: Why you? Why me? Why now?
From my point of view, I love what I do. I enjoy solving complex riddles with math and computing, developing novel technology, and helping others advance medicine. If enough researchers sign on, I get to keep doing what I’m doing and more. That’s the selfish part.
Software platform development is deceptively costly with high fixed R&D costs that must be spread over its user base. The more users, the lower their price. For consumer cloud software with millions of users (Gmail, Dropbox), the maybe $20M R&D cost is divided among maybe 1M users, so it may only cost $30/user with overhead. Turns out your contact info is worth about $40 to marketers, so IT companies can profitably give you free access by dollarizing your personal info from Nike and other advertisers.
This model doesn’t apply to our niche, however, both because the field is small and there’s little value to advertisers. Therefore, multi-$M R&D costs must divided among the user base not advertisers.
Price is based on a projected user base. With enough subscribers, we can deliver more capabilities without increasing prices. Most importantly at this juncture, we can show investors that proteomics is a growing market, which will unlock literally $M’s of investment now on the sidelines. New investment will allow us to open regional tech centers, hire some of your students, and help you invent new technologies and medical products.
In other words, your collective modest contribution primes the pump to attract more capital in a positive feedback loop.
Many unfamiliar with the software business model (the drug model is similar) assume that not buying a product due to price forces the price to drop like in a consumer market. In fact, it drives the price up and discourages further R&D, the opposite of what’s needed to advance a data-intensive field. Importantly, a quality platform helps, rather than competes against, technologists by letting you deliver new algorithms productively.
The time is now because of simultaneous disruptions in the market (healthcare) and technology (SorcererScore).
Look, we will obviously continue to build a strong patent portfolio around our uniquely valuable SorcererScore technology.
However, we sincerely hope enough proteomic researchers sign up for SORCERER Storm, so our fiduciary duty focuses outward to help you and the field, not just IP protection.
What’s in it for you
If you can put aside your healthy skepticism for a moment, your decision is simply this:
The future of precision proteomics is here. Are you ready to at least explore the possibility of change, or double-down on status quo?
Whether you are politically committed to a particular workflow or open to new better options, there is no downside to be among the first to learn it. Like the Excel analogy, even if you’re committed to some linear regression program, it doesn’t hurt to use a separate platform to double-check results or explore something new beyond that program’s functionality.
At a minimum, it lets you take off blinders and maybe rose-colored glasses when you can finally see your mass spec data in its full physical glory. If you’re like some of our early access users, the new panoramic point-of-view will blow your mind!
Absent a fatal flaw in the simple theory explained above, traditional analytics works only on abundant peptides, period, which limits its value. Rigorous workflows like TPP tell you truthful statistics about them. Fanciful PC programs also add semi-random ones.
Just as chemistry discredited alchemists’ attempts to turn lead into gold, mathematics shows how traditional workflows are inherently imprecise and unable to identify low abundance peptides.
SORCERER Storm lets you clearly see this for yourself.
For product information or to sign up, see http://sagenresearch.com/index.php/platform-solutions/sorcerer-storm/ . For questions, please contact Terri at Sales@SageNResearch.com .