SorcererScore: Science of Fragment Mass for Low-Abundance Peptides
The deep proteomics revolution is here!
Deep proteomics is a hi-res “biochemical x-ray” for low-abundance proteins in cells, it will revolutionize early detection of cancer and infections, and medical research in general.
Figure 1: Scatterplot of “PeakCount vs. Average Fragment Delta-Mass” of top-200 peptide ID hypotheses with non-decoys (green) and decoys (black).
SorcererScore(tm) brings proteomics to its tipping point by making deep proteomics possible for the first time. Researchers skilled at deep data analysis will most benefit. No matter how accurate the data, deep insights come from interpreting ambiguous data beyond the reach of fully-automated workflows.
Here we illustrate the theory and practice of deep data analysis of fragment mass data.
Proteomics 2.0 starts now
It should be noted that identifying low- vs. high-abundance peptides is like night and day requiring different paradigms. Abundant peptides can be readily identified by inexpensive PC programs employing non-scientific shortcuts. This is primarily an instrumentation game.
In contrast, low-abundance modified peptides (LAMPs), the foundation of deep proteomics, require rigorous adherence to the scientific method to avoid irreproducible “p-hacking” results. Those with very low signal-to-noise require human expertise using an interactive visualization platform — just like any other scientific field. LAMP analysis is a game of analytics not instrumentation.
Availability of deep proteomics is a profound game-changer. For mass spec labs, it becomes a must-have to remain relevant. (No one wants abundant-only analyses.) Savvy labs now understand high-accuracy mass spectrometers are mostly interchangeable molecular detectors. Like PCs, mass spectrometers are only as good as the platform and value-added applications.
For Sage-N Research, SorcererScore represents the culmination of our founding objective, which was to make a lasting contribution to medicine through math and computing. Academia excels at ideas but robust integration is the domain of professional companies — probably why you bought not built your mass spectrometer. Our contribution is to integrate critical ideas into a productive integrated system, the SORCERER iDA.
Say we have a pile of peptide ID hypotheses generated by the search engine. How to discriminate the relative few correct IDs from incorrect ones? Most are accustomed to rely on search “similarity” scores which by definition are wrong for LAMPs (see this post). Instead we need a new paradigm.
Starting from scientific first principles, SorcererScore distills each ID hypothesis to the following 5 parameters:
Average fragment delta-mass
Fragment matched peak-count
Decoy tag (i.e. implicit and/or explicit decoy)
The first 3 are derived directly from primary data (i.e. raw mass). The last 2 are meta-data that exploit statistical patterns from the auto-generated nature of ID hypotheses. Details are in the white paper (Chiang, 2016).
All correct-vs-incorrect discrimination is determined by geometric clustering of these points (vectors) in multi-dimensional space. In short, SorcererScore mathematically maps these into a compact 3D data-cube where a plane can divide ‘correct’ from ‘incorrect’ clusters (see this post). A figure-of-merit (“S-score”) can be assigned to each ID hypothesis point based on its distance to the plane.
Such is the scientifically rigorous, hypothesis-driven SorcererScore paradigm:
‘N’ tandem mass spectra are derived from a digested protein (i.e. peptide) mixture.
A sensitive search engine maps each to 200 top-scoring peptide ID hypotheses.
The ‘200N’ ID hypotheses are mapped to the SorcererScore data-cube, with a partition plane, and assigned an S-score based on distance to such plane.
The ‘200N’ are enriched to roughly one best ID hypothesis (empirically ~0.8*N total).
A subset of these are selected based on desired FDR (false-discovery error rate).
One can envision a layered pyramid of calculations that starts with raw mass spectral data on the bottom. The next layer is the peptide ID hypotheses, which are educated guesses by the search engine. Next is the distilled 5-parameter vectors. Finally, there are S-scores, which are used to assign ‘correct’ and ‘incorrect’ to the bulk of ID hypotheses.
The few remaining ones with ambiguous S-scores may require semi-manual analysis. For example, the researcher may choose to dig deeper into individual ones by tracing back to the 5 parameters, and perhaps all the way to the original spectrum if needed. Here, expert knowledge, say on characteristic ions or sequence-dependent behavior, can help disposition ID hypotheses that coded algorithms cannot.
Committed scientists appreciate that they, not some algorithm, are ultimately in control of data-testing ID hypotheses using primary mass data. Deep data analysis can never be far from raw primary data. A capable data platform makes that possible.
We believe this SorcererScore methodology is more or less the only possible hypothesis-driven methodology in deep proteomics. This is because hypotheses in mass spectrometry must be directly testable with mass data, particularly of fragment ions.
Solving the riddle of fragment mass analysis
Accurate fragment mass turns out to be central to LAMP identification, but it’s a double-edged sword. Unlike precursor mass which is one number, dozens of predicted fragment masses must be matched among hundreds of generic measured peaks that can be signal or background noise. Any individual match can be random noise, but multiple matches in the aggregate can increase signal-to-noise by using mass accuracy. The challenge is how.
The best answer turns out to be one of the simplest: We compute two values — (1) number of matched peaks (‘PeakCount’) within +/- 0.6 amu and (2) average fragment delta-mass (‘dFragMass’). The trick is to recognize they are not independent parameters, but must be considered together. The data show why.
Consider the dataset and search conditions in this previous post. Figure 1 above is a scatterplot of “PeakCount vs. dFragMass” for all top-200 peptide ID hypotheses, each colored green (non-decoy) or black (decoy). Per target-decoy methodology, random background is 50/50 green/black while correct IDs are mostly green. Jitter is added to PeakCount to enhance visualization of integer values. General separation of ‘correct’ (almost all green) and ‘incorrect’ (mixed black/green) is clear.
Figure 2: Scatterplot of “PeakCount vs. Average Fragment Delta-Mass” for top-1 peptide ID hypotheses.
To see the underlying structure of the data, we subset to the top-1 ID hypotheses in figure 2. Separate ‘correct’ and ‘incorrect’ clusters suggest a partition line (dotted), from which a horizontal score can be derived.
Figure 3 is a zoom-in view of figure 1, showing the same sloped partitioning of figure 2.
One can visualize S-scores being computed to each top-200 phospho-peptide ID hypotheses, which contain many same-backbone replicates. These raw ID hypotheses are enriched to roughly one best S-scored ID for each spectrum. The bulk can be auto-classified as ‘correct’ and ‘incorrect’. Those in the transition region can be resolved using additional information (e.g. precursor mass) or semi-manual analysis.
Two key points. First, we can accept-or-reject the bulk of peptide ID hypotheses using only geometric patterns, requiring no sophisticated mathematics nor special training.
Second, the correct-vs-incorrect boundary is clearly sloped — i.e. not vertical. This suggests using fragment mass in isolation, for example as a fragment mass tolerance to narrow the search, is fundamentally flawed.
Wrong way to use fragment mass
Proteomics has been hurt by simplistic answers to hard analytics questions that pushed back progress. We now understand the mathematics of irreproducibility that plagued our field.
Common (flawed) intuition sees fragment accuracy as a mass tolerance parameter to narrow the search space. Sure enough, if you tighten mass tolerances of both fragment and precursor, and search against a small protein database containing mainly viable peptides, then all the peptide ID hypotheses — including inevitable random results — look superficially right! By definition, they are all viable peptides that satisfy both tight mass tolerances.
However, the false discovery error rate (FDR) can betray their randomness. A rigorous post-search filter (e.g. SorcererScore) would show these correct-looking ID hypotheses are mostly wrong. But a popular non-rigorous post-search algorithm loops on more than a dozen secondary parameters to find a low-FDR subset. This is the textbook definition of p-hacking, or since we don’t use p-values, “FDR-hacking”.
In other words, the popular advice of tightening both search mass tolerances, and then using a FDR-hacking post-search filter algorithm, guarantees plenty of correct-looking IDs at low FDR independent of data quality. Such semi-random results are of course irreproducible.
Compare this to the non-rigorous search algorithm first published by Walt Disney in “Cinderella”, where the prince also misused a physical metric (shoe-size) to drive the search to identify a mysterious visitor.
We now know that a scientifically trained prince would have used meta-information — gender, age range, height. etc. — to drive a loose search to enrich to a handful of candidate females, who are then post-search filtered with the shoe-size. Clearly, quantitative hard data must be reserved for the post-search filter.
The prince’s non-rigorous algorithm — “Marry the first girl who fits the shoe” — can only yield a unique right answer by “finessing” conditions, say in a really small town with few girls. It famously worked for Cinderella because it’s a fairy tale. In science, fairy tales occur in demo labs where controlled conditions can make any algorithm work.
So look at your proteomics workflow right now, in the context of peptide ID hypotheses being auto-generated by the search engine.
If the post-search filter uses scores and other non-mass data, it is soft-science. Only if it uses raw fragment and precursor mass data as in SorcererScore can it be hard-science.
If it uses these primary data to formulate hypotheses — i.e. as tight mass tolerances — rather than test them, then it’s not really science at all. This unfortunately describes the “demo-quality” analytics used in published experiments in recent years.
Implications of proteomics as a hard science
For two decades proteomics was a soft science. Low data accuracy was the initial cause. When accurate data became available, it was due to a legacy soft-science workflow. A soft-science workflow can never identify LAMPs and hence is useless for high-value research.
We can mathematically prove that a hard-science workflow must: (1) use a cross-correlation search engine and (2) have a primary post-search filter with only hard mass data, particularly of fragment ions. Empirically, fragment mass analysis must include the matched peak-count.
In other words, any hard-science workflow must fundamentally use our patent-pending SorcererScore technology. And for practical computing — i.e. not require 50x more CPUs — the cross-correlation search part requires our patented partial-index methodology.
Until we formally license our technology, researchers can only get it from Sage-N Research in a SORCERER integrated data appliance. As a public service, we will work with interested labs in a data-analysis-as-service project on a limited basis.
For soft-science workflows, there are many choices for search engines, post-search filters, and workflows. They are available on PCs, servers, and perhaps cloud computing. These all exploit degrees of freedom from imprecision. But imprecision kills high-value research.
To join the deep proteomics revolution, please contact Terri (Sales@SageNResearch.com) for information or a price quote. Prices must necessarily increase commensurate with value. As appreciation for past support, we will do our best to extend price breaks to current and former SORCERER users who act now.
How to Identify Low-Abundance Modified Peptides with Proteomics Mass Spectrometry.
Sage-N Research, Inc. White Paper SorcWP101A. 2016 May 5.