SorcererScore™ for Identifying Spliced Peptides in Immunology
“Data analysis”, often trivialized by those who don’t understand deep data, can vary from trivial to extremely difficult. Analyzing mass spectrometry (MS) data is fundamentally solving math puzzles with computers. Solving a mostly-filled Sudoku can be easy (even for an amateur-written PC program), but a gigantic version with millions of possibly ambiguous entries (representing data with added noise) requires skilled detective work. Solving hard MS puzzles has $M impact in accelerating biomarker and clinical discovery.
Immunology is a hot clinical area involving tricky peptide analysis. A recent paper (Liepe et al, 10/2016) used a complex bioinformatics workflow to discover that proteasome-generated spliced peptides (PGSPs) in HLA class I ligands are more common than previously thought. With robust bioinformatics, immuno-peptide analysis can be a MS killer app.
We illustrate how to use the SORCERER GEMYNI platform to explore improvements on the paper’s ideas, particularly to increase sensitivity, robustness, and automation. Interested customers can use our prototype scripts and tech support to assemble a semi-custom immuno-peptidomics workflow.
Why immuno-peptide ID is hard
Conventional workflows can identify abundant peptides with known sequences, but immuno-peptides may be neither abundant nor known. SorcererScore(tm) handles low-abundance. Here we discuss using it for immuno-peptides.
There are two ways to infer a sequence that is not in an existing sequence database. Either a “bottom-up” approach of deducing it directly from fragment ion m/z’s (“de novo sequencing”), or a “top-down” approach to build a new sequence database for a search engine.
De novo sequencing, an area of interest a decade ago, proved to be sensitive to extraneous noise peaks that cause false-positives. We believe this is fundamental and cannot be fixed algorithmically, at least as a standalone methodology.
To synthesize the search database, Liepe et al used known proteasome splicing rules. For example, PGSPs are spliced from either forward or reversed subsequences of their original peptides of certain lengths. The paper starts with subsequences from the SwissProt database that are spliced in silico, then sectioned into lengths between 9 to 12 to form a huge theoretical sequence database. To fit into a manageable file size, only precursor masses relevant to the mass spec data are included.
The paper pioneered using synthetic sequence database to allow standard search engine workflows. However, subtle biases can arise from assumptions used to build it. The published methodology does not consider modifications and is not conducive to automation.
In summary, pure bottom-up is probably inherently non-robust, while pure top-down is sensitive to assumptions in building the synthetic search database.
Middle-out with SorcererScore
After several weeks of prototyping scripts on GEMYNI, we found the best approach to be the happy medium: We build a synthetic sequence database. But instead of using pure splicing assumptions, we use partial fragment ion information (just like de novo).
The gist is this: A fragment mass spectrum contains b-ions and y-ions corresponding to the left and right subsequences. The key is to find at least one pair of over-lapping b- and y-ion sequences such that, when merged together, has the identical precursor mass as the original spectrum. All such merged sequences are included in the synthetic peptide sequence database.
In order to find overlapping b-/y-sequences, we do a narrow cross-correlation search of up to 20 sub-sections of each mass spectrum, which increases chances of getting at least one good pair. For this particular search, we can use a standard database like SwissProt plus its reversed sequences (since reversed sequences can be spliced). Preliminary analysis shows the ‘XCorr’ search scores of valid b/y pairs are not necessarily the highest, but about half of the highest.
The central combinatorics trick is the idea that, even though the entire PGSP sequence is not in SwissProt, it is highly likely that most subsequences are.
The evidence to date suggests this simple middle-out approach works well. We build on Liepe et al while increasing sensitivity, robustness, and automation.
At this point, we await interested customers to continue this work with real data.
Power of the GEMYNI platform
The Precision Proteomics Revolution is here! The new paradigm focuses on high-value translational research. It also marks the end of one-size-fits-all bioinformatics.
We previously discussed hybrid mixtures. Here we discuss our brief (about 5 weeks) research on immuno-peptidomics. These are just a few applications that are difficult if not impossible for most labs without SORCERER GEMYNI. Yet it is possible to do rapid algorithm prototyping in days to weeks.
Bioinformatics is the foundation of Precision Proteomics.
A large fraction of HLA class I ligands are proteasome-generated spliced peptides Juliane Liepe, Fabio Marino, John Sidney, Anita Jeko, Daniel E. Bunting, Alessandro Sette, Peter M. Kloetzel, Michael P. H. Stumpf, Albert J. R. Heck and Michele Mishto (October 20, 2016) Science 354 (6310), 354-358. [doi: 10.1126/science.aaf4384]