Low-Abundance Peptides Are Everywhere in Proteomic MS1 Scans

Was this article helpful?

Proteomics mass spectrometry holds unlimited potential for translational “bench-to-bedside” medical research, but until now it lacked 3 must-haves: robustness, sensitivity, and verifiability. Clinical impact requires dependable analysis on low abundance peptides/proteins, with results that can be directly traced back to raw data. In contrast, most labs process data with low-priced or academic proof-of-concept software that lack some if not all of these requirements.

Here we address low-abundance and/or modified peptides (LAMPs). Specifically, we illustrate how to use the SORCERER GEMYNI data-mining platform to find, characterize, and quantify LAMPs in the intact peptide (“MS1”) mass spec data.

Our internal studies suggest: (1) LAMPs are distributed throughout MS1 data, (2) many exist for a split-second and are captured in a single MS1 scan, and (3) those exhibiting two isotopic peaks allow estimation of charge, intact mass, and rough peptide length. If SILAC light/heavy pairs are also identified, their number of ‘K’ and ‘R’ amino acid residues can also be inferred.

If you want something done right, do it yourself. A true “platform” lets you customize visualization and manipulation, which is critical for leading-edge research. No one but the scientists themselves can comprehensively analyze their own data.

Data-mining mass spec data is like analyzing a tiny drop of pond water with a powerful microscope — the deeper you look the more you see. In contrast, using a canned PC program is like being told by someone else what’s in your sample. You don’t learn as much and have no direct way to verify results.

Chromatography and mass spectrometry remain tricky yet doable, but the X-Factor is in data analysis as noted by the title of Scott Patterson’s 2003 paper (“Data Analysis – the Achilles heel of proteomics”, Nature Biotech)[1].

Basics of MS1 data

In shotgun proteomics, the first stage of a tandem mass spectrometry measures the mass/charge (m/z) ratio of peptides from an enzyme-digested protein mixture. This “MS1” data comprises triples of (scan#, m/z, intensity), which may also include the chromatographic retention time associated with the scan#.

Contrary to common preconception, MS1 data are quite easy to analyze at least in principle, but complex in practice due to overwhelming volume and overlapping peaks.

Raw MS1 data-points from the MSV000079761, a SILAC Yeast mzXML file available from ProteomeCommons[2]:

 scan  r-time    m/z     intensity
 ----- ------- -------- ------------
 18484 7120.63 448.7656      0.0000
 18484 7120.63 448.7692  28047.1641
 18484 7120.63 448.7728 175944.5938
 18484 7120.63 448.7764 420176.2812
 18484 7120.63 448.7800 632575.6875
 18484 7120.63 448.7836 661858.8750
 18484 7120.63 448.7871 484415.1875
 18484 7120.63 448.7907 230600.1250
 18484 7120.63 448.7943  53462.0352
 18484 7120.63 448.7979   6120.7920
 18484 7120.63 448.8015      0.0000

Figure 1: Intensity vs. m/z for two peaks

Figure 1 plots the above abundant peak (centered at m/z=448.7836) plus another (m/z=449.2833). The delta-mass of 0.5003 m/z suggests the peaks are isotopes of charge +2.

This means the intact mass is roughly 2*(448.7836 – 1.00278) + 1.00278, or 896.5644 amu.

Their peak ratio is 237K/662K = 0.36. A useful rule of thumb is that this ratio is roughly equal to the number of carbons divided by 80[3], especially if the peptide doesn’t contain sulfur. So there may be roughly 0.36*80 or 29 carbons. If we assume 4.9 carbons/residue on average, then the peptide is roughly a 6-mer.

To recap, with just two isotopic peaks from a single MS1 scan, we can estimate the charge, intact mass, and rough peptide composition. This information can potentially be used as a “peptide mass fingerprinting” (PMF) to supplement a standard proteomics workflow, for example to double-check protein quantitation.

Distribution of MS1 peaks

Using “log10(Intensity)” instead of “Intensity” is easier when working with LAMPs. Figure 2 shows the log(Intensity) version of Figure 1.

Figure 2: Log(Intensity) vs. m/z

Figure 3 shows the distribution of “Log10(Intensity)” for about 100 MS1 scans, which shows an asymmetric Gaussian-like distribution. Since we expect the number of peptides with Log(Intensity)<4.5 to be far higher than shown, we hypothesize the distribution is shaped by both instrumentation limitation (left-side) and the distribution of peptides in the sample (right-side).

Figure 3: Distribution of Log(Intensity) from roughly 100 MS1 scans

Here, for practical consideration, we define LAMPs in terms of the instrumentation capability instead of absolute quantity, since we are interested in sensitive analysis not precise terminology. In other words, a peptide can be difficult for mass spectrometry due to low abundance or chemical modification.

The significance of the distribution is that we expect robust m/z and peak intensity only for log(Intensity) > 4.5. Below this point, accuracy may be compromised, or the peak may not be captured at all. We can test this by looking at data.

Map view of (r-time, m/z, logInt) triples

Instead of plotting on continuous axes of r-time and m/z (for example with heat maps), we find it useful to put “log(Intensity)” into a giant table where rows and columns are m/z and r-time, respectively. (To avoid the decimal point, the printed value is 10*log(Int), so “45” means log(Int)=4.5.)

Therefore, the following shows 4 abundant peaks spanning 23 seconds representing SILAC light/heavy pairs, which clearly suggest +2 charged peptides. A 5 m/z light/heavy separation with +2 charge suggests a single R in the peptide. Their SILAC quantitation ratio is approximately 1, which is expected for most SILAC light/heavy pairs.

                     rt=7107.64 to 7130.73 (9 scans)
                      -----------------------------
 454.3011[ 9666](  2):             42
 454.2975[ 9665](  8):    42 48 48 49 48 45 42
 454.2938[ 9664](  7):    47 52 52 53 52 50 46
 454.2901[ 9663](  7):    49 54 54 54 53 51 48     HEAVY SILAC (mono+0.5m/z)
 454.2865[ 9662](  8):    49 54 54 54 54 51 47     <--- Center at 454.2865
 454.2828[ 9660]( 11):    48 52 52 52 52 49 44
 454.2792[ 9658]( 12):    44 49 48 48 48 45
 454.2755[ 9656]( 10):       41       38
 .
 .
 .
 453.8003[ 9589](  6):    39 40 43 39 42 41
 453.7967[ 9588](  7):    48 50 52 51 51 49 45 41
 453.7930[ 9586](  8):    52 55 56 56 55 53 50 47
 453.7893[ 9584](  7):    54 57 58 58 57 55 52     HEAVY SILAC (monoisotopic)
 453.7857[ 9583](  7):    55 57 59 58 57 55 53     <--- Center at 453.7857
 453.7820[ 9580]( 10): 48 54 56 57 57 56 54 51
 453.7784[ 9579]( 15): 45 50 53 54 54 53 50 47
 453.7747[ 9577]( 32): 41 42 47 48 46 45 45 41
 453.7711[ 9576]( 21):                   44
 453.7674[ 9574]( 17):                   41
 .
 .
 .
 449.2977[ 8982](  2):                35 38
 449.2941[ 8981](  7):    40 47 46 47 48 46 40 36
 449.2905[ 8979](  8):    47 51 51 51 53 51 48 44
 449.2869[ 8978](  8):    49 53 54 53 55 53 51 46  LIGHT SILAC (mono+0.5m/z)
 449.2833[ 8977](  9):    50 54 54 54 55 53 51 46  <--- Center at 449.2833
 449.2797[ 8976]( 13):    49 52 53 52 54 52 50 44
 449.2761[ 8975]( 15):    45 49 50 49 50 48 46
 449.2725[ 8974]( 20):       43 42    41 42
 .
 .
 .
 448.7979[ 8926](  5):       37 32 38 37 39
 448.7943[ 8925](  7):    42 47 48 47 49 46 43
 448.7907[ 8924](  8):    49 52 54 54 54 53 50 46
 448.7871[ 8922](  8):    52 55 58 57 57 56 53 50  LIGHT SILAC( monoisotopic)
 448.7836[ 8921](  7):    54 56 59 58 59 57 54 51  <--- Center at 448.7836
 448.7800[ 8919](  8):    54 56 58 58 58 57 54 51
 448.7764[ 8917]( 10):    52 54 57 56 56 55 52 49
 448.7728[ 8916]( 11):    48 49 53 52 52 51 48 46
 448.7692[ 8915](  4):             44

Map showing LAMPs

When we map out a slightly larger slice of the (rtime, m/z) space, we see a multitude of single-scan runs of typically five 40-something values. (Enlarge Figure 4 PDF image for details.) This 5 m/z wide slice includes astonishing complexity.

That these runs are consistent in value and shape of peptides, and that occasional isotopic pairs are present suggest these are the elusive LAMPs.

We are unaware of prior reports of their existence in the MS1 data.

Indeed, LAMPs are generally invisible to conventional proteomics workflows geared toward abundant peptides. Their fleeting nature means they don’t appear in both MS1 and MS2 scans. As well, many workflows (e.g. SILAC) specifically average multiple scans.

msv_map

Figure 4: Peak map in PDF form showing many likely LAMPs (click to view)

Conclusion

We hoped to show that LAMPs are key to translational research success, that they are everywhere, and that they are at least conceptually simple to analyze.

LAMP analysis is fundamentally a data-mining problem, whereby the deeper one digs the more one finds. The paradigm involves asking a question, coding a script to collect the data, then running it overnight to see if the hypothesis holds true.

Unlike PCs, server systems comprise server-class hardware, operating system, and software subsystems designed for high-reliability 24/7 operation. The SORCERER platform partitions data files (MS1, MS2) and search results into a structured directories, with a library of tools to run scripts within each subdirectory.

For many applications, Sage-N Research may be able to supply prototype scripts at no cost. The same scripts can run on the entry SORCERER Storm cloud account to the SORCERER Pro iDA system.

For information or questions, please contact Terri at Sales@SageNResearch.com.

References:

[1] Patterson, Scott D. Nature Biotechnology; New York21.3 (Mar 2003): 221-2.

[2] mzXML file by Nuno Bandeira of GNPS/UCSD and downloaded from ProteomeCommons.

[3] You can check this rule of thumb with http://prospector.ucsf.edu/prospector/cgi-bin/mssearch.cgi. Rule derived by binomial expansion assuming only carbon-13 isotope distribution.