Mass spectrometry protein identification software


















For each spectrum, the method firstly determines the baseline intensity and divides each peak's intensity to the baseline so that a normalised intensity is obtained.

The continuous intensities are used for the ion matching and the final score calculation, while the discrete intensities are used for the de novo sequencing-based tag inference. The normalised peak intensities are discretised into four levels: no signal, low signal, medium signal, and strong signal. The method removes the low signal peaks by using a sliding window mechanism and discards all the peaks except the top several peaks within each sliding window.

Because different regions of the spectrum have different characteristics, our method organises peaks into five regions based on the mass to charge ratio and utilises this information in the sequence tag inference. The second stage is to infer a number of peptide sequence tags directly from the spectrum. Instead of inferring short sequence tags which usually leads to misidentifications [ 18 ], NovoDB applies a more sophisticated algorithm to dynamically infer longer peptide sequences in a data-driven fashion.

This is achieved by incorporating a hybrid de novo sequencing approach which integrates a Bayesian Network probability model with a dynamic programming algorithm to infer the most probable tags [ 19 ].

The sequence tag inference stage consists of 3 major steps in total. Given a preprocessed spectrum S , NovoDB builds the spectrum graph and connects all edges if the mass difference between two vertices approximates the residue mass of an amino acid or other mass offsets of a residue derived from ion degradations. Since the most intensive peaks tend to be b- and y-ions, the spectrum graph has vertices for both interpretations.

A vertex for an empty peptide and a vertex for the intact peptide are also added. Our method extends the Bayesian Network model used by PepNovo to calculate the probability of observing each vertex of the constructed spectrum graph.

The details can be found in [ 19 ]. Each vertex of the network contains a conditional probability table given the values of its parent vertices. The probability tables are trained by using the large-scale Seattle dataset [ 20 ]. Each vertex is scored by comparing one hypothesis that the peak is a real fragment ion to the other hypothesis that the match is random.

It is calculated by the likelihood ratio:. Under the hypothesis that the mass matches are random events, the value of the denominator in 1 , namely P random t m j , S , can be calculated as the product of the probabilities of observing individual peaks at their mass positions.

For each vertex v of V , w v denotes the assigned intensities of v 's parents. Because vertex v is independent, the probability of observing ion fragment intensities of t given that the possible cleavage occurred at mass m j in S can be calculated as follows:.

One advantage of the model is that P real can distinguish the likely combinations of ions and ion degradations from the unlikely combinations. NovoDB finds several top ranking asymmetric paths as the most probable peptide sequences. The method employs the dynamic programming algorithm proposed in [ 21 ] to obtain a set of highly scored peptide sequences by exploring the sub-optimal space from the spectrum graph. There are two reasons. Firstly, a number of vertices on the optimal path may be false positives because it is common that many intensive peaks derive from interferences.

Secondly, vertices representing the real fragment ions may not always have the highest score and thus will not be included in the optimal path.

It is normal that real fragment ions have low intensities or even cannot be detected. The highly similar segments of the sequences correspond to the fragment ions that are likely to be correctly identified, while the ambiguous segments are where the ions are hardly distinguishable from baseline noise.

Given these characteristics, the most likely peptide sequence tags are extracted by adapting a dynamic programming-based algorithm similar to ClustalW [ 22 ]. In this case, the introduced "gaps" between the sub-optimal peptide sequences correspond to the ambiguous sections of the tandem mass spectrum.

Thus, it is able to dynamically generate longer sequence tags than 3 amino acid residues. After sequence tags are obtained, the next stage is to query a database to see if matches can be found. This is important firstly because the information provided by the database can fill the gaps that de novo sequencing leaves out. Secondly, the sequences directly inferred from the spectrum may not be sufficient to uniquely identify a protein.

Thirdly, even though the sequences of a novel protein are not present in the database, homologue proteins may have been discovered and they provide crucial information for validating and understanding the novel protein. The algorithm produces error-tolerant scores and does not require long and identical sequences to produce a confident protein hit. The sequence tag query algorithm identifies all high scoring pairs of regions having high local sequence similarities, namely between an individual peptide's sequences in the query and a protein's sequences from the database.

Scores for the two pairs of isobaric amino acid residues: glutamine and lysine, leucine and isoleucine, are substituted for their average values. The specificity of trypsin is considered by reserving the K symbol for the C-terminal lysine and by introducing a new value averaged between arginine and lysine to represent a cleavage site preceding the peptide sequence.

Undefined amino acid residues are introduced with zero scores in order to increase the similarity score if peptide sequence tags are incomplete and contain errors. NovoDB ranks the reported peptide hits by similarity scores S s and constrains the total number of query hits. The ion matching stage is based on the dot-product between the observed ions and the theoretical ions generated in silico from the sequence database.

Firstly, each peptide hit from the database query will be theoretically fragmented and a vector of peaks P will be created representing all possible fragment ions. In total, there will be 9 types of ions to be modelled. Secondly, experimental peaks will be aligned to theoretical peaks and the unmatched peaks excluded from the analysis.

The experimental ion series is represented as a vector I , where the value of I i corresponds to the intensity of an observed fragment ion or 0 if no fragment ion is observed. The correlation score S c is calculated in a similar way to X! The ion matching score as given assumes an underlying hyper-geometric distribution for a valid match.

This model has been shown to be very effective [ 24 ]. The ion matching score is calculated for every candidate protein returned by the database query. The protein with the highest score is considered to be correct. The delta score D is also calculated measuring how good the identification score, S max , is relative to the second best, namely S 2nd :. S c , S s and inferred sequence tags are also reported in the final output.

To evaluate the performance of our method, we use the raw spectra from two large-scale datasets as a benchmark: 1 the Aurum dataset [ 25 ] and 2 the CPTAC dataset [ 26 ] from Clinical Proteomic Technologies Assessment for Cancer.

The Aurum dataset is generated from a mixture of known human proteins. We compare NovoDB with 3 other widely used algorithms: 1 the de novo sequencing method PepNovo, 2 the database search method X! Tandem, and 3 the sequence tag method GutenTag. PepNovo is one of the most widely used de novo sequencing methods. GutenTag has been used as a benchmark for evaluating sequence tag-based methods [ 17 ]. We use two different performance criteria. Firstly we compare the tag inference results of NovoDB with PepNovo by using the sequence inference accuracy.

It is defined as the ratio of the number of correctly identified amino acid residues to the total number of identified residues of a peptide. Secondly, we evaluate how many peptides can be correctly identified by X! As expected, NovoDB has much better accuracy in identifying longer peptide sequences.

When inferring 5 amino acid residues, NovoDB achieves slightly better accuracy than PepNovo; however when inferring 10 residues, NovoDB doubles the accuracy. This is very important because the length of sequence tags greatly affects the final identification accuracy. Therefore, it is clear that one has to keep a balance: it may become detrimental to integrate sequence tags more than 8 residues long, although in theory it is better to incorporate longer tags.

The comparison results of X! For the Aurum dataset, X! Tandem marginally outperforms GutenTag. This may be due to the complicated spectra of the Aurum dataset. Because only short tags are targeted, it may become difficult for GutenTag to accurately generate a series of non-conflicting sequence tags.

On the other hand, the CPTAC dataset was generated on more advanced instruments, so the spectra have higher mass accuracy. This enables GutenTag to more accurately infer short tags. NovoDB performs significantly better than both methods. Tandem at the same FDR. This shows that integrated de novo sequencing has a strong effect on the final results especially when spectra are of good quality. On the other hand, this also indicates that short sequence tags may not be sufficient when the sample contains a large number of proteins and the spectra are of higher complexity.

Many pragmatic preprocessing, peptide-scoring, validation, and protein inference algorithms have been incorporated. To speed up the searching process, a toolbox for indexing protein databases is developed for high-throughput applications and all modules are implemented under a new architecture designed for large-scale parallel and distributed searching. An experiment on a public dataset shows that pFind 2. It is also demonstrated that this version of pFind 2. Finally, Byonic's Glycopeptide Search allows the user to identify glycopeptides without prior knowledge of glycan masses or glycosylation sites.

Abstract Byonic is the name of a software package for peptide and protein identification by tandem mass spectrometry. Publication types Research Support, N.



0コメント

  • 1000 / 1000