Introduction Stratification of biological samples by using high-dimensional data, such as those derived from mass spectrometry-based proteomics approaches, has become a promising strategy to solve biological questions, as well as to classify samples in relation to different phenotypes. In this regard, we have discussed some computational aspects related to the processing of Multidimensional Protein Identification Technology data through a class of algorithms widely used in machine learning community, such as support vector machines. Specifically, after a short presentation of the input data structure, we focused on properties and abilities of feature selection and classification models, indicating useful tools for assisting scientists in these computations. Finally, we concluded this review hinting at new strategies of inference which coupled to mass spectrometry improvement, in instruments and methods, may represent the perspectives of this field. Conclusion In this review we have made a welldefined overview of a method that, by combining high-throughput proteomic data and machine learning algorithms, allows the stratification of biological samples. Besides the importance that these procedures can play for diagnostic or prognostic purposes, they are useful also for identifying meaningful expression patterns. Therefore, it represents a valid tool for investigating both clinical and biological aspects. Introduction Recent developments in analytical techniques such as mass spectrometry (MS) have created the opportunity to measure proteomes at large-scale, providing a representative snapshot of cells and/or tissues associated with different phenotypes. In this context, new MS instruments are able to reach the limit of detection up to attomole and a dynamic range of 1 × 1061. As a consequence, MS has become essential for proteomic research, and owing to its powerful activity of discovering it has already been introduced as a tool for clinical applications. In fact, one of the main aims in this field is to use relevant biomarkers for improving current methods of diagnosis (e.g. healthy–diseased), for selecting appropriate therapeutic approaches and for monitoring their effectiveness2. The construction of an inference model able to discriminate biological samples (sharing some characteristics, such as m/z ions, peptides or proteins) is a common issue in many areas of life sciences including proteomics. In the last few years, a variety of algorithms have been designed for this purpose. In many of these studies, different authors applied support vector machines (SVMs)3 to experimental data mainly generated by analysing body fluids through MALDI (matrixassisted laser desorption/ionisation) and SELDI (surface-enhanced laser desorption/ionisation) technologies, while very few cases investigated the data obtained by liquid chromatography coupled to MS4. In a number of publications, discovery of biomarker patterns has been reported with diagnostic sensitivities and specificities approaching 100%. Although these results prefigure a prominent position for diseases diagnosis, to realise the potential of MS-based proteomics in the area of clinical utility, additional requirements, such as reproducibility and standardisation of methods, need to be addressed5. Regardless of the analytical methodology used to generate proteomic data, two main interests address the inference on the biological sample discrimination: the feature selection and the classification problems (Figure 1). For each of them, scientists can apply a wide range of algorithms, hence there is no unique way leading to an adequate inference model. As a consequence, which strategy works best is yet an open issue. To answer this question, some investigators have begun to perform studies for assessing which procedure * Corresponding author Email: pierluigi.mauri@itb.cnr.it 1 Institute for Biomedical Technologies, Council National Research (ITB-CNR), 20090 Segrate, Milan, Italy 2 Department of Informatics, Systems and Communication, University of Milano-Bicocca, 20132 Milan, Italy Figure 1: General workflow for sample classification by using highdimensional proteomic data.