Abstract

The development of new spectral analysis methods in bio thin-film detection has generated intense interest in terahertz (THz) spectroscopy and its application in a wide range of fields. In this paper, it is the first time that machine learning methods are applied to the quantitative characterization of bovine serum albumin (BSA) deposited thin-films detected by terahertz time-domain spectroscopy. The spectra data of BSA thin-films prepared by solutions with concentrations ranging from 0.5 to 35 mg/ml are analyzed using the support vector regression method to learn the underlying model of the frequency against the target concentration. The learned mode successfully predicts the concentrations of the unknown test samples with a coefficient of determination R2 = 0.97932. Furthermore, aiming to identify the relevance of each frequency to the concentration, the maximal information coefficient statistical analysis is used and the three most discriminating frequencies in THz frequency are identified at 1.2, 1.1 and 0.5 THz respectively, which means a good prediction for BSA concentration can be achieved by using the top three relevant frequencies. Moreover, the top discriminating frequencies are in good agreement with the frequencies predicted by a long-wavelength elastic vibration model for BSA protein.

Highlights

  • Over the past decade, the simple, label free, and high sensitive protein detection techniques have been extensively investigated

  • Making 7 measurements for each sample, 147 transmission spectra each characterized by 43 sampling points and 1 concentration value are preprocessed with principal component analysis (PCA) for data denoising and input to support vector regression (SVR) model for further investigation

  • Leave-one-out cross validation (LOOCV) scheme is considered approximately unbiased for estimating the true prediction errors of machine learning methods [43], in this study LOOCV is firstly used to evaluate the performance of the SVR model

Read more

Summary

Introduction

The simple, label free, and high sensitive protein detection techniques have been extensively investigated. Machine learning methods are capable of learning the underlying model of the experimental data and generalizing well to unknown test data They suit the requirements of data analysis for laboratory and industry purpose [12,13,14,15]. A machine learning framework is proposed to successfully predict the function of the frequencies and the target concentrations for an exemplar protein (bovine serum albumin protein) thin-film detected by the terahertz spectroscopy. This work presents the first THz time-domain spectroscopy investigation of the BSA thin-films with a support vector regression (SVR) method to learn the function of the frequency and the target concentration. SVR model applied to THz data in this work allows one to take into account possible nonlinearities in the detected signals to identify protein concentrations. The maximal information coefficient (MIC) was applied to identify the most discriminating frequencies to concentrations of BAS in THz region, which correspond to the fundamental vibration frequencies of a long-wavelength elastic vibration model

Sample preparation
THz-TDS measurement
Machine learning methods
Spectrum regression analysis by SVR
Discriminating frequencies identification using MIC
THz transmission spectra determination
LOOCV scheme
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.