Predicting High-frequency Stock Price Using Machine Learning Technique

Hyunwoo Roh

doi:10.2139/ssrn.3744765

Abstract

This paper addresses the problem of predicting the stock price using the high frequency data based on a machine learning approach. We study two things in this paper (1) comparison of the prediction performance among selected function classes with given look-back parameter in terms of the proposed evaluation measures in the process of finding the best in-sample empirical loss minimizer (2) the comparison of those results by changing the sampled frequency of financial time series data after obtaining an introduced set of high frequency data features extracted from the Trades and Quotes (TAQ) data. For the analysis of TAQ data, feature engineering involves the computation of 56 number of related features including market microstructure, statistical, and technical indicator features. Re-estimation was done to improve the prediction accuracy for data models to obtain the predicted value every moving window. On the other hand, algorithmic models are used without re-estimation for the practical matter in that the time spent to train the model is often larger than the sampled frequency of the data. Moreover, the look-back parameter is introduced to cut off the irrelevant long past historical data. Among the selected function class in the experiment, the results show that the PCA regression performs the best in terms of the mean directional accuracy and simple back testing for both NASDAQ100 index and TAQ data with given sampled frequencies (i.e., 3min, 5min, etc). Compared to previous studies using NASDAQ100, the results demonstrate that re-estimation and properly chosen look-back parameter improve the prediction performance in terms of proposed evaluation measures. When it comes to maximum draw down, which is a measure critical for risk management, DA-RNN rendered the smallest value and, thereby, was the best performing model for TAQ data for all time frequencies. We also provide DM statistics whose null hypothesis is that the accuracies of prediction values of any two given models will not be different. In case of TAQ data for all sampled frequencies, there is evidence that we cannot reject the null hypothesis when comparing between PCA regression and DA-RNN model. Extensive experiments provide insights into properly evaluating the prediction performance of best in-sample empirical loss minimizer using the high frequency time series data.

Full Text