Exploring and Selecting Features to Predict the Next Outcomes of MLB Games.

Shu-Fen Li,Mei-Ling Huang,Yun-Zhi Li

doi:10.3390/e24020288

Abstract

(1) Background and Objective: Major League Baseball (MLB) is one of the most popular international sport events worldwide. Many people are very interest in the related activities, and they are also curious about the outcome of the next game. There are many factors that affect the outcome of a baseball game, and it is very difficult to predict the outcome of the game precisely. At present, relevant research predicts the accuracy of the next game falls between 55% and 62%. (2) Methods: This research collected MLB game data from 2015 to 2019 and organized a total of 30 datasets for each team to predict the outcome of the next game. The prediction method used includes one-dimensional convolutional neural network (1DCNN) and three machine-learning methods, namely an artificial neural network (ANN), support vector machine (SVM), and logistic regression (LR). (3) Results: The prediction results show that, among the four prediction models, SVM obtains the highest prediction accuracies of 64.25% and 65.75% without feature selection and with feature selection, respectively; and the best AUCs are 0.6495 and 0.6501, respectively. (4) Conclusions: This study used feature selection and optimized parameter combination to increase the prediction performance to around 65%, which surpasses the prediction accuracies when compared to the state-of-the-art works in the literature.

Highlights

Sports events have been deeply connected into the lives of the general public
A large amount of game data is open to the public, and many scholars have invested in the research field of predicting the outcome of the game, player performance, and player value
The 24 variables of the original data were directly fed into the 1DCNN model, and the optimal parameter combination was searched by GridSearchCV

Summary

Introduction

Sports events have been deeply connected into the lives of the general public. Baseball is one of the most popular sports. Major League Baseball (MLB) is the world’s highest-level professional baseball game and has a long history in all North American professional sports leagues. A large amount of game data is open to the public, and many scholars have invested in the research field of predicting the outcome of the game, player performance, and player value. It is very fascinating and important to find out what the key variables are that affect the outcome of the game. Barnes and Bjarnadóttir [1] collected player data from 1998 to 2014 and used linear regression (LR), random forest (RF), regression trees (RT), and gradient-boosted trees (GBT)

Objectives

Methods

Results

Conclusion