In this paper, a new genetic programming (GP) algorithm for symbolic regression problems is proposed. The algorithm, named statistical genetic programming (SGP), uses statistical information—such as variance, mean and correlation coefficient—to improve GP. To this end, we define well-structured trees as a tree with the following property: nodes which are closer to the root have a higher correlation with the target. It is shown experimentally that on average, the trees with structures closer to well-structured trees are smaller than other trees. SGP biases the search process to find solutions whose structures are closer to a well-structured tree. For this purpose, it extends the terminal set by some small well-structured subtrees, and starts the search process in a search space that is limited to semi-well-structured trees (i.e., trees with at least one well-structured subtree). Moreover, SGP incorporates new genetic operators, i.e., correlation-based mutation and correlation-based crossover, which use the correlation between outputs of each subtree and the targets, to improve the functionality. Furthermore, we suggest a variance-based editing operator which reduces the size of the trees. SGP uses the new operators to explore the search space in a way that it obtains more accurate and smaller solutions in less time.SGP is tested on several symbolic regression benchmarks. The results show that it increases the evolution rate, the accuracy of the solutions, and the generalization ability, and decreases the rate of code growth.
Read full abstract