Machine learning techniques are becoming increasingly important in the selection and optimization of therapeutic molecules, as well as for the selection of formulation components and the prediction of long-term stability. Compared to first-principle models, machine learning techniques are easier to implement, and can identify correlations that would be hard to describe at a mechanistic level, but strongly rely on high-quality input training data. Here, we evaluate the potential of the "chaos game" representation to provide input data for machine learning models. The chaos game is an algorithm originally developed for the production of fractal structures, and later on applied also to the representation of biological sequences, such as genes and proteins. Our results show that the combination of the chaos game representation with convolutional neural networks results in comparable accuracy to other machine learning approaches, thus indicating that chaos game representations could be a valid alternative to existing featurization strategies for machine learning models of biological sequences. We implement the chaos game in Python 3.8.10, and use it to produce fractal as well as novel expanding representations of protein sequences. We then feed the resulting images to a convolutional neural network, built in Python 3.8.10, using TensorFlow 2.9.1, Keras 2.9.0, and the scikit-learn 1.1.1 packages. We select as case study a recently published dataset for the antibody emibetuzumab, with the objective of co-optimizing antibodies variants with both high affinity and low non-specific binding.
Read full abstract