Microbial cell factories allow the production of chemicals presenting an alternative to traditional fossil fuel-dependent production. However, finding the optimal expression of production pathway genes is crucial for the development of efficient production strains. Unlike sequential experimentation, combinatorial optimization captures the relationships between pathway genes and production, albeit at the cost of conducting multiple experiments. Fractional factorial designs followed by linear modeling and statistical analysis reduce the experimental workload while maximizing the information gained during experimentation. Although tools to perform and analyze these designs are available, guidelines for selecting appropriate factorial designs for pathway optimization are missing. In this study, we leverage a kinetic model of a seven-genes pathway to simulate the performance of a full factorial strain library. We compare this approach to resolution V, IV, III, and Plackett Burman (PB) designs. Additionally, we evaluate the performance of these designs as training sets for a random forest algorithm aimed at identifying best-producing strains. Evaluating the robustness of these designs to noise and missing data, traits inherent to biological datasets, we find that while resolution V designs capture most information present in full factorial data, they necessitate the construction of a large number of strains. On the other hand, resolution III and PB designs fall short in identifying optimal strains and miss relevant information. Besides, given the small number of experiments required for the optimization of a pathway with seven genes, linear models outperform random forest. Consequently, we propose the use of resolution IV designs followed by linear modeling in Design-Build-Test-Learn (DBTL) cycles targeting the screening of multiple factors. These designs enable the identification of optimal strains and provide valuable guidance for subsequent optimization cycles.
Read full abstract