Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy. The experimental dataset consisted of fully phosphorothioated oligonucleotides. Two models were trained and validated using two pseudo-orthogonal gradient modes and three gradient slopes. The results show that the spread in retention time differs between the two gradient modes, which indicated varying degree of sequence dependent separation. Peak widths from the experimental dataset were calculated and correlated with the guanine-cytosine content and retention time of the sequence for each gradient slope. This data was used to predict the resolution of the n – 1 impurity among 250 000 random 12- and 16-mer sequences; showing one of the investigated gradient modes has a much higher probability of exceeding a resolution of 1.5, particularly for the 16-mer sequences. Sequences having a high guanine-cytosine content and a terminal C are more likely to not reach critical resolution. The trained SVR models can both be used to identify characteristics of different separation methods and to assist in the choice of method conditions, i.e. to optimize resolution for arbitrary sequences. The methodology presented in this study can be expected to be applicable to predict retention times of other oligonucleotide synthesis and degradation impurities if provided enough training data.
Read full abstract