Fuente:
Molecules - Revista científica (MDPI)
Molecules, Vol. 31, Pages 1566: When Does Machine Learning Add Value over Theory? Predicting API Solubility in Binary Mixtures with COSMO-RS and DOOIT2 Across Diverse and Homogeneous Systems
Molecules doi: 10.3390/molecules31101566
Authors:
Maciej Przybyłek
Tomasz Jeliński
Adrian Drużyński
Piotr Cysewski
Predicting the solubility of active pharmaceutical ingredients (APIs) in binary aqueous-organic mixtures is critical for formulation design, yet remains challenging. Physics-based models such as COSMO-RS provide a solid theoretical foundation but often struggle with non-ideal mixing behavior in complex systems. This study asks a practical question: when does machine learning actually add value beyond established theory? We compared COSMO-RS with DOOIT2 (Dual-Objective Optimization with Iterative Feature Pruning), a hybrid COSMO-RS/machine-learning correction workflow, across two complementary datasets: 85 structurally diverse APIs and related formulation-relevant compounds (10,140 data points) and 37 acid-centered solutes (6030 data points). The datasets also incorporate newly measured solubilities of lidocaine, benzocaine, and vanillic acid in aqueous 4-formylmorpholine mixtures. DOOIT2 employs rigorous API-out Structured Group K-Fold validation with fold-specific ensemble models to ensure realistic assessment of generalization to unseen compounds. The obtained results are dataset-dependent. For the homogeneous acid series, COSMO-RS already delivers strong predictive performance (RMSD = 0.321, R2 = 0.925), and DOOIT2 brings no meaningful improvement (RMSD = 0.310, R2 = 0.923). In contrast, for the diverse API set, DOOIT2 reduces RMSD from 0.686 to 0.527 and increases R2 from 0.829 to 0.849. Residual analysis indicates that prediction uncertainty is driven primarily by the low-solubility region rather than by a simple monotonic dependence on molecular weight alone. These findings delineate the practical boundaries of machine-learning assistance in solubility prediction and offer clear guidance for formulation scientists.