Jiaxi Zhao: Machine Learning for Solubility Estimation: Insights from Data Quality, Model Architecture, and Physics-based Features

Datum
11 juni 2026, kl. 9.15
Plats
Room X, University Main Building (Universitetshuset), Uppsala
Länk till videomöte
https://uu-se.zoom.us/j/61161363713
Typ
Disputation
Respondent
Jiaxi Zhao
Opponent
Anne Juppo
Handledare
Per Larsson, Christel Bergström, Eline Hermans, Christophe Tistaert
Publikation
https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-584766

Abstract

Solubility is a critical physicochemical property and a key determinant of drug absorption. Insufficient solubility can lead to poor bioavailability and diminished pharmacological effects. Consequently, early and accurate assessment of solubility is essential in drug discovery and development. However, experimental measurements are time-consuming and costly. To accelerate decision making, reliable in silico models capable of accurate solubility estimation are needed.

Machine learning has become a natural solution because it can efficiently learn predictive functions directly from data. Over the past decades, numerous studies have applied machine learning to solubility prediction, yet achieving accurate predictions remains challenging. Prior research suggests that two major factors contribute to this difficulty: (1) limited quality of available datasets, and (2) the need for improved model architecture and molecular representations.

In this thesis, we leveraged large, high-quality Johnson & Johnson data to address both challenges. First, a procedure was developed to partition the data into six subsets varying in noise level and size. Random forest models trained on these subsets showed that, for a fixed dataset size, high-quality data yielded superior predictive accuracy, whereas larger datasets with analytical variability performed comparably to diverse, smaller but cleaner datasets. However, noise arising from the presence of amorphous solid post solubility measurement introduced a systematic positive bias that could not be mitigated by increasing dataset size. Insights from this analysis, combined with theoretical pH–solubility equations, enabled the construction of a large intrinsic solubility (S0) dataset. A multi-task graph transformer (GT) model was developed and achieved state-of-the-art S0 estimation. With the solubility equation and predicted S0, a complete pH dependent solubility profile from pH 2 to pH 10 can be derived, providing additional support for scientific decision making. Finally, we assessed FaSSIF solubility prediction using molecular-dynamics-derived features. These descriptors provided only modest performance gains when added to classical physicochemical properties. By incorporating transfer learning, model performance was improved by enhancing predictions in both low solubility and high solubility regions.

Overall, this thesis provided practical guidelines for solubility data processing, introduced a GT that delivered state-of-the-art S0 estimations, and provided insights into using molecular-dynamics-derived features for FaSSIF solubility estimation.

FÖLJ UPPSALA UNIVERSITET PÅ

Uppsala universitet på facebook
Uppsala universitet på Instagram
Uppsala universitet på Youtube
Uppsala universitet på Linkedin