Data-driven materials discovery is often hindered when target properties are computationally expensive or experimentally demanding to obtain, making conventional large-scale screening impractical. This challenge is particularly acute for metal–organic frameworks (MOFs), whose vast chemical diversity and complex electronic behavior demand both accuracy and data efficiency. Here, we present a unified active learning strategy based on regression tree methods to accelerate the discovery of functional MOFs under scarce, noisy, and imbalanced data conditions.
Using low-dimensional, physically motivated descriptors derived from stoichiometric and geometric features, we construct regression tree–based partitions of the descriptor space to actively select the most diverse and informative samples for electronic-structure evaluation. This new approach, that we name Regression Tree–Active Learning [1], is demonstrated across multiple MOF datasets, where it yields compact training sets that outperform existing active learning strategies in predicting band gaps, adsorption properties, and other key materials descriptors, while exhibiting reduced variance and enhanced robustness to uneven label distributions [2].
We further apply this framework to the discovery of spin-crossover (SCO) MOFs, a rare but technologically promising subclass relevant for sensing, spintronics, and gas-related applications. By coupling a new Quantile Regression Tree–Active Learning approach with Random Forest regression and new density functional theory calculations, necessary to predict this property, we accurately identify SCO-active candidates from limited and imperfect training data, recovering over 80% of true positives. This strategy enables the identification of a new set of high-confidence SCO MOFs, demonstrating that complex quantum phenomena can be reliably uncovered through data-efficient, actively guided exploration of large materials spaces [2].
[1] Data Min Knowl Disc (2023);
[2] J. Am. Chem. Soc. 2024, 146, 9, 6134–6144, https://doi.org/10.1021/jacs.3c13687;
[2] npj Comput. Mater., submitted.