作者:盛丽博,阮选敏等
题名:Interpretable XGBoost-SHAP machine learning model for identifying scientific breakthroughs
期刊:《Scientometrics》2025年12月在线发表(SSCI)
摘要:Identifying scientific breakthroughs is of great significance for research evaluation and policy-making. Thus, it has been the central focus in the realm of science. This study leverages a new dataset of Nobel and Lasker prize-winning publications and employs the eXtreme Gradient Boosting (XGBoost) algorithm to establish a predictive model for scientific breakthroughs. The Input-Process-Output-Outcome (IPOO) framework serves as the fundamental perspective to deconstruct the potential factors associated with breakthroughs into four dimensions: input, process, output, and outcome. We demonstrate that XGBoost achieves the best predictive accuracy among traditional machine learning models, with F1 scores of 0.613 and 0.611 in Dataset 1 and Dataset 2, and AUC values of 0.898 and 0.880, respectively. Large language models (LLMs), used as additional baselines, exhibit higher recall scores on both datasets. In addition, we utilize the SHapley Additive exPlanations (SHAP) approach to enhance the interpretability of our model, enabling a deeper understanding of how features influence the prediction of scientific breakthroughs, which has been overlooked in previous research. This study introduces an explainable machine learning approach for tracing breakthrough research in science with bibliographic information, yielding valuable insights into future research.
原文下载:Interpretable XGBoost-SHAP machine learning model for identifying scientific breakthroughs
