Virtual screening plays a prominent role in early hit identification in drug discovery. To get the best chance of success, very large numbers of molecules need to be docked against a target which can be costly in terms of compute time and resources. To streamline the computational process, a virtual screening pipeline was developed using deep learning or active learning to train on a small training set of docking scores. Docking scores were then used to train a predictor to be applied on ultra-large libraries and rank molecules for experimental testing.
To exemplify the developed machine learning (ML)-enabled pipeline, a SARS-CoV-2 target, nsp14, was chosen. nsp14 is a key part of coronavirus replication machinery with dual functions that facilitates 1) proof-reading during RNA replication and 2) maturation of the viral genome to enable viral protein synthesis and protection from innate immune defences. Using experimental structures and molecular dynamics simulation snapshots, the ML pipeline was trained on a set of docking results and applied to predict potential binders. Experimental biochemical and biophysical validation of the top 1000 predicted binders were carried out. Several binders were identified including one that shows robust nsp14 methyltransferase inhibition thus demonstrating the applicability of this pipeline in accelerating hit discovery.