Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
4
106 Correspondence
3
1 10 5 regina@csail.mit.edu (R.B.),
N2 Training set
5
(104 molecules)
jimjc@mit.edu (J.J.C.)
4
4
1 104
Deep learning
In Brief
2
Model validation
A trained deep neural network predicts
antibiotic activity in molecules that are
Growth
New
[antibiotic] 5
Antibiotics structurally different from known
antibiotics, among which Halicin exhibits
Drug Repurposing Hub
HALICIN
ZINC15 Database efficacy against broad-spectrum
N
∆pH Br
Br O
N+
bacterial infections in mice.
S O-
N N+
O S N OH
Rapidly bactericidal
N N Br Br N
Broad-spectrum
-O N N
S N N+ N+ N
O N O
O OH N O
H2N O
H2N
Bacterial cell death S
O
N Low MIC
N N Broad-spectrum
Acinetobacter baumannii N N
Clostridioides difficile -O OH
N+
O O O
Highlights
d A deep learning model is trained to predict antibiotics based
on structure
02139, USA
4Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
5Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
6Department of Biochemistry and Biomedical Sciences, Michael G. DeGroote Institute for Infectious Disease Research, McMaster University,
688 Cell 180, 688–702, February 20, 2020 ª 2020 Elsevier Inc.
Chemical landscape
107
3 Conventional small
N 2
1 molecule screening
5
4
106
3
1 105
N 2
5 Training set
(104 molecules) Iterative Chemical screening
4 model (upper limit 105 - 106)
re-training 10 4
new bond
vector Machine learning Hit validation
(1 - 3% hit rate)
concat
Predictions &
model validation
sum
Growth
Lead
bonds into 2:
3-2, 4-2
identification
bond 2-1 [antibiotic] & optimization
to a few million molecules, are often prohibitively costly to curate, molecules are represented; traditionally, molecules were
limited in chemical diversity, and fail to reflect the chemistry that represented by their fingerprint vectors, which reflected the
is inherent to antibiotic molecules (Brown et al., 2014). Since the presence or absence of functional groups in the molecule, or
implementation of high-throughput screening in the 1980s, no by descriptors that include computable molecular properties
new clinical antibiotics have been discovered using this method. and require expert knowledge to construct (Mauri et al., 2006;
Novel approaches to antibiotic discovery are needed to in- Moriwaki et al., 2018; Rogers and Hahn, 2010). Even though
crease the rate at which new antibiotics are identified and simul- the mapping from these representations to properties was
taneously decrease the associated cost of early lead discovery. learned automatically, the fingerprints and descriptors them-
Given recent advancements in machine learning (Camacho selves were designed manually. The innovation of neural
et al., 2018), the field is now ripe for the application of algorithmic network approaches lies in their ability to learn this representa-
solutions for molecular property prediction to identify novel tion automatically, mapping molecules into continuous vectors
structural classes of antibiotics. Indeed, adopting methodolo- that are subsequently used to predict their properties. These
gies that allow early drug discovery to be performed largely in sil- designs result in molecular representations that are highly at-
ico enables the exploration of vast chemical spaces that is tuned to the desired property, yielding gains in property predic-
beyond the reach of current experimental approaches. tion accuracy over manually crafted representations (Yang
The idea of analytical exploration in drug design is not new. et al., 2019b).
Decades of prior work in chemoinformatics have developed While neural network models narrowed the performance
models for molecular property prediction (Mayr et al., 2018; gap between analytical and experimental approaches, a differ-
Wu et al., 2017). However, the accuracy of these models has ence still exists. Here, we demonstrate how the combination of
been insufficient to substantially change the traditional drug in silico predictions and empirical investigations can lead to
discovery pipeline. With recent algorithmic advancements in the discovery of new antibiotics (Figure 1). Our approach con-
modeling neural network-based molecular representations, we sists of three stages. First, we trained a deep neural network
are beginning to have the opportunity to influence the paradigm model to predict growth inhibition of Escherichia coli using a
of drug discovery. An important development relates to how collection of 2,335 molecules. Second, we applied the resulting
1.2
0.8 0.8
Prediction score
0.8 0.6 0.6
D E F
1.2 10 1.2
99 predictions 63 predictions
growth / prediction score
1 1
0.8 1 0.8
OD (600nm)
OD (600nm)
0.6 0.6
G H I
1 0.7
N 0.6
0.8
S
Tanimoto similarity
0.5
O S OD (600nm)
0.6 N N
0.4
S N
O
0.4 0.3
H2N
0.2
0.2
0.1
training set BW25113
0 0
Broad library
0 500 1000 1500 2000 2500 10
-5
10
-4
10
-3 -2
10 10
-1
10
0
10
1
10
2
10
3
halicin
Ranked training set molecules [halicin] μg/ml
those with low predicted toxicity. The compound that satisfied all and Hahn, 2010) and the antibiotic metronidazole (Tanimoto
of these criteria was the c-Jun N-terminal kinase inhibitor similarity 0.21), displayed excellent growth inhibitory activity
SU3327 (De et al., 2009; Jang et al., 2015) (renamed halicin), a against E. coli, achieving a minimum inhibitory concentration
preclinical nitrothiazole under investigation as a treatment for (MIC) of 2 mg/mL (Figure 2I).
diabetes. Excitingly, halicin, which is structurally most similar Importantly, we observed that the prediction rank of halicin
to a family of nitro-containing antiparasitic compounds (Tani- in our model (position 89) was greater than that in four of the
moto similarity 0.37; Figures 2G and 2H; Table S2H) (Rogers other five models tested (positions ranging from 273 to 1987;
CFU/ml
CFU/ml
10 10 6
5
1hr 2hr 10 no drug
5
10 2hr 10 4hr 5 10x amp - 0x hal
10
4 3hr 4 6hr 10x amp - 5x hal
10 10 10
4
4hr 8hr 10x amp - 10x hal
3 3
10 BW25113 10 BW25113 10
3 10x amp - 20x hal
2
nutrient replete 2
nutrient deplete 2 BW25113
10 10 10
10
-2
10
-1
10
0
10
1
10
2
10
3
10
-2
10
-1
10
0
10
1
10
2
10
3 0 2 4 6 8 24
[halicin] μg/ml [halicin] μg/ml Time (hours)
D E F 9
10 1.4 10 M. tuberculosis H37Rv
no drug
1.2 halicin (16 μg/ml)
7
10
halicin MIC (μg/ml)
1
OD (600nm)
CFU/ml
5
0.8 10
1
0.6
3
10
0.4
0.2 10
1
M. tuberculosis H37Rv
0.1 0
-3 -2 -1 0 1 2 3
3* tolC 1* T -1 -Ia -cr 10 10 10 10 10 10 10 0 50 100 150 200
11 R- CA OXA t(2’’) ’)-Ib
W 25 mB∆ MC a n c (6 [halicin] μg/ml Time (hours)
B a aa
∆b
G
1000 Carbapenem resistant Enterobacteriaceae (CRE) panel
halicin MIC (μg/ml)
100
10
0.1
1000 Acinetobacter baumannii multidrug resistance panel
halicin MIC (μg/ml)
100
10
0.1
1000 Pseudomonas aeruginosa multidrug resistance panel
halicin MIC (μg/ml)
100
10
0.1
(D) MIC of halicin against E. coli strains harboring a range of antibiotic-resistance determinants. The mcr-1 gene was expressed in E. coli BW25113. All other
resistance genes were expressed in E. coli BW25113 DbamBDtolC. Experiments were conducted with two biological replicates.
(E) Growth inhibition of M. tuberculosis by halicin. Shown is the mean of three biological replicates. Bars denote standard deviation.
(F) Killing of M. tuberculosis by halicin in 7H9 media at 16 mg/mL (13 MIC). Shown is the mean of three biological replicates. Bars denote standard deviation.
(G) MIC of halicin against 36-strain panels of CRE isolates (green), A. baumannii isolates (red), and P. aeruginosa isolates (blue). Experiments were conducted with
two biological replicates.
See also Figure S2 and Table S3.
10 0.5
OD (600nm)
0.4 8,000
1
10
0.3 pH 5.4 6,000
pH 6.4
0 0.2
10 pH 7.4 halicin (2 μg/ml)
4,000
0.1 pH 8.4 PMB (32 μg/ml) halicin (4 μg/ml)
-1
pH 9.4 DMSO halicin (8 μg/ml)
10 0 2,000
-5 -4 -3 -2 -1 0 1 2 3
0 10 20 30 10 10 10 10 10 10 10 10 10 0 50 100 150
Time (days) [halicin] μg/ml Time (seconds)
B 10
E
10 8
CFU/ml
6
10 4
2
10
2
[halicin] μg/ml
a
b 1
0.5
0.3
c 0.1
0
0 0.3 0.5 1 2 4 8 16 0 0.3 0.5 1 2 4 8 16 0 0.01 0.06 0.25 1
3
[tetracycline] μg/ml [kanamycin] μg/ml [FeCl3] mM
2
d
1
0
-1
-2
e -3
1 2 3 4
Time (hours)
Figure 4. Halicin Dissipates the DpH Component of the Proton Motive Force
(A) Evolution of resistance to halicin (blue) or ciprofloxacin (red) in E. coli after 30 days of passaging in liquid LB media. Cells were passaged every 24 h.
(B) Whole transcriptome hierarchical clustering of relative gene expression of E. coli treated with halicin at 43 MIC for 1 h, 2 h, 3 h, and 4 h. Shown is the mean
transcript abundance of two biological replicates of halicin-treated cells relative to untreated control cells on a log2-fold scale. Genes enriched in cluster b are
involved in locomotion (p1020); genes enriched in cluster c are involved in ribosome structure/function (p1030); and genes enriched in cluster d are involved
in membrane protein complexes (p1015). Clusters a, e, and f are not highly enriched for specific biological functions. In the growth curve, blue represents
untreated cells; red represents halicin-treated cells.
(C) Growth inhibition by halicin against E. coli in pH-adjusted media. Shown is the mean of two biological replicates. Bars denote absolute error.
(D) DiSC3(5) fluorescence in E. coli upon exposure to polymyxin B (PMB), halicin, or DMSO.
(E) Growth inhibition checkerboards of halicin in combination with tetracycline (left), kanamycin (center), and FeCl3 (right). Dark blue represents greater growth.
See also Figure S3 and Table S4.
potency decreased as pH increased, providing evidence that into the extracellular milieu resulting in increased fluorescence.
this compound may be dissipating the DpH component of the Conversely, when DpH is disrupted, cells compensate by
proton motive force (Farha et al., 2013). Consistent with this increasing Dc, resulting in enhanced DiSC3(5) uptake into the
observation, the addition of sodium bicarbonate to the growth cytoplasmic membrane and therefore decreased fluorescence.
medium (Farha et al., 2018) antagonized the action of halicin Here, early-log E. coli cells were washed in buffer and introduced
against E. coli (Figure S3E). to DiSC3(5) to allow fluorescence equilibration. Cells were then
To further ascertain that halicin dissipates the transmembrane introduced to polymyxin B (Figure 4D), which disrupts the cyto-
DpH potential in bacteria, we applied the potentiometric plasmic membrane, causing release of DiSC3(5) from the mem-
fluorophore 3,30 -dipropylthiadicarbocyanine iodide [DiSC3(5)] brane and a corresponding increase in fluorescence. Next, we
(Wu et al., 1999). DiSC3(5) accumulates in the cytoplasmic mem- introduced cells to varying concentrations of halicin, and
brane in response to the Dc component of the proton motive observed an immediate decrease in DiSC3(5) fluorescence in a
force, and self-quenches its own fluorescence. When Dc is dis- dose-dependent manner (Figure 4D), suggesting that halicin
rupted, or the membrane is permeabilized, the probe is released selectively dissipated the DpH component of the proton motive
7
0.5 10 7
10
OD (600nm)
CFU/ml
0.4 10
CFU/g
6
5
2hr 10
0.3 10 4hr
5
4 6hr 10
0.2 10
8hr 4
0.1 10
3
A. baumannii CDC 288 10
A. baumannii CDC 288 2
nutrient deplete 3
0 10 10
-5 -4 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 vehicle halicin
[halicin] μg/ml [halicin] μg/ml (0.5% DMSO) (0.5% w/v)
D E F 9
1 10
8
vehicle 10
0.8
7
10
OD (600nm)
metronidazole
0.6 10
6
CFU/g
5
0.4 10
disrupt halicin 10
4 vehicle
colonization infect metronidazole (50 mg/kg)
0.2 3
resistance 10 halicin (15 mg/kg)
C. difficile 630
2
0 10
-5 -4 -3 -2 -1 0 1 2 3 -72 -48 -24 0 hrs 24 144 24 48 72 96 120 144
10 10 10 10 10 10 10 10 10
[halicin] μg/ml ampicillin C. difficile treatment Time after infection (hours)
200 mg/kg infection every 24 hrs
force. Similar DiSC3(5) fluorescence changes were observed in Halicin Displays Efficacy in Murine Models of Infection
S. aureus treated with halicin (Figures S3F and S3G). Moreover, Given that halicin displays broad-spectrum bactericidal activity
halicin displayed antibiotic antagonism and synergy profiles and is not highly susceptible to plasmid-borne antibiotic-resis-
consistent with DpH dissipation. Of note, halicin antagonized tance elements or de novo resistance mutations at high fre-
the activity of tetracycline in E. coli, and synergized with quency, we next asked whether this compound might have
kanamycin (Figure 4E), consistent with previous work showing utility as an antibiotic in vivo. We therefore tested the efficacy
that the uptake of tetracyclines is dependent upon DpH (Yama- of halicin in a murine wound model of A. baumannii infection.
guchi et al., 1991), whereas aminoglycoside uptake is driven On the dorsal surface of neutropenic BALB/c mice, we estab-
largely by Dc (Taber et al., 1987). lished a 2 cm2 wound and infected with 2.5 3 105 CFU of
Interestingly, our observations that halicin induced the expres- A. baumannii strain 288 acquired from the Centers for
sion of iron acquisition genes at sub-lethal concentrations Disease Control and Prevention (CDC). This strain is not suscep-
(Tables S4A–S4C) suggested that this compound complexes tible to clinical antibiotics generally used for treatment of
with iron in solution, thereby dissipating transmembrane DpH A. baumannii, and therefore represents a pan-resistant
potential similarly to other antibacterial ionophores, such as isolate. Importantly, halicin displayed potent growth inhibition
daptomycin (Farha et al., 2013). We note here that daptomycin against this strain in vitro (MIC = 1 mg/mL; Figure 5A) and
resistance via deletion of dsp1 in S. aureus did not confer was able to sterilize A. baumannii 288 cells residing in metabol-
cross-resistance to halicin (Figure S3H). We observed enhanced ically repressed conditions (Figures 5B, S4A, and S4B). After 1 h
potency of halicin against E. coli with increasing concentrations of infection establishment, mice were treated with Glaxal
of environmental Fe3+ (Figure 4E). This is consistent with a Base Moisturizing Cream supplemented with vehicle (0.5%
mechanism of action wherein halicin may bind iron prior to DMSO) or halicin (0.5% w/v). Mice were then treated after 4 h,
membrane association and DpH dissipation. 8 h, 12 h, 20 h, and 24 h of infection, and sacrificed at 25 h
Number of molecules
6 3,260 molecules with PS > 0.8
-1 10 0.95 ZINC000004481415
Prediction score
5 1,070 molecules with PS > 0.9
10 ZINC000019771150
0
4 ZINC000004623615
10 0.9
1 3
ZINC000238901709
10 ZINC000100032716
2 2
10
2.5 1
0.85
10
LogP
3 10
0
0.8
3.5 0 0.2 0.4 0.6 0.8 1
1
91
01
19
28
46
1
73
64
82
37
55
0.
4 Tanimoto score to nearest antibiotic
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
Bin
4.5
E F
5 1 100
ZINC000100032716 ZINC000225434673
>5
MIC (μg/ml)
MIC (μg/ml)
N F
128
-4
N
10 0.1
3* lC 1* T -1 -Ia -cr 3* lC 1* T -1 -Ia -cr
11 B∆to CR- CA OXA t(2’’) ’)-Ib 11 B∆to CR- CA OXA t(2’’) ’)-Ib
S
S
25 an ac(6 25 an ac(6
N
N
S
N+
O
32 8 32 BW ∆bam
M
a BW ∆bam
M
a
O O-
Br
Br O
N+
G H
O- 9 9
N N+ 10 10
-O
Br Br
N
N
N
N
OH
4 2 4 4 4 8 8
N+ N+
N
N
O 10 10
O OH N O
7 7
10 10
HN
O N SH
32 6 CFU/ml 6
CFU/ml
N
10 10
H2N S NH2
S 5
5
10 10
N
4 4
32 32 32 16 10
O S N
10
-O
N+ N
time = 0 time = 0
3 3 time = 4hrs
10 time = 4hrs 10
O
BW25113 2
BW25113
F N+ 2
O- 10 10
N NH2 128 128 64 10
-4
10
-3
10
-2
10
-1
10
0
10
1
10
-2
10
-1
10
0
10
1
10
2
10
3
HN
[ZINC000100032716] μg/ml [ZINC000225434673] μg/ml
O
S I
HN NH
128 primary training set
N Broad library
O WuXi library
H2N
O ZINC15 predictions > 0.9
S
O
N
0.01 0.25 4 0.25 0.06 false predictions
N N true predictions
N N
-O OH
N+
O O O
(C) Prediction scores and Tanimoto nearest neighbor antibiotic scores of the 23 predictions that were empirically tested for growth inhibition. Yellow circles
represent those molecules that displayed detectable growth inhibition of at least one pathogen. Grey circles represent inactive molecules. ZINC numbers of
active molecules are shown on the right.
(D) MIC values (mg/mL) of the eight active predictions from the ZINC15 database against E. coli (EC), S. aureus (SA), K. pneumoniae (KP), A. baumannii (AB), and
P. aeruginosa (PA). Blank regions represent no detectable growth inhibition at 128 mg/mL. Structures are shown in the same order (top to bottom) as their
corresponding ZINC numbers in (C).
(E) MIC of ZINC000100032716 against E. coli strains harboring a range of antibiotic-resistance determinants. The mcr-1 gene was expressed in E. coli BW25113.
All other resistance genes were expressed in E. coli BW25113 DbamBDtolC. Experiments were conducted with two biological replicates. Note the minor increase
in MIC in the presence of aac(60 )-Ib-cr.
(F) Same as (E) except using ZINC000225434673.
(G) Killing of E. coli in LB media in the presence of varying concentrations of ZINC000100032716 after 0 h (blue) and 4 h (red). The initial cell density is 106 CFU/
mL. Shown is the mean of two biological replicates. Bars denote absolute error.
(H) Same as (G) except using ZINC000225434673.
(I) t-SNE of all molecules from the primary training dataset (blue), the Drug Repurposing Hub (red), the WuXi anti-tuberculosis library (green), the ZINC15 mol-
ecules with prediction scores >0.9 (pink), false-positive predictions (gray), and true-positive predictions (yellow).
See also Figure S5 and Tables S5, S6, and S7.
Love, M.I., Huber, W., and Anders, S. (2014). Moderated estimation of fold Stokes, J.M., Lopatkin, A.J., Lobritz, M.A., and Collins, J.J. (2019b). Bacterial
change and dispersion for RNA-seq data with DESeq2. Genome Biol. Metabolism and Antibiotic Efficacy. Cell Metab. 30, 251–259.
15, 550. Surawicz, C.M., Brandt, L.J., Binion, D.G., Ananthakrishnan, A.N., Curry, S.R.,
Gilligan, P.H., McFarland, L.V., Mellow, M., and Zuckerbraun, B.S. (2013).
Manson, M.D., Tedesco, P., Berg, H.C., Harold, F.M., and Van der Drift, C.
Guidelines for diagnosis, treatment, and prevention of Clostridium difficile in-
(1977). A protonmotive force drives bacterial flagella. Proc. Natl. Acad. Sci.
fections. Am. J. Gastroenterol. 108, 478–498, quiz 499.
USA 74, 3060–3064.
Taber, H.W., Mueller, J.P., Miller, P.F., and Arrow, A.S. (1987). Bacterial uptake
Mauri, A., Consonni, V., Pavan, M., and Todeschini, R. (2006). Dragon soft-
of aminoglycoside antibiotics. Microbiol. Rev. 51, 439–457.
ware: an easy approach to molecular descriptor calculations. MATCH Com-
mun. Math. Comput. Chem. 56, 237–248. Tally, F.P., Goldin, B.R., Sullivan, N., Johnston, J., and Gorbach, S.L. (1978).
Antimicrobial activity of metronidazole in anaerobic bacteria. Antimicrob.
Mayr, A., Klambauer, G., Unterthiner, T., Steijaert, M., Wegner, J.K., Ceule- Agents Chemother. 13, 460–465.
mans, H., Clevert, D.A., and Hochreiter, S. (2018). Large-scale comparison
Tommasi, R., Brown, D.G., Walkup, G.K., Manchester, J.I., and Miller, A.A.
of machine learning methods for drug target prediction on ChEMBL. Chem.
(2015). ESKAPEing the labyrinth of antibacterial discovery. Nat. Rev. Drug Dis-
Sci. (Camb.) 9, 5441–5451.
cov. 14, 529–542.
Moriwaki, H., Tian, Y.S., Kawashita, N., and Takagi, T. (2018). Mordred: a mo- Wang, Y., Bryant, S.H., Cheng, T., Wang, J., Gindulyte, A., Shoemaker, B.A.,
lecular descriptor calculator. J. Cheminform. 10, 4. Thiessen, P.A., He, S., and Zhang, J. (2017). PubChem BioAssay: 2017 up-
O’Neill, J. (2014). Antimicrobial Resistance: Tackling a Crisis for date. Nucleic Acids Res. 45 (D1), D955–D963.
the Health and Wealth of Nations (Review on Antimicrobial Resis- Winston, J.A., Thanissery, R., Montgomery, S.A., and Theriot, C.M. (2016). Ce-
tance, 2014). foperazone-treated mouse model of clinically-relevant Clostridium difficile
Ortholand, J.Y., and Ganesan, A. (2004). Natural products and combinatorial strain R20291. J. Vis. Exp. (118), 54850.
chemistry: back to the future. Curr. Opin. Chem. Biol. 8, 271–280. Wright, G.D. (2017). Opportunities for natural products in 21st century anti-
biotic discovery. Nat. Prod. Rep. 34, 694–701.
Paul, K., Erhardt, M., Hirano, T., Blair, D.F., and Hughes, K.T. (2008). Energy
source of flagellar type III secretion. Nature 451, 489–492. Wu, M., Maier, E., Benz, R., and Hancock, R.E. (1999). Mechanism of interac-
tion of different classes of cationic antimicrobial peptides with planar bilayers
Perez, F., Hujer, A.M., Hujer, K.M., Decker, B.K., Rather, P.N., and Bonomo,
and with the cytoplasmic membrane of Escherichia coli. Biochemistry 38,
R.A. (2007). Global challenge of multidrug-resistant Acinetobacter baumannii.
7235–7242.
Antimicrob. Agents Chemother. 51, 3471–3484.
Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S.,
P.E.W. Trusts (2019). Five-year analysis shows continued deficiencies in anti- Leswing, K., and Pande, V. (2017). MoleculeNet: a benchmark for molecular
biotic development. https://www.pewtrusts.org/en/research-and-analysis/ machine learning. Chem. Sci. (Camb.) 9, 513–530.
data-visualizations/2019/five-year-analysis-shows-continued-deficiencies-in-
Yamaguchi, A., Ohmori, H., Kaneko-Ohdera, M., Nomura, T., and Sawai, T.
antibiotic-development.
(1991). Delta pH-dependent accumulation of tetracycline in Escherichia coli.
Robinson, M.D., McCarthy, D.J., and Smyth, G.K. (2010). edgeR: a Bio- Antimicrob. Agents Chemother. 35, 53–56.
conductor package for differential expression analysis of digital gene expres- Yang, J.H., Wright, S.N., Hamblin, M., McCloskey, D., Alcantar, M.A., Schrüb-
sion data. Bioinformatics 26, 139–140. bers, L., Lopatkin, A.J., Satish, S., Nili, A., Palsson, B.O., et al. (2019a). A white-
Rogers, D., and Hahn, M. (2010). Extended-connectivity fingerprints. J. Chem. box machine learning approach for revealing antibiotic mechanisms of action.
Inf. Model. 50, 742–754. Cell 177, 1649–1661.
Further information and requests for resources and reagents should be directed to James J. Collins (jimjc@mit.edu). All unique/stable
reagents generated in this study are available from the Lead Contact with a completed Materials Transfer Agreement.
METHODS DETAILS
1. Atom features: atomic number, number of bonds for each atom, formal charge, chirality, number of bonded hydrogens, hy-
bridization, aromaticity, atomic mass.
2. Bond features: bond type (single/double/triple/aromatic), conjugation, ring membership, stereochemistry.
The model applies a series of message passing steps where it aggregates information from neighboring atoms and bonds to
build an understanding of local chemistry. In Chemprop, on each step of message passing, each bond’s featurization is updated
by summing the featurization of neighboring bonds, concatenating the current bond’s featurization with the sum, and then
applying a single neural network layer with non-linear activation. After a fixed number of message-passing steps, the learned
featurizations across the molecule are summed to produce a single featurization for the whole molecule. Finally, this featurization
is fed through a feed-forward neural network that outputs a prediction of the property of interest. Since the property of interest in
our application was the binary classification of whether a molecule inhibits the growth of E. coli, the model is trained to output a
number between 0 and 1, which represents its prediction about whether the input molecule is growth inhibitory.
In addition to the basic D-MPNN architecture described above, we employed three model optimizations (Yang et al., 2019b):
Hyperparameter optimization
The performance of machine learning models is known to depend critically on the choice of hyperparameters, such as the size of
the neural network layers, which control how and what the model is able to learn. We used the Bayesian hyperparameter optimization
scheme, with 20 iterations of optimization to improve the hyperparameters of our model (see table below). Baysian hyperparameter
Ensembling
Another standard machine learning technique used to improve performance is ensembling, where several copies of the same model
architecture with different random initial weights are trained and their predictions are averaged. We used an ensemble of 20 models,
with each model trained on a different random split of the data (Dietterich, 2000).
Our initial training dataset consisted of 2,335 molecules, with 120 compounds (5.14%) showing growth inhibitory activity against
E. coli, as defined by endpoint OD600 < 0.2. We performed predictions on the Drug Repurposing Hub, consisting of 6,111 unique mol-
ecules; the WuXi anti-tuberculosis library, consisting of 9,997 unique molecules; and tranches of the ZINC15 database. The ZINC15
tranches that we used for molecular predictions were selected based on their likelihood to contain antibiotic-like molecules; these
tranches included: ‘AA’, ‘AB’, ‘BA’, ‘BB’, ‘CA’, ‘CB’, ‘CD’, ‘DA’, ‘DB’, ‘EA’, ‘EB’, ‘FA’, ‘FB’, ‘GA’, ‘GB’, ‘HA’, ‘HB’, ‘IA’, ‘IB’, ‘JA’,
‘JB’, ‘JC’, ‘JD’, ‘KA’, ‘KB’, ‘KC’, ‘KD’, ‘KE’, ‘KF’, ‘KG’, ‘KH’, ‘KI’, ‘KJ’, and ‘KK’, constituting a dataset of 107,349,233 unique molecules.
Our experimental procedure consisted of four phases: (1a) a training phase to evaluate the optimized but non-ensembled model
and (1b) training the ensemble of optimized models; (2) a prediction phase; (3) a retraining phase; and (4) a final prediction phase. We
began by evaluating our model on the training set of 2,335 molecules using all optimizations except for ensembling, in order to deter-
mine the best performance of a single model. Here, we randomly split the dataset into 80% training data, 10% validation data, and
10% test data. We trained our model on the training data for 30 epochs, where an epoch is defined as a single pass through all of the
training data, and we evaluated it on the validation data at the end of each epoch. After training was complete, we used the model
parameters that performed best on the validation data and tested the model with those parameters on the test data. We repeated
this procedure with 20 different random splits of the data and averaged the results. After we were satisfied with model performance,
we conducted predictions on new datasets. Since we wanted to maximize the amount of training data and were no longer interested
in measuring performance on the test set, we trained new models on the training data from each of 20 random splits, each with 90%
training data, 10% validation data, and no test data.
The ensemble consisting of 20 models is the model that was applied first to the Drug Repurposing Hub, and then the WuXi anti-
tuberculosis library. After empirically testing the highest and lowest predicted molecules from these libraries for growth inhibition
against E. coli, we included all these data into our original training sets to create a new training set. The updated training set contained
2,911 unique molecules, with 232 (7.97%) showing growth inhibitory activity. We next used our retrained model to make predictions
on the aforementioned subset of the ZINC15 database. We selected all molecules with a prediction score > 0.7, which resulted in
6,820 compounds. All molecules selected for curation were subsequently cross-referenced with SciFinder to ensure that these
were not clinical antibiotics.
We lastly compared the prediction outputs of our augmented D-MPNN with a D-MPNN without RDKit features; a feedforward
DNN model with the same depth as our D-MPNN model with hyperparameter optimization using RDKit features only; the same
DNN instead using Morgan fingerprints (radius 2) as the molecular representation; and RF and SVM models using the same Morgan
fingerprint representations. We used the scikit-learn implementation of a random forest classifier with all of the default parameters
except for the number of trees, where we used 500 instead of 10. When making predictions, we output the growth inhibition
probability for each molecule according to the random forest, which is the proportion of trees in the model that predict a 1 for
that molecule. Similarly, we used the scikit-learn implementation of a support vector machine with all of the default parameters.
When making predictions, we output the signed distance between the Morgan fingerprint of the molecule and the separating
hyperplane that is learned by the SVM. This number represents how much the model predicts a molecule is antibacterial, with large
positive distances meaning most likely antibacterial and large negative distances meaning most likely not. Although the signed
distance is not a probability, it can still be used to rank the molecules according to how likely they are to be antibacterial.
To predict the toxicity of molecules for possible in vivo applications, we trained a Chemprop model on the ClinTox dataset.
This dataset consisted of 1,478 molecules, each with two binary properties: (a) clinical trial toxicity and (b) FDA-approval status. Of
these 1,478 molecules, 94 (6.36%) had clinical toxicity and 1,366 (92.42%) were FDA approved. Using the same methodology as
described in phase (1) of our experimental procedure, the Chemprop model was trained on both properties simultaneously and learned
a single molecular representation that was used by the feed-forward neural network layers to predict toxicity. We utilized the same
RDKit features as in our other models, except for that the ClinTox model was an ensemble of five models and used the following
optimal hyperparameters: message-passing steps = 6; neural network hidden size = 2200; number of feed-forward layers = 3; and
dropout probability = 0.15. This ensemble of models was subsequently used to make toxicity predictions on our candidate molecules.
Chemical screening
E. coli BW25113 was grown overnight in 3 mL Luria-Bertani (LB) medium and diluted 1/10,000 into fresh LB. 99 ml of cells was added
to each well of a 96-well flat-bottom plate (Corning) using a multichannel pipette. Next, 1 ml of a 5 mM stock of each molecule from an
FDA-approved drug library supplemented with a natural product library (2,560 molecules total; MicroSource Discovery Systems)
was added, in duplicate, using an Agilent Bravo liquid handler. The final screening concentration was 50 mM. Plates were then incu-
bated in sealed plastic bags at 37 C without shaking for 16 hr, and subsequently read at 600 nm using a SpectraMax M3 plate reader
(Molecular Devices) to quantify cell growth. Plate data were normalized based on the interquartile mean of each plate.
Mutant generation
For serial passage evolution, E. coli BW25113 was grown overnight in 3 mL LB medium and diluted 1/10,000 into fresh LB. Cells were
grown in 96-well flat-bottom plates (Corning), in the presence of varying concentrations of halicin (or ciprofloxacin) at two-fold
serial dilutions, in final volumes of 100 ml. Plates were incubated at 37 C without shaking for 24 hr, at which time they were read
at 600 nm using a SpectraMax M3 plate reader. After 24 hr, cells that grew in the presence of the highest concentration of halicin
AB5044 TAGCCGGGCAGATGCCCGGCAAGAGAGAATTACACTTCGGTTAAGGTGATATTCCGGGGATCCGTCGACC
AB5045
ACCTTGTAATCTGCTGGCACGCAAAATTACTTTCACATGGAGTCTTTATGTGTAGGCTGGAGCTGCTTCG
AB5046
tgcaaaataatatgcaccacgacggcggtcagaaaaataa
AB5047
gaagcgttacttcgcgatctgatcaacgattcgtggaatc
RNA sequencing
Cells were grown overnight in 3 mL LB medium and diluted 1/10,000 into 50 mL fresh LB. When cultures reached 107 CFU/ml, hal-
icin was added at 0.25x MIC (0.5 mg/ml), 1x MIC (2 mg/ml), or 4x MIC (8 mg/ml) and cells were incubated for the noted durations. After
incubation, cells were harvested via centrifugation at 15,000 x g for 3 min at 4 C, and RNA was purified using the Zymo Direct-zol 96-
well RNA purification kit (R2056). Briefly, 107-108 CFU pellets were lysed in 500 ml hot Trizol reagent (Life Technologies). 200 ml
chloroform (Millipore Sigma) was added, and samples were centrifuged at 15,000 x g for 3 min at 4 C. 200 ml of the aqueous phase
was added to 200 ml anhydrous ethanol (Millipore Sigma), and RNA was purified using a Zymo-spin plate as per the manufacturer’s
instructions. After purification, Illumina cDNA libraries were generated using a modified version of the RNAtag-seq protocol (Shishkin
et al., 2015). Briefly, 500 ng – 1 mg of total RNA was fragmented, depleted of genomic DNA, dephosphorylated, and ligated to DNA
adapters carrying 50 -AN8-30 barcodes of known sequence with a 50 phosphate and a 30 blocking group. Barcoded RNAs were pooled
and depleted of rRNA using the RiboZero rRNA depletion kit (Epicenter). Pools of barcoded RNAs were converted to Illumina cDNA
libraries in two main steps: (1) reverse transcription of the RNA using a primer designed to the constant region of the barcoded
adaptor with addition of an adaptor to the 30 end of the cDNA by template switching using SMARTScribe (Clontech), as previously
described (Zhu et al., 2001); and (2) PCR amplification using primers whose 50 ends target the constant regions of the 30 or 50 adaptors
and whose 30 ends contain the full Illumina P5 or P7 sequences. cDNA libraries were sequenced on the Illumina NextSeq 500 platform
to generate paired end reads. Following sequencing, reads from each sample in a pool were demultiplexed based on their associated
barcode sequence. Up to one mismatch in the barcode was allowed, provided it did not make assignment of the read to a different
barcode possible. Barcode sequences were removed from the first read, as were terminal G’s from the second read that may have
been added by SMARTScribe during template switching. Next, reads were aligned to the E. coli MG1655 genome (NC_000913.3)
using BWA (Li and Durbin, 2009) and read counts were assigned to genes and other genomic features. Differential expression
analysis was conducted with DESeq2 (Love et al., 2014) and/or edgeR (Robinson et al., 2010). To verify coverage, visualization
of raw sequencing data and coverage plots in the context of genome sequences and gene annotations was conducted using
GenomeView (Abeel et al., 2012). To determine biological response of cells as a function of halicin exposure, we performed hierar-
chical clustering of the gene expression profiles using the clustergram function in MATLAB 2016a. We selected the Euclidean dis-
tance as the metric to define the pairwise distance between observations, which measures a straight-line distance between two
points. The use of Euclidian distance has been considered as the most appropriate to cluster log-ratio data (D’haeseleer, 2005).
With a metric defined, we next selected the average linkage as the clustering method. The average linkage uses the algorithm called
unweighted pair group method with arithmetic mean (UPGMA), which is the most popular and preferred algorithm for hierarchical
data clustering (Jaskowiak et al., 2014; Loewenstein et al., 2008). UPGMA uses the mean similarity across all cluster data points
to combine the nearest two clusters into a higher-level cluster. UPGMA assumes there is a constant rate of change among species
DiSC3(5) assays
S. aureus USA300 and E. coli MC1061 were streaked onto LB agar and grown overnight at 37 C. Single colonies were picked and
used to inoculate 50 mL LB in 250 mL baffled flasks, which were incubated for 3.5 hr in a 37 C incubator shaking at 250 rpm. Cultures
were pelleted at 4000 x g for 15 min and washed three times in buffer. For E. coli, the buffer was 5 mM HEPES with 20 mM glucose (pH
7.2). For S. aureus, the buffer was 50 mM HEPES with 300 mM KCl and 0.1% glucose (pH 7.2). Both cell densities were normalized to
OD6000.1, loaded with 1 mM DiSC3(5) dye (3,30 -dipropylthiadicarbocyanine iodide), and left to rest for 10 min in the dark for probe
fluorescence to stabilize. Fluorescence was measured in a cuvette-based fluorometer with stirring (Photon Technology International)
at 620 nm excitation and 670 nm emission wavelengths. A time-course acquisition was performed, with compounds injected after
60 s of equilibration to measure increases or decreases in fluorescence. For E. coli, polymyxin B was used as a control to monitor Dc
dissipation. For S. aureus, valinomycin was used as a Dc control and nigiricin was used as a DpH control. Upon addition of antibiotic,
fluorescence was read continuously for 3 min and at an endpoint of 4 hr.
Code availability
Chemprop code is available at: https://github.com/swansonk14/chemprop.
ADDITIONAL RESOURCES
A B
1.6 1.6
1.4
1.4
1.2
Relative growth
Relative growth replicate 2 1.2 1
0.8
1
0.6
0.8 0.4
0.2
0.6
0
0 500 1000 1500 2000 2500
0.4
Plate index
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Relative growth replicate 1
Figure S1. Primary Screening and Initial Model Training, Related to Figure 2.
(A) Primary screening data for growth inhibition of E. coli by 2,560 molecules within the FDA-approved drug library supplemented with a natural product collection.
Red are growth inhibitory molecules; blue are non-growth inhibitory molecules. (B) Rank-ordered de-duplicated screening data containing 2,335 molecules.
Shown is the mean of two biological replicates. Red are growth inhibitory molecules; blue are non-growth inhibitory molecules.
A B C
10 10 9
10 10 10
9 9 8
10 10 10
8 8
10 10 7
10
7 7
10 10 6
CFU/ml
CFU/ml
CFU/ml
6 6
10
10 1hr 10 1hr 5
2hr
5 2hr 5 2hr 10 4hr
10 10
3hr 3hr 4 6hr
10
4
10
4 10
4hr 4hr 8hr
3
10
3 BW25113 10
3 BW25113 10 BW25113
2
nutrient replete 2
nutrient replete 2
nutrient deplete
10 10 10
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
[halicin] μg/ml [halicin] μg/ml [halicin] μg/ml
D E F
9 9 9
10 10 10
8 8 8
10 10 10
7 7 7
10 10 10
6 6 6
CFU/ml
CFU/ml
CFU/ml
10 10 10
5
2hr 5
2hr 5
2hr
10 4hr 10 4hr 10 4hr
4 6hr 4 6hr 4 6hr
10 10 10
8hr 8hr 8hr
3 3 3
10 BW25113 10 BW25113 10 BW25113
2
nutrient deplete 2
nutrient deplete 2
nutrient deplete
10 10 10
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
[halicin] μg/ml [ampicillin] μg/ml [ampicillin] μg/ml
G H I
9 10 10
10 10 10
8 9 9
10 10 10
8 8
7 10 10
10
7 7
6 10 10
CFU/ml
CFU/ml
CFU/ml
10 6 6
5
2hr 10 1hr 10 1hr
10 4hr 5 2hr 5 2hr
10 10
4 6hr 3hr 3hr
10 10
4
10
4
8hr 4hr 4hr
3
10 BW25113 10
3 BW25113 10
3 BW25113
2
nutrient deplete 2
nutrient replete 2
nutrient replete
10 10 10
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
[ampicillin] μg/ml [ampicillin] μg/ml [ampicillin] μg/ml
J K L
10 3 amp gent
10 10 OXA-1 ant(2”)-Ia 0.7
chlor
9
10 2 CAT 0.6
10 colistin
8
10 MCR-1 0.5
1
10
OD (600nm)
7
MIC (μg/ml)
10
CFU/ml
0
0.4
6
10 1hr 10
5 2hr 0.3
10 -1 levo
3hr 10 aac-(6’)-Ib-cr
4 0.2
10 4hr -2
3 BW25113 10 0.1 NfsA/B
10
2
nutrient replete -3 BW25113
10 10 0
10
-2
10
-1
10
0
10
1
10
2
10
3 WT R WT R WT R WT R WT R 10
-5 -4 -3 -2
10 10 10 10
-1
10
0
10
1
10
2
10
3
0.6
0.5
OD (600nm)
0.4
0.3
0.2
0.1 NfsA/B
BW25113
0
-5 -4 -3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10
[nitrofurantoin] μg/ml
wildtype wildtype
no drug
B 10
C 10
D
10 10
CFU/ml
CFU/ml
6 6
10 10 0.3
2 2
10 10
OD (600nm)
0.2
pH 5.4
pH 6.4
0.1
pH 7.4
pH 8.4
pH 9.4
0
-5 -4 -3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10
3 3 [halicin] μg/ml
2 2 E
0.7
1 1
0 0 0.6
-1 -1 0.5
OD (600nm)
-2 -2 0.4
-3 -3
0.3
0.2
0.1 LB + 25 mM NaHCO3
LB
0
-5 -4 -3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10
[halicin] μg/ml
1 2 3 4 1 2 3 4
Time (hours) Time (hours)
F G
5 5
10 6,000 10
valinomycin nigericin
DiSC3(5) fluorescence (RFU)
DiSC3(5) fluorescence (RFU)
DiSC3(5) fluorescence (RFU)
valinomycin
5,000
10
4
nigericin 4,000 DMSO 10
4
DMSO halicin
3,000
halicin
3 3
10 2,000 10
0 50 100 150 0 50 100 150
n
in
SO
l
l
l
/m
/m
/m
/m
i
ric
yc
M
μg
μg
om
ge
D
H
ni
5
25
lin
0.
0.
va
0.5 0.5
0.4 0.4
OD (600nm)
OD (600nm)
0.3 0.3
0.2 0.2
CFU/ml
10 10 2hr
2hr
10
5
10
5 4hr
4hr
6hr
4 6hr 4
10 10 8hr
8hr
3 3 A. baumannii 288
10 A. baumannii 288 10 nutrient deplete
2
nutrient deplete 2
10 10
-2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10
[halicin] μg/ml [halicin] μg/ml
Figure S4. Activity of Halicin against A. baumannii CDC 288, Related to Figure 5
(A) Killing of A. baumannii in PBS in the presence of varying concentrations of halicin after 2 hr (blue), 4 hr (cyan), 6 hr (green), and 8 hr (red). The initial cell density is
107 CFU/ml. Shown is the mean of two biological replicates. Bars denote absolute error. (B) Same as (A), with initial cell density 106 CFU/ml.
A B C
1 1.2 1.2
0.8 1 1
Prediction score
0.8 0.8
OD (600nm)
OD (600nm)
0.6
0.6 0.6
0.4
0.4 0.4
0.2 0.2 0.2
top 200 bottom 100
0 0 0
0 4000 8000 12000 0 50 100 150 200 0 20 40 60 80 100
Ranked molecules Ranked molecules Ranked molecules
D E F
1 1 1
OD (600nm)
OD (600nm)
0.6 0.6 0.6
E. coli
S. aureus
0.4 0.4 0.4 K. pneumoniae
A. baumannii
0.2 0.2 0.2 P. aeruginosa
0 0 0
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
[ZINC000098210492] μg/ml [ZINC000001735150] μg/ml [ZINC000225434673] μg/ml
G H I
1 1 1
OD (600nm)
0.6 0.6 0.6
0 0 0
-2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
[ZINC000004481415] μg/ml [ZINC000019771150] μg/ml [ZINC000004623615] μg/ml
J K L
1 1 0.6
0.5
0.8 0.8
0.4
OD (600nm)
OD (600nm)
OD (600nm)
0.6 0.6
0.3
0.4 0.4
0.2
BW25113
0.2 0.2 0.1 BW25113 gyrA S83A
0 0 0
-2 -1 0 1 2 3 -5 -4 -3 -2 -1 0 1 2 3 -5 -4 -3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
[ZINC000238901709] μg/ml [ZINC000100032716] μg/ml [ZINC000100032716] μg/ml
M
0.6
0.5
0.4
OD (600nm)
0.3
0.2
BW25113
0.1 BW25113 gyrA S83A
0
-5 -4 -3 -2 -1 0 1 2 3
10 10 10 10 10 10 10 10 10
[ciprofloxacin] μg/ml
Figure S5. Model Predictions from the WuXi Anti-tuberculosis Library and the ZINC15 Database, Related to Figure 6
(A) Rank-ordered prediction scores of WuXi anti-tuberculosis library molecules. Note the overall low prediction scores. (B) The top 200 predictions from the data
shown in (A) were curated for empirical testing for growth inhibition of E. coli. None were growth inhibitory, in agreement with their low prediction scores. Shown is
the mean of two biological replicates. (C) The bottom 100 predictions from the data shown in (A) were curated for empirical testing for growth inhibition of E. coli.
None were growth inhibitory, in agreement with their low prediction scores. Shown is the mean of two biological replicates. (D-K) Growth inhibition by eight
empirically validated ZINC15 predictions against E. coli (blue), S. aureus (green), K. pneumoniae (purple), A. baumannii (pink), and P. aeruginosa (red) in LB media.
Shown is the mean of two biological replicates. Bars denote absolute error. (L) Growth inhibition by ZINC000100032716 against E. coli BW25113 (blue) or a
ciprofloxacin-resistant gyrA S83A mutant of BW25113 (red). Shown is the mean of two biological replicates. Bars denote absolute error. (M) Same as (L) except
using ciprofloxacin. Note the 4-fold smaller change in MIC with ZINC000100032716 between the gyrA mutant and wild-type E. coli relative to ciprofloxacin.