Molecular Water solubility (LogS) prediction by Machine Learning Method

Water solubility of compounds significantly affect its druggability, absorption and distribution property, such as oral bioavailability, intestinal absorption and BBB penetration. Typically, a low solubility goes along with a bad absorption and therefore the general aim is to avoid poorly soluble compounds. For convenient, water solubility (mol/Liter) are converted to logarithm value as LogS.

There are two major methods to predict LogS, atom contribution method and machine learning based method. The atom contribution method predict solubility via an increment system by adding atom contributions depending on their atom types. The machine learning method uses 2D or 3D features generated from molecular structures to fit a regression model for prediction.

The atom contribution method requires solid domain knowledge of cheminformatics, while machine learning method can use out-of-box cheminformatic toolkit to generate features for fitting models. Sounds easy, right? 😉

Here, we use python with rdkit and sklearn to predict LogS trained from a public dataset of water solubility

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import os
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.externals import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score, make_scorer

from rdkit import Chem
from rdkit.Chem import AllChem, DataStructs, Descriptors, ReducedGraphs
from rdkit.Avalon.pyAvalonTools import GetAvalonFP
from rdkit.ML.Descriptors import MoleculeDescriptors
from rdkit.Chem.EState import Fingerprinter
from rdkit.Chem import Descriptors

from sklearn.feature_selection import mutual_info_regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def fingerprint(mol, fptype="MACCSKeys", radius=2, bits=1024):
npfp = np.zeros((1,))
if fptype == "MACCSKeys":
DataStructs.ConvertToNumpyArray(AllChem.GetMACCSKeysFingerprint(mol), npfp)
elif fptype == "Avalon":
DataStructs.ConvertToNumpyArray(GetAvalonFP(mol), npfp)
elif fptype == "ECFP":
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(mol, radius, bits), npfp)
elif fptype == "ErG":
npfp = ReducedGraphs.GetErGFingerprint(mol)
elif fptype == "Estate":
npfp = Fingerprinter.FingerprintMol(mol)[0]
else:
raise TypeError()
return npfp

def descriptors(mol):
calc=MoleculeDescriptors.MolecularDescriptorCalculator([x[0] for x in Descriptors._descList])
ds = np.asarray(calc.CalcDescriptors(mol))
return ds

Reading and Preprocessing

1
2
3
4
5
6
7
8
9
smis = []
logs = []
with open("LogS.txt", "r") as f:
reader = csv.reader(f, delimiter=" ")
for row in reader:
smis.append(row[1])
logs.append(float(row[2]))

print(smis[:5], logs[:5])
['CC(N)=O', 'CNN', 'CC(O)=O', 'C1CCCN1', 'NC([NH]O)=O'] [1.58, 1.34, 1.22, 1.15, 1.12]
1
2
3
%%time
mols = [Chem.MolFromSmiles(smi) for smi in smis]
feature = [np.append(fingerprint(mol), descriptors(mol)) for mol in mols]
CPU times: user 14.4 s, sys: 109 ms, total: 14.5 s
Wall time: 14.7 s
1
2
3
4
5
6
7
8
9
10
%%time
# random seed was set as 2 for reproduction
np.random.seed(2)
X = np.array(feature)
y = np.array(logs)

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=2)
minfo = mutual_info_regression(X_train, y_train)
CPU times: user 5.52 s, sys: 62.5 ms, total: 5.58 s
Wall time: 5.59 s

Model Selection

Basic Linear Regression model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
print("Train set R^2: ", r2_score(y_train, y_pred_train))
print("Train MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(X_test)
print("Test set R^2: ", r2_score(y_test, y_pred_test))
print("Test MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))

#plt.xlim((-12,2))
#plt.ylim((-12,2))
plt.title("LinerRegression Prediction")
plt.xlabel("Real LogS")
plt.ylabel("Predicted LogS")
plt.grid()
plt.scatter(y_train, model.predict(X_train),
color="blue", alpha=0.8, label="train")
plt.scatter(y_test, model.predict(X_test),
color="lightgreen", alpha=0.8, label="test")
plt.legend(loc = 'best')
plt.show()
Train set R^2:  0.959263340625798
Train MAE score: 0.3102
Test set R^2:  0.06907362861596777
Test MAE score: 0.6906

png

SVM Regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.svm import SVR

model = SVR()
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
print("Train set R^2: ", r2_score(y_train, y_pred_train))
print("Train MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(X_test)
print("Test set R^2: ", r2_score(y_test, y_pred_test))
print("Test MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))

plt.xlim((-12,2))
plt.ylim((-12,2))
plt.title("SVM Regression Prediction")
plt.xlabel("Real LogS")
plt.ylabel("Predicted LogS")
plt.grid()
plt.scatter(y_train, model.predict(X_train),
color="blue", alpha=0.8, label="train")
plt.scatter(y_test, model.predict(X_test),
color="lightgreen", alpha=0.8, label="test")
plt.legend(loc = 'best')
plt.show()
Train set R^2:  0.5843761663378015
Train MAE score: 0.7394
Test set R^2:  0.09632083916211343
Test MAE score: 1.4035

png

SVM Regression (SVR) model shows very bad prediction results, because we haven’t normalize features into (-1, 1). Data normalization can promote the performance in common machine learning problems, and speed up the coverage of gradient descent algorithm. We use StandardScaler, a rescaling method, to scale features to (-1, 1) range. After scaling, SVR works perfectly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
stds = StandardScaler()
stds.fit(X_train)

model = SVR()
model.fit(stds.transform(X_train), y_train)

y_pred_train = model.predict(stds.transform(X_train))
print("Train set R^2: ", r2_score(y_train, y_pred_train))
print("Train MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(stds.transform(X_test))
print("Test set R^2: ", r2_score(y_test, y_pred_test))
print("Test MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))

plt.xlim((-12,2))
plt.ylim((-12,2))
plt.title("SVM Regression Prediction with Scaler")
plt.xlabel("Real LogS")
plt.ylabel("Predicted LogS")
plt.grid()
plt.scatter(y_train, y_pred_train,
color="blue", alpha=0.8, label="train")
plt.scatter(y_test, y_pred_test,
color="lightgreen", alpha=0.8, label="test")
plt.legend(loc = 'best')
plt.show()
Train set R^2:  0.9529119367162064
Train MAE score: 0.2789
Test set R^2:  0.8969447643834331
Test MAE score: 0.4498

png

KNN Regression

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
stds = StandardScaler()
stds.fit(X_train)

model = KNeighborsRegressor()
model.fit(stds.transform(X_train), y_train)

y_pred_train = model.predict(stds.transform(X_train))
print("Train set R^2: ", r2_score(y_train, y_pred_train))
print("Train MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(stds.transform(X_test))
print("Test set R^2: ", r2_score(y_test, y_pred_test))
print("Test MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))

plt.xlim((-12,2))
plt.ylim((-12,2))
plt.title("KNN-Regression Prediction with Scaler")
plt.xlabel("Real LogS")
plt.ylabel("Predicted LogS")
plt.grid()
plt.scatter(y_train, y_pred_train,
color="blue", alpha=0.8, label="train")
plt.scatter(y_test, y_pred_test,
color="lightgreen", alpha=0.8, label="test")
plt.legend(loc = 'best')
plt.show()
Train set R^2:  0.8662481382255657
Train MAE score: 0.5540
Test set R^2:  0.7600020560030265
Test MAE score: 0.7197

png

Random Forest Regression

Random Forest (RF) method is an ensemble method, which is an ensemble of Decision Tree models, thus we call it “Forest”. Feature normalization is not needed for RF, because RF didn’t compare magnitude of different features and it didn’t use gradient descent algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)

y_true, y_pred = y_test, model.predict(X_test)

y_pred_train = model.predict(X_train)
print("Train set R^2: ", r2_score(y_train, y_pred_train))
print("Train MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(X_test)
print("Test set R^2: ", r2_score(y_test, y_pred_test))
print("Test MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))

plt.xlim((-12,2))
plt.ylim((-12,2))
plt.title("Random Forest Regression Prediction")
plt.xlabel("Real LogS")
plt.ylabel("Predicted LogS")
plt.grid()
plt.scatter(y_train, y_pred_train,
color="blue", alpha=0.8, label="train")
plt.scatter(y_test, y_pred_test,
color="lightgreen", alpha=0.8, label="test")
plt.legend(loc = 'best')
plt.show()
Train set R^2:  0.9826342744246852
Train MAE score: 0.1889
Test set R^2:  0.8906418060478993
Test MAE score: 0.4818

png

Neural Network Perception

Neural Network (NN) is a hot-topic after alpha-go defated the world champine, it can also used to build regression model. Here we will build a shallow NN to predict LogS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
from keras.models import Sequential
from keras.layers import Dense

stds = StandardScaler()
stds.fit(X_train)

model = Sequential()
model.add(Dense(units = 128, input_dim = X.shape[1], activation='relu'))
model.add(Dense(units = 1))

# model.summary()

model.compile(loss = 'mae',
optimizer = 'adam',
metrics=['accuracy'])
history = model.fit(stds.transform(X_train), y_train, epochs = 100, batch_size = 32,
validation_data = (stds.transform(X_test), y_test), verbose=0)


y_pred_train = model.predict(stds.transform(X_train))
print("Train set R^2: ", r2_score(y_train, y_pred_train))
print("Train MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(stds.transform(X_test))
print("Test set R^2: ", r2_score(y_test, y_pred_test))
print("Test MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))


loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = len(loss)
plt.xlim((0, 50))
plt.ylim((0, 1))
plt.plot(range(epochs), loss, marker = '.', label = 'loss')
plt.plot(range(epochs), val_loss, marker = '.', label = 'val_loss')
plt.legend(loc = 'best')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()



#plt.xlim((-12,2))
#plt.ylim((-12,2))
plt.title("Neural Network Prediction with Scaler")
plt.xlabel("Real LogS")
plt.ylabel("Predicted LogS")
plt.grid()
plt.scatter(y_train, y_pred_train,
color="blue", alpha=0.8, label="train")
plt.scatter(y_test, y_pred_test,
color="lightgreen", alpha=0.8, label="test")
plt.legend(loc = 'best')
plt.show()
Train set R^2:  0.9901072859319796
Train MAE score: 0.1366
Test set R^2:  -0.11753402040517735
Test MAE score: 0.5345

png

png

Obviously, NN predicted an abnormal value, why? Let’s dig out this abnormal compounds.

1
2
3
4
5
6
7
8
9
from rdkit.Chem.Draw import IPythonConsole
v = X_test[np.argwhere(y_pred_test == y_pred_test.max())[0,0]]
for smi, s in zip(smis, X):
if np.all(s == v):
abnorm_smi = smi
print(abnorm_smi)
m = Chem.MolFromSmiles(abnorm_smi)
print(Descriptors.MolWt(m))
m
CC(CC=CC=CC=CC=CC(OC4OC(C)C(O)C(N)C4O)CC(C(C(O)=O)C(O)C3)OC3(O)CC(O)CC(O2)C2C=C1)OC1=O
665.7330000000004

png

1
2
3
4
5
6
7
8
9
10
11
12
13
mws = [Descriptors.MolWt(mol) for mol in mols]

plt.xlabel("Molecular Weight")
plt.ylabel("No. of Compounds")
plt.title("Distribution of Molecular Weight")
plt.hist(mws, color="gray", alpha=0.8)
plt.show()

plt.xlabel("Molecular LogS")
plt.ylabel("No. of Compounds")
plt.title("Distribution of Molecular LogS")
plt.hist(logs, color="green", alpha=0.2)
plt.show()

png

png

This molecular is a macrocycle and has many hydrophilic functional groups. These hydrophilic groups contribute too much positive features and covered negative features of hydrophobic backbone. And of course, only 5 compounds have molecular weight greater than 500. Large compounds and macrocycles are unbalanced samples in this dataset, and this model is overfitted on small molecules.

This illustrate that it is not wise to predict LogS of large compounds and macrocycles by model trained on this dataset. On the other hand, this model is good at predicting LogS for small organic molecules.

Search for Best type of features

There are many kind of molecular finger prints and descriptors, which one or which kind of them is vital for LogS prediction?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# different features
for fptype in ["MACCSKeys", "ErG", "Avalon", "ECFP", "Descriptor", "MACCSKeys+Descriptors"]:
if fptype == "Descriptor":
feature = [descriptors(mol) for mol in mols]
elif fptype == "MACCSKeys+Descriptors":
feature = [np.append(fingerprint(mol, "MACCSKeys"), descriptors(mol)) for mol in mols]
else:
feature = [fingerprint(mol, fptype) for mol in mols]

# random seed was set as 2 for reproduction
np.random.seed(2)

X = np.array(feature)
y = np.array(logs)

X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=2)


stds = StandardScaler()
stds.fit(X_train)

model = SVR()
model.fit(stds.transform(X_train), y_train)

print("Feature Type:\t%s" % fptype)
y_pred_train = model.predict(stds.transform(X_train))
print("\tTrain set R^2: ", r2_score(y_train, y_pred_train))
print("\tTrain MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(stds.transform(X_test))
print("\tTest set R^2: ", r2_score(y_test, y_pred_test))
print("\tTest MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))

print("-"*40,"\n\n")

Feature Type:	MACCSKeys
	Train set R^2:  0.8348432537253556
	Train MAE score: 0.5340
	Test set R^2:  0.7269931078788134
	Test MAE score: 0.7887
---------------------------------------- 


Feature Type:	ErG
	Train set R^2:  0.43533956678068464
	Train MAE score: 1.0801
	Test set R^2:  0.41443974442003917
	Test MAE score: 1.1513
---------------------------------------- 


Feature Type:	Avalon
	Train set R^2:  0.8411296883499888
	Train MAE score: 0.5065
	Test set R^2:  0.7501108238770297
	Test MAE score: 0.7341
---------------------------------------- 


Feature Type:	ECFP
	Train set R^2:  0.7829458742941775
	Train MAE score: 0.5726
	Test set R^2:  0.5447519853586031
	Test MAE score: 1.0134
---------------------------------------- 


Feature Type:	Descriptor
	Train set R^2:  0.9429008343024807
	Train MAE score: 0.3142
	Test set R^2:  0.892167479543334
	Test MAE score: 0.4646
---------------------------------------- 


Feature Type:	MACCSKeys+Descriptors
	Train set R^2:  0.9529119367162064
	Train MAE score: 0.2789
	Test set R^2:  0.8969447643834331
	Test MAE score: 0.4498
---------------------------------------- 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# different feature selection cutoff
minfo = mutual_info_regression(X_train, y_train)
for cutoff in (0., 0.01, 0.05, 0.1):
print("Feature left at cutoff %.2f:\t%d feature" % (cutoff, np.sum(minfo > cutoff)))
stds = StandardScaler()
stds.fit(X_train[:, minfo > cutoff])

model = SVR()
model.fit(stds.transform(X_train[:, minfo > cutoff]), y_train)

y_pred_train = model.predict(stds.transform(X_train[:, minfo > cutoff]))
print("\tTrain set R^2: ", r2_score(y_train, y_pred_train))
print("\tTrain MAE score: %.4f" % mean_absolute_error(y_train, y_pred_train))

y_pred_test = model.predict(stds.transform(X_test[:, minfo > cutoff]))
print("\tTest set R^2: ", r2_score(y_test, y_pred_test))
print("\tTest MAE score: %.4f" % mean_absolute_error(y_test, y_pred_test))
print("-"*40, "\n\n")
Feature left at cutoff 0.00:	300 feature
	Train set R^2:  0.9547396313014773
	Train MAE score: 0.2688
	Test set R^2:  0.9018933828417203
	Test MAE score: 0.4386
---------------------------------------- 


Feature left at cutoff 0.01:	232 feature
	Train set R^2:  0.9560197972695562
	Train MAE score: 0.2641
	Test set R^2:  0.9144409662106348
	Test MAE score: 0.4106
---------------------------------------- 


Feature left at cutoff 0.05:	107 feature
	Train set R^2:  0.9388813570437281
	Train MAE score: 0.3228
	Test set R^2:  0.905214588528981
	Test MAE score: 0.4471
---------------------------------------- 


Feature left at cutoff 0.10:	73 feature
	Train set R^2:  0.9306199986641003
	Train MAE score: 0.3609
	Test set R^2:  0.8990763861247755
	Test MAE score: 0.4719
---------------------------------------- 

Conclusion

In this article, I used different models, features and feature selection cutoff to build a state-of-art LogS prediction model on a public dataset. On this dataset, my SVR-0.01 model (R^2 0.92, MAE:0.41) shows best performance on test set. Roughly compared to other blog and ALOGPS, this model shows best performance, but I still need an external validation set to estimate its generalization. Since most compounds in this dataset are soluble (-8 < LogS < 2) small compounds (50 < MW < 400), this model is very suitable to estimate drug-like moleculars LogS.

Reference

[1] Kerasで化合物の溶解度予測(ニューラルネットワーク,回帰分析)

[2] Wash that gold- Modelling solubility with Molecular fingerprints by Esben Jannik Bjerrum

[3] Chainer: クラス分類による溶解度の予測 – Cheminformist3