1 | Optimising a Neural Network for Solubility Prediction
This project uses a dataset of calculated log(P) values for a variety of organic molecules to train a long short-term memory recurrent neural network (LSTM RNN). SMILES strings are tokenised, trimmed, and padded, before being used to train the model. Bayesian optimisation is used to optimise the hyperparameters in order to achieve the best balance between the mean square error of the training and test datasets. The model achieves reasonable accuracy, but tends toward larger errors when less common structural features are present (e.g. for hypervalent iodine compounds or some atypical phosphorous-containing species, such as deprotonated alkylphosphanes).

