Modelling Chemistry

Published by admin on

Tshitoyan et al. have a paper out that uses NLP, specifically word2vec, to grope towards predictive chemistry. They analyze a corpus of a the abstracts of a few million papers over the previous century with a vocabulary of half a million words. They embed the representation of the words in a 200 dimensional basis and, as usual, use the dot product as measure of similarity in the embeddings. This yields the conventional analogic measures (king-man+queen= woman, or more pertinently ferromagnetic − NiFe + IrMn ≈ antiferromagnetic)

Then they use the technique to predict properties of compounds for thermoelectric, photoelectric and other behaviour with surprising success. They try this with the elements as well as the compounds, but there they falter in places. As they note in the Supplementary materials hydrogen turns out to be all wrong. I attach Extended Fig 1 of their attempt at a periodic table, illustrating both the successes and otherwise.

They found the skip ngram worked best after trying continuous bag of words and Glove2Vec. Do read the whole paper, the approach seems promising in spite of the problems. The paper is at https://doi.org/10.1038/s41586-019-1335-8

Categories: DataScience