Supervised machine learning for source allocation of per- and polyfluoroalkyl substances (PFAS) in environmental samples

By Tohren C.G. Kibbey, Rafal Jabrzemski, and Denis M. O’Carroll
March 31, 2020
DOI: 10.1016/j.chemosphere.2020.126593

Environmental contamination by per- and polyfluoroalkyl substances (PFAS) is widespread, because of both their decades of use, and their persistence in the environment. These factors can make identification of the source of contamination in samples a challenge, because in many cases contamination may originate from decades ago, or from a number of candidate sources. Forensic source allocation is important for delineating plumes, and may also be able to provide insights into environmental behaviors of specific PFAS components. This paper describes work conducted to explore the use of supervised machine learning classifiers for allocating the source of PFAS contamination based on patterns identified in component concentrations. A dataset containing PFAS component concentrations in 1197 environmental water samples was assembled based on data from sites from around the world. The dataset was split evenly into training and test datasets, and the 598-sample training dataset was used to train four machine learning classifiers, including three conventional machine learning classifiers (Extra Trees, Support-Vector Machines, K-Neighbors), and one multilayer perceptron feedforward deep neural network. Of the methods tested, the deep neural network and Extra Trees exhibited particularly high performance at classification of samples from a range of sources. The fact that the methods function on completely different principles and yet provide similar predictions supports the hypothesis that patterns exist in PFAS water sample data that can allow forensic source allocation. The results of the work support the idea that supervised machine learning may have substantial promise as a tool for forensic source allocation.

View on PubMed