Source allocation of per- and polyfluoroalkyl substances (PFAS) with supervised machine learning: Classification performance and the role of feature selection in an expanded dataset

By Tohren C G Kibbey, Rafal Jabrzemski, and Denis M O'Carroll
May 24, 2021
DOI: 10.1016/j.chemosphere.2021.130124

This work explores the use of supervised machine learning as a tool for identifying the source of per- and polyfluorinated alkyl substances (PFAS) in water samples on the basis of the detected component concentrations. Specifically, the work focuses on distinguishing between PFAS used in aqueous film forming foam (AFFF) fire suppression applications, and PFAS from other sources. The fact that many sites contaminated with legacy PFOS-based AFFF formulations are dominated by perfluorinated sulfonates can make it tempting to naïvely classify samples dominated by perfluorinated sulfonates as being of AFFF origin. However, a large fraction of samples do not follow this pattern, including some of the most important cases, such as legacy PFOS-based AFFF far from its source. Although PFAS composition can vary substantially at a site as a result of mobility differences between components and other factors, the hypothesis driving the work is that compositional patterns created in the environment can be recognized across different sites by machine learning, and used for source allocation. This work builds on earlier preliminary work by the authors based on a small dataset. This work is based on a much larger 8040-sample dataset, and explores different preprocessing approaches, as well as how feature selection impacts classification performance. The results of this work strongly support the idea that supervised machine learning based on composition can identify patterns that can be used to distinguish PFAS sources. The results provide new insights into selection of classifiers and features for source identification based on PFAS sample composition.


View on PubMed