Pikeperch, perch and bream are among the most traded and valued fish species in North-Eastern Europe. Therefore, it is necessary to be able to distinguish fish from different lakes and coastal sea regions to ensure a good traceability of products in the fish market and to protect both consumers and fish stocks. Untargeted metabolomics using nuclear magnetic resonance (NMR) spectroscopy is a suitable tool for this purpose. It is an established method for determining various properties of biological and living systems, such as health, origin, type, etc. Statistical methods including principal component analysis (PCA) and linear discriminant analysis (LDA) are typically applied to NMR data to correlate spectra with a particular research question.Herein we examine fish from three closely related water bodies and demonstrate that reliable determination of the water body that a particular fish originates from by traditional statistical analysis (PCA and LDA) of fish NMR spectra is not possible. In contrast, determining the fish species is possible. We proceed to show that machine learning methods perform better and that a combination of statistical analysis (LDA) and random forest (RF), a supervised machine learning technique, allows reliable determination of the originating water body, while being also tolerant to seasonal variations. This is an improvement over prior work, which has dealt with more clearly distinguished origins of fish. Exceptional accuracy was achieved in correctly assigning fish to their origin even in a scenario where two of the water bodies are connected by a river through which the fish are known to migrate. Since determining the origin of fish is important in environmental protection, we recommend following up this approach and using it as the basis of a robust tool for environmental protection and other monitoring purposes.
Read full abstract