A data mining approach to evaluate suitability of dissolved oxygen sensor observations for lake metabolism analysis

Kohji Muraoka,Paul Hanson,Eibe Frank,Kenneth Chiu,Meilan Jiang,David Hamilton

doi:10.1002/lom3.10283

Abstract

AbstractDespite rapid growth in continuous monitoring of dissolved oxygen for lake metabolism studies, the current best practice still relies on visual assessment and manual data filtering of sensor observations by experienced scientists in order to achieve meaningful results. This time consuming approach is fraught with potential for inconsistency and individual subjectivity. An automated method to assure the quality of data for the purpose of metabolism modeling is clearly needed to obtain consistent results representative of collective expertise. We used a hybrid approach of expert panel and data mining for data filtration. Symbolic Aggregate approXimation (SAX) treats discretized numerical timeseries segments as symbolic indications, creating a series of strings which are literally comparable to human words and sentences. This conversion allows established text mining techniques, such as classification methods to be applied to timeseries data. Half‐hourly frequency surface dissolved oxygen data from 18 global lakes were used to create day‐long segments of the original time series data. Three hundred sets of 1‐d measurements were provided to a group of seven anonymous experts, experienced in manual filtering of oxygen data for metabolism modeling studies. The collective results were treated as expert panel decisions, and were used to rank the data by confidence level for use in metabolism calculations. While considerable variation occurred in the way the experts perceived the quality of the data, the model provides an objective and quantitative assessment method. The program output will assist the decision making process in determining whether data should be used for metabolism calculations. An R version of the program is available for download.

Highlights

Considering the substantial size of the parent data used, the parent data patterns in small Symbolic Aggregate approXimation (SAX) parameters are thought to include all idealized dissolved oxygen (DO) curves driven by biological activities, and those theoretical patters that did not appear in the parent datasets are primarily “noisy.” This coverage decreases as the number of possible SAX strings increases
The lowest coverage is found in SAX(6,6), where 4% of the available sequences appeared in the parent data
The expert survey results tended to confirm that the removal of data was predominantly due to the influence of non-biological processes

Summary

Introduction

Metabolism models in lakes typically assume that a change in free-water dissolved oxygen (DO) through time is driven primarily by the balance between photosynthesis (or primary production) and mineralization of organic carbon (often called “respiration” for simplicity), as well as equilibration of DO with the atmosphere (Staehr et al 2010) When these three processes are dominant, diel DO patterns will be nearly sinusoidal, with increases during daylight due to primary production exceeding respiration and decreases at night due to respiration. Increasing dimensionality (information), which is inherent in increased sampling frequency from sensors, decreases performance of similarity, or distance-based discovery algorithms (e.g., more difficult to build a robust model; Aggarwal et al 2001; Zimek et al 2012) This can be circumvented by removing some data or compressing the amount of information processed (Cannata et al 2011) or by representing data in a simpler form (Keogh et al 2001). Techniques to accurately define “suitable data” have not been generalized but any methods needs to be robust and repeatable

Objectives

Methods

Results

Discussion

Conclusion