Применение неконтролируемой кластеризации выборок для повышения качественных показателей многоуровневых моделей обработки данных
The problem of improving the quality indicators of data processing models by segmenting data samples is considered. A multi-level data processing architecture is proposed, which allows determining the current properties of data in segments and assigning the best models according to the achieved quality indicators. A formal description of the architecture is given. The proposed solution is aimed at reducing the cost of retraining models in the case of transformation of data properties. Experimental studies have been carried out on a number of data sets that show an increase in the quality of processing indicators. The model can be considered as an improvement of ensemble methods of processing data samples.
- Research Article
7
- 10.28991/esj-2024-08-01-025
- Feb 1, 2024
- Emerging Science Journal
This paper presents a solution for building and implementing data processing models and experimentally evaluates new possibilities for improving ensemble methods based on multilevel data processing models. This study proposes a model to reduce the cost of retraining models when transforming data properties. The research objective is to improve the quality indicators of machine learning models when solving classification problems. The novelty is a method that uses a multilevel architecture of data processing models to determine the current data properties in segments at different levels and assign algorithms with the best quality indicators. This method differs from the known ones by using several model levels that analyze data properties and assign the best models to individual segments of data and training. The improvement consists of using unsupervised clustering of data samples. The resulting clusters are separate subsamples for assigning the best machine-learning models and algorithms. Experimental values of quality indicators for different classifiers on the whole sample and different segments were obtained. The findings show that unsupervised clustering using multilevel models can significantly improve the quality indicators of “weak” classifiers. The quality indicators of individual classifiers improve when the number of data clusters is increased to a certain threshold. The results obtained are applicable to classification when developing models and machine learning methods. The proposed method improved the classification quality indicators by 2–9% due to segmentation and the assignment of models with the best quality indicators in individual segments. Doi: 10.28991/ESJ-2024-08-01-025 Full Text: PDF
- Research Article
3
- 10.1097/eja.0000000000001469
- Mar 1, 2021
- European Journal of Anaesthesiology
The effect of different methods for data sampling and data processing on the results of comparative statistical analyses in method comparison studies of continuous arterial blood pressure (AP) monitoring systems remains unknown. We sought to investigate the effect of different methods for data sampling and data processing on the results of statistical analyses in method comparison studies of continuous AP monitoring systems. Prospective observational study. University Medical Center Hamburg-Eppendorf, Hamburg, Germany, from April to October 2019. 49 patients scheduled for neurosurgery with AP measurement using a radial artery catheter. We assessed the agreement between continuous noninvasive finger cuff-derived (CNAP Monitor 500; CNSystems Medizintechnik, Graz, Austria) and invasive AP measurements in a prospective method comparison study in patients having neurosurgery using all beat-to-beat AP measurements (Methodall), 10-s averages (Methodavg), one 30-min period of 10-s averages (Method30), Method30 with additional offset subtraction (Method30off), and 10 30-s periods without (Methodiso) or with (Methodiso-zero) application of the zero zone. The agreement was analysed using Bland-Altman and error grid analysis. For mean AP, the mean of the differences (95% limits of agreement) was 9.0 (-12.9 to 30.9) mmHg for Methodall, 9.2 (-12.5 to 30.9) mmHg for Methodavg, 6.5 (-9.3 to 22.2) mmHg for Method30, 0.5 (-9.5 to 10.5) mmHg for Method30off, 4.9 (-6.0 to 15.7) mmHg for Methodiso, and 3.4 (-5.9 to 12.7) mmHg for Methodiso-zero. Similar trends were found for systolic and diastolic AP. Results of error grid analysis were also influenced by using different methods for data sampling and data processing. Data sampling and data processing substantially impact the results of comparative statistics in method comparison studies of continuous AP monitoring systems. Depending on the method used for data sampling and data processing, the performance of an AP test method may be considered clinically acceptable or unacceptable.
- Conference Article
2
- 10.1109/icpects56089.2022.10047766
- Dec 8, 2022
Big data processing and analysis use shared-nothing computer clusters. Cluster computing relies heavily on data segmentation and sampling to boost the speed and scalability of large data computations. In this study, we provide a thorough review of sampling and data division approaches applicable to large data processing and analysis. Next, we'll go over some of the fundamentals of data partitioning, such as the difference between range partitioning, hash partitioning, and random partitioning. Then, we get into the standard data sampling techniques such stratified sampling, reservoir sampling, and simple random sampling and approaches suitable for clusters. Our proposal is to take both data partitioning and sampling into consideration simultaneously while processing large data sets in parallel environment.
- Research Article
7
- 10.28991/esj-2023-07-03-03
- May 3, 2023
- Emerging Science Journal
This research aims to improve quality indicators in solving classification and regression problems based on the adaptive selection of various machine learning models on separate data samples from local segments. The proposed method combines different models and machine learning algorithms on individual subsamples in regression and classification problems based on calculating qualitative indicators and selecting the best models on local sample segments. Detecting data changes and time sequences makes it possible to form samples where the data have different properties (for example, variance, sample fraction, data span, and others). Data segmentation is used to search for trend changes in an algorithm for points in a time series and to provide analytical information. The experiment performance used actual data samples and, as a result, obtained experimental values of the loss function for various classifiers on individual segments and the entire sample. In terms of practical novelty, it is possible to use the obtained results to increase quality indicators in classification and regression problem solutions while developing models and machine learning methods. The proposed method makes it possible to increase classification quality indicators (F-measure, Accuracy, AUC) and forecasting (RMSE) by 1%–8% on average due to segmentation and the assignment of models with the best performance in individual segments. Doi: 10.28991/ESJ-2023-07-03-03 Full Text: PDF
- Research Article
1
- 10.11591/ijeecs.v29.i3.pp1466-1472
- Mar 1, 2023
- Indonesian Journal of Electrical Engineering and Computer Science
<span lang="EN-US">The identification of abnormal situations in information and telecommunication systems is considered, based on analyze statistical information of network traffic packages. The method of identifying an anomalous situation based on segmentation of data sample is proposed. The method is aimed at using classifying algorithms that have the best quality indicators on individual data segments. The proposed method will be useful for monitoring information security systems. The method registers of factors that affect the change in the properties of targeted variables. Impact detection allows you to generate data samples, depending on current and expected situations. On the example of the NSL-KDD dataset, there was a division of many data into subset, taking into account the influence of the factors on the range of values. The processing of factors is shown using the change point detection function in the time series. With its use, a division of data sample by the final number of non-intersecting measurable subsets has been made. The results of Accuracy, Precision, F-Measure, Recall for various classifiers are shown. The proposed method allows to increase the quality indicators of classification in continuously changing operating conditions of telecommunication systems.</span>
- Research Article
- 10.15622/ia.22.3.1
- May 22, 2023
- Информатика и автоматизация
There is a constant need to create methods for improving the quality indicators of information processing. In most practical cases, the ranges of target variables and predictors are formed under the influence of external and internal factors. Phenomena such as concept drift cause the model to lose its completeness and accuracy over time. The purpose of the work is to improve the processing data samples quality based on multi-level models for classification and regression problems. A two-level data processing architecture is proposed. At the lower level, the analysis of incoming information flows and sequences takes place, and the classification or regression tasks are solved. At the upper level, the samples are divided into segments, the current data properties in the subsamples are determined, and the most suitable lower-level models are assigned according to the achieved qualitative indicators. A formal description of the two-level architecture is given. In order to improve the quality indicators for classification and regression solving problems, a data sample preliminary processing is carried out, the model’s qualitative indicators are calculated, and classifiers with the best results are determined. The proposed solution makes it possible to implement constantly learning data processing systems. It is aimed at reducing the time spent on retraining models in case of data properties transformation. Experimental studies were carried out on several datasets. Numerical experiments have shown that the proposed solution makes it possible to improve the quality processing indicators. The model can be considered as an improvement of ensemble methods for processing information flows. Training a single classifier, rather than a group of complex classification models, makes it possible to reduce computational costs.
- Research Article
- 10.1158/1538-7445.am2021-2280
- Jul 1, 2021
- Cancer Research
The successful application of Next Generation Sequencing (NGS) to drug discovery requires systems to manage and document each step of the sequencing process from sample receipt through data generation and data processing. We combined BenchlingTM, a solution for tracking NGS lab processes, with FONDA (Framework Of Next generation sequencing Data Analysis) an internally developed data processing platform, to support multiple types of NGS data generation and processing. Benchling combines a digital notebook and a laboratory information management system (LIMS). The system documents and automates steps in the NGS process including: sample registration, nucleic acid extraction, library construction, flow cell construction, sequencer sample sheet generation and BCL2FASTQ conversion. This enables wet lab scientists to easily retrieve an appropriate protocol for each sample and sequencing library type. We connected our sequencers to Benchling in order to monitor each sequencing run and to keep track of the quality of NGS data. In addition, it generates “analysis ready sample sheet” (contains project and study information, location of FASTQ, sample species and library type) and uploads it into designated S3 buckets for data processing. Benchling dashboards provide overviews of NGS sample preparation, data generation and quality control. In summary, Benchling interconnects the original sample, the labels, the barcodes, the cDNA/DNA, the library, and all the QC results. We process NGS data using pipelines implemented in FONDA on a dockerized Amazon Web Services cloud platform. Analyses can be configured automatically from information exported by Benchling or launched manually. After data processing is completed, output files such as gene expression counts or variant calls are deposited into project-specific folders, ready for secondary analysis. In the current FONDA version (as of Nov 2020), we have developed pipelines for single cell multi-omics (CITE-seq and single-cell immune profiling) and bulk RNA-seq. The modular design of FONDA facilitates the development, the updating, and the extension of pipelines to new sequencing technologies. In summary, Benchling and FONDA enable high quality sample and NGS data flows from the lab for target identification, understanding mechanism of action, patient stratification and biomarker discovery. Availability and implementation: FONDA is implemented in Java and released under the Apache License 2.0. FONDA can be downloaded from GitHub at https://github.com/epam/fonda. Citation Format: Chandra Sekhar Pedamallu, Joon Sang Lee, Shu Yan, Adalis Maisonet, Aleksandr Sidoruk, Tengui Chen, Yulia Kamyshova, Mariia Zueva, Mark Magid, Quan Wan, Jeffrey Thompson, Valerie Zebrouck, Immanuel Gadaczek, Mikhail Alperovich, Brian McNatt, Alexei Protopopov, Donald Jackson, Jack Pollard. A comprehensive sample tracking and data processing workflow for next generation sequencing [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2021; 2021 Apr 10-15 and May 17-21. Philadelphia (PA): AACR; Cancer Res 2021;81(13_Suppl):Abstract nr 2280.
- Research Article
- 10.1093/ehjci/jeae333.059
- Jan 29, 2025
- European Heart Journal - Cardiovascular Imaging
Background Until now, no method is available for automatic extraction of indicators of image quality and view standardization from echocardiographic recordings. Thus, description of data quality has been indirect, often by description of operators experience and the methodology used. Peak systolic global longitudinal strain (GLS) is a sensitive measure of left ventricular (LV) function, still hampered by random variability due to suboptimal standardization of recordings and variability in data processing by the operators. Data on how acquisitions and data processing influence GLS and its variability are scarce. Objectives Aim was to study the importance of image acquisition and data processing by experienced operators for GLS in a large healthy population by automatically extracting quality indicators using novel deep learning based image analysis software and vendor specific metadata. Secondly, we aimed to study how these characteristics influenced the reference ranges and GLS variability. Methods Participants from a large echocardiographic study were included. Echocardiography was performed according to current recommendations. Two experienced operators and two expert cardiologists read and re-read all GLS recordings (1,412 paired analyses). Acquisition- and data processing characteristics were extracted using deep learning and GLS software metadata providing specific data on position of landmarks in the view, as well as rotation and tilt of the cut-plane according to the best possible standard. From these measurements several quality indicators were provided (Table 1). Results Mean age for the 1,412 participants was 58±12 years and 56% were women. Averaged apical LV foreshortening in the recordings was ≤2 mm and the view specific recordings were well standardized, with high alignment with the preferred cut-plane for tilt and rotation across the three apical views (Figure 1). Most quality indicators influenced GLS and absolute GLS was lower with longer LVs (-0.2% per 1 cm, p=0.003) and longer distance from the transducer to the LV apex (-0.3% per 1 cm, p=0.02) in the recordings, and also for wider regions of interest during data processing (-0.7% per 1 mm, p &lt;0.001). Absolute GLS was 0.7% higher per mm systolic foreshortening of the apical region of interest, and when more knots were placed under initialization of the ROI (0.1% per knot, p=0.004). In test-retest analyses, the results for agreement between operators followed the above presented data. Conclusions Novel quality indicators were successfully extracted using novel DL software and strain specific metadata. Even though acquisitions and data processing were well standardized, several quality indicators from both steps influenced GLS. Deep learning software and data processing metadata may provide quality indicators of the echocardiographic databases of importance to interpret the study outcomes and motivate researchers to optimize echocardiographic acquisitions and analyses.
- Research Article
217
- 10.26599/bdma.2019.9020015
- Jun 1, 2020
- Big Data Mining and Analytics
Computer clusters with the shared-nothing architecture are the major computing platforms for big data processing and analysis. In cluster computing, data partitioning and sampling are two fundamental strategies to speed up the computation of big data and increase scalability. In this paper, we present a comprehensive survey of the methods and techniques of data partitioning and sampling with respect to big data processing and analysis. We start with an overview of the mainstream big data frameworks on Hadoop clusters. The basic methods of data partitioning are then discussed including three classical horizontal partitioning schemes: range, hash, and random partitioning. Data partitioning on Hadoop clusters is also discussed with a summary of new strategies for big data partitioning, including the new Random Sample Partition (RSP) distributed model. The classical methods of data sampling are then investigated, including simple random sampling, stratified sampling, and reservoir sampling. Two common methods of big data sampling on computing clusters are also discussed: record-level sampling and block-level sampling. Record-level sampling is not as efficient as block-level sampling on big distributed data. On the other hand, block-level sampling on data blocks generated with the classical data partitioning methods does not necessarily produce good representative samples for approximate computing of big data. In this survey, we also summarize the prevailing strategies and related work on sampling-based approximation on Hadoop clusters. We believe that data partitioning and sampling should be considered together to build approximate cluster computing frameworks that are reliable in both the computational and statistical respects.
- Research Article
12
- 10.1109/tits.2022.3155689
- Oct 1, 2022
- IEEE Transactions on Intelligent Transportation Systems
Accurately mapping the raw global position system (GPS) trajectories to the road network is the basis for studying the application of trajectory data. This study proposes a novel off-line map matching algorithm based on road network topology, to address the problems of low execution efficiency and poor matching accuracy of selective look-ahead map matching (SLAMM) algorithm. First, the noise points of the trajectory data are removed by data preprocessing. Second, the algorithm searches for critical samples in the trajectory data and segments the data accordingly. Then, the adjacent road segments around the transition node corresponding to the critical sample are selected as candidate arcs. Finally, the segmented trajectory data are matched to the road network by constructing an error ellipse. The algorithm fully considers the topology of the road network and the characteristics of high-frequency trajectory data. The experimental results, using Beijing trajectory data to perform matching on an actual road network environment, show that the proposed algorithm is more efficient and robust than other map matching algorithms for high-frequency trajectories.
- Research Article
12
- 10.3390/electronics9060925
- Jun 2, 2020
- Electronics
Research on electroencephalography (EEG) signals and their data analysis have drawn much attention in recent years. Data mining techniques have been extensively applied as efficient solutions for non-invasive brain–computer interface (BCI) research. Previous research has indicated that human brains produce recognizable EEG signals associated with specific activities. This paper proposes an optimized data sampling model to identify the status of the human brain and further discover brain activity patterns. The sampling methods used in the proposed model include the segmented EEG graph using piecewise linear approximation (SEGPA) method, which incorporates optimized data sampling methods; and the EEG-based weighted network for EEG data analysis, which can be used for machinery control. The data sampling and segmentation techniques combine normal distribution approximation (NDA), Poisson distribution approximation (PDA), and related sampling methods. This research also proposes an efficient method for recognizing human thinking and brain signals with entropy-based frequent patterns (FPs). The obtained recognition system provides a foundation that could to be useful in machinery or robot control. The experimental results indicate that the NDA–PDA segments with less than 10% of the original data size can achieve 98% accuracy, as compared with original data sets. The FP method identifies more than 12 common patterns for EEG data analysis based on the optimized sampling methods.
- Research Article
4
- 10.31799/1684-8853-2021-3-29-38
- Jun 29, 2021
- Информационно-управляющие системы
Introduction: The application of machine learning methods involves the collection and processing of data which comes from the recording elements in the offline mode. Most models are trained on historical data and then used in forecasting, classification, search for influencing factors or impacts, and state analysis. In the long run, the data value ranges can change, affecting the quality of the classification algorithms and leading to the situation when the models should be constantly trained or readjusted taking into account the input data. Purpose: Development of a technique to improve the quality of machine learning algorithms in a dynamically changing and non-stationary environment where the data distribution can change over time. Methods: Splitting (segmentation) of multiple data based on the information about factors affecting the ranges of target variables. Results: A data segmentation technique has been proposed, based on taking into account the factors which affect the change in the data value ranges. Impact detection makes it possible to form samples based on the current and alleged situations. Using PowerSupply dataset as an example, the mass of data is split into subsets considering the effects of factors on the value ranges. The external factors and impacts are formalized based on production rules. The processing of the factors using the membership function (indicator function) is shown. The data sample is divided into a finite number of non-intersecting measurable subsets. Experimental values of the neural network loss function are shown for the proposed technique on the selected dataset. Qualitative indicators (Accuracy, AUC, F-measure) of the classification for various classifiers are presented. Practical relevance: The results can be used in the development of classification models of machine learning methods. The proposed technique can improve the classification quality in dynamically changing conditions of the functioning.
- Research Article
- 10.1255/tosf.149
- May 27, 2022
- TOS Forum
The sampling intensive mining and environmental industries share a common need for representative data but differ in the motivations to accomplish this objective. The desire to obtain representative sample data for “commodities” in the mining industry is driven by anticipated economic gain and the exploitation of natural resources. The desire to obtain representative sample data for “contaminants” in the environmental industry is driven by anticipated social gain and the protection of natural resources. In terms of obtaining reliably representative data, motivation driven by economic gain has thus far been the clear winner. Theory of Sampling concepts are well established and tested in the mining industry. The environmental industry, in contrast, has traditionally been plagued by scientifically unsound sampling practices and data that are not reliably representative of conditions in the field. This has significant implications for topics ranging from the efficient identification and remediation of contaminated industrial lands to the accurate assessment of risk to human health and the environment. This paper explores the nature and cause of this dichotomy and presents a methodical approach for application of Theory of Sampling concepts to environmental testing of soil, water and air. Much of the problem is tied to a general recognition of compositional and distributional heterogeneity in contaminated media but unawareness of a method to control it or an understanding of the magnitude of potential error. As a result, published regulatory guidance focused on classical sampling and statistical methods appropriate for testing of“finite element” media. A lone exception is testing of indoor air, where concepts of “Decision Units” and sampling methods more appropriate for testing of “infinite element” media have long been employed to control and represent heterogeneity. The solutions are, in hindsight, relatively simple. Pushback from affected parties and even scientists and environmental agencies can be significant, however. This is primarily due to a lack of training of environmental professionals in the Theory of Sampling and the common absence of clear evidence of erroneous or misleading sample data in the field. Reluctance to change is also tied in some cases to implications regarding liability for past and ongoing projects. The need for more reliable, efficient and science-based methods to assess and address risk posed by environmental contamination is clear, however. Progress will be made by countries like China that are beginning to tackle legacies of early development and are able to learn from the successes as well as the mistakes of countries that have been addressing environmental contamination for several decades. Training of environmental workers as well as pressure from liability-savvy responsible parties, attorneys and financial institutions will continue to force the industry to evolve, to the benefit of the environment as well as stakeholders on all sides.
- Research Article
3
- 10.1007/s00779-014-0830-z
- Sep 25, 2014
- Personal and Ubiquitous Computing
Mobile and wearable sensors are increasingly permeating our lives, and information gathered from them can provide unprecedented insights into diverse aspects of human behaviour. Analysis of human behaviour is of special interest in health care, as there exists dual relationship between behaviour and health. On one hand, our health is influenced by our behaviour, including physical activity levels, amount of social activity, and work–life balance amongst others, while on the other hand, symptoms of various disorders are manifested as behaviour changes. This is especially prominent for mental disorders [11]. Therefore, human behaviour understanding has significant value for health care, from the point of view of both maintaining good health and helping in the diagnosis of the diseases. While the link between various aspects of behaviour and health has been explored in clinical settings, use of technology to automatically measure behaviour is still in its infancy. Considering enormous potential of automatic behaviour understanding in health care, this Theme Issue explores the link between automatic understanding of human behaviour and how it can inform decisions of range of stakeholders in the health ecosystem. Sensing modalities, data processing methods, and behaviour capturing techniques that facilitate this exploration received a particular focus in the contents of this Theme Issue. As such, authors in [8] present an automated behaviour analysis system, consisting of a sensor network set-up in a home setting. Experiments performed showed how sensor readings can be used to automatically detect anomalous behaviour. This anomalous behaviour can be a sign of health changes in the user, and automatic detection could offer the possibility for intervention if required. In the same theme of detecting anomalous behaviour, authors in [5] propose an activity recognition system based on the Markov logic network. The performance and use of the method in dementia care is demonstrated by applying it to a dataset recorded in a smart home environment. Results indicate that the hierarchical approach presented has higher accuracy in recognition and a faster response time than existing approaches. As one of the first step in detecting activities, segmentation of data is typically required. In this regard, the paper in [9] presents an approach that enables segmentation of continuous sensor data in real time. The proposed dynamic segmentation is based on a two-layer strategy—sensor correlation and time correlation manipulation. The methodology was validated utilising two different datasets recorded in smart home settings. Performance measurement of machine learning methods in order to understand human behaviour was considered in [1]. The authors have evaluated the performance of two machine learning methods on five real-world datasets. They show that the commonly used metrics such as F. Gravenhorst A. Muaremi Wearable Computing Laboratory, ETH Zurich, Zurich, Switzerland e-mail: gravenhorst@ife.ee.ethz.ch
- Research Article
8
- 10.1016/0022-2364(88)90277-6
- Jun 1, 1988
- Journal of Magnetic Resonance (1969)
Use of CLEAN in conjunction with selective data sampling for 2D NMR experiments
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.