Set Of Data Points Research Articles

17817 compounds were collected from the Bradley open melting point data set, including eight elements: C, H, O, N, F, S, Cl, Br, and I. An extended atom-based and bond-based group contribution descriptor was suggested to represent these compounds, which consists of a one-dimensional descriptor based on the molecular formula, a two-dimensional group contribution descriptor based on atoms and bonds, and a structural feature descriptor. Random forest (RF), Partial Least Squares (PLS), and Deep Learning (DL) methods were used to establish models to predict melting points, and the constructed models were evaluated by correlation coefficient (R), mean absolute error (MAE) and root-mean-square error (RMSE). Among them, the best results were obtained using the model constructed by Random forest: the results of out-of-bag (OOB) cross-validation of the training set are R = 0.8977/MAE = 29.57 °C/RMSE = 40.34 °C; the predicted results of the test set are R = 0.8982/MAE = 29.68 °C/RMSE = 40.63 °C. Compared with the results obtained using the subset of this data set in a literature, the results in this study are better than the corresponding results in the literature. The established model was also used to predict an external data set consisting of 74 compounds retrieved from another literature, and the obtained results are R = 0.8946 °C/MAE = 24.51 °C/RMSE = 34.19 °C, which were significantly better than the corresponding results in the literature. If the descriptor suggested in this study is combined with RDKit descriptor that contains charge and electronegativity information and so on, better results were achieved: the results of OOB cross-validation of the training set are R = 0.9013/MAE = 29.25 °C/RMSE = 39.76 °C; the results of the test set are R = 0.9017/MAE = 29.34 °C/RMSE = 40.07 °C.

Read full abstract

Abstract A problem of partitioning large datasets of flat points is considered. Known as the centroid-based clustering problem, it is mainly addressed by the k-means algorithm and its modifications. As the k-means performance becomes poorer on large datasets, including the dataset shape stretching, the goal is to study a possibility of improving the centroid-based clustering for such cases. It is quite noticeable on non-sparse datasets that the resulting clusters produced by k-means resemble beehive honeycomb. It is natural for rectangular-shaped datasets because the hexagonal cells make efficient use of space owing to which the sum of the within-cluster squared Euclidean distances to the centroids is approximated to its minimum. Therefore, the lattices of rectangular and hexagonal clusters, consisting of stretched rectangles and regular hexagons, are suggested to be successively applied. Then the initial centroids are calculated by averaging within respective hexagons. These centroids are used as initial seeds to start the k-means algorithm. This ensures faster and more accurate convergence, where at least the expected speedup is 1.7 to 2.1 times by a 0.7 to 0.9 % accuracy gain. The lattice of rectangular clusters applied first makes rather rough but effective partition allowing to optionally run further clustering on parallel processor cores. The lattice of hexagonal clusters applied to every rectangle allows obtaining initial centroids very quickly. Such centroids are far closer to the solution than the initial centroids in the k-means++ algorithm. Another approach to the k-means update, where initial centroids are selected separately within every rectangle hexagons, can be used as well. It is faster than selecting initial centroids across all hexagons but is less accurate. The speedup is 9 to 11 times by a possible accuracy loss of 0.3 %. However, this approach may outperform the k-means algorithm. The speedup increases as both the lattices become denser and the dataset becomes larger reaching 30 to 50 times.

Read full abstract

Set Of Data Points Research Articles

Related Topics

Articles published on Set Of Data Points

Extended atom-based and bond-based group contribution descriptor and its application to melting point prediction of energetic compounds

Improving The Accuracy Of Digital Elevation Model Using Hopfield Neural Network With Additional Elevation Point Dataset

Precision of sinewave amplitude estimation in the presence of additive noise and quantization error

Development of Machine Learning Algorithms for Riverside Land Cover Classification Using Synthetic Aperture Radar Satellite Imagery and Terrain Data

Pose estimation for swimmers in video surveillance

Defining Clusters by Topology Warping Features, an Interpretable Data Clustering Method

Enhancement of inverse-distance-weighting 2D interpolation using accelerated decline

Λ hypernuclear potentials beyond linear density dependence

Smoothed separable nonnegative matrix factorization

A global time series dataset to facilitate forest greenhouse gas reporting

Research on Key Point Detection Method of Mechanical Arm

Betti Number for Point Sets

Uncertain characterization of reservoir fluids due to brittleness of equation of state regression

IDENTIFIABLE BOUNDED COMPONENT ANALYSIS VIA MINIMUM VOLUME ENCLOSING PARALLELOTOPE.

Sustainability Reporting: A Financial Reporting Perspective

Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

Automatic Schelling Point Detection From Meshes.

Qualitative inverse problems: mapping data to the features of trajectories and parameter values of an ODE model

A [formula omitted] nearest neighbour ensemble via extended neighbourhood rule and feature subsets

Building Machine Learning Small Molecule Melting Points and Solubility Models Using CCDC Melting Points Dataset.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Set Of Data Points Research Articles

Related Topics

Articles published on Set Of Data Points

Extended atom-based and bond-based group contribution descriptor and its application to melting point prediction of energetic compounds

Improving The Accuracy Of Digital Elevation Model Using Hopfield Neural Network With Additional Elevation Point Dataset

Precision of sinewave amplitude estimation in the presence of additive noise and quantization error

Development of Machine Learning Algorithms for Riverside Land Cover Classification Using Synthetic Aperture Radar Satellite Imagery and Terrain Data

Pose estimation for swimmers in video surveillance

Defining Clusters by Topology Warping Features, an Interpretable Data Clustering Method

Enhancement of inverse-distance-weighting 2D interpolation using accelerated decline

Λ hypernuclear potentials beyond linear density dependence

Smoothed separable nonnegative matrix factorization

A global time series dataset to facilitate forest greenhouse gas reporting

Research on Key Point Detection Method of Mechanical Arm

Betti Number for Point Sets

Uncertain characterization of reservoir fluids due to brittleness of equation of state regression

IDENTIFIABLE BOUNDED COMPONENT ANALYSIS VIA MINIMUM VOLUME ENCLOSING PARALLELOTOPE.

Sustainability Reporting: A Financial Reporting Perspective

Speedup of the k-Means Algorithm for Partitioning Large Datasets of Flat Points by a Preliminary Partition and Selecting Initial Centroids

Automatic Schelling Point Detection From Meshes.

Qualitative inverse problems: mapping data to the features of trajectories and parameter values of an ODE model

A [formula omitted] nearest neighbour ensemble via extended neighbourhood rule and feature subsets

Building Machine Learning Small Molecule Melting Points and Solubility Models Using CCDC Melting Points Dataset.