Abstract

Large archives and digital sky surveys with dimensions of bytes currently exist, while in the near future they will reach sizes of the order of . Numerical simulations are also producing comparable volumes of information. Data mining tools are needed for information extraction from such large datasets. In this work, we propose a multidimensional indexing method, based on a static R-tree data structure, to efficiently query and mine large astrophysical datasets. We follow a top-down construction method, called VAMSplit, which recursively splits the dataset on a near median element along the dimension with maximum variance. The obtained index partitions the dataset into nonoverlapping bounding boxes, with volumes proportional to the local data density. Finally, we show an application of this method for the detection of point sources from a gamma-ray photon list.

Highlights

  • At present, several projects for the multiwavelength observation of the universe are underway, for example, Sloan Digital Sky Survey (SDSS), GALEX, POSS2, DENIS, and so forth [1]

  • We propose a point source detection algorithm based on kernel methods [15], and in particular on the one-class support vector machines (SVMs) [16]

  • The one-class SVM algorithm estimates the support of a multidimensional distribution, that is, a binary function such that most of the data will live in the region where the function is nonzero

Read more

Summary

INTRODUCTION

Several projects for the multiwavelength observation of the universe are underway, for example, SDSS, GALEX, POSS2, DENIS, and so forth [1]. Typical queries required by this kind of analysis are the following: (i) point queries, to find all objects overlapping the query point; (ii) range queries, to find all objects having at least one common point with a query window; and (iii) nearest-neighbor queries, to find all objects that have a minimum distance from the query object Another important operation is the spatial join, which in the astrophysical field is needed to search multiple source catalogs and cross-identify sources from different wavebands. These multidimensional (spatial) data tend to be large (sky maps can reach sizes of terabytes) requiring the integration of the secondary storage, and there is no total ordering on spatial objects preserving spatial proximity [4]. This characteristic makes it difficult to use traditional indexing methods, like B+-trees or linear hashing

AN OPTIMIZED R-TREE
Determination of the tree topology
The split strategy
TESTS ON A PHOTON DATASET
NEIGHBORHOOD AND “WEAK” ADJACENCY
A STRATEGY FOR THE DETECTION OF POINT SOURCES
One-class SVM
Scaling one-class method with the optimized R-tree
Tests on the anticenter region
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.