Per Segment Plane Sweep Line Segment Intersection on the GPU

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Polygon overlay operations are used for various purposes, such as GIS searches and queries, VLSI, and basic geometric operations of intersection, union, and difference. There have been recent research articles presenting algorithms using the GPU to perform line segment intersection for geometric operations. We present two parallel algorithms implemented on the GPU that focus on the active list portion of the traditional serial plane sweep algorithm. The first algorithm uses a single block of threads to simulate the active list data structure in hardware; this algorithm is slow due to GPU thread block size limitations and synchronization points but demonstrates favorable time complexity. The second algorithm uses dynamic parallelism to remove synchronization and scales to utilize available GPU hardware (single GPU). We perform experiments on both synthetic and real-world datasets. The presented results show improvement in execution time with respect to recent algorithms and low memory usage compared to recent algorithms. We achieve speedups of up to 38.8 over the serial sweep line algorithm on real-world data.

Similar Papers
  • Research Article
  • 10.1128/msystems.00408-25
Impact of study design, contamination, and data characteristics on results and interpretation of microbiome studies.
  • Aug 6, 2025
  • mSystems
  • Jose Agudelo + 1 more

Advances in high-throughput molecular techniques have enabled microbiome studies in low-biomass environments, which pose unique challenges due to contamination risks. While best-practice guidelines can reduce contamination by over 90%, the impact of residual contamination and data set variability on statistical outcomes remains understudied. Here, we quantitatively assessed how study design factors influence microbiome analyses using simulated and real-world data sets. Alpha diversity was affected by sample number and community dissimilarity, but not by the number of unique taxa. Beta diversity was influenced primarily by unique taxa and group dissimilarity, with a marginal effect of sample number. The number of differentially abundant taxa depended on the number of unique taxa but was also influenced by sample number, depending on thealgorithm. Notably, contamination had a marginal impact on weighted beta diversity but altered the number of differentially abundant taxa when at least 10 contaminants were present, with a greater effect as contamination increased. Findings closely mirrored results from seven real-world low-biomass data sets. Overall, group dissimilarity and the number of unique taxa were the primary drivers of statistical outcomes. The DESeq2 algorithm outperformed ANCOM-BC when exposed to stochastically distributed contamination, but algorithms were equivocal under contamination weighted toward one group. In all cases, the rate of false positives in differential abundance analyses was <15%. Importantly, in both simulated and real-world data, contamination rarely impacts whether microbiome differences were detected but did affect the number of differentially abundant taxa. Thus, when validated protocols with internal negative controls are used, residual contamination minimally impacts statistical outcomes. Alpha diversity was affected by sample number and community dissimilarity, but not by the number of unique taxa. Beta diversity was influenced primarily by unique taxa and group dissimilarity, with a marginal effect of sample number. The number of differentially abundant taxa depended on the number of unique taxa but was also influenced by sample number, depending on the algorithm. Notably, contamination had a marginal impact on weighted beta diversity but altered the number of differentially abundant taxa when at least 10 contaminants were present, with a greater effect as contamination increased. Findings closely mirrored results from seven real-world low-biomass data sets. Overall, group dissimilarity and the number of unique taxa were the primary drivers of statistical outcomes. The DESeq2 algorithm outperformed ANCOM-BC when exposed to stochastically distributed contamination, but algorithms were equivocal under contamination weighted toward one group. In all cases, the rate of false positives in differential abundance analyses was <15%. Importantly, in both simulated and real-world data, contamination rarely whether microbiome differences were detected but did affect the number of differentially abundant taxa. Thus, when validated protocols with internal negative controls are used, residual contamination minimally impacts statistical outcomes.IMPORTANCEMicrobiome studies in low-biomass environments face challenges due to contamination. However, even after implementing strict contamination prevention, control, and analysis measures, the impact of residual contamination on the validity of statistical outcomes in such studies remains a topic of ongoing discussion. Our analyses reveal that key drivers of microbiome study outcomes are group dissimilarity and the number of unique taxa, while contamination has minimal impact on statistical outcomes, primarily limited to the number of differentially abundant taxa detected. A common approach to contamination control involves removing taxa based on published contaminant lists. However, our analysis shows that these lists are highly inconsistent across studies, limiting reliability. Instead, our results support the use of internal negative controls as the most robust means of identifying and mitigating contamination. Collectively, data show that low-biomass microbiome studies have reduced power to detect differences between groups. However, when differences are observed, they are unlikely to be contamination-driven. By prioritizing validated protocols that prevent, assess, and eliminate contaminants through the use of internal negative controls, researchers can minimize the impact of contamination and improve the reliability of results.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ipdpsw.2015.104
On-demand Connection Management for OpenSHMEM and OpenSHMEM+MPI
  • May 1, 2015
  • Sourav Chakraborty + 4 more

Partitioned Global Address Space (PGAS) programming models like Open SHMEM and hybrid models like Open SHMEM+MPI can deliver high performance and improved programmability. However, current implementations of Open SHMEM assume a fully-connected process model which affects their performance and scalability. We address this critical issue by designing on-demand connection management support for Open SHMEM which significantly improves the startup performance and reduces the resource usage. We further enhance the Open SHMEM startup performance by utilizing non-blocking out-of-band communication APIs. We evaluate our designs using a set of micro benchmarks and applications and observe 30 times reduction in Open SHMEM initialization time and 8.3 times improvement in execution time of a Hello World application at 8,192 processes. In particular, when sufficient work can be overlapped, we show that use of non-blocking out-of-band communication APIs allow for a constant initialization cost of Open SHMEM jobs at different core counts. We also obtain up to 90% reduction in number of network endpoints and up to 35% improvement in application execution time with NAS Parallel Benchmarks.

  • Research Article
  • Cite Count Icon 9
  • 10.1186/s12864-022-08540-6
PhyloMissForest: a random forest framework to construct phylogenetic trees with missing data
  • May 18, 2022
  • BMC Genomics
  • Diogo Pinheiro + 2 more

BackgroundIn the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of additional applications in science. One of the most challenging problems that arise when building phylogenetic trees is the presence of missing biological data. More specifically, the possibility of inferring wrong phylogenetic trees increases proportionally to the amount of missing values in the input data. Although there are methods proposed to deal with this issue, their applicability and accuracy is often restricted by different constraints.ResultsWe propose a framework, called PhyloMissForest, to impute missing entries in phylogenetic distance matrices and infer accurate evolutionary relationships. PhyloMissForest is built upon a random forest structure that infers the missing entries of the input data, based on the known parts of it. PhyloMissForest contributes with a robust and configurable framework that incorporates multiple search strategies and machine learning, complemented by phylogenetic techniques, to provide a more accurate inference of lost phylogenetic distances. We evaluate our framework by examining three real-world datasets, two DNA-based sequence alignments and one containing amino acid data, and two additional instances with simulated DNA data. Moreover, we follow a design of experiments methodology to define the hyperparameter values of our algorithm, which is a concise method, preferable in comparison to the well-known exhaustive parameters search. By varying the percentages of missing data from 5% to 60%, we generally outperform the state-of-the-art alternative imputation techniques in the tests conducted on real DNA data. In addition, significant improvements in execution time are observed for the amino acid instance. The results observed on simulated data also denote the attainment of improved imputations when dealing with large percentages of missing data.ConclusionsBy merging multiple search strategies, machine learning, and phylogenetic techniques, PhyloMissForest provides a highly customizable and robust framework for phylogenetic missing data imputation, with significant topological accuracy and effective speedups over the state of the art.

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-981-10-0129-1_5
Faster Load Flow Analysis
  • Jan 1, 2016
  • Rahul Saxena + 2 more

Over the past few decades, load flow algorithms for radial distribution networks have been an area of interest for researches, which has led to improvement in the approach and results for the problem. Different procedures and algorithms have been followed in lieu of performance enhancement in terms of simplicity of implementation, execution time, and memory space requirements. The implementation of load flow algorithm using CUDA parallel programming architecture for a radial distribution network is discussed. The computations involved in serial algorithm for load current, branch impedances, etc., have been parallelized using CUDA programming model. The end result will be an improvement in execution time of the algorithm as compared to the running time of the algorithm over CPU. Finally, a comparison has been drawn between the serial and parallel approaches, where an improvement in execution time has been shown over the functions involved in computations.

  • Conference Article
  • Cite Count Icon 20
  • 10.5555/314613.315048
Theory and practice of I/O-efficient algorithms for multidimensional batched searching problems
  • Jan 1, 1998
  • Lars Arge + 4 more

We describe a powerful framework for designing efficient batch algorithms for certain large-scale dynamic problems that must be solved using external memory. The class of problems we consider, which we call colorable eztemaldecomposable problems, include rectangle intersection, orthogonal line segment intersection, range searching, and point location. We are particularly interested in these problems in two and higher dimensions. They have numerous applications in geographic information systems (GIS), spatial databases, and VLSI and CAD design. We present simplied algorithms for problems previously solved by more complicated approaches (such as rectangle intersection), and we present efficient algorithms for problems not previously solved in an efficient way (such as point location and higherdimensional versions of range searching and rectangle intersection). communication is the bottleneck in many large-scale applications such as those arising in VLSI and CAD design, spatial databases, and geographic information systems (GIS). In this paper we consider I/O-efficient algorithms for batched searching problems. We consider both batched static and batched dynamic problems, and use the correspondence which often exists between a ddimensional static problem and a (dl)-dimensional dynamic problem to obtain a number of new d-dimensional algorithms. We give experimental results concerning the running time for our approach applied to the red-blue rectangle intersection problem, which is a key component of the extremely important database operation spatial join. Our algorithm scales well with the problem size, and for large problems sizes it greatly outperforms the well-known sweepline approach. One prominent example of the problems we consider is the rectangle intersection problem, which is a key component in VLSI design rule checking [31] and in the extremely important database operation spatial join [34]. We illustrate the practical significance of our algorithms by comparing the empirical performance of our algorithm for this problem with the well-known sweepline algorithm developed for internal memory. 1.1 Problem definition and memory model

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/arith.1985.6158944
A fast algorithm for the symmetric eigenvalue problem
  • Jun 1, 1985
  • J J Dongarra + 1 more

The symmetric eigenvalue problem is one of the most fundamental problems of computational mathematics. It arises in many applications, and therefore represents an important area for algorithmic research. It is also one of the first eigenvalue problems for which reliable methods have been obtained. It would be surprising therefore, if a new method were to be found that would offer a significant improvement in execution time over the fundamental algorithms available in standard software packages such as EISPACK [7]. However, it is reasonable to expect that eigenvalue calculations might be accelerated through the use of parallel algorithms for parallel computers that are emerging. We shall present such an algorithm in this paper. The algorithm is able to exploit parallelism at all levels of the computation and is well suited to a variety of architectures. However, a pleasant bonus of this research is that the parallel algorithm, even when run in serial mode, is significantly faster than the best sequential algorithm on large problems, and is effective on moderate size (order ≥30) problems when run in serial mode.

  • Research Article
  • Cite Count Icon 3
  • 10.1007/s11554-018-0830-8
Exploiting architectural features of a computer vision platform towards reducing memory stalls
  • Oct 9, 2018
  • Journal of Real-Time Image Processing
  • Naveed Ul Mustafa + 3 more

Computer vision applications are becoming more and more popular in embedded systems such as drones, robots, tablets, and mobile devices. These applications are both compute and memory intensive, with memory bound stalls (MBS) making a significant part of their execution time. For maximum reduction in memory stalls, compilers need to consider architectural details of a platform and utilize its hardware components efficiently. In this paper, we propose a compiler optimization for a vision-processing system through classification of memory references to reduce MBS. As the proposed optimization is based on the architectural features of a specific platform, i.e., Myriad 2, it can only be applied to other platforms having similar architectural features. The optimization consists of two steps: affinity analysis and affinity-aware instruction scheduling. We suggest two different approaches for affinity analysis, i.e., source code annotation and automated analysis. We use LLVM compiler infrastructure for implementation of the proposed optimization. Application of annotation-based approach on a memory-intensive program shows a reduction in stall cycles by 67.44%, leading to 25.61% improvement in execution time. We use 11 different image-processing benchmarks for evaluation of automated analysis approach. Experimental results show that classification of memory references reduces stall cycles, on average, by 69.83%. As all benchmarks are both compute and memory intensive, we achieve improvement in execution time by up to 30%, with a modest average of 5.79%.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/dsaa49011.2020.00046
MRSweep: Distributed In-Memory Sweep-line for Scalable Object Intersection Problems
  • Oct 1, 2020
  • Tilemachos Pechlivanoglou + 2 more

Several data mining and machine learning problems can be reduced to the computational geometry problem of finding intersections of a set of geometric objects, such as intersections of line segments or rectangles/boxes. Currently, the state-of-the-art approach for addressing such intersection problems in Euclidean space is collectively known as the sweep-line or plane sweep algorithm, and has been utilized in a variety of application domains, including databases, gaming and transportation, to name a few. The idea behind sweep line is to employ a conceptual line that is swept or moved across the plane, stopping at intersection points. However, to report all K intersections among any N objects, the standard sweep line algorithm (based on the Bentley-Ottmann algorithm) has a time complexity of O((N + K)logN), therefore cannot scale to very large number of objects and cases where there are many intersections. In this paper, we propose MRSWEEP and MRSWEEP-D, two sophisticated and highly scalable algorithms for the parallelization of sweep-line and its variants. We provide algorithmic details of fully distributed in-memory versions of the proposed algorithms using the MapReduce programming paradigm in the Apache Spark cluster environment. A theoretical analysis of the proposed algorithms is presented, as well as a thorough experimental evaluation that provides evidence of the algorithms’ scalability in varying levels of problem complexity. We make source code and datasets available to support the reproducibility of the results.

  • Conference Article
  • Cite Count Icon 6
  • 10.1145/1450095.1450120
Control flow optimization in loops using interval analysis
  • Oct 19, 2008
  • Mohammad Ali Ghodrat + 2 more

We present a novel loop transformation technique, particularly well suited for optimizing embedded compilers, where an increase in compilation time is acceptable in exchange for significant performance increase. The transformation technique optimizes loops containing nested conditional blocks. Specifically, the transformation takes advantage of the fact that the Boolean value of the conditional expression, determining the true/false paths, can be statically analyzed using a novel interval analysis technique that can evaluate conditional expressions in the general polynomial form. Results from interval analysis combined with loop dependency information is used to partition the iteration space of the nested loop. In such cases, the loop nest is decomposed such as to eliminate the conditional test, thus substantially reducing the execution time. Our technique completely eliminates the conditional from the loops (unlike previous techniques) thus further facilitating the application of other optimizations and improving the overall speedup. Applying the proposed transformation technique on loop kernels taken from Mediabench, SPEC-2000, mpeg4, qsdpcm and gimp, on average we measured a 175% (1.75X) improvement of execution time when running on a SPARC processor, a 336% (4.36X) improvement of execution time when running on an Intel Core Duo processor and a 198.9% (2.98X) improvement of execution time when running on a PowerPC G5 processor.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 7
  • 10.3390/s23104606
From Lab to Real World: Assessing the Effectiveness of Human Activity Recognition and Optimization through Personalization.
  • May 9, 2023
  • Sensors
  • Marija Stojchevska + 4 more

Human activity recognition (HAR) algorithms today are designed and evaluated on data collected in controlled settings, providing limited insights into their performance in real-world situations with noisy and missing sensor data and natural human activities. We present a real-world HAR open dataset compiled from a wristband equipped with a triaxial accelerometer. During data collection, participants had autonomy in their daily life activities, and the process remained unobserved and uncontrolled. A general convolutional neural network model was trained on this dataset, achieving a mean balanced accuracy (MBA) of 80%. Personalizing the general model through transfer learning can yield comparable and even superior results using fewer data, with the MBA improving to 85%. To emphasize the issue of insufficient real-world training data, we conducted training of the model using the public MHEALTH dataset, resulting in 100% MBA. However, upon evaluating the MHEALTH-trained model on our real-world dataset, the MBA drops to 62%. After personalizing the model with real-world data, an improvement of 17% in the MBA is achieved. This paper showcases the potential of transfer learning to make HAR models trained in different contexts (lab vs. real-world) and on different participants perform well for new individuals with limited real-world labeled data available.

  • Conference Article
  • 10.1109/esmarta56775.2022.9935508
DEADC: Density Extending Algorithm for Data Clustering
  • Oct 25, 2022
  • Mahmoud Abdullah Al-Mojahed + 1 more

Data clustering is a very active research area in machine learning and knowledge discovery. Generating clusters of different densities is a challenging task. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based clustering algorithm that has problems discovering clusters of varied density since it uses a fixed radius and has quadratic time complexity making it difficult in real applications with large datasets. In this paper, a Density Extending Algorithm for Data Clustering (DEADC) is proposed to cluster datasets with different densities, sizes, and noise with better accuracy and less execution time. DEADC uses a dynamic radius variable called ϵ <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Extended</inf> based on statistical analysis that assigns a regional density value for each cluster by extending the data points with the ϵ <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Extended</inf> neighborhood. DEADC implicitly needs to compute the empirical density for each created cluster, leading to linear time complexity. Experimental results showed the effectiveness of the DEADC for identifying clusters with varied densities from synthetic and real-world datasets with a significant improvement in clustering accuracy and execution time.

  • Conference Article
  • Cite Count Icon 29
  • 10.1145/1629395.1629422
Instruction cache locking inside a binary rewriter
  • Oct 11, 2009
  • Kapil Anand + 1 more

Cache memories in embedded systems play an important role in reducing the execution time of the applications. Various kinds of extensions have been added to cache hardware to enable software involvement in replacement decisions, thus improving the run-time over a purely hardware-managed cache. Novel embedded systems, like Intel's Xscale and ARM Cortex processors provide the facility of locking one or more lines in cache - this feature is called cache locking. This paper presents the first method in the literature for instruction-cache locking that is able to reduce the average-case run-time of the program. We devise a cost-benefit model to discover the memory addresses which should be locked in the cache. We implement our scheme inside a binary rewriter, thus widening the applicability of our scheme to binaries compiled using any compiler. Results obtained on a suite of MiBench and MediaBench benchmarks show up to 25% improvement in the instruction-cache miss rate on average and up to 13.5% improvement in the execution time on average for applications having instruction accesses as a bottleneck, depending on the cache configuration. The improvement in execution time is as high as 23.5% for some benchmarks.

  • Book Chapter
  • 10.1007/978-3-540-39920-9_10
Tailoring Software Pipelining for Effective Exploitation of Zero Overhead Loop Buffer
  • Jan 1, 2003
  • Gang-Ryung Uh

A Zero Overhead Loop Buffer (ZOLB) is an architectural feature that is commonly found in DSPs (Digital Signal Processors). This buffer can be viewed as a compiler (or program) managed cache that can hold a limited number of instructions, which will be executed a specified number of times without incurring any loop overhead. Preliminary versions of the research, which exploit a ZOLB, report significant improvement in execution time with a minimal code size increase [UH99,UH00]. This paper extends the previous compiler efforts to further exploit a ZOLB by employing a new software pipelining methodology. The proposed techniques choose complex instructions, which capitalize on instruction level parallelism across loop iteration boundaries. Unlike the traditional pipelining techniques, the proposed pipelining strategy is tightly coupled with instruction selection so that it can perform register renaming and/or proactively generate additional instruction(s) on the fly to discover more loop parallelism on the ZOLB. This framework reports additional significant improvements in execution time with modest code size increases for various signal processing applications on the DSP16000.

  • Conference Article
  • 10.1109/elnano.2015.7146936
Possibility of the pipelining technique application in a space/spatial-frequency filter implementation based on the local frequency estimation
  • Apr 1, 2015
  • Veselin N Ivanovic + 3 more

Possibility of the pipelining technique application in a local frequency estimation-based system for nonstationary two-dimensional (2D) FM signals estimation, initially observed in [1], [2], but completely developed, implemented (in FPGA), and adapted (extended to the highly nonstationary signals case) in [3], is considered here. The applied technique allows the implemented filter to overlape in execution unconditional steps performing in neighboring space/spatial-frequency (S/SF) instants and, therefore, to significantly improve execution time. In this way, the improvement in execution time coresponding to the one clock cycle (CLK) by an S/SF point is achieved, which means that the improvement by an S/SF point can reach 50% in a great part of S/SF space. The achieved improvement in execution time by an S\SF point is presented on the example of the multicomponent nonstationary noisy signal filering.

  • Research Article
  • 10.1016/j.compbiomed.2025.110198
A framework to create, evaluate and select synthetic datasets for survival prediction in oncology.
  • Jun 1, 2025
  • Computers in biology and medicine
  • A T Christoforou + 14 more

A framework to create, evaluate and select synthetic datasets for survival prediction in oncology.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon