Related Topics
Articles published on Parallel Implementation
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
6878 Search results
Sort by Recency
- New
- Research Article
- 10.29140/jct.v2n1.103473
- Apr 23, 2026
- JALTCALL Trends
- Olga Li + 1 more
Two instructors at the same Japanese university compared two parallel implementations of ChatGPT-assisted draft revision within identical course structures. The results showed how teacher backgrounds, beliefs, and experience function in shaping pedagogical realities and teacher attitudes toward generative artificial intelligence (GenAI) integration in English as a Foreign Language (EFL) classrooms. The study proposes that professional dialogue between instructors with different beliefs and experiences may be productive for effective AI implementation.
- Research Article
- 10.1021/acs.jctc.6c00116
- Apr 7, 2026
- Journal of chemical theory and computation
- Phillip S Thomas + 8 more
StochasticGW is a code for computing accurate quasi-particle (QP) energies of molecules and material systems in the GW approximation. StochasticGW utilizes the stochastic Resolution of the Identity (sROI) technique to enable a massively parallel implementation with computational costs that scale semilinearly with system size, allowing the method to access systems with tens of thousands of electrons. We introduce a new implementation, StochasticGW-GPU, for which the main bottleneck steps have been ported to GPUs and give substantial performance improvements over previous versions of the code. We showcase the new code by computing band gaps of hydrogenated silicon clusters (SixHy) containing up to 10,001 atoms and 35,144 electrons, and we obtain individual QP energies with a statistical precision of better than ±0.03 eV with times-to-solution of less than 1 h.
- Research Article
- 10.3390/metabo16040245
- Apr 4, 2026
- Metabolites
- Robert M Flight + 2 more
Background: Almost all correlation measures currently available are unable to directly handle missing values. Typically, missing values are either ignored completely by removing them or are imputed and used in the calculation of the correlation coefficient. In either case, the correlation value will be impacted based on the perspective that the missing data represents no useful information. However, missing values occur in real datasets for a variety of reasons. In metabolomics datasets a major reason for missing values is that a specific measurable phenomenon falls below the detection limits of the analytical instrumentation (left-censored values). These missing data are not missing at random, but represent potentially useful information by virtue of their "missingness" at one end of the data distribution. Methods: To include this information due to left-censored missingness, we propose the information-content-informed Kendall-tau (ICI-Kt) methodology. We develop a statistical test and then show that most missing values in metabolomics datasets are the result of left-censorship. Next, we show how left-censored missing values can be included within the definition of the Kendall-tau correlation coefficient, and how that inclusion leads to an interpretation of information being added to the correlation. We also implement calculations for additional measures of theoretical maxima and pairwise completeness that add further layers of information interpretation in the methodology. Results: Using both simulated and over 700 experimental data sets from the Metabolomics Workbench, we demonstrate that the ICI-Kt methodology allows for the inclusion of left-censored missing data values as interpretable information, enabling both improved determination of outlier samples and improved feature-feature network construction. Conclusions: We provide explicitly parallel implementations in both R and Python that allow fast calculations of all the variables used when applying the ICI-Kt methodology on large numbers of samples. The ICI-Kt methods are available as an R package and Python module on GitHub.
- Research Article
- 10.1088/1361-6560/ae5586
- Apr 2, 2026
- Physics in Medicine & Biology
- Christopher B C Dean + 5 more
Objective.To systematically assess the accuracy and computational performance of a newly proposed stochastic differential equation (SDE)-based model for proton beam dose calculation by benchmarking it against Geant4 in a set of simplified but increasingly challenging phantom geometries.Approach.Building on previous work in Crossleyet al(2025Proc. R. Soc. A48120240687), where energy deposition from a proton beam was modelled using an SDE framework, we implemented the model using standard approximations to interaction cross sections and mean excitation energies, enabling straightforward adaptation to new materials and configurations. The model was benchmarked against Geant4 in homogeneous, longitudinally heterogeneous and laterally heterogeneous phantoms, for assessment of depth-dose behaviour, lateral transport and impact of material heterogeneities.Main results.Across all phantom configurations and beam energies, the SDE model reproduced the main depth-dose characteristics predicted by Geant4, with proton range agreement within 0.2 mm for 100 MeV beams and within 0.6 mm for 150 MeV beams. Voxel-wise comparisons yielded gamma pass rates exceeding 95% for all cases under strict 2%/0.5 mm criteria with a 1% dose threshold. Differences between the two approaches were spatially localised and primarily associated with regions of steep dose gradients or material heterogeneities, while overall lateral beam dispersion was well reproduced. In terms of computational performance, the SDE model achieved speed-up factors of approximately 2.5-3 relative to single-threaded Geant4, consistently across different Geant4 physics lists.Significance.These results demonstrate that the SDE-based approach can reproduce key dosimetric features predicted by high-fidelity Monte Carlo simulations with good accuracy while already offering a moderate reduction in computational cost. Owing to its formulation, the method is naturally amenable to parallel and GPU-accelerated implementations, suggesting potential for substantial further speed improvements. This makes the approach a promising candidate for fast dose calculations in proton therapy.
- Research Article
- 10.1002/cpe.70715
- Apr 1, 2026
- Concurrency and Computation: Practice and Experience
- Michel B Cordeiro + 1 more
ABSTRACT Reduction is an operation that combines all the elements of a collection by applying a binary operation, such as sum, maximum, or minimum, to all the elements to obtain a single resulting value. This paper investigates implementation strategies for both segmented and non‐segmented reduction on GPUs. Existing techniques for segmented reduction often show consistent performance but are relatively inefficient in absolute terms. These techniques are frequently optimized for specific workloads and, as a result, may exhibit degraded performance with certain input data, especially when segments have varying sizes. The algorithm presented in this paper employs six different strategies for handling segments of varying sizes and automatically selects the most suitable one at runtime based on segment size. To ensure portability and consistent performance across GPU architectures, we developed a tuning algorithm that automatically optimizes for current or future GPUs. Experiments conducted on GPUs with diverse architectures demonstrate the effectiveness of this tuning approach and validate the algorithms' consistent performance. Overall, the proposed method achieves over a 135× speedup when processing a large number of small segments, and up to a 98× speedup for a small number of large segments, depending on the distribution, compared to segmented reduction algorithms in state‐of‐the‐art GPU libraries. Additionally, this paper also explores strategies to accelerate non‐segmented reduction, resulting in up to 1.72 times improvement compared to other parallel GPU implementations.
- Research Article
1
- 10.1109/tevc.2025.3560217
- Apr 1, 2026
- IEEE Transactions on Evolutionary Computation
- Amanda S Dufek + 3 more
Bilevel optimization is encountered in many relevant real-world applications. The main feature of this type of problem is that an upper-level optimization problem is constrained by a nested lower-level optimization problem. Because of this nested structure, bilevel problems are usually computationally expensive to solve. Differential Evolution has demonstrated promising results in solving bilevel problems of relatively small scales. As the problem scale increases, the decision space becomes intrinsically larger, requiring a growing number of function evaluations for the method to work properly. In this context, heavy parallelization and high-performance computing techniques are indispensable to enable the resolution of more complex and challenging optimization problems. Hence, we propose a hierarchical many-threaded parallel Differential Evolution approach for bilevel problems, where both levels are parallelized. The computational experiments demonstrate that the parallel implementation achieved runtime speeds ranging from 44 to 2559 times faster than the sequential version on a well-known scalable SMD benchmark test problem when executed on an NVIDIA A100 GPU. The findings indicate that the algorithm’s convergence is strongly influenced by the number of both upper-and lower-level generations. Moreover, the success of experiments with large-scale problems is closely linked to the choice of small population sizes.
- Research Article
- 10.1371/journal.pone.0343262
- Mar 27, 2026
- PloS one
- Tram Nguyen + 3 more
This paper proposes a parallel hybrid metaheuristic, named PH-SHOWOA, that integrates the Spotted Hyena Optimizer (SHO) and the Whale Optimization Algorithm (WOA) to solve the Vehicle Routing Problem with Simultaneous Pickup and Delivery and Time Windows (VRPSPDTW). The proposed method leverages the strength of both algorithms: SHO primarily supports population-level diversification, while WOA focuses on best-guided intensification. An adaptive probability control mechanism dynamically regulates the interaction between these two search behaviours during the optimization process. To further enhance robustness and mitigate premature convergence, the framework incorporates simulated-annealing-based acceptance, periodic local search, and population diversification strategies. A parallel implementation enables concurrent solution updates and local refinements, improving computational efficiency on medium-scale instances. The VRPSPDTW is formulated using a hierarchical lexicographic objective that prioritizes minimizing the number of vehicles, followed by total travel distance. Extensive experiments on 65 well-known benchmark instances demonstrate that PH-SHOWOA consistently outperforms standalone SHO and WOA, achieving an average reduction in total distance of over 10%. Compared with advanced algorithms such as Co-GA, MA-FIRD, and ACO-DR, PH-SHOWOA exhibits competitive and often superior performance. Notably, it achieves the lowest total distance on several Rdp and Cdp instances and performs well in centralized-demand scenarios. Furthermore, comprehensive non-parametric statistical tests are conducted to verify the effectiveness and robustness of the proposed method.
- Research Article
- 10.22399/ijcesen.5080
- Mar 26, 2026
- International Journal of Computational and Experimental Science and Engineering
- Navya Reddy Kunta
Enterprise software organizations face mounting pressure to accelerate deployment cycles while maintaining comprehensive quality assurance standards that protect business operations and customer trust. Traditional sequential testing approaches within continuous integration and continuous deployment pipelines create critical bottlenecks that constrain software delivery velocity and force difficult trade-offs between test coverage breadth and feedback speed. This article examines the implementation of parallel and distributed testing architectures leveraging Selenium Grid and Docker containerization to address these challenges in large-scale enterprise environments. The distributed framework employs hub-node topology coordinating test execution across containerized browser nodes with intelligent load balancing algorithms and dynamic scaling mechanisms. Performance evaluation demonstrates substantial execution time reductions enabling transformation from extended overnight testing cycles to rapid feedback loops compatible with continuous integration practices. Deployment frequency increases directly attributable to reduced feedback cycle duration enable authentic continuous delivery practices where individual features deploy independently upon completion. Cost-benefit analysis reveals optimal parallelization configurations balancing performance improvements against infrastructure expenses and resource utilization efficiency. Scalability measurements confirm sub-linear execution time growth as test suites expand organically, indicating sustainable accommodation of increasing quality coverage requirements. Reliability metrics demonstrate operational stability comparable to serial execution approaches while maintaining high availability essential for production pipeline integration. Strategic implications extend throughout software development lifecycle management, enabling shift-left quality practices and sophisticated release management capabilities including progressive rollouts and rapid experimentation. The documented architectural patterns, implementation guidance, and empirical performance characteristics provide actionable frameworks for organizations seeking to resolve tensions between comprehensive quality assurance and competitive delivery velocity in demanding enterprise contexts.
- Research Article
- 10.1002/cpe.70673
- Mar 24, 2026
- Concurrency and Computation: Practice and Experience
- Gabriel Hévr + 2 more
ABSTRACT This paper presents PPQSort (Pattern‐defeating Parallel Quicksort), a new parallel quicksort algorithm that provides high performance and ease of use. PPQSort uses C++ threads for parallelization, achieving efficient sorting without external libraries and allowing seamless integration across different computing environments. This paper describes novel quicksort optimizations, including branchless partitioning and their scalable parallel implementation. PPQSort is compared with existing parallel quicksort algorithms on two different CPU architectures (AMD EPYC, ARM A64FX) and with 7 different synthetic input data distributions. An analysis of its cache behavior is also provided. The results of the experimental evaluation demonstrate that PPQSort is fast and robust, consistently outperforming the fastest available parallel quicksort‐based implementations for all tested inputs.
- Research Article
- 10.5194/gmd-19-2279-2026
- Mar 19, 2026
- Geoscientific Model Development
- Kun Liu + 11 more
Abstract. Although the Singular Value Decomposition-three Dimensional Ensemble Variational (SVD-3DEnVar) data assimilation scheme has achieved successful application in real case simulations with comprehensive numerical weather prediction models, its computational efficiency still falls short of meeting the demands of operational numerical prediction. The main limitations lie in the generation of three-dimensional perturbations and the implementation of parallel calculations. To bridge this gap towards operational readiness, this study introduces key computational optimizations: a new three-dimensional perturbation field generation scheme that supports multi-process parallelism and can directly generate any specified grid, and an efficient parallel implementation scheme tailored for the local patch assimilation in the SVD-3DEnVar scheme. Results from Observing System Simulation Experiment (OSSE) based on the Tropical Regional Atmospheric Model System (TRAMS) indicate that after computational efficiency optimization, the time required to generate a 3D perturbation field was reduced from 22 min to 2.2 s, while the runtime of the assimilation process decreased from costly serial execution, to 1700 min under single-node parallel execution (with 64 cores), and further to less than 15 min (using 150 nodes in parallel). Finally, we conducted an assimilation experiment using actual observational data of sea surface wind fields to preliminarily validate the reasonableness of the assimilation results from the optimized SVD-3DEnVar scheme.
- Research Article
- 10.33788/rcis.92.6
- Mar 15, 2026
- Revista de Cercetare si Interventie Sociala
- Anna Borysova + 1 more
The growing number of humanitarian crises in the world is highlighting the need to find new models of assistance that can ensure not only physical survival, but also long-term sustainability and integration of affected communities. The Ukrainian context, where the war has created multidimensional challenges to public well-being, requires special attention. The purpose of the study is to examine the conceptual foundations of a multi-level approach to humanitarian interventions and analyze the logic of their application in crises. The methodological basis of the work is a combination of conceptual analysis, interdisciplinary approaches and field assessment data, supplemented by international standards (Sphere Association, HNO, MSNA). As a result, it was found that a multi-level approach allows systematizing humanitarian activities according to the logic of a gradual transition from life support to life realization, reducing the risk of duplication of efforts and forming the basis for social sustainability. It has been shown that taking into account the physical, social, psychological and cultural dimensions of well-being ensures greater targeting of assistance and promotes the development of social capital. The practical significance of the work lies in the possibility of using the proposed model to assess the state of communities and plan targeted interventions that meet current needs. It is proved that the parallel implementation of psychosocial support enhances the effectiveness of material assistance and forms the basis for long-term development. The novelty of the study lies in the adaptation of the multi-level logic to the Ukrainian context, which allows combining international experience with local specifics. A limitation of the study is the insufficient representativeness of field data and the need for long-term observations. Further research should focus on developing mechanisms for coordinating humanitarian actors and integrating quantitative and qualitative indicators into monitoring systems.
- Research Article
- 10.3390/appliedmath6030043
- Mar 9, 2026
- AppliedMath
- Mudassir Shams + 1 more
Nonlinear systems with multiple roots arise frequently in biomedical and engineering models, yet their reliable numerical solution remains a challenging task. Many classical methods suffer from sensitivity to initial guesses, reduced convergence rates, and loss of accuracy in the presence of multiple or clustered solutions. In addition, the exploitation of parallelism to improve robustness and computational efficiency has received limited attention. In this work, we propose a high-accuracy parallel numerical framework of fourth-order convergence for the simultaneous approximation of all solutions of nonlinear systems with multiple roots. The proposed scheme is derivative-free and structurally decoupled, enabling efficient parallel implementation and robust convergence even when reliable initial approximations are unavailable. The effectiveness of the method is demonstrated on representative biomedical engineering models, including a glucose–insulin–glucagon regulatory network and a multi-compartment pharmacokinetic system, both exhibiting strong nonlinearity and multistability. Numerical experiments confirm stable convergence toward distinct solution clusters, machine-level accuracy, reduced residual norms, and improved computational performance when compared with existing approaches. These results indicate that the proposed framework provides a reliable and efficient alternative for solving nonlinear systems with multiple roots in complex applied settings.
- Research Article
- 10.1109/tap.2025.3646254
- Mar 1, 2026
- IEEE Transactions on Antennas and Propagation
- Linxi Wang + 2 more
This paper proposes a time-domain parallel method that integrates the hybrid implicit-explicit finite-difference time-domain (HIE-FDTD) method with time-domain physical optics (TDPO), implemented using the message passing interface (MPI) library. Compared to conventional FDTD/TDPO hybrid approaches, the proposed method offers two main advantages, First, the HIE-FDTD algorithm, as a weakly conditionally stable implicit scheme, enables the use of significantly larger time steps by relaxing the Courant–Friedrichs–Lewy (CFL) constraint, and a sparse sample strategy combined with a sliding-window mechanism is employed in the near-to-far-field extrapolation process, to support long-time transient simulations. Second, that the method supports distributed parallelization across multiple processors, substantially reducing computational time. Several numerical examples are presented to validate the accuracy and efficiency of the proposed approach. The simulation results show good agreement with those obtained by traditional methods, and the parallel implementation consistently achieves approximately 80% efficiency across varying processor counts.
- Research Article
- 10.1016/j.array.2026.100701
- Mar 1, 2026
- Array
- Héctor Migallón + 3 more
This paper presents an efficient approach to toolpath generation tailored to the needs of Industry 5.0, with a focus on turning lathe machining. The study addresses the challenge of rapidly and accurately generating helical toolpaths in personalized manufacturing, where traditional sequential methods often become computational bottlenecks. To overcome this limitation, we propose efficient parallel implementations of the Virtual Digitizing (VD) algorithm, specifically designed to accelerate the computation of machining trajectories on both multicore and manycore architectures. The multicore implementation achieves notable speedups, especially when execution is properly tuned. The manycore strategy explores both asynchronous (coarse-grained) and synchronous (fine-grained) execution models. In the asynchronous method, independent trajectory computations are assigned to separate CUDA threads, whereas the synchronous method further parallelizes the internal processing of each trajectory point, providing finer computational granularity. Experimental evaluations conducted on authentic industrial shoe last models reveal notable gains in computational efficiency. The manycore implementation achieves up to 70x acceleration on low-end GPUs, over 80x on high-range devices and over 270x on state-of-the-art GPU devices when compared to their respective CPU-based computations. Although the synchronous method introduces additional complexity, it delivers the best performance on powerful GPU platforms, whereas the asynchronous method is better suited for resource-constrained systems. Therefore, the study concludes that the optimal parallelization strategy depends on the available hardware.
- Research Article
- 10.18421/tem151-08
- Feb 27, 2026
- TEM Journal
- Radostin Rafailov + 1 more
Steganography plays a vital role in information security as it enables the covert transmission of sensitive data. Among various steganographic techniques, least significant bit (LSB) steganography is one of the most widely employed methods for embedding hidden information in digital images. This paper introduces a novel parallel colorspecific LSB steganographic technique designed for 24- bit *.BMP images. The proposed method selectively embeds data into pixels that match a specific color within a defined tolerance, leveraging parallel processing to enhance computational efficiency and reduce encoding and decoding times. Experimental evaluations, including histogram analysis and peak signal-to-noise ratio (PSNR) calculations, confirm that the proposed algorithm maintains high imperceptibility and ensures minimal distortion to the cover image. Furthermore, the parallel implementation significantly reduces execution time, especially when utilizing multiple CPU cores, making the method highly scalable and suitable for large datasets. The results demonstrate the potential of the proposed approach for secure, highperformance data embedding in real-world steganographic applications.
- Research Article
- 10.1007/s00211-026-01531-9
- Feb 26, 2026
- Numerische Mathematik
- Oded Schwartz + 3 more
Abstract Alternative basis matrix multiplication algorithms are the fastest matrix multiplication algorithms in practice to date. However, are they numerically stable? We obtain the first numerical error bound for alternative basis matrix multiplication algorithms. Particularly, we derive a novel fast matrix multiplication algorithm with a 2-by-2 base case that simultaneously attains the optimal leading coefficient for arithmetic costs, and achieves an improved asymptotic error bound. We further show that arithmetic costs and error bounds of alternative basis algorithms can be simultaneously optimized. We provide high-performance parallel implementations of our algorithms with benchmarks showing that our algorithm is on par with the best in class for speed, and the best in class of for stability. Finally, we show that diagonal scaling stability improvement techniques for fast matrix multiplication are as effective for alternative basis algorithms, both theoretically and empirically. These findings promote the use of alternative basis matrix multiplication algorithms in practical applications.
- Research Article
- 10.1063/5.0303206
- Feb 21, 2026
- The Journal of chemical physics
- Thies Romig + 5 more
The multi-configurational Ehrenfest (MCE) method offers a promising trajectory-based alternative to exact quantum dynamics for simulating non-adiabatic transitions, combining quantum and semiclassical elements through a basis of independent mean-field trajectories. A critical challenge in MCE lies in the optimal selection of initial trajectory conditions. In this work, we systematically evaluate different strategies for generating such initial conditions, including physical approaches based on Wigner and compressed Wigner distributions, as well as regular geometric constructions such as single- and multi-layered spherical and cubic grids. Among them, the cubic grid with a unit spacing in dimensionless coordinates emerges as a nearly universal and system-independent choice. We further demonstrate that, in systems with multiple conical intersections, trajectory cloning improves performance, but only after achieving convergence with respect to trajectory number using a well-constructed basis. Overall, MCE proves particularly advantageous for problems involving many coupled electronic states and offers flexibility for parallel implementation, making it especially relevant for the simulation of broadband excitation processes in attosecond science.
- Research Article
- 10.3390/a19020145
- Feb 10, 2026
- Algorithms
- Zhirui Niu + 5 more
Polygon Boolean operations are widely used in integrated circuit (IC) layout processing tasks such as design rule checking (DRC) and optical proximity correction (OPC). Single-threaded Boolean algorithms cannot meet the efficiency demand of modern IC layouts, necessitating parallel algorithms for acceleration. However, existing parallel algorithms exhibit unsatisfactory parallel speedups and limited scalability, which typically stem from an inefficient merging phase that uses generic Boolean OR operations and redundantly reprocesses all edges of polygons on grid boundaries. To solve these problems, we proposed Polygon Tailor, a novel parallel algorithm for polygon Boolean operations that employs a data-parallel strategy and a new merging approach performing incremental XOR operations solely on edges along grid boundaries, eliminating redundant computations in previous methods. This innovation drastically reduces the grid-merging time by 1–2 orders of magnitude. Compared with the parallel implementation from a commercial layout processing tool, PolygonTailor is on average 5.08× faster and up to 14.36× faster for OR operations that generate highly complex polygons.
- Research Article
- 10.3390/axioms15020121
- Feb 6, 2026
- Axioms
- Sanmin Wang + 1 more
Let ΦN(X,Y) be the N-th classical modular polynomial and let Z0(N)={(X,Y)∈C2∣ΦN(X,Y)=0} be the plane model of the modular curve X0(N). We present an explicit procedure that, for a prime ℓ, enumerates all non-cuspidal singular points of Z0(ℓ) over C and outputs the corresponding pairs of distinct points on X0(ℓ) mapping to each node. The method relies on the arithmetic (CM) classification of self-intersections of the map X0(ℓ)→Z0(ℓ) and on effective computations of proper ideal classes in imaginary quadratic orders. We also provide a complete and self-contained exposition of Kara’s proof of the automorphism-group equality Aut(E)=Aut(E′) in the self-intersection setting, making explicit where Kolyvagin’s conductor lemma is used essentially. Finally, we discuss termination, correctness, and practical complexity issues, and we report computational evidence for larger primes using a parallel implementation; in particular, for ℓ=389, we obtained 151,288 output pairs in 151,017 seconds on a 56-core machine.
- Research Article
- 10.17587/mau.27.97-105
- Feb 6, 2026
- Mekhatronika, Avtomatizatsiya, Upravlenie
- S V Sokolov + 1 more
Currently, the integration of satellite navigation systems (SNS) and correlation-extremal navigation systems (CENS) for unmanned vehicles (UVs) is implemented based on principles of separate or, at best, weakly coupled integration, where their measurements are processed by different navigation algorithms (stochastic filters) with subsequent correction of inevitable discrepancies using various optimization methods. This approach is characterized by both high computational costs due to the need for parallel implementation of SNS and CENS measurement processing algorithms and subsequent optimization problem solving, as well as critical dependence of positioning accuracy on increasing levels of radio measurement interference. In this regard, a solution is proposed to improve the positioning accuracy of UVs based on the principle of tightly coupled integration, which involves representing the UV’s coordinate vector and the terrain elevation of the underlying surface as a single navigation vector, estimated by a common stochastic filter. Such measurement processing, in addition to significantly reducing computational costs, ensures robust and high-precision estimation of UV navigation parameters under conditions of intense interference of both natural and artificial origin. The results of a numerical experiment illustrating the effectiveness of the proposed approach are presented.