Related Topics
Articles published on High Speedup
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
378 Search results
Sort by Recency
- Research Article
- 10.1007/s10509-025-04529-1
- Dec 1, 2025
- Astrophysics and Space Science
- Charbel Mamlankou + 1 more
An optimized analytical-numerical method for Kepler’s equation with near-machine precision and high computational speedup
- Research Article
2
- 10.1007/s42496-025-00269-1
- Jun 12, 2025
- Aerotecnica Missili & Spazio
- Giulio Malinverno + 1 more
Abstract This article is meant to review of current state-of-the-art of quantum computing applied to computational fluid dynamics and discuss possible applications of Quantum computing technologies in the area of thermofluid-dynamics. Quantum computers are devices that use quantum mechanical phenomena like superposition, entanglement and interference to perform calculations, with a theoretical high speed-up compared with traditional computing solutions, especially for computer science applications but also for engineering applications, like structural optimization, resolution of linear system and analysis of complex dynamics. Beside the consideration about the hardware implementation of these type of devices, this article propose a simplified taxonomy of the technologies that can be currently envisioned for the resolution of thermofluid dynamics problems, identifying the three main approaches, i.e., the traditional algorithmic or circuital approach (with the use of serval algorithms dedicated to the resolution of partial differential equations as well as algorithms dedicated to optimization and search), the analog approach (with the development of direct simulations given the analogy between quantum mechanical systems, like the Schrödinger flow or the Dirac Majorana formulation, and fluid problems, like the inviscid flow or the Lattice Boltzmann model), and the applications based on machine learning techniques. The article discusses practical examples which highlight the flexibility of the methods as well as their intrinsic limitations that hinder the application to many industrial problems, i.e. the simplifications requires to manage physical non-linearities, or the absence of a general purpose algorithm, indicating, beside the intrinsic properties of each method, the technology readiness level (TRL) of this type of approach and the required level of modelling.
- Research Article
- 10.1109/tc.2025.3547139
- Jun 1, 2025
- IEEE Transactions on Computers
- Xuhang Wang + 6 more
In the realm of video understanding tasks, Video Transformer models (VidT) have recently exhibited impressive accuracy improvements in numerous edge devices. However, their deployment poses significant computational challenges for hardware. To address this, pruning has emerged as a promising approach to reduce computation and memory requirements by eliminating unimportant elements from the attention matrix. Unfortunately, existing pruning algorithms face a limitation in that they only optimize one of the two key modules on VidT's critical path: linear projection or self-attention. Regrettably, due to the variation in battery power in edge devices, the video resolution they generate will also change, which causes both linear projection and self-attention stages to potentially become bottlenecks, the existing approaches lack generality. Accordingly, we establish a Run-Through Sparse Attention (RTSA) framework that simultaneously sparsifies and accelerates two stages. On the algorithm side, unlike current methodologies conducting sparse linear projection by exploring redundancy within each frame, we extract extra redundancy naturally existing between frames. Moreover, for sparse self-attention, as existing pruning algorithms often provide either too coarse-grained or fine-grained sparsity patterns, these algorithms face limitations in simultaneously achieving high sparsity, low accuracy loss, and high speedup, resulting in either compromised accuracy or reduced efficiency. Thus, we prune the attention matrix at a medium granularity—sub-vector. The sub-vectors are generated by isolating each column of the attention matrix. On the hardware side, we observe that the use of distinct computational units for sparse linear projection and self-attention results in pipeline imbalances because of the bottleneck transformation between the two stages. To effectively eliminate pipeline stall, we design a RTSA architecture that supports sequential execution of both sparse linear projection and self-attention. To achieve this, we devised an atomic vector-scalar product computation underpinning all calculations in parse linear projection and self-attention, as well as evolving a spatial array architecture with augmented processing elements (PEs) tailored for the vector-scalar product. Experiments on VidT models show that RTSA can save 2.71<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math></inline-formula> to 5.32<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math></inline-formula> ideal computation with <inline-formula><tex-math notation="LaTeX">$ \lt 1\%$</tex-math></inline-formula> accuracy loss, achieving 105<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math></inline-formula>, 56.8<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math></inline-formula>, 3.59<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math></inline-formula>, and 3.31<inline-formula><tex-math notation="LaTeX">$\boldsymbol{\times}$</tex-math></inline-formula> speedup compared to CPU, GPU, as well as the state-of-the-art ViT accelerators ViTCoD and HeatViT.
- Research Article
3
- 10.1093/sysbio/syaf043
- May 30, 2025
- Systematic Biology
- Anastasis Togkousidis + 2 more
Maximum likelihood–based phylogenetic inference constitutes a challenging optimization problem. Given a set of aligned input sequences, phylogenetic inference tools strive to determine the tree topology, the branch lengths, and the evolutionary model parameters that maximize the phylogenetic likelihood function. However, there exist compelling reasons to not push optimization to its limits, by means of early, yet adequate stopping criteria. Because input sequences are typically subject to stochastic and systematic noise, caution is warranted to prevent overoptimization and the risk of overfitting the model to noisy data. To address this, we integrate the Kishino–Hasegawa (KH) test into RAxML-NG as a reliable and fast-to-compute Early Stopping criterion to effectively limit excessive and compute-intensive overoptimization. Initially, we introduce a simplified heuristic tree search strategy in RAxML-NG (sRAxML-NG) as an underlying method for Early Stopping. Subsequently, we use the KH test in combination with sRAxML-NG to statistically assess the significance of differences between intermediate trees prior to and after major optimization steps. The tree search terminates early when improvements are statistically insignificant. We also propose an extension to the standard KH test that allows to correct for multiple testing, which maintains accuracy while achieving even higher speedups. For benchmarking, we use 300 large representative empirical data sets from TreeBASE. For 98% of the DNA data sets, all Early Stopping methods we introduce infer trees that are statistically equivalent to those inferred from RAxML-NG v1.2. For AA data sets, the fraction of data sets where sRAxML-NG, KH, and the KH-multiple testing versions infer statistically equivalent trees is 96%, 95%, and 92%, respectively. In conjunction with sRAxML-NG, the average speedup achieved by the KH-multiple testing version is 5× for DNA and 3.9× for protein data sets compared with RAxML-NG v1.2. We implemented our stopping criteria in RAxML-NG, which is available under GNU GPL at https://github.com/togkousa/raxml-ng/tree/stopping-criteria.
- Research Article
7
- 10.1145/3691636
- Nov 18, 2024
- ACM Transactions on Reconfigurable Technology and Systems
- Enhao Tang + 7 more
Field-programmable gate arrays (FPGAs) are an ideal candidate for accelerating graph neural networks (GNNs). However, the FPGA redeployment process is time-consuming when updating or switching between diverse GNN models across different applications. Existing GNN processors eliminate the need for FPGA redeployment when switching between different GNN models. However, adapting matrix multiplication types by switching processing units decreases hardware utilization. In addition, the bandwidth of DDR limits further improvements in hardware performance. This article proposes a highly flexible FPGA-based overlay processor for GNN accelerations. Graph-OPU provides excellent flexibility and programmability for users, as the executable code of GNN models is automatically compiled and reloaded without requiring FPGA redeployment. First, we customize the compiler and instruction sets for the inference process of different GNN models. Second, we customize the datapath and optimize the data format in the microarchitecture to fully leverage the advantages of high bandwidth memory (HBM). Third, we design a unified matrix multiplication to handle both sparse-dense matrix multiplication (SpMM) and general matrix multiplication (GEMM), enhancing Graph-OPU performance. During Graph-OPU execution, the computational units are shared between SpMM and GEMM instead of being switched, which improves the hardware utilization. Finally, we implement a hardware prototype on the Xilinx Alveo U50 and test the mainstream GNN models using various datasets. Experimental results show that Graph-OPU achieves up to 1,654 \(\times\) and 63 \(\times\) speedup, as well as up to 5,305 \(\times\) and 422 \(\times\) energy efficiency boosts, compared to implementations on CPU and GPU, respectively. Graph-OPU outperforms state-of-the-art (SOTA) end-to-end overlay accelerators for GNN, reducing latency by an average of 1.36 \(\times\) and improving energy efficiency by 1.41 \(\times\) on average. Moreover, Graph-OPU exhibits an average 1.45 \(\times\) speed improvement in end-to-end latency over the SOTA GNN processor. Graph-OPU represents an in-depth study of an FPGA-based overlay processor for GNNs, offering high flexibility, speedup, and energy efficiency.
- Research Article
19
- 10.1016/j.watres.2024.122396
- Sep 11, 2024
- Water Research
- Alexander Garzón + 3 more
Storm water systems (SWSs) are essential infrastructure providing multiple services including environmental protection and flood prevention. Typically, utility companies rely on computer simulators to properly design, operate, and manage SWSs. However, multiple applications in SWSs are highly time-consuming. Researchers have resorted to cheaper-to-run models, i.e. metamodels, as alternatives of computationally expensive models. With the recent surge in artificial intelligence applications, machine learning has become a key approach for metamodelling urban water networks. Specifically, deep learning methods, such as feed-forward neural networks, have gained importance in this context. However, these methods require generating a sufficiently large database of examples and training their internal parameters. Both processes defeat the purpose of using a metamodel, i.e., saving time. To overcome this issue, this research focuses on the application of inductive biases and transfer learning for creating SWS metamodels which require less data and retain high performance when used elsewhere. In particular, this study proposes an auto-regressive graph neural network metamodel of the Storm Water Management Model (SWMM) from the Environmental Protection Agency (EPA) for estimating hydraulic heads. The results indicate that the proposed metamodel requires a smaller number of examples to reach high accuracy and speed-up, in comparison to fully connected neural networks. Furthermore, the metamodel shows transferability as it can be used to predict hydraulic heads with high accuracy on unseen parts of the network. This work presents a novel approach that benefits both urban drainage practitioners and water network modeling researchers. The proposed metamodel can help practitioners on the planning, operation, and maintenance of their systems by offering an efficient metamodel of SWMM for computationally intensive tasks like optimization and Monte Carlo analyses. Researchers can leverage the current metamodel’s structure for developing new surrogate model architectures tailored to their specific needs or start paving the way for more general foundation metamodels of urban drainage systems.
- Research Article
- 10.21015/vtm.v12i1.1847
- Jun 30, 2024
- VFAST Transactions on Mathematics
- Shakeel Ahmed Kamboh + 5 more
The ideas of parallelism for the large scale problems or problems with dense meshes have gained much attention in last few decades. The key goal of applying the parallelization is to reduce the computational time. In this paper; the 2D finite difference mesh partitioning schemes and their effect on performance of parallel numerical solution is evaluated. The main objective was to investigate the mesh partitioning schemes for less computational time and high speedup. For testing and implementation purpose a 2D electrostatics Poisson’s equation with Dirichlet and Neumann boundary conditions applied on a 2D cross section of Electrohydrodynamic (EHD) planar ion-drag micropump is used to simulate the electric potential and electric field on a parallel system. The performance of the 7 different mesh partitioning schemes (PS) in terms of computational time, speedup, efficiency and communication cost was evaluated. It was revealed that among the seven different partitioning schemes the PS-3 (two-way or tile partitioning) is found the best scheme for the parallel numerical simulation of the problem. Moreover, the parallel algorithm remains more efficient on \(P=2\) to \(P=8 \) workers while for \(P>8\) the efficiency of the algorithm may drop because of the high communication time.
- Research Article
4
- 10.1016/j.jfluidstructs.2024.104156
- Jun 20, 2024
- Journal of Fluids and Structures
- Azzeddine Tiba + 4 more
Non-intrusive reduced order models for partitioned fluid–structure interactions
- Research Article
2
- 10.1016/j.asr.2024.04.056
- May 3, 2024
- Advances in Space Research
- Carlos Rubio + 3 more
Efficient computation of the geopotential gradient is essential for numerical propagators, particularly in scenarios involving low Earth orbits. Conventional geopotential calculations are based on spherical harmonics series, which become computationally demanding as the degree/order increases. This computational burden can be mitigated by means of parallelized algorithms. Additionally, certain situations lend themselves to high parallelization, such as the propagation of space debris catalogs, satellite mega-constellations, or the dispersion of particles resulting from a space collision event. This paper introduces an optimized Graphics Processing Unit (GPU) implementation designed to facilitate extensive parallelization in the geopotential gradient calculation. The formulation developed in this study is not specific to any GPU. However, to illustrate the low-level optimizations necessary for an efficient implementation, we selected the Compute Unified Device Architecture (CUDA) as the dominant and de facto standard in parallel computing. Nevertheless, most of the concepts and optimizations presented in this paper are also valid for other GPU architectures. Built upon the spherical harmonic expansion using the Cunningham formulation, which is well-suited for GPU computations, our implementation offers several variants with different tradeoffs between speed and accuracy. Besides GPU double precision, we introduced a mixed precision arithmetic –a hybrid between single and double precision– that exploits GPU capabilities with a low penalty in accuracy. The proposed algorithm was implemented as a software reusable module, and its performance was evaluated against GMAT, GODOT, and Orekit astrodynamic codes. The algorithm’s accuracy in double precision is comparable to such codes. The mixed precision version showed enough accuracy for LEO satellite propagation, with around 1 m difference in four days. Testing across different CUDA architectures revealed very high speed-up factors compared to a single CPU, reaching a speed-up of 645 for the mixed precision variant and 450 for the double precision one in the propagation of about 3200 objects with a geopotential of degree/order 126 × 126 using an A100 GPU device.
- Research Article
3
- 10.1109/tc.2024.3365937
- May 1, 2024
- IEEE Transactions on Computers
- Chao Chen + 2 more
Recently, experiment-driven machine-learning (ML) based configuration tuning for in-memory data analytics such as Apache Spark become popular because they can achieve high speedups. However, experiment-driven ML-based approaches naturally need a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">large</i> number of iterations and each iteration generates a configuration with a probabilistic strategy and executes the program on a real cluster with the configuration. It therefore takes a long time to optimize the performance of an in-memory data analytics program, and thereby hinders these approaches from being widely used in practice. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">To address this issue, we propose a novel as well as simple approach dubbed <i>Terminating-It-Early (TIE)</i> to reduce the time needed to perform the experiment executions but to achieve speedups similar to those obtained by experiment-driven ML-based approaches. The key idea is that, during the process of searching for the optimal configuration which produces the shortest execution time for a program, we <i>terminate</i> an experiment program execution with a trial configuration as soon as possible when we find its execution time is <i>longer than a predefined threshold</i> (e.g., the shortest execution time thus far). In contrast, traditional experiment-driven ML-based approaches always run all experiment executions completely. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">We employ 19 Apache Spark programs running on a physical cluster as well as a virtual cluster to evaluate TIE. We compare the <i>tuning time</i> used to find the optimal configuration of a program and the <i>optimized execution time</i> of a program obtained by TIE against those obtained by <i>CherryPick</i> and a reinforcement learning (RL) based approach. The experimental results show that on physical machines, TIE reduces the tuning time used by <i>CherryPick</i> and the RL-based approach by factors of 2.39× and 1.68× on average, respectively. On virtual machines, the corresponding factors are 2.79× and 1.71×. Moreover, the average optimized execution time of the 19 programs tuned by TIE is slightly shorter than those tuned by <i>CherryPick</i> and the RL-based approach.
- Research Article
7
- 10.1145/3632743
- Mar 29, 2024
- ACM Transactions on Software Engineering and Methodology
- Wensheng Tang + 6 more
Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithmOctopusto collect path conditions for realizable paths efficiently.Octopusbuilds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implementOctopusas a tool and evaluate it over 15 real-world programs. The experiment shows thatOctopussignificantly outperforms the state-of-the-art algorithms. Particularly, it detects NULL-pointer-dereference bugs for the projectllvmwith 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup ofOctopus. Our empirical and theoretical results demonstrate the great potential ofOctopusin supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.
- Research Article
7
- 10.1145/3639305
- Mar 12, 2024
- Proceedings of the ACM on Management of Data
- Matteo Brucato + 4 more
Modern database systems offer index-tuning advisors that automatically identify a set of indexes to improve workload performance. Advisors leverage the optimizer's what-if API to optimize a query for a hypothetical index configuration. Because what-if calls constitute a major bottleneck of index tuning, existing techniques, such as workload compression, help reduce the number of what-if calls to speed up tuning. Unfortunately, even with small workloads and few what-if calls, tuning can still take hours due to the complexity of the queries (e.g., the number of joins, filters, group-by and order-by clauses), which increases their optimization time. This paper introduces workload reduction, a new complementary technique aimed at expediting index tuning by decreasing individual what-if call time without significantly affecting the quality of index tuning. We present an efficient workload reduction algorithm, called Wred, which rewrites each query in the original workload to eliminate column and table expressions unlikely to benefit from indexes, thereby accelerating what-if calls. We study its complexity and ability to maintain high index quality. We perform an extensive evaluation over industry benchmarks and real-world customer workloads, which shows that Wred results in a 3x median speedup in tuning efficiency over an industrial-strength state-of-the-art index advisor, with only a 3.7% median loss in improvement---where improvement is the total workload cost as estimated by the query optimizer---and results in up to 24.7x speedup with 1.8% improvement loss. Furthermore, combining Wred and Isum (a state-of-the-art workload compression technique for index tuning) results in higher speedups than either of the two techniques alone, with 10.5x median speedup and 5% median improvement loss.
- Research Article
- 10.1088/1742-6596/2648/1/012036
- Dec 1, 2023
- Journal of Physics: Conference Series
- A El Hokayem + 3 more
The increase of the energy efficiency of the building stock is a national priority, especially in consideration of the recent energy crisis and of the 2050 decarbonisation goals of the European Union. Public buildings, in particular, are expected to lead the way and give examples of best practices and solutions for energy savings, some of them related to optimized controls of the building systems. In this context, however, many public administrations lack detailed technical competences, time, and resources to define and implement the best energy efficiency measures and controls for public buildings. To address these needs, fast and accurate simplified models for buildings energy simulation appear to be a promising solution. In this context, this research aims at evaluating the applicability and reliability of an approach previously proposed by the authors, i.e., the shoeboxing algorithm, for the study of operation control strategies. The analysis has been conducted on a public kindergarten, in Bolzano, Italy, comparing simplified and detailed building energy models. Results have shown a fairly good level of accuracy of the algorithm and consistency of energy savings, with a remarkably high simulation speedup.
- Research Article
5
- 10.1109/tpds.2023.3322037
- Dec 1, 2023
- IEEE Transactions on Parallel and Distributed Systems
- Vasilios Kelefouras + 1 more
In this article, a new method is provided for accelerating the execution of convolution layers in Deep Neural Networks. This research work provides the theoretical background to efficiently design and implement the convolution layers on x86/x64 CPUs, based on the target layer parameters, quantization level and hardware architecture. The proposed work is general and can be applied to other processor families too, e.g., Arm. The proposed work achieves high speedup values over the state of the art, which is Intel oneDNN library, by applying compiler optimizations, such as vectorization, register blocking and loop tiling, in a more efficient way. This is achieved by developing an analytical modelling approach for finding the optimization parameters. A thorough experimental evaluation has been applied on two Intel CPU platforms, for DenseNet-121, ResNet-50 and SqueezeNet (including 112 different convolution layers), and for both FP32 and int8 input/output tensors (quantization). The experimental results show that the convolution layers of the aforementioned models are executed from <inline-formula><tex-math notation="LaTeX">$x1.1$</tex-math></inline-formula> up to <inline-formula><tex-math notation="LaTeX">$x7.2$</tex-math></inline-formula> times faster.
- Research Article
13
- 10.1016/j.cma.2023.116467
- Oct 10, 2023
- Computer Methods in Applied Mechanics and Engineering
- Theron Guo + 2 more
In recent years, there has been a growing interest in understanding complex microstructures and their effect on macroscopic properties. In general, it is difficult to derive an effective constitutive law for such microstructures with reasonable accuracy and meaningful parameters. One numerical approach to bridge the scales is computational homogenization, in which a microscopic problem is solved at every macroscopic point, essentially replacing the effective constitutive model. Such approaches are, however, computationally expensive and typically infeasible in multi-query contexts such as optimization and material design. To render these analyses tractable, surrogate models that can accurately approximate and accelerate the microscopic problem over a large design space of shapes, material and loading parameters are required. In this work, we develop a reduced order model based on Proper Orthogonal Decomposition (POD), Empirical Cubature Method (ECM) and a geometrical transformation method with the following key features: (i) large shape variations of the microstructure are captured, (ii) only relatively small amounts of training data are necessary, and (iii) highly non-linear history-dependent behaviors are treated. The proposed framework is tested and examined in two numerical examples, involving two scales and large geometrical variations. In both cases, high speed-ups and accuracies are achieved while observing good extrapolation behavior.
- Research Article
1
- 10.1080/10618562.2024.2347335
- Sep 14, 2023
- International Journal of Computational Fluid Dynamics
- Haitao Dong + 2 more
In this paper, we construct a large time step wave-adding scheme (LTS-WA) for unsteady Euler equations in compressible flows with a much simpler strategy in the wave-adding procedure, named LTS-WAS. The new method does not need to calculate coefficients of contribution and also add the waves according to their types and families, which makes the scheme easy for coding and optimisation. The new scheme has a high speedup ratio, and its formulation is the same in scalar and system cases. The scheme is extended to 3D compressible flows using the dimension split method. Numerical experiments show that the new scheme maintains the advantages of the original LTS-WA scheme at large CFL numbers, and has a higher efficiency.
- Research Article
31
- 10.1109/tpami.2023.3268415
- Sep 1, 2023
- IEEE Transactions on Pattern Analysis and Machine Intelligence
- Sidharth Maheshwari + 6 more
Inference at-the-edge using embedded machine learning models is associated with challenging trade-offs between resource metrics, such as energy and memory footprint, and the performance metrics, such as computation time and accuracy. In this work, we go beyond the conventional Neural Network based approaches to explore Tsetlin Machine (TM), an emerging machine learning algorithm, that uses learning automata to create propositional logic for classification. We use algorithm-hardware co-design to propose a novel methodology for training and inference of TM. The methodology, called REDRESS, comprises independent TM training and inference techniques to reduce the memory footprint of the resulting automata to target low and ultra-low power applications. The array of Tsetlin Automata (TA) holds learned information in the binary form as bits: 0,1, called excludes and includes, respectively. REDRESS proposes a lossless TA compression method, called the include-encoding, that stores only the information associated with includes to achieve over 99% compression. This is enabled by a novel computationally minimal training procedure, called the Tsetlin Automata Re-profiling, to improve the accuracy and increase the sparsity of TA to reduce the number of includes, hence, the memory footprint. Finally, REDRESS includes an inherently bit-parallel inference algorithm that operates on the optimally trained TA in the compressed domain, that does not require decompression during runtime, to obtain high speedups when compared with the state-of-the-art Binary Neural Network (BNN) models. In this work, we demonstrate that using REDRESS approach, TM outperforms BNN models on all design metrics for five benchmark datasets viz. MNIST, CIFAR2, KWS6, Fashion-MNIST and Kuzushiji-MNIST. When implemented on an STM32F746G-DISCO microcontroller, REDRESS obtained speedups and energy savings ranging 5-5700× compared with different BNN models.
- Research Article
4
- 10.1016/j.powtec.2023.118811
- Jul 16, 2023
- Powder Technology
- Jizhou Liu + 1 more
An efficient three-dimensional numerical simulation of particle acoustic agglomeration with fine-grained parallelization on graphical processing unit
- Research Article
5
- 10.1093/nar/gkad354
- May 9, 2023
- Nucleic Acids Research
- Michal Wlasnowolski + 3 more
In the current update, we added a feature for analysing changes in spatial distances between promoters and enhancers in chromatin 3D model ensembles. We updated our datasets by the novel in situ CTCF and RNAPII ChIA-PET chromatin loops obtained from the GM12878 cell line mapped to the GRCh38 genome assembly and extended the 1000 Genomes SVs dataset. To handle the new datasets, we applied GPU acceleration for the modelling engine, which gives a speed-up of 30× versus the previous versions. To improve visualisation and data analysis, we embedded the IGV tool for viewing ChIA-PET arcs with additional genes and SVs annotations. For 3D model visualisation, we added a new viewer: NGL, where we provided colouring by gene and enhancer location. The models are downloadable in mmcif and xyz format. The web server is hosted and performs calculations on DGX A100 GPU servers that provide optimal performance with multitasking. 3D-GNOME 3.0 web server provides unique insights into the topological mechanism of human variations at the population scale with high speed-up and is freely available at https://3dgnome.mini.pw.edu.pl/.
- Research Article
17
- 10.3390/app13095821
- May 8, 2023
- Applied Sciences
- Wenbo Zhang + 3 more
Federated learning is currently a popular distributed machine learning solution that often experiences cumbersome communication processes and challenging model convergence in practical edge deployments due to the training nature of its model information interactions. The paper proposes a hierarchical federated learning algorithm called FedDyn to address these challenges. FedDyn uses dynamic weighting to limit the negative effects of local model parameters with high dispersion and speed-up convergence. Additionally, an efficient aggregation-based hierarchical federated learning algorithm is proposed to improve training efficiency. The waiting time is set at the edge layer, enabling edge aggregation within a specified time, while the central server waits for the arrival of all edge aggregation models before integrating them. Dynamic grouping weighted aggregation is implemented during aggregation based on the average obsolescence of local models in various batches. The proposed algorithm is tested on the MNIST and CIFAR-10 datasets and compared with the FedAVG algorithm. The results show that FedDyn can reduce the negative effects of non-independent and identically distributed (IID) data on the model and shorten the total training time by 30% under the same accuracy rate compared to FedAVG.