The Need for Speed and Energy Efficiency in Genome Analysis

Sachin Rawat

doi:10.1089/genbio.2023.29100.sra

Abstract

GEN BiotechnologyVol. 2, No. 3 News FeaturesFree AccessThe Need for Speed and Energy Efficiency in Genome AnalysisSachin RawatSachin RawatE-mail Address: rawatsachin27@gmail.comFreelance Science Writer & Journalist, Bangalore, India.Search for more papers by this authorPublished Online:19 Jun 2023https://doi.org/10.1089/genbio.2023.29100.sraAboutSectionsPDF/EPUB Permissions & CitationsPermissionsDownload CitationsTrack CitationsAdd to favorites Back To Publication ShareShare onFacebookTwitterLinked InRedditEmail Smarter algorithms and accelerated hardware are eliminating bottlenecks throughout genome analysis pipelines, improving both speed and energy efficiency.The Summit supercomputer at Oak Ridge National LabIn the early 2010s, as a biomedical engineering undergrad with an interest in computer architecture, Guilia Guidi noticed that many genomic analyses—identifying new variants or screening for biomarker activity—were not feasible without a lot of computing. Even back then, this was already a massive shift from the early days of genomics. Today, researchers produce orders of magnitude more biomedical and other genomic data. Guidi, now a computer scientist at Cornell University, is developing tools to keep pace with genome data generation.When it was expensive to produce genomic data, laboratories produced little of it and could deal with it on their laptops. As genome sequencing technologies became faster and more affordable, the amount of genomic data has exploded to a point that their analysis is the major bottleneck now. “How fast a researcher can process genomic data poses an implicit limit to the insights they can gain from it,” Guidi told GEN Biotechnology.Giulia Guidi, Assistant Professor of Computer Science at Cornell University.And it seems the genomics community as a whole is gaining insufficient insights from their data. Modern DNA sequencers spit out nucleotides at a blazing fast rate—up to billions of bases per hour.1 But analyzing these sequences moves at a snail's pace in comparison. For example, read mapping is slower by three to four orders of magnitude. This simple math underlies the growing divide between sequenced genomes accessible to researchers and the insights they can derive from them.“Tertiary analysis—where we combine sequencing and other types of data to assign interpretation and gain new knowledge—can be a bottleneck on both human and compute capacity,” said Chris Dwan, former chief information officer at the New York Genome Center and a technical consultant to multiple biotechnology companies.Emerging developments in energy-efficient algorithms and hardware show that they could be key to bridging this gap. Much of the progress in sequencing technologies was due to the codevelopment of techniques and algorithms for faster sequencing. It is time to replicate similar collaborations to tackle bottlenecks in genome analysis.Clever Algorithms Eliminate Wasteful ComputationBiologists perform genome analysis for a variety of applications ranging from identifying rare diseases and predicting how drugs interact with different cell types to unlocking the mechanisms to design climate-resilient crops. Regardless of their goal, all genome analysis pipelines start with read mapping. This is also the usually most computationally expensive step in these pipelines.Sequencers spit out billions of short reads, with multiple copies of each read to ensure sufficient coverage. Mapping all of these reads on a reference genome is computationally slow, as it requires each read to be screened across the length of the genome. Incomplete reference genomes and repetitive sequences further increase the complexity of this task.Sequencing technologies that produce longer reads, such as PacBio and Oxford Nanopore, make mapping easier as DNA fragment lengths exceed most repetitive regions. However, long-read sequencing has traditionally produced more errors than short-read platforms.For researchers running analyses on publicly available data sets, read mapping is also complicated by the diversity of sequencers and sequencing protocols. “The metadata for those studies is not always complete,” said Serghei Mangul, assistant professor of clinical pharmacy and computational biology at the University of Southern California. “If your goal is to predict the mortality in the patient for a particular disease but datasets for that disease have no metadata, the raw genomic data becomes essentially useless.”Read mapping generally involves these steps—the reads are seeded to the genome by comparing exact matches of shorter sequences, poor quality matches are filtered out, and the reads are finally aligned to the genome. Computer scientists and bioinformaticians are developing better algorithms to reduce the computation required in each step.One such shortcut to speed up the alignment process is optimizing the X-drop sequence alignment, developed by Zheng Zhang nearly two decades ago,2 for use with modern graphics processors.3 For every read, it keeps a dynamic condition based on the current score of the best alignment. If the score for an alignment drops beyond a threshold as compared with the best score, it stops the alignment for that read. This saves a lot of computation.The next step in genomic analysis pipelines is often variant calling. It identifies the differences between the sample genomes and the reference genome. Long reads span the length of structural variants such as inversions or genomic imbalances and are better at detecting them than shorter reads. Conversely, long reads fare poorly in detecting single nucleotide changes owing to their higher error rates.Serghei Mangul, Assistant Professor of Clinical Pharmacy and Quantitative and Computational Biology at University of Southern California.Mark Oldakowski, COO of Bionano Genomics.The choice of a variant-calling algorithm often depends on the sequencing technology used to generate the reads. For example, a landmark study from Ashley and colleagues at Stanford published last year showed4 that parallel computing over the cloud on data obtained by nanopore sequencing reduced the runtime for a clinical whole genome sequencing (WGS) pipeline to <8 h. Every step in the pipeline was optimized for the raw format file obtained from nanopore sequencing.It is easy to imagine why efficient variant calling is particularly critical for clinical applications when a rapid diagnosis matters. But sometimes depth matters, too. “Emerging applications like CAR-T cell therapies require an even higher level of depth. They're looking for rare events in those applications, which would require even more processing time,” Mark Oldakowski, chief operating officer at Bionano Genomics, told GEN Biotechnology.San Diego-based Bionano Genomics has developed an optical genome mapping technology that identifies structural variants in long reads. Its software enables quick identification of rare variants, such as for cancer diagnosis.Legacy tech companies are also building solutions for accelerated variant calling, and genome analysis more generally. For instance, NVIDIA offers Parabricks,5 a software suite that brings together popular alignment and variant calling tools within a highly efficient computational pipeline. Parabricks rely on data-parallel computing and powered the clinical WGS mentioned earlier. Parallelization is more energy efficient and recent research has pinned the reason down to thermodynamics.6Another approach to efficient genome mapping involves prealignment filtering. A major reason why pairwise sequence alignment is highly inefficient is that most alignments are highly dissimilar and are discarded immediately. Prealignment filtering tackles the computing waste during alignment. This requires clever tricks that estimate the similarity and proceed with more costly alignment calculations only if it is above a threshold.Accelerated HardwareHardware acceleration is the use of specialized hardware (Fig. 1) as compared with general-purpose computing units such as CPUs. Graphics processing units (GPUs) are optimized to perform graphics-heavy tasks such as gaming or editing. Computer scientists and chemists discovered that GPUs are great for machine learning and molecular simulations, respectively. In recent years, computational biologists have also started exploring hardware acceleration7 with GPUs and other specialized hardware for genome analysis.FIG. 1. A processing-in-memory system with 2,560 data processing units.Beyond performance improvement from faster devices, hardware acceleration also addresses a major point of inefficiency in genome analysis: data movement. Moving large raw sequence files across computing and storage units, or between devices, or between the device and the cloud or on-premise server slows computations.“That's one bottleneck we're tackling—to get rid of the data movement,” said Onur Mutlu, professor of computer science at ETH Zürich. “As much as possible, we are processing data locally, where it resides, and this improves both performance and energy efficiency.”To this end, Mutlu and his colleagues are developing a range of solutions that leverage hardware acceleration for genome analysis. For example, GenStore8 performs computation within the storage space of a solid-state drive (SSD). It exploits the internal parallelization of an SSD to filter out reads that do not need to be aligned. This boosts energy efficiency by both eliminating irrelevant alignments and the need to move data between the computing unit and the SSD.GPUs are even better at parallelization with thousands of specialized cores. This is why companies such as Bionano Genomics are using GPUs to accelerate genomic pipelines. Speaking on their collaboration with NVIDIA, Oldakowski said that “the goal is to provide a bench-top tower that can accelerate the computing by 8–10 times over independently four node servers that go in the rack.” Although this directly leads to energy savings, further efficiency improvements are offered by codesign of software that utilizes these GPU cores most efficiently.For applications that need even faster movement of larger data sets, field-programmable gate arrays (FPGAs) offer an alternative. These are systems that are designed for lower latency than GPUs. They are named so because they can be programmed in the field or, in simpler terms, configured to specific applications by the user.Mutlu's team demonstrated that FPGAs can be repurposed to move a significant chunk of the genomic analysis computation away from the CPU.9 “We have a configurable logic that has high bandwidth/low latency access to this memory architecture. And we offload some genome analysis computations, in this case, pre-alignment filtering, eliminating useless computation as much as possible,” Mutlu explained.For clinical genomics, Dwan added that technologies such as DRAGEN hardware accelerator boards from Illumina will be a game changer. “They give the ability to perform primary and secondary analysis on the instrument—so that the user never even has to worry about reads and alignments. They just get variants.”“Of course, cloud technologies are not news in 2023. The story of centralizing genomic data on the cloud and accelerating analysis by running computations in parallel is more than a decade old at this point,” Dawn added.Uniting Software and HardwareMany researchers working on energy-efficient genome analysis dabble in both software and hardware acceleration—one can only go so far relying on just one of them. Codesign of faster algorithms and accelerated hardware is critical to narrowing the gap between genome data and insights.Mutlu said that “if you just use fast algorithms, you may still be moving a lot of data from the memory hierarchy. You may be improving performance a lot, but the energy efficiency may not improve that much.”Conversely, “hardware acceleration without optimizing software is worse, in my opinion, in the sense that you're really accelerating some unoptimized software.” Mutlu added that it is important to try to do the best algorithmically before looking at hardware acceleration.For example, GenStore's performance is driven in large part by its algorithm's capability to quickly determine exact matches between the reads and the reference genome as well as reads that will not align at all. For both of these filters, across different SSD configurations, the researchers noticed nearly 4 and 30 times energy reduction, respectively. Similarly, the team noticed energy reduction by one to two orders of magnitude with FPGAs as compared with GPUs, even for intelligent algorithms already optimized for genome analysis.Big memory computing offers another way to move computation away from storage systems such as hard drives or the cloud. MemVerge is a California-based cloud automation company that enables big memory computing for data-intensive applications, including genomics. Charles Fan, cofounder and CEO of MemVerge, says this technology protects pipelines from failures. “For any kind of pipeline, you can take snapshots after each stage and allow your application to roll back anytime you want to those running points.”This provides high-volume low-cost analyses to genomics laboratories that would otherwise find it cost prohibitive. Fan believes that software–hardware codesign, such as the NVIDIA GPUs designed for the task, has the potential to democratize genome analysis.Presently, large-scale genome analysis is primarily limited to well-funded laboratories and institutions. “We believe that smaller labs can also benefit from genomic technologies. Having a solution that is all-inclusive on their bench not only helps with kind of the whole power dynamics, but it also helps them to get to their research faster,” Fan pointed out.Charles Fan, CEO of MemVerge.Energy Efficiency Across the PipelineGenomic technologies are advancing sustainability in many ways such as with alternatives to petroleum-derived materials or reducing methane emissions. It is time that genomics researchers look inward as well. Large data sets, computationally expensive tools, and the need for high coverage all contribute to the mammoth carbon footprint. For example, the de novo assembly of a human genome produces emissions equivalent to traveling 85 kilometers10 by car.When pipelines are faster, the devices they run on are online for shorter periods, directly leading to energy savings. Conversely, energy-efficient algorithms and hardware reduce unnecessary computation, improving performance and speed.But speed and energy efficiency do not always go together. For example, parallelization is faster but also the energy savings from faster runtimes may not always negate the higher costs of running multiple servers or instances. With codesign, however, any trade-offs between energy efficiency and performance can be eliminated. This is also why even though researchers who use genomic analyses may prioritize either speed or computational cost, those who develop these tools need to actively consider energy efficiency.Moreover, data-intensive applications are not limited to genomics or biology. Other fields such as deep learning and astronomy also face computational constraints and develop creative solutions to tackle them. A cross-disciplinary attitude and quick adoption of advances elsewhere will be critical to further improving genome analysis pipelines.Just like focusing on only software or hardware, optimizing only one aspect of genome analysis is myopic. In a review published in the Computational and Structural Biotechnology Journal, the authors use Amdhal's law as an argument for accelerating all steps of the genome analysis pipeline.1 Amdahl's law posits that optimizing only a part of a system has a limited impact in proportion to the fraction of the time that part is in use. For genome analysis, this would just constrain the bottlenecks to the other parts with little improvement in, say, the total time for a WGS-based diagnosis.As genome analysis pipelines keep improving, it is possible that they might hit an upper efficiency limit due to either the nature of the genome sequences, the physics of the devices, or the complexity of algorithms. For now, it is unclear what that limit is or which of the factors could become the limiting one.As the history of bioinformatics, and computing more generally, shows, there is always a way to do stuff faster. What then could be a goal worth chasing for energy-efficient genome analysis? Maybe one day end-to-end genome analysis will again be possible from just a laptop. Meanwhile, better algorithms and accelerated hardware will continue to reveal more about the wealth of biological data that we have and will generate in the future.

Full Text