many-core.group

BarraCUDA - an ultra fast short read sequence alignment software using GPUs

Brian Lam

Introduction

With the maturation of next-generation DNA sequencing (NGS) technologies, the throughput of DNA sequence reads has increased from 20 megabases to now 200 gigabases from a single instrument run. It has become more apparent that current bioinformatics pipelines are too computation and I/O intensive to be sustainable in order to cope with the advancement of the technology. General purpose computing on graphics processing units, or simply known as GPGPU, extracts the computing power from hundreds of parallel stream processors within graphics processing cores and provides a cost-effective and energy- efficient alternative to the traditional high-performance computing (HPC) clusters. GPGPU has been proven to improve speed and responsiveness for a wide range of computation-intensive applications including financial simulations as well as life science applications such as sequence alignments and medical imaging.

BarraCUDA is designed with the aim of downsizing the NGS software pipeline from complex and expensive HPC clusters back to cheap and easy-to-manage personal computers. To achieve this, we borrow the processing capability of modern NVIDIA graphics cards through the use of the CUDA application programming interface.

Implementation

Figure 1: BarraCUDA benchmarks

BarraCUDA is largely built on the foundation of BWA which is originally written in C/C++. One major challenge of using GPGPU is the limited amount of memory and bandwidth available and the lack of advanced memory management regime as in normal workstations. With thousands of concurrent threads running at the same time, the workspace available to each thread in BarraCUDA is very small. To tackle this, BarraCUDA utilizes a depth-first search (DFS) approach to perform inexact sequence alignments. The major advantage of using DFS is that it requires much less memory at O(m) to look for alignments compared to O(9^m) in BWA where m is the length of the sequence to be aligned.

In addition to DFS, we also employed several other strategies to further reduce memory usage and enhance memory throughputs:

  • Packing sequences to be aligned into one single continuous block. In BarraCUDA, we pack all the sequences to be aligned into one single block and create a separate index for the data, then we use the index to retrieve the sequence back when needed. By doing this, we could avoid internal fragmentation and thus reduce the memory footprint.
  • Texture Mapping. The reference sequence that the sequencing reads are to be mapped onto, as well as the sequencing reads themselves (in on single block) are all bind to the texture memory in order to take advantage of the texture cache.
  • Coalesce memory access.If possible, all the data units are packed into CUDA-optimized data structure (e.g. uint4) to facilitate coalesce memory access.

Results

To compare the performances between BarraCUDA and its CPU counterparts BWA version 0.5.7, we obtained a set of 11.3 million pairs of sequence reads of 37bp from the 1000 genome project (SRA accession: ERR003014) and performed the alignments with or without gap openings to the human genome (NCBI 36, 3.7 GB).

For gapped alignments, running BWA with 4 cores increased the alignment speed by about 3 folds in both single-end reads (Figure 1A), and paired-end reads (Figure 1B) but when we performed the alignment with 8 processor cores, the speed was not linearly enhanced but was only 29 % quicker than 4 cores, at 4 times the speed of a single core (Figure 1A and 1B). On the other hand, BarraCUDA took 16 m 7 s and 33 m 14 s to perform alignments for single- (Figure 1A) and paired-end reads (Figure 1B) respectively, which was on par with BWA using 4 cores.

For un-gapped alignments as shown in the right panels of Figure 1A and 1B, while BWA exhibited a speed increase by 20 %, and we observed a 2- fold speed-up in BarraCUDA compared to gapped alignment. BarraCUDA was also 13.4 % quicker than BWA with 8 cores in terms of alignments for both single- and paired-end reads, which is equivalent to the speed of 10 processing cores.

Code

BarraCUDA is available on SourceForge.