GMSA: A Data Sharing System for Multiple Sequence Alignment Across Multiple Users

Author(s): Na Bai, Shanjiang Tang*, Ce Yu*, Hao Fu, Chen Wang, Xi Chen.

Journal Name: Current Bioinformatics

Volume 14 , Issue 6 , 2019

Become EABM
Become Reviewer

Graphical Abstract:


Background: In recent years, the rapid growth of biological datasets in Bioinformatics has made the computation of Multiple Sequence Alignment (MSA) become extremely slow. Using the GPU to accelerate MSA has shown to be an effective approach. Moreover, there is a trend that many bioinformatic researchers or institutes setup a shared server for remote users to submit MSA jobs via provided web-pages or tools.

Objective: Given the fact that different MSA jobs submitted by users often process similar datasets, there can be an opportunity for users to share their computation results between each other, which can avoid the redundant computation and thereby reduce the overall computing time. Furthermore, in the heterogeneous CPU/GPU platform, many existing applications assign their computation on GPU devices only, which leads to a waste of the CPU resources. Co-run computation can increase the utilization of computing resources on both CPUs and GPUs by dispatching workloads onto them simultaneously.

Methods: In this paper, we propose an efficient MSA system called GMSA for multi-users on shared heterogeneous CPU/GPU platforms. To accelerate the computation of jobs from multiple users, data sharing is considered in GMSA due to the fact that different MSA jobs often have a percentage of the same data and tasks. Additionally, we also propose a scheduling strategy based on the similarity in datasets or tasks between MSA jobs. Furthermore, co-run computation model is adopted to take full use of both CPUs and GPUs.

Results: We use four protein datasets which were redesigned according to different similarity. We compare GMSA with ClustalW and CUDA-ClustalW in multiple users scenarios. Experiments results showed that GMSA can achieve a speedup of up to 32X.

Conclusion: GMSA is a system designed for accelerating the computation of MSA jobs with shared input datasets on heterogeneous CPU/GPU platforms. In this system, a strategy was proposed and implemented to find the common datasets among jobs submitted by multiple users, and a scheduling algorithm is presented based on it. To utilize the overall resource of both CPU and GPU, GMSA employs the co-run computation model. Results showed that it can speed up the total computation of jobs efficiently.

Keywords: Multiple Sequence Alignment (MSA), multiple users, data sharing, heterogeneous CPU/GPU, co-run, pairwise alignments.

Karadimitriou K, Kraft DH. genetic algorithms and the multiple sequence alignment problem in biology, the second annual Molecular Biology and Biotechnology Conference. 1-7.
Catalyurek U, Stahlberg E, Ferreira R, Kurc T, Saltz J. Improving performance of multiple sequence alignment analysis in multi-client environments. 16th International Parallel and Distributed Processing Symposium (IPDPS 2002). 2002 Apr 15-19; Fort Lauderdale, FL, USA.
Schmidt B. Bioinformatics: High Performance Parallel Computer Architectures. Phy Sol Liq 2010; 25(1): 68-88.
Wang L, Jiang T. On the complexity of multiple sequence alignment. J Comput Biol 1994; 1(4): 337-48.
Sneath PHA, Sokal RR. Numerical taxonomy: The principles and practices of numerical classification In: Freeman WH and Company San Francisco 1973.
Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol 1987; 4(4): 406-25.
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22(22): 4673-80.
Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982; 162(3): 705-8.
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol 1981; 147(1): 195-7.
Myers EW, Miller W. Optimal alignments in linear space. Comput Appl Biosci 1988; 4(1): 11-7.
Liu Y, Schmidt B, Maskell DL. MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA. 20th IEEE International Conference on Application-Specific Systems, Architectures and Processors. 2009 July 7-9; Boston, MA, USA. New York: IEEE Computer Society 2009.
Garcia F, Fernandez J. POSIX thread libraries. Linux J 2000; 2000(70es): 36.
Reinders J. Intel threading building blocks: outfitting C++ for multi-core processor parallelism. J Comput Sci 2007; 11: 65-8.
Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng 1998; 5(1): 46-55.
Stone JE, Gohara D, Shi G, Open CL, Open CL. A Parallel Programming Standard for Heterogeneous Computing Systems. Comput Sci Eng 2010; 12(3): 66-72.
Hung C-L, Lin Y-S, Lin C-Y, Chung Y-C, Chung Y-F. CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on Multi-GPUs. Comput Biol Chem 2015; 58: 62-8.
Foster I, Zhao Y, Raicu I, Lu S. Cloud computing and grid computing 360-degree compared. Grid Computing Environments Workshop. 2008 Nov 12-16; Austin, TX, USA. 1-10.
Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark. Proteins 2005; 61(1): 127-36.
Walker JM. The Proteomics Protocols Handbook. In: Humana Press. 1st ed. Clifton. 1988.
Simossis V, Kleinjung J, Heringa J. An overview of multiple sequence alignment. Curr Protoc Bioinformatics 2003; 3(1): 7.
Li KB. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics 2003; 19(12): 1585-6.
Marucci EA, Zafalon GFD, Momente JC, et al. Using threads to overcome synchronization delays in parallel multiple progressive alignment algorithms. Am J Bioinform 2012; 1(1): 50-63.
Cheetham J, Dehne F, Pitre S, Rau-Chaplin A, Taillon PJ. Parallel clustal w for pc clusters. International Conference on Computational Science and Its Applications. 2003 May 18-21; Montreal, Canada. Berlin, Heidelberg: Springer 2003.
Zou Q, Li XB, Jiang WR, Lin ZY, Li GL, Chen K. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform 2014; 15(4): 637-47.
Zou Q, Hu Q, Guo M, Wang G. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. Bioinformatics 2015; 31(15): 2475-81.
Wan S, Zou Q. HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing. Algorithms Mol Biol 2017; 12(1): 25.
Su W, Liao X, Lu Y, Zou Q, Peng S. Multiple sequence alignment based on a suffix tree and center-star strategy: A Linear method for multiple nucleotide sequence alignment on spark parallel framework. J Comput Biol 2017; 24(12): 1230-42.
Zou Q, Wan S, Zeng X, Ma ZS. Reconstructing evolutionary trees in parallel for massive sequences. BMC Syst Biol 2017; 11(Suppl. 6): 100.
Chen X, Wang C, Tang S, Yu C, Zou Q. CMSA: A heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment. BMC Bioinformatics 2017; 18(1): 315.
Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002; 30(14): 3059-66.
Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000; 302(1): 205-17.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004; 32(5): 1792-7.
Thusoo A, Sarma JS, Jain N, et al. Hive: A warehousing solution over a map-reduce framework. Proceedings VLDB Endowment--- 2009; 2(2): 1626-9.
Nykiel T, Potamias M, Mishra C, Kollios G, Koudas N. MRShare: sharing across multiple queries in MapReduce. Proceedings VLDB Endowment--- 2010; 3(1-2): 494-505.
Ropars T, Lefray A, Kim D, Schiper A. Efficient process replication for MPI applications sharing work between replicas parallel and distributed processing symposium (IPDPS). 2015 May 25-29; Hyderabad, India. 2015; pp. 645-54.
Mehdi NA, Holmes B, Mamat A, Subramaniam SK. Sharing-aware intercloud scheduler for data-intensive jobs. 2012 International Conference on Cloud Computing Technologies, Applications and Management (ICCCTAM). 2012 Dec 8-10; Dubai, United Arab Emirates. 22-6.
Tang S, Yu C, Sun J, et al. EasyPDP: An Efficient parallel dynamic programming runtime system for computational biology. IEEE Trans Parallel Distrib Syst 2012; 23(5): 862-72.
Wang Z, Yang J, Melhem R, Childers B, Zhang Y, Guo M. Simultaneous Multikernel: Fine-grained Sharing of GPGPUs. IEEE Comput Archit Lett 2016; 15(2): 113-6.
Zhang F, Zhai J, Chen W, He B, Zhang S. To Co-Run, or Not To Co-Run: A performance study on integrated architectures. 23rd International symposium on modeling, analysis and simulation of computer and telecommunication systems (MASCOTS). 2015 Oct 5-7; Atlanta, GA, USA. 89-92.
Tang S, He BS, Zhang S, Niu Z. Elastic multi-resource fairness: balancing fairness and efficiency in coupled CPU-GPU architectures. Proceedings of the international conference for high performance computing, networking, storage and analysis. 2017 Nov 12- 17; Denver, CO, USA. 75.

Rights & PermissionsPrintExport Cite as

Article Details

Year: 2019
Page: [504 - 515]
Pages: 12
DOI: 10.2174/1574893614666190111160101
Price: $58

Article Metrics

PDF: 39