1 Introduction
-
Up to our knowledge, this is the first parallel application that can exploit modern High Performance Computing (HPC) multicore clusters to accelerate the search of TRs.
-
MPI-dot2dot reduces the memory requirements of the dot plots in order to be able to analyze large datasets.
-
The biological results provided by MPI-dot2dot are highly accurate, as they are identical to those of Dot2dot.
-
This tool is publicly available to download and use under an open-source license from https://sourceforge.net/projects/mpi-dot2dot.
2 Related work
-
Methods that perform an exhaustive search of all possible TRs. The main drawback of this approach is its high computational cost, which makes it unfeasible in most scenarios, as the number of possible combinations grows exponentially with the length of the motif. For instance, the main algorithm in mreps [18] follows the exhaustive approach to find all the repetitions that fulfill certain mathematical properties, but the results are then processed with an heuristic to provide only those with a biologically relevant representation. This approach is more common when searching specifically for microsatellites [2, 31], as their short length makes the exhaustive search still affordable.
-
Algorithms based on a dictionary. This approach starts from a set of seeds which are later extended into strings that are searched in the sequence. These algorithms are suitable for those scenarios where only a limited list of predefined patterns must be found. Some examples of this type of tools are TROLL [6], STAR [9], and BWtrs [29].
-
ab-nitio algorithms. They are non-exhaustive methods that use advanced mathematical techniques to improve the results of the search. Unlike the previous class, they do not require any previous knowledge about the input sequences. One of the most traditionally employed ab-nitio tool is TRF [4], which models the patterns that are repeated according to their similarity and the frequency of their differences. TRF then uses statistical criteria to select the proper patterns. Other examples of more modern ab-nitio tools are tandemSWAN [5], which is based on local autocorrelation analysis; TRStalker [28], which follows a multiphase algorithm where the candidates are sorted according to a score and only the best ones reach the next phase; MsDetector [13] and ULTRA [27], based on hidden Markov models; TAREAN [26], with a graph-based sequence clustering phase followed by a reconstruction step from the most frequent k-mers; and TR-ESA [14], based on enhanced suffix arrays. Moreover, there even exist methods to find TRs on concrete scenarios such as datasets with noisy long reads [16].
3 Background: Dot2dot
3.1 Dot plots
3.2 Search methodology
3.3 Multithreading approach
4 Implementation
4.1 Reduction in memory requirements for dot plots
4.2 MPI parallelization
-
A block distribution that assigns the same number of sequences to each process. Its main advantage is the simplification of the preprocessing step, as the root process only needs to know the total number of sequences of the input file. However, as it does not take into account the length of the sequences, it can cause workload imbalance in the common scenario where such length is highly variable.
-
A balanced distribution that assigns a similar number of bases among the processes. In this case, the number of sequences per process can be significantly different, but the workload is better balanced than in the previous approach. The main drawback is that it requires the root process to completely read the input file in order to know the length of each sequence.
4.3 Hybrid MPI/OpenMP parallelization
5 Experimental evaluation
5.1 Configuration of the experiments
Organism | Number of sequences | Millions of bases | Max Mem (MB) | |
---|---|---|---|---|
Dot2dot
|
MPI-dot2dot
| |||
Picea Glauca (PG) | 4,464,856 | 21,582 | 8.11 | 3.13 |
Picea Engelmannii (PE) | 2,394,260 | 24,943 | 333.73 | 128.94 |
Pinus Lambertiana (PL) | 4,253,097 | 27,602 | 170.54 | 65.89 |
Ambystoma Mexicanum (AM) | 125,724 | 32,396 | 85,188.98 | 32,913.92 |
5.2 Experimental results
-
The synchronizations required by the inter-sequence multithreaded approach of Dot2dot, where complete sequences are assigned to different threads (see Sect. 3.3), lead to significant performance overhead, especially for the two datasets with more variability in the sequence length (Picea Glauca and Pinus Lambertiana). This overhead is significantly reduced with the intra-sequence multithreaded version of MPI-dot2dot. Consequently, the latter is 2.60 times faster on average, with a maximum speedup of 4.15 for the Pinus Lambertiana genome.
-
The balanced MPI distribution obtains much better results than the block one that assigns the same number of sequences per process. This behavior was expected as the length of the sequences is highly variable in these real genomic datasets. The results of the block distribution are only competitive for the Picea Glauca genome, which presents quite regular lengths, but even in this case they are worse than using the balanced approach.
-
The pure MPI implementation that balances workload distribution achieves better performance than any multithreaded approach. Concretely, it is on average 2.99 and 1.13 times faster than the multithreaded versions of Dot2dot and MPI-dot2dot, respectively, using the 24 cores of one whole node.
-
The hybrid MPI/OpenMP implementation with a correct configuration of processes and threads can further improve performance, but its performance gain is not quite high (on average, 2.72%).
Config. | PG | PE | PL | AM |
---|---|---|---|---|
1P–24Th | 1857 | 1886 | 2214 | 2512 |
2P–12Th | 1831 | 1810 | 1957 | 2444 |
4P–6Th | 1864 | 1688 | 1852 | – |
8P–3Th | 1779 | 1615 | 1794 | – |
12P–2Th | 1746 | 1598 | 1768 | – |
24P–1Th | 1808 | 1625 | 1824 | – |
Dot2dot | MPI-dot2dot | |||||
---|---|---|---|---|---|---|
1 node | 2 nodes | 4 nodes | 8 nodes | 16 nodes | ||
PG | 1h 10m | 29m 6s | 15 m 31s | 8m 37s | 4m 53s | 3m 8s |
PE | 43m | 26m 38s | 14m 1s | 7m 48s | 4m 51s | 3m 26s |
PL | 2h 33m | 29m 28s | 17m 1s | 9m 12s | 5m 23s | 3m 54s |
AM | 12h 57m | 41m 44s | 22m 27s | 13m 14s | 9m 20s | 7m 9s |