Skip to main content
Top
Published in: Journal of Big Data 1/2018

Open Access 01-12-2018 | Research

Kavosh: an effective Map-Reduce-based association rule mining method

Authors: Mohammadhossein Barkhordari, Mahdi Niamanesh

Published in: Journal of Big Data | Issue 1/2018

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The immense amount of data generated on a daily basis by various devices and systems necessitates a change in data analysis methods. As an important part of analytics, data mining methods require a paradigm shift to solve problems because the old methods cannot manage massive data. Association rule mining is a data mining algorithm used to solve various domain problems. Because of the immense volume of data, one-node solutions are no longer useful, and it is necessary to solve problems by using a distributed and shared-nothing architecture such as Map-Reduce. However, when association rule mining is transferred to these architectures, new problems appear. The main problems are lack of data locality and iteration support and process skewness. In this paper, a method is proposed that solves these problems. Kavosh converts data into a unified format that helps nodes perform their tasks independently without the need to exchange data with other nodes. In addition, the proposed method compresses input data to facilitate data management. Another advantage is the lack of process skewness because it is possible to allocate a predefined amount of data to each node. Kavosh omits iterations required for finding frequent itemsets by changing the Map-Reduce architecture. The proposed method is implemented using Hadoop, and the results are compared with open-source products in terms of three aspects: execution time, load balancing and data compression. The results show that Kavosh outperforms other methods in these aspects.

Introduction

With the growth of information, traditional analysis methods must be modified because they cannot handle immense amounts of data. Data mining algorithms are analytics that require a paradigm shift for algorithm execution and changes for deployment over nodes. Data mining algorithms must be modified so they can be executed over scalable and distributable environments. However, modifying data mining algorithms for distributed architecture is not easy. One problem with distributed architecture is data locality. This occurs when the required data for processing do not exist on the processor node.
One of the most prominent methods used to solve big data problems over shared-nothing architecture is Map-Reduce. Map-Reduce [1] is used in open-source solutions such as Hadoop and Spark. There are many products in addition to Hadoop and Spark that can be used to solve data mining problems. However, pure Map-Reduce suffers from the data locality problem and does not support iteration. Because nearly all data mining algorithms are iterative in nature, it is necessary to solve the iteration problem in Map-Reduce to solve data mining problems by using Map-Reduce-based methods.
Association rule mining is a data mining algorithm that requires iteration to find frequent itemsets. Methods such as Apriori [2] and FP-Growth [3] solve association rule mining problems. There are two main parameters for association rule mining: confidence and support. A rule in an association rule is defined as x → y (if x, then y), where x and y are itemsets. The support of a rule is defined as the frequency of an itemset in a database. The confidence of a rule is defined as
$$ Confidence\left( {x \to y} \right) = \frac{{{\text{Support}}\left( {{\text{x}} \cap {\text{y}}} \right)}}{{{\text{Support}}\left( {\text{x}} \right)}} $$
In this paper, Kavosh, which is a method for association rule mining, is proposed. This method is designed for a shared-nothing architecture and can properly solve the association rule mining problem in Map-Reduce.
This method solves the data locality problem completely. Thus, the nodes do not require data from other nodes. By changing and unifying the data format, data compression and data load balancing are supported. In addition, iteration is omitted in the proposed method; therefore, it is well-suited for the Map-Reduce architecture.
The remainder of this paper is structured as follows: In “Related works” section, related works are discussed. In the third section, Kavosh is presented. The fourth section describes an evaluation, and the final section presents the conclusions.
Map-Reduce is used to solve many problems over a shared-nothing architecture. This method is also used to solve association rule mining. There are four main problems with association rule mining on a shared-nothing architecture, namely, lack of support for the following:
  • Data locality,
  • Iteration,
  • Load balancing, and
  • Data compression.
As mentioned above, the data locality problem is a lack of required data on the processor node, which creates data dependency among nodes. This problem causes network congestion and increases the algorithm execution time. Another problem is lack of iteration support. When intermediate results are created by the Reducers, they must be fed as input to the Mappers for another iteration. This problem also increases the execution time because intermediate data must be written on the disk by the Reducers and read by the Mappers to initiate another iteration. The third problem is the load balancing problem. This occurs when the processes are not allocated to the Mappers fairly. This problem reduces the algorithm speed because all nodes must wait for a busy node after job completion. The last problem is lack of support for data compression. Because of the large amount of data, data volume reduction is necessary for more rapid iterations and intermediate result storage.
In some methods, traditional Apriori methods are used in Map-Reduce. Transactions are allocated to Mappers and frequent k-itemsets are extracted from each Mapper before the results are shuffled through combiners and the final k-itemsets are extracted according to support and confidence thresholds. In Oruganti et al. [4], Kovacs et al. [5], Li et al. [6], Mappers and Reducers are used, but in [7, 8], in addition to Mappers and Reducers, Combiners are used for better shuffling and to address performance issues.
In Lin et al [9], three methods are proposed: Single Pass Counting (SPC), Fixed Passes Combined-counting (FPC) and Dynamic Passes Combined-counting (DPC). In Apriori, each iteration must generally wait until the results of all the Reducers from the previous iteration have been generated.
The FiDoop [10] method uses three Map-Reduce phases to generate frequent itemsets. FiDoop claims to support automatic parallelization, load balancing, data distribution, and fault tolerance. ScaDiBino [11] extracts rules with the maximum length and one target field. This method omits iterations. In Yu et al. [12], the Distributed Parallel Apriori (DPA) algorithm is proposed and metadata are stored in the form of Transaction Identifiers (TIDs). In this method, a single scan of the database is required and a balanced workload among nodes is created.
Various proposed methods use FP-Growth on a shared-nothing architecture. In Li et al. [13], parallel FP-Growth is proposed for independently executing a group of tasks on a node. In Bechini et al. [14], MRAC and MRAC+ are proposed, which are Map-Reduce-based and FP-Growth-based methods, respectively, and FP-Growth was modified to overcome performance issues. In Yang et al. [15], a Hadoop-based method is proposed that uses a distributed DH-TRIE frequent pattern algorithm and tries to solve FP-Growth problems for big data. In Tlili et al. [16], a partitioning method is used to achieve load balancing problems for association rule mining. PARMA [17] creates multiple small random samples of the input data and runs a mining algorithm on them. Because it is implemented on small samples, the algorithm can be run in parallel and independently. The results are aggregated to produce the final results. In Yu and Zhou [18], Tidset-based Parallel FP-tree (TPFP-tree) and Balanced Tidset-based Parallel FP-tree (BTP-tree) are proposed. In the proposed method, a transaction identification set (Tidset) is used to provide direct access to transactions instead of a full database scan. In Moens et al. [19], two methods are proposed: Dist-Eclat and BigFIM. Dist-Eclat has three steps: finding the frequent items, generating frequent itemsets of size k and sub-tree mining. BigFIM also has three steps: generating frequent itemsets of size k, finding potential extensions and sub-tree mining. The Sequence-Growth algorithm is designed according to the concepts of the lexicographical sequence tree and the lazy mining pruning strategy and is implemented in the MapReduce framework for a distributed execution [20]. In Liang et al. [21], an algorithm for lexicographic frequent itemset generation is proposed, which claims to find the maximum information from a database regarding frequent itemsets and their respective frequencies in a single database scan.

Methods

The proposed method uses the Map-Reduce architectural structure for association rule mining. The first step in the proposed method is to convert the input data items to a Kavosh format and the second step is to extract rules with variable lengths. In this paper, the number of fields after the “if” conditions are applied is defined as the length of the rule. In the proposed method, data are converted into Kavosh format, which helps nodes execute their processes independently. This format is a unified format and other input data formats are converted to this format. The proposed unified format helps create <key, value> pairs for use in the Map-Reduce method.

• Converting input data items to Kavosh format

In this section, the Kavosh format is defined
Θ = {θ1, θ2, …, θn, ϯ}.
The input data table is denoted as Θ, the columns of the table as θ and the table key as ϯ
θk= {µ1, µ2, ..., µm}.
The distinct values of each column are denoted as µ. In this paper, it is assumed that all θ values are discrete and that a continuous θ must be converted to discrete values.
Based on the above definitions, the Kavosh format is defined as follows:
ϝ = {θ1 → μ1, θ2 → μ2,…, θn → μm}
where ϝ is the Kavosh table and each θp → µq is equal to zero or one. Each input column value is converted to a Boolean column if there are more than two values for the column.
The input data (Θ) are divided into equal segments. For fragmentation, the function ψ is used, which is defined as follows:
Ψ (ϝ, ϯstart, ϯfinish, ηk)
where ηk is the Mapper node to which the data ϝ are allocated from key ϯstart to key ϯfinish. From the previous conversion steps, a Boolean matrix is created on each Mapper. Each field ϝ can be selected as a target field. If a binary position is assigned to each field ϝ (except target fields), each row can be converted into a decimal number. Therefore, a <Key, Value> pair is created in which the Key is the decimal number of each row (Rule-Key) and the Value is the decimal value of the target field. In the next step, the Kavosh triple is created, which is defined as follows:
<Rule-Key, Rule count, Target field count with the specified value>
According to the generated <Rule-Key, Target fields>, the count of each Rule-Key and target field with the specified value is calculated and the Kavosh triple set is created. To convert Mapper data into the Kavosh triple and Rule-Key aggregation, the function Ϯ is used.
Ϯ(ηk, ϼk)
where ϼk is the result of computation of the input data. After calculating ϼk for each Mapper, all ϼk are reduced to ϼReducer by the function £.
£(ϼ1, ϼ2,…, ϼp, ϼReducer, ϧ)
where ϧ is the Reducer node and p is the number of Mappers.

• Rule extraction

In this section, a Map-Key is added to the Rule-Key (ϻ). The Map-Key is used for rule generation, and its length in binary format is equal to the Rule-Key. Here, ϼReducer is sent to the rule generation nodes. Each rule generation node has one or more Map-Keys. The function ϖ allocates ϼReducer to each rule generation node.
where ι is the Map-Key, q is the number of Map-Keys allocated to a rule generation node, and Ϧ is the rule generation node.
ϖ (ϼReducer, ι1, ι2,…, ιq ,Ϧr)
Each Map-Key is concatenated to all Rule-Keys. If the length of the Rule-Key in binary format is equal to L, then there are 2L − 1 Map-Keys. They start from zero to 2L − 1 and are concatenated with the Rule-Keys to create various condition combinations of fields ϝ. The concatenation (ϒ) of the Map-Key with the result of the “logical and” (ϰ) of the Map-Key and the Rule-Key is called MHB_Key, and τ is the MHB_Key.
τ = ϒi, ϰ (ϻ, ιi))
There are two important factors for rule extraction in association rule mining: confidence and support. In this paper, Д (τ) is defined as the frequency of a specified MHB_Key (τ) and ζ (τ, σ, ρ) is defined as the frequency of a specific decimal value (ρ) that is equal to the target field (σ) for a specified MHB_Key (τ). In addition, ς is the total number of generated Kavosh triple sets. Thus, the support ( https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Figa_HTML.gif ) is calculated using the following formula:
https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Figb_HTML.gif
Moreover, the confidence ( https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Figc_HTML.gif ) is calculated using the following formula:
https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Figd_HTML.gif
After https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Figa_HTML.gif and https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Figc_HTML.gif have been calculated, they are compared with the defined thresholds for the association rules and the final results are produced. The rules are extracted using Д.
https://static-content.springer.com/image/art%3A10.1186%2Fs40537-018-0129-4/MediaObjects/40537_2018_129_Fige_HTML.gif
where α is the support threshold and β is the confidence threshold. Each rule generation node creates its final result separately because the results do not have common rules with other nodes. Figure 1 shows the Kavosh architecture. It consists of three layers: the Mapper layer, the Reducer layer and the rule generation layer.
For clearer illustration, pseudocodes are provided. Table 1 lists the functions used in the code.
Table 1
Pseudocode functions
Function
Description
NOC(TableName)
Returns the number of columns
NOR(TableName)
Returns the number of rows
NODV(TableName, ColumnName/ColumnID)
Returns the number of distinct values in the specified column
NaOC(TableName, k)
Returns the name of the kth column in the specified table
DVC(TableName,ColumnName/ColumnID, k)
Returns the kth distinct value of the specified column
CT(ColumnsArray[],TableName)
Creates a table with an input column array with a specified table name
RVT(TableName, ColumnName/ColumnID,RowID)
Returns the value of the specified table name, column name/column ID, and row ID
EXEC (Query, ResultTable)
Executes a query, creates a results table and puts the results into the results table
The results table can be created in RAM or HDD, according to the node hardware specification.
&
& is a logical operator (And)
I2B(i)
Converts the integer value i to a binary value
The following code shows the binominal conversion.
The following pseudocode shows the Mapper’s functionality:
In the reduce phase, the EXEC() function is repeated to obtain the final intermediate results layer. The following pseudocode shows the Reducer phase:
After the extraction of all the rules with a maximum length in the intermediate results layer (Fig. 4), a mapping field is added to these rules. Each rule is converted to 2n − 1 rules, where n is the number of input data item fields. The new rule value is equal to the concatenation of the mapping field and the result of the logical And between the mapping field and the rule value.
Suppose the input data are as described in Table 2, where New Service is the target field and the data are divided between two Mappers.
Table 2
Sample input information
City
Sex
Income
New service
Mapper
Tehran
Male
High
Yes
1
Tehran
Female
Low
Yes
Yazd
Male
High
No
Yazd
Male
High
Yes
2
Tehran
Female
Low
No
Yazd
Male
High
No
Table 3 shows the data after conversion to a binominal format. Table 4 shows the Mapper results. The rules are now combined based on their keys in the Reducer layer. As shown in Table 5, the rules are combined in this phase, and their rule count and rule summation values are recalculated. Then, the mapping fields are added to the rules. The mapping fields are sorted to prevent future shuffling among the Mappers. As the input item, the field count is equal to three, and seven mapping fields (23 − 1) must be added to the rules. Table 6 shows this stage.
Table 3
Binominal format
Tehran
Sex
High
New service
Mapper
1
1
1
1
1
1
0
0
1
0
1
1
0
0
1
1
1
2
1
0
0
0
0
1
1
0
Table 4
Mapper results
Decimal rule
Target field
Rule count
Rule sum
Mapper
7
1
1
1
1
4
1
1
1
3
0
1
0
3
0
1
0
2
4
1
1
1
3
0
1
0
Table 5
Reducer results
Rule
Rule count
Rule sum
7
1
1
4
2
2
3
3
0
Table 6
Mapping field addition stage
Mapping field
Rule
MHB_Key (Binary)
MHB_Key (Decimal)
Rule count
Rule sum
111
111
111111
63
1
1
111
100
111100
60
2
2
111
011
111011
59
3
0
110
111
110110
54
1
1
110
100
110100
52
2
2
110
011
110010
50
3
0
101
111
101101
45
1
1
101
100
101100
44
2
2
101
011
101001
41
3
0
100
111
100100
36
1
1
100
100
100100
36
2
2
100
011
100000
32
3
0
011
111
011011
27
1
1
011
100
011000
24
2
2
011
011
011011
27
3
0
010
111
010010
18
1
1
010
100
010000
16
2
2
010
011
010010
18
3
0
001
111
001001
9
1
1
001
100
001000
8
2
2
001
011
001001
9
3
0
The rule sum and count are recalculated, and the same key rules are combined, as depicted in Table 7. In the next stage, confidence and support are calculated. Table 8 shows the results of this stage. As mentioned earlier, confidence values must be calculated for two values: zero and one. Finally, the Reduce stage of the rule generation layer filters rules according to thresholds for support and confidence. If support is greater than 5% or confidence is greater than 95%, the Reducer results are as shown in Table 9. As shown in Table 9, some rows have confidence equal to 0%. These rows have confidence equal to 100% for a zero value. In other words, for rule filtering with confidence greater than 95%, it is necessary to consider confidence lower than 5% (100 − 95%). Kavosh can be summarized as follows:
Table 7
Rule sum and count recalculation
MHB_Key (Binary)
MHB_Key (Decimal)
Rule count
Rule sum
111111
63
1
1
111100
60
2
2
111011
59
3
0
110110
54
1
1
110100
52
2
2
110010
50
3
0
101101
45
1
1
101100
44
2
2
101001
41
3
0
100100
36
3
3
100000
32
3
0
011011
27
4
1
011000
24
2
2
010010
18
4
1
010000
16
2
2
001001
9
4
1
001000
8
2
2
Table 8
Support and confidence results
Rule
Support (%)
Confidence (one) (%)
63
2
100
60
5
100
59
7
0
54
2
100
52
5
100
50
7
0
45
2
100
44
5
100
41
7
0
36
7
100
32
7
0
27
10
20
24
5
100
18
10
20
16
5
100
9
10
20
8
5
100
Table 9
Reducer results
Rule
Support (%)
Confidence (one) (%)
59
7
0
50
7
0
41
7
0
36
7
100
32
7
0
  • The input data are converted to a binominal format.
  • The converted data are distributed among the Mappers.
  • The results of the Mappers are sent to the Reducer layer.
  • Mapping fields are added to the results of the Reducer layer, and MHB_Keys are generated.
  • The Reducer results are sent to the Rule generation layer.
  • Confidence and support parameters are applied to the extracted rules to create the final results.

Evaluation

The proposed method is evaluated on the TPC-DS dataset and a real-world dataset. “The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi user mode and data maintenance performance for a given hardware, operating system, and data processing system configuration under a controlled, complex, multi-user decision support workload. The purpose of TPC benchmarks is to provide relevant, objective performance data to industry users. TPC-DS Version 2 enables emerging technologies, such as Big Data systems, to execute the benchmark” (http://​www.​tpc.​org/​tpcds).
The proposed method can use DBMS on each node independently. PostgreSQL is used as DBMS on each node. PostgreSQL is a powerful, open-source object-relational database system that uses and extends the SQL language, combined with many features that safely store and scale the most complicated data workloads (https://​www.​postgresql.​org/​about/​).
To achieve higher performance, an In-Memory database is used. Redis is an open-source (BSD licensed), in-memory data structure store, which is used as a database, cache and message broker (https://​redis.​io/​).

TPC-DS

To implement the Kavosh method, Hadoop 2.7.3 is used. Ubuntu 16.04 is installed on each node. Parts of Hadoop were modified for Kavosh implementation. As described above, there are three layers of nodes in Kavosh: The first layer consists of the Mapper nodes, the second layer consists of the Reducer node, and the third layer is the rule generation layer. The DataNodes of Hadoop are used for these three layers. A MetaNode is added to the Hadoop NameNode to maintain the type information of each group of nodes. Because of the Kavosh data format and the independence of each node, it is possible to use a database management system for each data node. Therefore, PostgreSQL 9.6.1 is used for the Mapper nodes. A Kavosh converter exists on the Mapper nodes to convert the input data format into the Kavosh format. To achieve faster computation speed, an in-memory database is used for the Reducer node and the rule generation nodes. Redis 3.2.5 is used as an in-memory database on these nodes with AOF disk persistence for the Reducer and disk persistence is disabled for the rule generation nodes. Table 10 shows DBMS for the various node types.
Table 10
DBMS for the various node types
Node type
DBMS
Mapper
PostgreSQL
Reducer
Redis
Rule generation
Redis
Figure 2 shows Kavosh architecture using Hadoop.
For data generation, TPC-DS_Tools_v2.8.0 was used. Four tables (Customer_demographics, Store_sales, Web_sales, and Item) were used for association rule mining. Figures 3 and 4 show the ER-Diagrams.
A database view is created over four mentioned tables. This view contains customer information and items that each customer bought. Figures 5, 6, 7 and 8 show the table structures.
The Kavosh method extracts market basket analysis (MBA) rules for items on various sell channels. The frequent items with a maximum length of twenty are extracted. In other words, twenty related items are extracted. The scale factor is 100 TB and SF = 100,000. Table 11 shows the number of rows in each data table.
Table 11
Number of rows in each data table
Table name
Number of rows
Items
502,000
web_sales
71,999,670,164
Store_sales
287,997,818,084
Customer_Demographics
1,920,800
Tables 12 and 13 show the node specifications for the Kavosh method. A total of 86 nodes are used for evaluation. Seventy nodes were used as the Mapper nodes, one node was used as the Reducer node and fifteen nodes were used as the rule generation nodes. The input data must be converted to the Kavosh format and it requires 6740 s for the Kavosh converter to convert the input data to the Kavosh format. Each Mapper node simultaneously converts approximately 1.5 TB of data.
Table 12
Mapper node specifications
CPU
Intel Core i7-3770 Quad-Core Processor 3.4 GHz
HDD
2 TB
RAM
64 GB
Table 13
Reducer and rule generation node specifications
CPU
Intel Core i7-3770 Quad-Core Processor 3.4 GHz
HDD
500 GB
RAM
512 GB
The Kavosh method is compared with FiDoop [10] and Sequence-Growth [20]. Table 14 shows the node specifications for the FiDoop and Sequence-Growth methods.
Table 14
Nodes for FiDoop and Sequence-Growth
CPU
Intel Core i7-3770 Quad-Core Processor 3.4 GHz
HDD
1.5 TB
RAM
150 GB
The total CPU, RAM and HDD used for Kavosh and other methods are equal.
To evaluate the three methods, frequent items for MBA of web sales and store sales are extracted. The association rule parameters are shown in Table 15.
Table 15
Association rule parameters
Parameter name
Parameter value (%)
Confidence
90
Support
0.05
They are compared with one another in three dimensions: execution time, data compression and load balancing.

Execution time

The method proposed by FiDoop requires three Map-Reduce cycles to generate the final results. The execution time for each cycle is almost the same, and in each cycle, a full database scan is required. Sequence-Growth requires twenty Map-Reduce cycles because it generates each frequent item in a Map-Reduce cycle. However, the execution time decreases each cycle in comparison with the previous cycle because the number of frequent items decreases. Nevertheless, Kavosh requires a Map-Reduce cycle on compressed data and a rule generation phase. According to the data format, all nodes can complete their processes independently. In addition, Kavosh uses an in-memory database in the Reducer and rule generation nodes, which causes a reduction in I/O time. According to the above description, Table 16 shows the execution time for each method.
Table 16
Execution details
Method name
Map (s)
Reduce (s)
Rule generation (s)
FiDoop
1265
1194
0
Sequence-Growth
3500
2890
0
Kavosh
288
265
1159
Figure 9 shows the total execution time for each method.

Compression

In this part of the evaluation, the Kavosh data volume is compared with those of FiDoop and Sequence-Growth. In FiDoop and Sequence-Growth, no compression is performed on the input data, whereas the data are compressed in Kavosh due to the conversion of the input data to the Kavosh data format. The compression rates are different: if fields have fewer discrete values, the compression rate reaches ninety percent, but Kavosh compresses data by an average of approximately 57%. The data compression in the Kavosh data format is due to the conversion of string and number values to the bit data type. In the Kavosh method, no data replication or collocation is required for performance issues. In addition, data compression helps Kavosh retrieve more data in the data retrieval phase. Table 17 shows the memory usage for different methods.
Table 17
Memory usage
Method name
Used table data volume (TB)
FiDoop
39.9
Sequence-Growth
70
Kavosh
70

Load balancing

In this section, the load balancing of various methods is investigated. To calculate the load balance, the length of time during which a processor has a CPU usage of over eighty percent while other CPUs have CPU usages of under thirty percent is calculated. In this paper, this measure is called Balance_Factor. Using this definition, the results in Table 18 were obtained.
Table 18
Hadoop configuration
Method
Balance_Factor (s)
FiDoop
800
Sequence-Growth
3230
Kavosh
15

Real traffic data of a mobile operator

In this section, the proposed method is evaluated using real traffic data of a mobile value-added service (VAS) provider. The evaluation data include more than 60,000,000 subscribers, together with their services. The subscribers used 3000 services and all the subscribers’ activities were considered in the evaluation. The mobile operator needs accurate information about their customers’ favourites to be able to provide appropriate VAS suggestions. Due to the high volume of information, existing association rule algorithms would take a long time to execute and, therefore, are not suitable. In this evaluation, high-dimensional data items were used. If all the combinations of rules were applied, 23000 − 1 mapping fields would have to be added to each rule that is generated in the rule generation layer. Such a large number of rules is neither necessary nor calculable. Based on additional information obtained about the services, a maximum of ten services is required for recommending VAS. Thus, rules with a maximum length of ten must be generated.
The input data format is as shown in Table 19. The first column is the subscriber mobile number and the second column is the service’s ID.
Table 19
Input data format
MobileNumber
ServiceID
09121450111
1041
09121450111
58
09121450111
971
09121450111
119
09123895004
971
09191005069
113
We converted Table 18 to a binominal format. Finally, a <Key,Value> pair is created. The Key is the mobile number, and the value is a bit stream with a length of 2999 (Number of VAS services − 1 (Target Field)). Kavosh was deployed on the hardware nodes with the specifications detailed in “TPC-DS” section.
In the first step, the customer service information table is converted to the Kavosh format. In this phase, one service is considered as a target field. The relationships between this service and other services are extracted as rules. The customer service information table is distributed over 70 nodes. The conversion time is 1200s. Fifty servers are used for the Mapper, one server is used for the Reducer and sixty servers are used for the rule generation layer.
Table 20 presents the parameters that are used for the association rule algorithm.
Table 20
Association rule parameters
Parameter name
Parameter value (%)
Confidence
85
Support
0.1
Table 21 shows execution details for the three methods.
Table 21
Execution details
Method name
Map (s)
Reduce (s)
Rule generation (s)
FiDoop
206
199
0
Sequence-Growth
779
750
0
Kavosh
75
71
119
Figure 10 shows the total execution time for each method.

Compression

Table 22 shows the memory usage for the three methods.
Table 22
Memory usage
Method name
Table data volume used (TB)
FiDoop
19
Sequence-Growth
40
Kavosh
40

Load balancing

Table 23 shows the values of Balance_Factor for the three methods
Table 23
Load balancing
Method
Balance_Factor (s)
FiDoop
683
Sequence-Growth
2800
Kavosh
17

Scalability

In this section Kavosh scalability is evaluated. Different data sizes(TPC-DS) and various number of nodes are investigated and the results are depicted as Fig. 11. The results show that by adding more nodes and scale out data over nodes less time is required for data processing.

Conclusion

In this paper, the Kavosh method is proposed. Kavosh is a Map-Reduce-based association rule mining method that can manage an immense amount of data. This method converts input data into the Kavosh format, which is a unified and compressed format. Using the Kavosh data format, data can be distributed over nodes, and nodes do not require additional data from other nodes. Kavosh also compresses input data and facilitates data management. The main advantage of the proposed method is the omission of iterations for extracting rules. This is a substantial achievement because writing intermediate results to a disk and retrieving them for the next iteration is the most time-consuming portion of association rule mining. Another important achievement of the proposed method is the balanced process for each node. Because of the unified data format, each node is allocated the same number of rows in the homogenous nodes; thus, the process is equally divided among the nodes. By modifying Hadoop, the proposed method was implemented. Kavosh was compared with prominent association rule mining methods and achieved a faster execution time.
Kavosh is not suitable for high-dimensional data because creating all the input field combinations requires many nodes in the data generation layer. However, Kavosh is completely usable for normal data (not high-dimensional data) and can generate rules of a predefined length if other rule lengths are not required.
Data format unification can be applied to problems in other fields. This method can be used in data warehouse like Aras and Atrak methods [22, 23], graph processing [24], integrating multidimensional data sources [25] and specific problems like finding patient similarity [26]. For future works, this method can also be used for interactive query processing, online data mining and stream processing.

Authors’ contributions

All authors contributed equally. All authors read and approved the final manuscript.

Acknowledgements

None.
In submitting an article to any of the journals published by SpringerOpen I certify that: (1) I am authorized by my co-authors to enter into these arrangements. (2) I warrant, on behalf of myself and my co-authors, that: (i) the article is original, has not been formally published in any other peer-reviewed journal, is not under consideration by any other journal and does not infringe any existing copyright or any other third party rights; (ii) I am/we are the sole author(s) of the article and have full authority to enter into this agreement and in granting rights to Springer are not in breach of any other obligation; (iii) the article contains nothing that is unlawful, libellous, or which would, if published, constitute a breach of contract or of confidence or of commitment given to secrecy; (iv)I/we have taken due care to ensure the integrity of the article. To my/our—and currently accepted scientific—knowledge all statements contained in it purporting to be facts are true and any formula or instruction contained in the article will not, if followed accurately, cause any injury, illness or damage to the user. (3) I, and all co-authors, agree that the article, if editorially accepted for publication, shall be licensed under the Creative Commons Attribution License 4.0. If the law requires that the article be published in the public domain, I/we will notify Springer at the time of submission, and in such cases the article shall be released under the Creative Commons 1.0 Public Domain Dedication waiver. For the avoidance of doubt it is stated that “Introduction” and “Related works” sections of this license agreement shall apply and prevail regardless of whether the article is published under Creative Commons Attribution License 4.0 or the Creative Commons 1.0 Public Domain Dedication waiver. (4) I, and all co-authors, agree that, if the article is editorially accepted for publication in Chemistry Central Journal, Chemical and Biological Technologies in Agriculture, Geochemical Transactions, Heritage Science, Journal of Cheminformatics, or Sustainable Chemical Processes, data included in the article shall be made available under the Creative Commons 1.0 Public Domain Dedication waiver, unless otherwise stated. For the avoidance of doubt it is stated that “Introduction”, “Related works” and “Methods” sections of this license agreement shall apply and prevail.

Competing interests

The authors declare that they have no competing interests.

Funding

None.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Literature
1.
go back to reference Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.CrossRef Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.CrossRef
2.
go back to reference Zaki MJ. Parallel and distributed association mining: a survey. IEEE Concurr. 1999;7(4):14–25.CrossRef Zaki MJ. Parallel and distributed association mining: a survey. IEEE Concurr. 1999;7(4):14–25.CrossRef
3.
go back to reference Pramudiono I, Kitsuregawa M. FP-tax: tree structure based generalized association rule mining. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery. New York: ACM; 2004. p. 60–3. Pramudiono I, Kitsuregawa M. FP-tax: tree structure based generalized association rule mining. In: Proceedings of the 9th ACM SIGMOD workshop on research issues in data mining and knowledge discovery. New York: ACM; 2004. p. 60–3.
4.
go back to reference Oruganti S, Ding Q, Tabrizi N. Exploring Hadoop as a platform for distributed association rule mining. In: Future computing 2013 the fifth international conference on future computational technologies and applications; 2013. p. 62–7. Oruganti S, Ding Q, Tabrizi N. Exploring Hadoop as a platform for distributed association rule mining. In: Future computing 2013 the fifth international conference on future computational technologies and applications; 2013. p. 62–7.
5.
go back to reference Kovacs F, Illés J. Frequent itemset mining on hadoop. In: 2013 IEEE 9th international conference on computational cybernetics (ICCC). New York: IEEE; 2013. p. 241–5. Kovacs F, Illés J. Frequent itemset mining on hadoop. In: 2013 IEEE 9th international conference on computational cybernetics (ICCC). New York: IEEE; 2013. p. 241–5.
6.
go back to reference Li N, Zeng L, He Q, Shi Z. Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel and distributed computing (SNPD). New York: IEEE; 2012. p. 236–41. Li N, Zeng L, He Q, Shi Z. Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS international conference on software engineering, artificial intelligence, networking and parallel and distributed computing (SNPD). New York: IEEE; 2012. p. 236–41.
7.
go back to reference Yang XY, Liu Z, Fu Y. MapReduce as a programming model for association rules algorithm on Hadoop. In: 2010 3rd international conference on information sciences and interaction sciences (ICIS). New York: IEEE; 2010. p. 99–102. Yang XY, Liu Z, Fu Y. MapReduce as a programming model for association rules algorithm on Hadoop. In: 2010 3rd international conference on information sciences and interaction sciences (ICIS). New York: IEEE; 2010. p. 99–102.
8.
go back to reference Li L, Zhang M. The strategy of mining association rule based on cloud computing. In: 2011 international conference on business computing and global informatization (BCGIN). New York: IEEE; 2011. p. 475–8. Li L, Zhang M. The strategy of mining association rule based on cloud computing. In: 2011 international conference on business computing and global informatization (BCGIN). New York: IEEE; 2011. p. 475–8.
9.
go back to reference Lin MY, Lee PY, Hsueh SC. Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th international conference on ubiquitous information management and communication. New York: ACM; 2012. p. 76. Lin MY, Lee PY, Hsueh SC. Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th international conference on ubiquitous information management and communication. New York: ACM; 2012. p. 76.
10.
go back to reference Xun Y, Zhang J, Qin X. Fidoop: parallel mining of frequent itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst. 2016;46(3):313–25.CrossRef Xun Y, Zhang J, Qin X. Fidoop: parallel mining of frequent itemsets using mapreduce. IEEE Trans Syst Man Cybern Syst. 2016;46(3):313–25.CrossRef
11.
go back to reference Barkhordari M, Niamanesh M. ScadiBino: an effective MapReduce-based association rule mining method. In: Proceedings of the sixteenth international conference on electronic commerce. New York: ACM; 2014. p. 1. Barkhordari M, Niamanesh M. ScadiBino: an effective MapReduce-based association rule mining method. In: Proceedings of the sixteenth international conference on electronic commerce. New York: ACM; 2014. p. 1.
12.
go back to reference Yu KM, Zhou J, Hong TP, Zhou JL. A load-balanced distributed parallel mining algorithm. Expert Syst Appl. 2010;37(3):2459–64.MathSciNetCrossRef Yu KM, Zhou J, Hong TP, Zhou JL. A load-balanced distributed parallel mining algorithm. Expert Syst Appl. 2010;37(3):2459–64.MathSciNetCrossRef
13.
go back to reference Li H, Wang Y, Zhang D, Zhang M, Chang EY. Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM conference on recommender systems. New York: ACM; 2008. p. 107–14. Li H, Wang Y, Zhang D, Zhang M, Chang EY. Pfp: parallel fp-growth for query recommendation. In: Proceedings of the 2008 ACM conference on recommender systems. New York: ACM; 2008. p. 107–14.
14.
go back to reference Bechini A, Marcelloni F, Segatori A. A MapReduce solution for associative classification of big data. Inf Sci. 2016;332:33–55.CrossRef Bechini A, Marcelloni F, Segatori A. A MapReduce solution for associative classification of big data. Inf Sci. 2016;332:33–55.CrossRef
15.
go back to reference Yang L, Shi Z, Xu LD, Liang F, Kirsh I. DH-TRIE frequent pattern mining on Hadoop using JPA. In: 2011 IEEE international conference on granular computing (GrC). New York: IEEE; 2011. p. 875–8. Yang L, Shi Z, Xu LD, Liang F, Kirsh I. DH-TRIE frequent pattern mining on Hadoop using JPA. In: 2011 IEEE international conference on granular computing (GrC). New York: IEEE; 2011. p. 875–8.
16.
go back to reference Tlili R, Slimani Y. A novel data partitioning approach for association rule mining on grids. Int J Grid Distributed Comput. 2012;5(4):1–20. Tlili R, Slimani Y. A novel data partitioning approach for association rule mining on grids. Int J Grid Distributed Comput. 2012;5(4):1–20.
17.
go back to reference Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. In: Proceedings of the 21st ACM international conference on Information and knowledge management. New York: ACM; 2012. p. 85-94. Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. In: Proceedings of the 21st ACM international conference on Information and knowledge management. New York: ACM; 2012. p. 85-94.
18.
go back to reference Yu KM, Zhou J. Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system. Expert Syst Appl. 2010;37(3):2486–94.CrossRef Yu KM, Zhou J. Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system. Expert Syst Appl. 2010;37(3):2486–94.CrossRef
19.
go back to reference Moens S, Aksehirli E, Goethals B. Frequent itemset mining for big data. In: 2013 IEEE international conference on Big Data. New York: IEEE; 2013. p. 111–8. Moens S, Aksehirli E, Goethals B. Frequent itemset mining for big data. In: 2013 IEEE international conference on Big Data. New York: IEEE; 2013. p. 111–8.
20.
go back to reference Liang YH, Wu SY. Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on MapReduce framework. In: 2015 IEEE international congress on Big Data (BigData Congress). New York: IEEE; 2015. p. 393–400. Liang YH, Wu SY. Sequence-growth: A scalable and effective frequent itemset mining algorithm for big data based on MapReduce framework. In: 2015 IEEE international congress on Big Data (BigData Congress). New York: IEEE; 2015. p. 393–400.
21.
go back to reference Chaudhary S, Sharma A, Singh R, Kumar P. Lexicographic logical multi-hashing for frequent itemset mining. In: 2015 international conference on computing, communication and automation (ICCCA). New York: IEEE; 2015. p. 563–8. Chaudhary S, Sharma A, Singh R, Kumar P. Lexicographic logical multi-hashing for frequent itemset mining. In: 2015 international conference on computing, communication and automation (ICCCA). New York: IEEE; 2015. p. 563–8.
22.
go back to reference Barkhordari M, Niamanesh M. Atrak: a MapReduce-based data warehouse for big data. J Supercomput. 2017;73:4596–610.CrossRef Barkhordari M, Niamanesh M. Atrak: a MapReduce-based data warehouse for big data. J Supercomput. 2017;73:4596–610.CrossRef
23.
go back to reference Barkhordari M, Niamanesh M. Aras: a method with uniform distributed dataset to solve data warehouse problems for big data. Int J Distributed Syst Technol (IJDST). 2017;8(2):47–60.CrossRef Barkhordari M, Niamanesh M. Aras: a method with uniform distributed dataset to solve data warehouse problems for big data. Int J Distributed Syst Technol (IJDST). 2017;8(2):47–60.CrossRef
24.
go back to reference Barkhordari M, Niamanesh M. ScaDiGraph: a MapReduce-based method for solving graph problems. J Inform Sci Eng. 2017;33(1):143–58. Barkhordari M, Niamanesh M. ScaDiGraph: a MapReduce-based method for solving graph problems. J Inform Sci Eng. 2017;33(1):143–58.
25.
go back to reference Barkhordari M, Niamanesh M. Arvand: a method to integrate multidimensional data sources into big data analytic structures. J Inf Sci Eng. 2018;34(2):505–18. Barkhordari M, Niamanesh M. Arvand: a method to integrate multidimensional data sources into big data analytic structures. J Inf Sci Eng. 2018;34(2):505–18.
26.
go back to reference Barkhordari M, Niamanesh M. ScaDiPaSi: an effective scalable and distributable MapReduce-based method to find patient similarity on huge healthcare networks. Big Data Res. 2015;2(1):19–27.CrossRef Barkhordari M, Niamanesh M. ScaDiPaSi: an effective scalable and distributable MapReduce-based method to find patient similarity on huge healthcare networks. Big Data Res. 2015;2(1):19–27.CrossRef
Metadata
Title
Kavosh: an effective Map-Reduce-based association rule mining method
Authors
Mohammadhossein Barkhordari
Mahdi Niamanesh
Publication date
01-12-2018
Publisher
Springer International Publishing
Published in
Journal of Big Data / Issue 1/2018
Electronic ISSN: 2196-1115
DOI
https://doi.org/10.1186/s40537-018-0129-4

Other articles of this Issue 1/2018

Journal of Big Data 1/2018 Go to the issue

Premium Partner