Introduction
Related works
-
Complete data locality
-
Network congestion omission
-
Data replication and collocation omission.
Methods
Problem definition
Chabok
-
Selected dimensions{Dimension 1 .Attribute 1, Dimension 2 .Attribute 1, …, Dimension n .Attribute m }
-
Distributive measures{[Distributive measure 1, measure 1 ],[Distributive measure 1, measure 2 ],…,[Distributive measure n, measure m ]}
-
Conditions on dimensions{[Dimension 1 .Attribute 1 ,operator, value], [Dimension 2 .Attribute 1 ,operator, value],…, [Dimension n .Attribute m , operator, value]}
-
Conditions on measures{[measure 1 , operator 1 , value 1 ], [measure 2 , operator 2 , value 2 ],…, [measure n , operator n , value n ]}
-
Conditions on distributive measures{[Distributive measure 1 (measure 1) , operator 1 , value 1 ], [Distributive measure 2 (measure 2 ), operator 2 , value 2 ],…,[Distributive measure n (measure n ), operator n , value n ]}
Formal definitions
Case study
Count | Data size (TB) | |
---|---|---|
Transactions (20 years) | 584,000,000,000 | 100 |
Customers | 44,215,886 | 5 |
Issued cards | 201,441,762 | 2 |
Accounts | 145,258,359 | 8 |
Transaction |
---|
Transaction_ID |
Date |
Time |
Amount |
Transaction_Type |
Account_ID |
Card_Number |
Customer_Number |
City ID |
City |
---|
City_ID |
City |
Province |
Transaction_Type |
---|
Transaction_Type ID |
Transaction_Type_Name |
Account |
---|
Account number |
Open date |
Account type |
Branch |
… |
Card |
---|
Card_Number |
Expire date |
Card_Type |
Branch |
… |
Customer |
---|
Customer_Number |
Name |
Family |
Address |
Sex |
Job |
… |
Date |
---|
Date |
Day |
Month |
Year |
Date |
---|
Time ID |
Hour |
Minute |
Second |
Results
Experiment setup
Experimental platform
Experiment settings
HDD | 4 TB |
---|---|
RAM | 64 GB |
CPU | Dual core 3.6 GHz |
dfs.replication | 3 |
---|---|
mapred.map.tasks | 49 |
mapred.reduce.tasks | 1 |
SPARK_WORKER_CORES | 49 |
Benchmarks
Query 8 | Compute the net profit of stores located in 400 Metropolitan areas with more than 10 preferred customers Qualification substitution parameters □ ZIP.01 = 24128 ZIP.81 = 57834 ZIP.161 = 13354 ZIP.241 = 15734 ZIP.321 = 78668 □ ZIP.02 = 76232 ZIP.82 = 62878 ZIP.162 = 45375 ZIP.242 = 63435 ZIP.322 = 22245 □ ZIP.03 = 65084 ZIP.83 = 49130 ZIP.163 = 40558 ZIP.243 = 25733 ZIP.323 = 15798 … |
Query 19 | Select the top revenue generating products bought by out of zip code customers for a given year, month and manager. Qualification substitution parameters □ MANAGER.01 = 8 □ MONTH.01 = 11 □ YEAR.01 = 1998 |
Query 38 | Display count of customers with purchases from all 3 channels in a given year Qualification substitution parameters □ DMS.01 = 1200 |
Query 41 | How many items do we carry with specific combinations of color, units, size and category Qualification substitution parameters □ MANUFACT.01 = 738 □ SIZE.01 = medium □ SIZE.02 = extra large □ SIZE.03 = N/A □ SIZE.04 = small □ SIZE.05 = petite □ SIZE.06 = large □ UNIT.01 = Ounce □ UNIT.02 = Oz □ UNIT.03 = Bunch □ UNIT.04 = Ton □ UNIT.05 = N/A □ UNIT.06 = Dozen … |
Query 42 | For each item and a specific year and month calculate the sum of the extended sales price of store transactions Qualification substitution parameters □ MONTH.01 = 11 □ YEAR.01 = 2000 |
Query 45 | Report the total web sales for customers in specific zip codes, cities, counties or states, or specific items for a given year and quarter Qualification substitution parameters □ QOY.01 = 2 □ YEAR.01 = 2001 □ GBOBC = ca_city |
Query 51 | Report the total of extended sales price for all items of a specific brand in a specific year and month Qualification substitution parameters □ MONTH.01 = 11 □ YEAR.01 = 2000 |
Query 55 | For a given year, month and store manager calculate the total store sales of any combination all brands Qualification Substitution Parameters □ MANAGER.01 = 28 □ MONTH.01 = 11 □YEAR.01 = 1999 |
Query 82 | Find customers who tend to spend more money (net-paid) on-line than in stores Qualification substitution parameters MANUFACT_ID.01 = 129 MANUFACT_ID.02 = 270 MANUFACT_ID.03 = 821 MANUFACT_ID.04 = 423 INVDATE.01 = 2000-05-25 PRICE.01 = 62 |
Query 84 | List all customers living in a specified city, with an income between 2 values Qualification substitution parameters INCOME.01 = 38128 CITY.01 = Edgewood |
Query 98 | Report on items sold in a given 30 day period, belonging to the specified category Qualification substitution parameters □ YEAR.01 = 1999 □ SDATE.01 = 1999-02-22 □ CATEGORY.01 = Sports □ CATEGORY.02 = Books □ CATEGORY.03 = Home |
Results and analysis
Query execution time
Q8 | Q19 | Q38 | Q41 | Q42 | Q45 | Q52 | Q55 | Q84 | Q98 | |
---|---|---|---|---|---|---|---|---|---|---|
Hive | 750 | 430 | 440 | 559 | 638 | 650 | 680 | 591 | 603 | 493 |
SparkSQL | 91 | 53 | 49 | 62 | 60 | 71 | 91 | 62 | 81 | 48 |
Chabok | 21 | 18 | 17 | 22 | 21 | 27 | 32 | 19 | 29 | 18 |
Load balancing
Method | Balance_Factor (s) |
---|---|
Hive | 1080 |
Spark-SQL | 208 |
Chabok | 22 |
Network congestion
Method | Data size (MB) |
---|---|
Hive | 2241 |
Spark-SQL | 1855 |
Chabok | 204 |