Background
Theoretical analysis
Apache Pig
MapReduce
Apache Tez
Parameters chosen
Parameters | MapReduce | Apache Tez |
---|---|---|
Types of queries | MapReduce supports batch oriented queries [7] | Apache Tez supports interactive queries |
Usability | MapReduce is the backbone of hadoop ecosystem and Apache Pig relies on this framework | Apache Tez also works for Apache Pig but it is very useful in interactive scenarios |
Processing model | MapReduce always requires a map phase before the reduce phase
| A single Map phase and we may have multiple reduce phases
|
Hadoop version | MapReduce is backbone of hadoop available in all hadoop versions | Apache Tez is available in Apache Hadoop 2.0 and above |
Response time | Slower due to the access of HDFS after every Map and Reduce phase | High due to lesser job splitting and HDFS access |
Temporary data storage | Stores temporary data into HDFS after every map and reduce phase [8]
| Apache Tez doesn’t write data into HDFS, so it is more efficient
|
Usage of hadoop containers | MapReduce divide the task into more jobs. So more containers required for more jobs | Apache Tez reduces this inefficiency by dividing the task into lesser no of jobs and also by using existing containers |
Experimental evaluation
Name | No of records | No of attributes |
---|---|---|
Geolocation | 8013 | 10 |
Drivermilage | 101 | 2 |
Experimental setup
Experimental results and metrics
Effect on runtime with increase in configuration of cluster nodes
No of jobs
No of containers required
Parameter chosen | MapReduce | Apache Tez |
---|---|---|
No of jobs | 2 jobs | 1 job |
No of containers | 1st job = 2 containers | 1 job = 3 containers |
2nd job = 3 containers |