Skip to main content
Top

2019 | Book

Benchmarking, Measuring, and Optimizing

First BenchCouncil International Symposium, Bench 2018, Seattle, WA, USA, December 10-13, 2018, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the First International Symposium on Benchmarking, Measuring, and Optimization, Bench 2018, held in Seattle, WA, USA, in December 2018.

The 20 full papers presented were carefully reviewed and selected from 51 submissions.
The papers are organized in topical sections named: AI Benchmarking; Cloud; Big Data; Modelling and Prediction; and Algorithm and Implementations.

Table of Contents

Frontmatter
Correction to: MiDBench: Multimodel Industrial Big Data Benchmark
Yijian Cheng, Mengqian Cheng, Hao Ge, Yuhe Guo, Yuanzhe Hao, Xiaoguang Sun, Xiongpai Qin, Wei Lu, Yueguo Chen, Xiaoyong Du

AI Benchmarking

Frontmatter
AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking
Abstract
AI benchmarking provides yardsticks for benchmarking, measuring and evaluating innovative AI algorithms, architecture, and systems. Coordinated by BenchCouncil, this paper presents our joint research and engineering efforts with several academic and industrial partners on the datacenter AI benchmarks—AIBench. The benchmarks are publicly available from http://​www.​benchcouncil.​org/​AIBench/​index.​html. Presently, AIBench covers 16 problem domains, including image classification, image generation, text-to-text translation, image-to-text, image-to-image, speech-to-text, face embedding, 3D face recognition, object detection, video prediction, image compression, recommendation, 3D object reconstruction, text summarization, spatial transformer, and learning to rank, and two end-to-end application AI benchmarks. Meanwhile, the AI benchmark suites for high performance computing (HPC), IoT, Edge are also released on the BenchCouncil web site. This is by far the most comprehensive AI benchmarking research and engineering effort.
Wanling Gao, Chunjie Luo, Lei Wang, Xingwang Xiong, Jianan Chen, Tianshu Hao, Zihan Jiang, Fanda Fan, Mengjia Du, Yunyou Huang, Fan Zhang, Xu Wen, Chen Zheng, Xiwen He, Jiahui Dai, Hainan Ye, Zheng Cao, Zhen Jia, Kent Zhan, Haoning Tang, Daoyi Zheng, Biwei Xie, Wei Li, Xiaoyu Wang, Jianfeng Zhan
HPC AI500: A Benchmark Suite for HPC AI Systems
Abstract
In recent years, with the trend of applying deep learning (DL) in high performance scientific computing, the unique characteristics of emerging DL workloads in HPC raise great challenges in designing, implementing HPC AI systems. The community needs a new yard stick for evaluating the future HPC systems. In this paper, we propose HPC AI500—a benchmark suite for evaluating HPC systems that running scientific DL workloads. Covering the most representative scientific fields, each workload from HPC AI500 is based on real-world scientific DL applications. Currently, we choose 14 scientific DL benchmarks from perspectives of application scenarios, data sets, and software stack. We propose a set of metrics for comprehensively evaluating the HPC AI systems, considering both accuracy, performance as well as power and cost. We provide a scalable reference implementation of HPC AI500. The specification and source code are publicly available from http://​www.​benchcouncil.​org/​HPCAI500/​index.​html. Meanwhile, the AI benchmark suites for datacenter, IoT, Edge are also released on the BenchCouncil web site.
Zihan Jiang, Wanling Gao, Lei Wang, Xingwang Xiong, Yuchen Zhang, Xu Wen, Chunjie Luo, Hainan Ye, Xiaoyi Lu, Yunquan Zhang, Shengzhong Feng, Kenli Li, Weijia Xu, Jianfeng Zhan
Edge AIBench: Towards Comprehensive End-to-End Edge Computing Benchmarking
Abstract
In edge computing scenarios, the distribution of data and collaboration of workloads on different layers are serious concerns for performance, privacy, and security issues. So for edge computing benchmarking, we must take an end-to-end view, considering all three layers: client-side devices, edge computing layer, and cloud servers. Unfortunately, the previous work ignores this most important point. This paper presents the BenchCouncil’s coordinated effort on edge AI benchmarks, named Edge AIBench. In total, Edge AIBench models four typical application scenarios: ICU Patient Monitor, Surveillance Camera, Smart Home, and Autonomous Vehicle with the focus on data distribution and workload collaboration on three layers. Edge AIBench is publicly available from http://​www.​benchcouncil.​org/​EdgeAIBench/​index.​html. We also build an edge computing testbed with a federated learning framework to resolve performance, privacy, and security issues.
Tianshu Hao, Yunyou Huang, Xu Wen, Wanling Gao, Fan Zhang, Chen Zheng, Lei Wang, Hainan Ye, Kai Hwang, Zujie Ren, Jianfeng Zhan
AIoT Bench: Towards Comprehensive Benchmarking Mobile and Embedded Device Intelligence
Abstract
Due to increasing amounts of data and compute resources, the deep learning achieves many successes in various domains. Recently, researchers and engineers make effort to apply the intelligent algorithms to the mobile or embedded devices. In this paper, we propose a benchmark suite, AIoT Bench, to evaluate the AI ability of mobile and embedded devices. Our benchmark (1) covers different application domains, e.g. image recognition, speech recognition and natural language processing; (2) covers different platforms, including Android and Raspberry Pi; (3) covers different development frameworks, including TensorFlow and Caffe2; (4) offers both end-to-end application workloads and micro workloads.
Chunjie Luo, Fan Zhang, Cheng Huang, Xingwang Xiong, Jianan Chen, Lei Wang, Wanling Gao, Hainan Ye, Tong Wu, Runsong Zhou, Jianfeng Zhan
A Survey on Deep Learning Benchmarks: Do We Still Need New Ones?
Abstract
Deep Learning has recently been gaining popularity. From the micro-architecture field to the upper-layer end applications, a lot of research work has been proposed in the literature to advance the knowledge of Deep Learning. Deep Learning Benchmarking is one of such hot spots in the community. There are a bunch of Deep Learning benchmarks available in the community already and new ones keep coming as well. However, we find that not many survey works are available to give an overview of these useful benchmarks in the literature. We also find few discussions on what has been done for Deep Leaning Benchmarking in the community and what are still missing. To fill this gap, this paper attempts to provide a survey on multiple high-impact Deep Learning Benchmarks with training and inference support. We share some of our insightful observations and discussions on these benchmarks. In this paper, we believe the community still needs more benchmarks to capture different perspectives, while these benchmarks need a way for converging to a standard.
Qin Zhang, Li Zha, Jian Lin, Dandan Tu, Mingzhe Li, Fan Liang, Ren Wu, Xiaoyi Lu

Cloud

Frontmatter
Benchmarking VM Startup Time in the Cloud
Abstract
Elasticity is one of the primary reasons of the popularity of cloud computing. However, a frequent problem is affecting this popularity - longer processing time for the acquired Virtual Machines (VM) to be ready for usage. This problem hinders the advantages of elasticity. This processing time, known as VM startup time, depends on various factors. VM startup time varies due to space-time trade-off. Comparing VM startup time according to distinctive factors allows users to choose their desirable VM. They have options to select among the VMs as per their preferences. In this paper, we benchmark VM startup time in Amazon EC2, Microsoft Azure and Google Cloud for factors like instance type, time of the day, instance location, cluster creation and cluster resize.
Samiha Islam Abrita, Moumita Sarker, Faheem Abrar, Muhammad Abdullah Adnan
An Open Source Cloud-Based NoSQL and NewSQL Database Benchmarking Platform for IoT Data
Abstract
Internet of Things (IoT) is continually expanding, and the information being transmitted through IoT is often in large-scale in both volume and velocity. With its evolution, IoT raises new challenges such as throughput and scalability of software and database working with it. This is the reason that traditional techniques for data management and database operations cannot adopt the new challenges from IoT data. We need an efficient database system that can handle, store, and retrieve continuous, high-speed, and large-volume data, perform various database operations, and generate quick results. Recent developments of database technologies such as NoSQL and NewSQL database provides promising solutions to IoT. This paper proposes an extensible cloud-based open-source benchmarking framework on how these databases could work with IoT data. Using the framework, we compare the performances of VoltDB NewSQL and MongoDB NoSQL database systems on IoT data injection, transactional operations, and analytical operations.
Arjun Pandya, Chaitanya Kulkarni, Kunal Mali, Jianwu Wang
Scalability Evaluation of Big Data Processing Services in Clouds
Abstract
Currently, many cloud providers deploy their big data processing systems as cloud services, which helps users conveniently manage and process their data in clouds. Among different service providers’ big data processing services, how to evaluate and compare their scalability is an interesting and challenging work. Most traditional benchmark tools focus on performance evaluation of big data processing systems, such as aggregated throughput and IOPS, but fail to conduct a quantitative analysis of their scalability. In this paper, we propose a measurement methodology to quantify the scalability of big data processing services, which makes the cloud services scalability comparable. We conduct a group of comparative experiments on AliCloud E-MapReduce and Baidu MRS, and collect their respective scalability characteristics under Hadoop and Spark workloads. The scalability characteristics observed in our work could help cloud users choose the best cloud service platform to set up an optimized big data processing system to achieve their specific goals more successfully.
Xin Zhou, Congfeng Jiang, Yeliang Qiu, Tiantian Fan, Yumei Wang, Liangbin Zhang, Jian Wan, Weisong Shi
PAIE: A Personal Activity Intelligence Estimator in the Cloud
Abstract
Personal Activity Intelligence (PAI) is a recently proposed metric for physical activity tracking, which takes into account continuous heart rate and other physical parameters. PAI plays an important role to inform users of the risk of premature cardiovascular disease, and helps to promote physical activity. However, the PAI computing is too expensive to provide feedback in time, which restricts its practical value in disease warning. In this paper, we present PAIE, a Personal Activity Intelligence Estimator based on massive heart rate data in the cloud. PAIE provides approximate PAI with desired accuracy of statistical significance, which costs much less time than that used to provide the exact value. We design the PAI estimate framework in the cloud, and propose a novel estimate mechanism to leverage the efficiency and accuracy. We analyze the PAI algorithm, and formulate the statistical foundation that supports block-level stratified sampling, effective estimation of PAI and error bounding. We experimentally validate our techniques on Storm, and the results demonstrate that PAIE can provide promising physical activity estimate for massive heart rate data in the cloud.
Yingjie Shi, Fang Du, Yanyan Zhang, Zhi Li, Tao Zhang
DCMIX: Generating Mixed Workloads for the Cloud Data Center
Abstract
To improve system resource utilization, consolidating multi-tenants’ workloads on the common computing infrastructure is a popular way for the cloud data center. The typical deployment of the modern cloud data center is co-locating online services and offline analytics applications. However, the co-locating deployment inevitably brings workloads’ competitions for system resources, such as the CPU and the memory resources. These competitions result in that the user experience (the request latency) of the online services cannot be guaranteed. More and more efforts try to assure the latency requirements of services as well as the system resource efficiency. Mixing the cloud workloads and quantifying resource competition is one of the prerequisites for solving the problem. We proposed a benchmark suite—DCMIX as the cloud mixed workloads, which covered multiple application fields and different latency requirements. Furthermore the mixture of workloads can be generated by specifying mixed execution sequence in the DCMIX. We also proposed the system entropy metric, which originated from some basic system level performance monitor metrics as the quantitative metric for the disturbance caused by system resource competition. Finally, compared with the Service-Standalone mode (only executing the online service workload), we found that \(99^{th}\) percentile latency of the service workload under the Mixed mode (workloads mix execution) increased 3.5 times, and the node resource utilization under that mode increased 10 times. This implied that mixed workloads can reflect the mixed deployment scene of cloud data center. Furthermore, the system entropy of mixed deployment mode was 4 times larger than that of the Service-Standalone mode, which implied that the system entropy can reflect the disturbance of the system resource competition. We also found that the isolation mechanism has some efforts for mixed workloads, especially the CPU-affinity mechanism.
Xingwang Xiong, Lei Wang, Wanling Gao, Rui Ren, Ke Liu, Chen Zheng, Yu Wen, Yi Liang
Machine-Learning Based Spark and Hadoop Workload Classification Using Container Performance Patterns
Abstract
Big data Hadoop and Spark applications are deployed on infrastructure managed by resource managers such as Apache YARN, Mesos, and Kubernetes, and run in constructs called containers. These applications often require extensive manual tuning to achieve acceptable levels of performance. While there have been several promising attempts to develop automatic tuning systems, none are currently robust enough to handle realistic workload conditions. Big data workload analysis research performed to date has focused mostly on system-level parameters, such as CPU and memory utilization, rather than higher-level container metrics. In this paper we present the first detailed experimental analysis of container performance metrics in Hadoop and Spark workloads. We demonstrate that big data workloads show unique patterns of container creation, completion, response-time and relative standard deviation of response-time. Based on these observations, we built a machine-learning-based workload classifier with a workload classification accuracy of 83% and a workload change detection accuracy of 74%. Our observed experimental results are an important step towards developing automatically tuned, fully autonomous cloud infrastructure for big data analytics.
Mikhail Genkin, Frank Dehne, Pablo Navarro, Siyu Zhou
Testing Raft-Replicated Database Systems
Abstract
The replication technique based on Raft protocol is essential in modern distributed and highly-available database systems. Although Raft is a protocol easy to understand and implement, testing a Raft-replicated database system is still a challenging task due to multiple sources of nondeterminism. Conventional testing techniques, such as unit, integration and stress testing, are ineffective in preventing serious but subtle bugs from reaching production. This paper first introduces evaluation metrics after the abstraction of general Raft-replicated database systems. These metrics are defined from several aspects including correctness, performance, and scalability. Then, we present test dimensions for the design of test cases, which contain various fault types, different workloads and system configurations. Finally, we describe test results of Raft-replicated open source database system.
Guohao Ding, Weining Qian, Peng Cai, Tianze Pang, Qiong Zhao

Big Data

Frontmatter
Benchmarking for Transaction Processing Database Systems in Big Data Era
Abstract
Benchmarking is an essential suite supporting development of database management systems. It runs a set of well defined data and workloads on a specific hardware configuration to gather the results to fill the measurements. It is used widely for evaluating new technology or comparing different systems so as to promote the progress of database systems. To date, under the requirement of data management, new databases are designed and issued for different application requirements. Most of the state-of-the-art benchmarks are also designed for specific types of applications. Based on our experiences, however, we argue that considering the characteristics of data or workloads in big data era, benchmarking transaction processing databases (TP) must put much effort for domain specific needs to reflet 4V properties (i.e. volume, velocity, variety and veracity). With the critical transaction processing requirements of new applications, we see an explosion of designing innovative scalable databases or new processing architecture on traditional databases dealing with high intensive transaction workloads, which are called SecKill and can saturate the traditional database systems by high workloads, for example “11\(\cdot 11\)” of Tmall, “ticket booking” during China Spring Festival and “Stock Exchange” applications.
In this paper, we first analyze SecKill applications and the implementation logics, and also summarize and abstract the business model in details. Then, we propose a totally new benchmark called PeakBench for simulating SecKill applications, including workload characteristics definition, workload distribution simulating, and logics implementing. Additionally, we define new evaluation metrics for performance comparison among DBMSs under different implementation architecture from the micro- and macro- points of views. At last, we provide a package of tools for simulating and evaluating purpose.
Chunxi Zhang, Yuming Li, Rong Zhang, Weining Qian, Aoying Zhou
UMDISW: A Universal Multi-Domain Intelligent Scientific Workflow Framework for the Whole Life Cycle of Scientific Data
Abstract
Existing scientific data management systems rarely manage scientific data from a whole-life-cycle perspective, and the value-creating steps defined throughout the cycle constitute essentially a scientific workflow. The scientific workflow system developed by many organizations can well meet their own domain-oriented needs, but from the perspective of the entire scientific data, there is a lack of a common framework for multiple domains. At the same time, some systems require scientists to understand the underlying content of the system, which virtually increases the workload and research costs of scientists. In this context, this paper proposes a universal multi-domain intelligent scientific data processing workflow framework (UMDISW), which builds a general model that can be used in multiple domains by defining directed graphs and descriptors, and makes the underlying layer transparent to scientists to just focus on high-level experimental design. On this basis, the paper also uses scientific data as a driving force, incorporating a mechanism of intelligently recommending algorithms into the workflow to reduce the workload of scientific experiments and provide decision support for exploring new scientific discoveries.
Qi Sun, Yue Liu, Wenjie Tian, Yike Guo, Bocheng Li
MiDBench: Multimodel Industrial Big Data Benchmark
Abstract
Driven by the increasing industrial data over decades, big data systems have evolved rapidly. The diversity and complexity of industrial applications raise great challenge for companies to choose appropriate big data systems. Therefore, big data system benchmark becomes a research hotspot. Most of the state-of-the-art benchmarks focus on specific domains or data formats.
This paper presents our efforts on multimodel industrial big data benchmark, called MiDBench. MiDBench focuses on big data systems in crane assembly, wind turbines monitoring and simulation results management scenarios, which correspond to bills of materials (a.b.a BoM), time series and unstructured data format respectively. Currently, we have chose and developed eleven typical workloads of these three types application domains in our benchmark suite and we generate synthetic data by scaling the sample data. For the sake of fairness, we chose widely acceptable throughput and response time as metrics. Through the above we have established a set of benchmark applicable to high-end manufacturing with high credibility. Overall, experiment results show that Neo4j (representing graph database) performs better than Oracle (representing relation database) for processing BoM data. IotDB is better than InfluxDB in time series data for query and stress test. MongoDB performs better than ElasticSearch in simulation results management domain.
Yijian Cheng, Mengqian Cheng, Hao Ge, Yuhe Guo, Yuanzhe Hao, Xiaoguang Sun, Xiongpai Qin, Wei Lu, Yueguo Chen, Xiaoyong Du

Modelling and Prediction

Frontmatter
Power Characterization of Memory Intensive Applications: Analysis and Implications
Abstract
DRAM is a significant source of server power consumption especially when the server runs memory intensive applications. Current power aware scheduling assumes that DRAM is as energy proportional as other components. However, the non-energy proportionality of DRAM significantly affects the power and energy consumption of the whole server system when running memory intensive applications. Thus good knowledge of server power characterization under memory intensive workloads can help better workload placement with power reduction. In this paper, we investigate the power characteristics of memory intensive applications on real rack servers of different generations. Through comprehensive analysis we find that (1) Server power consumption changes with workload intensity and concurrent execution threads. However, fully utilized memory systems are not the most energy efficient. (2) Powered memory modules of installed memory capacity, i.e. the memory capacity per processor core has significant impact on the application’s performance and server power consumption even if the memory system is not fully utilized. (3) Memory utilization is not always a good indicator for server power consumption when it is running memory intensive applications. Our experiments show that hardware configuration, workload types, as well as concurrently running threads have significant impact on a server’s energy efficiency when running memory intensive applications. Our findings presented in this paper provide useful insights and guidance to system designers, as well as data center operators for energy efficiency aware job scheduling and power reductions.
Yeliang Qiu, Congfeng Jiang, Tiantian Fan, Yumei Wang, Liangbin Zhang, Jian Wan, Weisong Shi
Multi-USVs Coordinated Detection in Marine Environment with Deep Reinforcement Learning
Abstract
In recent years, with the rapid development of deep reinforcement learning, numerous researches have begun taking more and more attention in military and civilian fields. Compared with ship monitoring and other technical means, USVs have more significant advantages in marine environment and is gradually becoming a concern of academic and marine management departments. However, single agent reinforcement learning cannot fit well in the multi-USVs cases because of the non-stationary environment and complex multi-agent interactions. In order to learn cooperation models among USVs, we propose a multi-USVs coordinated detection method based on DDPG and LSTM is used for storage about the sequence of states and actions. Besides, in order to adapt to the algorithm, we model the marine environment where every USV is considered as an agent. Experiments are constructed in simulation conditions and the results verify the effectiveness of the proposed method.
Ruiying Li, Rui Wang, Xiaohui Hu, Kai Li, Haichang Li
EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures
Abstract
Various Erasure Coding (EC) schemes based on hardware accelerations have been proposed in the community to leverage the advanced compute capabilities on modern data centers, such as Intel ISA-L Onload EC coders and Mellanox InfiniBand Offload EC coders. These EC coders can play a vital role in designing next-generation distributed storage systems. Unfortunately, there does not exist a unified and easy way for distributed storage systems researchers and designers to benchmark, measure, and characterize the performance of these different EC coders. In this context, we propose a unified benchmark suite, called EC-Bench, to help the users to benchmark both onload and offload EC coders on modern hardware architectures. EC-Bench provides both encoding and decoding benchmarks with tunable parameter support. A rich set of metrics, including latency, actual and normalized throughput, CPU utilization, and cache pressure, can be reported through EC-Bench. Evaluations with EC-Bench demonstrate that hardware-optimized offload coders (e.g. Mellanox-EC) have lower demands on CPU and cache compared to onload coders, and highly optimized onload coders (e.g., Intel ISA-L) outperform offload coders for most configurations.
Haiyang Shi, Xiaoyi Lu, Dhabaleswar K. Panda

Algorithm and Implementations

Frontmatter
Benchmarking SpMV Methods on Many-Core Platforms
Abstract
SpMV is an essential kernel existing in many HPC and data center applications. Meanwhile, the emerging many-core hardware provides promising computational power, and is widely used for acceleration. Many methods and formats have been proposed aiming at better performance of SpMV on many-core platforms. However, there is still lack of a comprehensive comparison of SpMV methods to show their performance difference on sparse matrices with various sparse patterns. Moreover, there is still no systematic work to bridge the gap between SpMV performance and sparse pattern.
In this paper, we investigate the performance of 27 SpMV methods with 1500+ sparse matrices on two many-core platforms: Intel Xeon Phi (Knights Landing 7250) and Nvidia GPGPU (Tesla M40). Our work shows that no single SpMV methods is optimal for all sparse patterns, but some methods can achieve approximately the best performance on most sparse matrices. We further select 13 features to describe the sparse pattern and analyze their correlations to the performance of each SpMV method. Our observations should help other researchers and practitioners to better understand the SpMV performance and provide implications to guide the selection of suitable SpMV method.
Biwei Xie, Zhen Jia, Yungang Bao
Benchmarking Parallel K-Means Cloud Type Clustering from Satellite Data
Abstract
The study of clouds, i.e., where they occur and what are their characteristics, plays a key role in the understanding of climate change. Clustering is a common machine learning technique used in atmospheric science to classify cloud types. Many parallelism techniques e.g., MPI, OpenMP and Spark, could achieve efficient and scalable clustering of large-scale satellite observation data. In order to understand their differences, this paper studies and compares three different approaches on parallel clustering of satellite observation data. Benchmarking experiments with k-means clustering are conducted with three parallelism techniques, namely OpenMP, OpenMP+MPI, and Spark, on a HPC cluster using up to 16 nodes.
Carlos Barajas, Pei Guo, Lipi Mukherjee, Susan Hoban, Jianwu Wang, Daeho Jin, Aryya Gangopadhyay, Matthias K. Gobbert
Backmatter
Metadata
Title
Benchmarking, Measuring, and Optimizing
Editors
Chen Zheng
Jianfeng Zhan
Copyright Year
2019
Electronic ISBN
978-3-030-32813-9
Print ISBN
978-3-030-32812-2
DOI
https://doi.org/10.1007/978-3-030-32813-9

Premium Partner