By ‘compute benchmark’, we understand that a reference workload is used to measure the performance of a system with respect to some metric. Early, well known, benchmarks include Whetstone (Curnow and Wichmann, 1976) and Dhrystone (Weicker, 1984). Both of these are examples of so-called synthetic benchmarks which are designed to statistically mimic the CPU usage of a ‘target’ set of workloads. By knowing the performance of synthetic benchmarks, users can estimate performance of their real workloads. Whetstone, for example, mimics typical workloads that were being run at the UK National Physical Laboratory. Both of these benchmarks repeatedly execute a loop and report results in terms of loop iterations per second. The reporting of work done per second in a machine-independent manner makes them suitable for comparing different architectures - overcoming some of the architecture dependent limitations with MIPS and FLOPS.
Whilst synthetic benchmarks focus on statistical CPU usage, kernel benchmarks focus on the task a ‘target’ workload is trying to accomplish. Typically, they extract loops of code from a target which was considered to be of core importance. A well-known example includes Livermore Loops, developed at Lawrence Livermore National Laboratory, which executes 24 different loops ranging from Monte Carlo search to calculating inner products (Weicker, 1990). These loops were chosen as they correspond to common numerical computations in use at the laboratory.
A more recent example of a kernel benchmark approach is the OpenDwarfs benchmark suite. This is based on the so-called Computational Dwarfs13 of which there are now 13. The original definition included, for example, Sparse Linear Algebra and Monte Carlo methods. The OpenDwarfs suite has a benchmark workload for each dwarf. Unfortunately, at the time of writing, all bar one of the benchmarks within the OpenDwarf suite is classified as beta, and indeed attempts to compile and run the beta benchmarks produced errors.
Both early synthetic and kernel benchmarks suffer from similar problems, one of which is how well they correlate to workloads outside of their chosen target sets. A further problem is that, due to their size, they fit into the CPU instruction cache. As a consequence, their execution time will not be affected by access speeds associated with the various memory hierarchies. However, the performance of real world workloads is very much dependent upon memory hierarchy. Further, they have more complex instruction dependencies, which leads to hazards in CPUs with pipelines, than are found in either synthetic or kernel benchmarks. Indeed, such issues have long been recognized, with the author of Dhrystone, stating that (EEMBC, 1999):
‘Although the Dhrystone benchmark that I published in 1984 was useful at the time...it cannot claim to be useful for modern workloads and CPUs because it is so short, it fits in on-chip caches, and fails to stress the memory system’
Difficulties in constructing synthetic or kernel benchmarks which stress the systems sufficiently so as to report ‘realistic’ performance has led to a preference for so-called ‘real world’ benchmarks. These are CPU bound applications which are in common use, being run with realistic inputs. Such a preference is typified by the Standard Performance Evaluation Corporation (SPEC). SPEC is a not-for-profit organisation consisting of leading industry players such as Intel, Red Hat and AWS. SPEC makes available a number of different benchmarking suites, with SPEC CPU 2006 being one of the most well-known and widely used, which is described as a ‘...measure of compute intensive performance across the widest practical range of hardware using workloads developed from real user applications’.
SPEC CPU 2006 is made of two suites, SPECInt 2006 and SPECfp 2006; the former consists of 12 integer benchmarks and the latter 19 floating point ones. The current version is CPU 2006 v2, last updated on 09/2011. SPEC also specifies that, where possible, the input set for any benchmark should consist of at least 3 files. This helps to ensure that the results obtained are not a feature of one particular input. Looking ahead, when we run the GNU GO benchmark for example, we do so over an input set containing 500 games.
SPEC can either run individual benchmarks, in which case they report execution time, or run all benchmarks in the suite and report a composite known as SPECmark. To calculate SPECmark, first the execution time of all the individual benchmarks is obtained. Next, these are normalised by dividing them into the corresponding execution time on a reference machine - a Sun Ultra Enterprise 2 workstation with a 296-MHz UltraSPARC II processor. These ratios are known as SPECratios and are dimensionless. Finally, the geometric mean of SPECratios is calculated to obtain the SPECmark.
These composite metrics are not without issues. SPECmark is a non-linear metric, which means that if machine A has a score of twice machine B, it does not necessarily imply that machine A will run applications twice as fast as B. This makes interpretation of SPECmarks across a range of machines difficult, without further information such as the SPECratios, so that execution times can be determined. More significantly, SPECmark is dependent upon the reference machine. Indeed, Gustafson and Snell (1995) state that the SPECmark of the reference machine14 is 3, indicating it is 3 times faster than itself! Further, Smith (1988) shows that by using different reference machines it is possible that machine A is both faster and slower than machine B.
Whilst the SPEC CPU 2006 suite was developed before the launch of Public Infrastructure Clouds we note the inclusion of the suite in Google’s PerfKitBenchmarker15 for benchmarking Clouds. Indeed, PerfKitBenchmarker also includes the pgbench benchmark which we make use of in section 5.6. More recently, SPEC have released SPEC Cloud IaaS 201616 benchmarking suite which aims to measure both provisioning and runtime performance, with the latter measuring CPU and I/O performance. Interestingly, performance is with respect to a group of instances working together to complete a task, rather than performance of individual instances. Whilst our focus is on individual instances, future work will consider impact of variation amongst instances which are coupled, as discussed in section 8.4.6. Finally, as noted in section 3.2 variation in provisioning and general service response does exist, however, as discussed in section 1.4, we are not concerned with these issues.
Dostları ilə paylaş: |