A performance Brokerage for Heterogeneous Clouds



Yüklə 1,38 Mb.
səhifə14/49
tarix09.01.2019
ölçüsü1,38 Mb.
#94329
1   ...   10   11   12   13   14   15   16   17   ...   49

3.4 Performance Metrics

The purpose of a performance metric is to summarise the performance of a machine in such a way as to allow users of it to infer ‘how well’ their workload will run. By ‘how well’ we typically mean how fast the machine is at completing a task or tasks of interest. Performance metrics allow for useful comparisons between machines to be made, often serving as an important factor in purchasing decisions. Lilja (2008) describes 6 characteristics that a good performance metric should have, which we summarise as follows:



  1. Linearity: A metric is linear, if machine A is rated at twice that of machine B implies that the time to taken to complete tasks of interest on A is half of that on B.

  2. Reliability: A metric is said to reliable, if the rating on machine A is higher than for machine B implies a machine A executes their tasks quicker than B.

  3. Repeatability: Repeated measuring on the same machine, with all else being equal, should return the same results repeatedly.

  4. Ease of Use: A good metric is easy to measure.

  5. Consistency: The metric is defined in a consistent way across different machines and independently of the underlying architecture.

  6. Independence: A good metric should be developed independently from the needs of any one particular manufacturer.

It is tempting to express the performance of a machine in terms of what it is physically doing, for example by its clock-rate: the number of clock cycles per second. For identical CPUs, a higher clock-rate indicates likely higher performance, although the actual speed up will be dependent upon other hardware components such caches and storage and so the effect of this will vary by workload. However, different CPUs will do different amounts of work per clock cycle and so clock-rate makes for a meaningless and potentially misleading comparison. In terms of the characteristics of good performance metrics it is not a reliable metric. The Pentium 4 range, for example, had higher clock rates than the Pentium 3 models. However, this was achieved with a significant increase in the number of processing stages in the CPU pipeline (pipeline depth), with each stage taking one cycle to complete. In theory, this should have meant that whilst instruction latency was the same, instruction throughput was increased as more instructions are being processed at any given time. However, pipeline stalls and branch mis-predictions meant that application performance was typically worse.


In short, clock-rate simply defines a unit of time (clock-cycle) during which the CPU carries out some specific actions before moving onto the next set of actions, but tells us nothing as to what those actions are achieving. As such, clock-rates cannot summarise performance across different types of machines.
A simple early metric was based on the observation that for most systems instructions on the same machine took the same time to execute (Lilja, 2008); indeed, on single cycle instruction machines they always took the same time The time taken to execute a chosen instruction, typically addition, was used as basis for comparing performance. However, as machine architecture developed and multi-cycle instruction machines became more prevalent, different instructions took different amounts of time. For example, more complex arithmetic instructions take longer than simple ones, whilst memory access instructions typically take longer still. Execution time of a single instruction cannot reliably determine performance between multi-cycle instruction machines.
Gibson classified instructions on an IBM 704 series into 13 classes, where instructions in each class took the same number of cycles to execute. The classification is known as the Gibson instruction mix (Jain, 1991). Based on observations from a set of workloads prevalent on these systems at the time, he determined a frequency for each class, from which he calculated a weighted average. This weighted average is the metric used for measuring performance. Difficulties with Gibson Instruction Mix were noted by Palme (1972), who observed that the average instruction execution time reported only corresponds well to workloads with a similar instruction mix, and so it is not a reliable metric.
Developments in CPU architecture have proven to be increasingly problematic for the instruction mix approach. With multiple memory hierarchies, pipelines, and speculative execution, the classification approach breaks down, as the time taken to execute a given instruction is no longer deterministic but a function of CPU state.
Millions of Instructions per Second (MIPS) is purported to be a measure of instruction throughput and it is possible, from the physical characteristics of a machine, to determine a Theoretical Peak Performance (TPP), defined as the maximum number of instructions that can be executed per second. However, computer systems can rarely achieve an advertised TPP due to pipeline stalls, branch mis-predictions, and memory hierarchy latency. Adjusted Peak Performance adds a weighting for each term in the TPP formula. Whilst these weightings are architecture specific, a commonly used rule of thumb is given by APP = 0.7*TPP (U.S. Department of Commerce, 2006).
MIPS can be determined by measuring the number of instructions executed by running workloads; however it then becomes workload dependent as different workloads will report different MIPS. More problematically, Dixit (1993) notes ‘All instructions...are not equal’, for example, a CISC instruction will typically do more application specific work than a RISC instruction. From a MIPS rating alone, all we know is that different machines are executing different quantities of instructions per second; however, we do not know how much work these instructions are doing. As a consequence, on the basis of the MIPS rating only it is not possible to determine how quickly different machines will execute the same workload and so in general MIPS is not a reliable metric.
Another popular instruction throughput measurement, and one commonly used in the scientific computing domain, is Floating Point Operations per Second (FLOPS). However, FLOPS suffer from similar problems to MIPS, primarily stemming from the fact that it is also defined in terms of numbers of particular types of instructions executed per second, but without reference to what the instructions are actually doing. As different instructions can do different amounts of work in different machines, FLOPS is also not a reliable metric.
The metrics considered above are all defined in terms of what the machine is physically doing, as opposed to progression towards completing a task. Metrics defined in terms of task progression are generally considered more useful and most informative to a user (Lilja, 2008; Hennessy and Patterson, 2012). Examples of such metrics include:
Program Execution Time (CPU latency): This metric is defined by the elapsed wall clock time from program start to finish. Performance is often defined as the inverse of execution time and so faster execution times give higher performance scores. When using this metric it is necessary to specify the reference workload being used to measure execution time.
Work done in a fixed time: In the program execution time metric, the amount of work done is fixed and wall clock time is the variable of interest.
Throughput (CPU bandwidth): Throughput is defined as the number of units of work per unit time (usually per second) the CPU can perform. For consistency, the unit of work should be well defined and remain constant. However, there is no commonly accepted definition of a unit of computational work and so this becomes workload dependent.
Quality Improvements per Second (QUIPS): Gustafson and Snell (1995) argue that neither the amount of work nor the time should be fixed, and introduce a benchmark called Hierarchical Integration (HINT), which measures quality improvements per second for a specific numerical integration. A notable feature of this work is that neither the time nor quantity of work is fixed, but rather the benchmark terminates when quality of the solution can no longer be improved upon. .

Response Time: The above metrics are suitable for batch jobs. For interactive applications or websites, response time (also known as application latency) is a good metric. Hoxmeier and DeCesare (2000) present evidence to show that higher response times lead to lower user satisfaction.
As noted, one of the purposes of metrics is allow for comparisons between machines, however, whilst the task progression metrics above allow us to quantify how fast machines A and B are, they do not tell us how much faster/slower A is compared to B. When comparing relative performance between machines it is common to use the ‘speed up’ which, for a fixed workload, is defined as follows: The speed up of A relative to B, speedup(A, B) is:
speedup(A, B) := execution_time(B)/execution_time(A) 3.1
Note that if speedup(A, B) > 1 machine A executes the workload faster than B.
Looking ahead, in chapter 5 we will make use of a similar metric to speed up – the degrade or slowdown, and we define degrade of A relative to B as the inverse of the speedup, that is:
degrade(A, B) := execution_time(A)/execution_time(B) 3.2
In this case, if degrade(A, B) > 1 then machine A executes the workload slower than B. We make use of degrade instead of speedup for ease of exposition, as it allows us to say, for example, that if instance A has a degrade of 1.4 compared to instance B then the workload execution cost of A is 1.4 times that of B. Further, we will typically take as our reference point the best possible performance and consider degrade relative to this.
A common metric for I/O performance is MB/s, and looking ahead we note that Iozone, which we use to measure storage performance, reports results using this metric. However, EC2 makes use of IOPS when defining storage, and an IOPS is the number of I/O operations per second. They define an I/O operation as a read or write up to a particular limit, which for SSD drives is 256KB, and for HDD is 1024KB. Although sequential read/writes under the IOP data limit may be aggregated into one or more I/O operations, this is not the case for random access patterns. To estimate execution costs of an I/O bound workload requires knowledge of the workload’s data access patterns.
Whilst the focus of this section is on measuring resource performance, we note efforts aimed at measuring other aspects of service performance: Carnegie Mellon University launched the Cloud Services Measurement Initiative Consortium (CSMIC) to create a Service Management Index (SMI) for comparing Cloud services (Siegel and Purdue, 2012). SMI defines 7 categories: Accountability, agility, assurance, financial, security and privacy, performance and usability, each of which contains one or more attributes. For example, the attributes of the performance category are: accuracy, functionality, suitability, interoperability and service response time. However, it is not readily apparent how to objectively measure them. Suitability, for example, likely depends on workload which will vary by user.
In summary, metrics defined in terms of what a machine is physically doing, such as MIPS, FLOPS and clock rates, are not reliable metrics as they cannot reliably distinguish actual performance of different machines due to differences in the amount of work different instructions will complete on different machine. This has led to a preference for so-called task progression metrics such as execution times and throughput. However, task progression metrics require reference workloads (benchmarks) to be chosen. There are a number of pitfalls that need to be avoided when choosing benchmarks, as we discuss next.

Yüklə 1,38 Mb.

Dostları ilə paylaş:
1   ...   10   11   12   13   14   15   16   17   ...   49




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin