A performance Brokerage for Heterogeneous Clouds



Yüklə 1,38 Mb.
səhifə17/49
tarix09.01.2019
ölçüsü1,38 Mb.
#94329
1   ...   13   14   15   16   17   18   19   20   ...   49

4.1 Cloud Performance



4.1.1 HPC Focused
Initial (academic) work on Cloud performance typically focused on the suitably of Cloud platforms for running the kinds of tightly coupled parallel codes that are commonly used in the scientific community. These codes usually use either MPI or PVM to allow cooperating processes to communicate in the completion of a task. Processes running on the same physical machine may communicate via shared memory, whilst those on different machines do so over a network fabric. The performance of these types of codes, i.e. time taken to complete the task, depends upon algorithmic design, CPU performance, memory bandwidth, and network latency. Network latency is a significant factor in performance due to the number of small status messages sent between processes to coordinate activities. It is typical for on-premise HPC systems to have a low latency network fabric, such as InfiniBand, over which MPI/PVM processes communicate.
However, these fabrics can be expensive, driving up the costs of on-premise HPC. Yelick, et al. (2011), investigated the suitability of Clouds with regards to cost, usability and performance for running US. Department of Energy (DOE) scientific workloads; the outcome is known as the Magellan Report. They state that ‘applications in the Cloud…experience significant performance variations’, where the variability is in respect to tightly-coupled workloads. They also find Clouds suitable for loosely coupled workloads. In addition, they note that whilst DOE data centres are highly energy efficient, and so are cost efficient with regards to energy consumption, they cannot offer the same rapid elasticity as found on Public Clouds as capacity is limited.

Arguably, few institutions can afford to run data centres similar to the DOE ones, and so HPC on Cloud is a somewhat different proposition for them. Osterman et al. (2009) investigated the performance suitability of EC2 for running scientific workloads using the HPL – an MPI component of the HPC Challenge (HPCC) benchmark suite, the performance of which is known to scale linearly with the quantity of resource provided. However, when scaling clusters of instances from the M1 and C1 family linear scaling was not observed. Indeed, as the number of instances was scaled, observed performance increasingly deviated from the expected linear performance scaling, which they attributed to relatively high latency found between instances on EC2. Their main conclusion is that ‘...performance and the reliability of the tested cloud are low’, we note however that at the time of their study lower latency options were not available.


Rehr et al. (2010) also show that performance of MPI codes does not scale linearly with cluster size. Lack of performance scalability is an important consideration as cost does scale linearly with quantity of resource, making the execution of large scale codes increasingly more expensive. This leads Evangelinos and Hill (2008) to recommend EC2 as suitable for ‘small’ HPC systems, useful for testing and developing code. Further, they also imagine that future Cloud platforms may target tightly coupled parallel codes, by, for example, offering similar low latency fabrics as found in (some) on-premise systems, and indeed Microsoft now do this with Azure.
Akioika and Muraoka (2010) and Gupta and Milojicic (2011) report that, whilst Clouds are unsuitable for tightly coupled parallel codes, they are suitable for so called embarrassingly parallel (EP) tasks which have little, if any, communication overhead. Indeed, the former authors report linear performance scaling for the EP component of NAS Parallel Benchmark suite.
Early extant work investigating HPC on EC2 is notable for the choices of instance types – usually either from the M1 or C1 family, for example in the Magellan report use is made of m1.large instance types (Yelick, et al., 2011). However, by 2010 EC2 had introduced a new Cluster Compute instance type, the cc1.4xlarge, followed by the cc2.8xlarge in 2011. These instance types offer 10GB Ethernet networking, providing a significant increase in bandwidth, and were described as ‘…specifically designed for high-performance computing (HPC) applications and other demanding network-bound applications’. However, Mauch et al. (2013) and Exposito et al. (2013) report that performance scaling issues remain, an indication that latency has a larger effect than bandwidth on performance of tightly coupled parallel codes.
Hassan et al. (2016) investigate HPC on Azure using A10 instance types, and report performance scaling issues for the various MPI codes used. However, the choice of A10 is somewhat curious as it (and the A11) only differs from previous generation A8 and A9 by not having low-latency RDMA (remote direct access memory) network adapters. Re-running these benchmarks on A8/A9 instances designed for tightly coupled codes would allow for a more ‘like for like’ comparison of Cloud HPC with on-premise HPC, and so a more accurate cost analysis.
Jackson (2010) compared the cost of HPC performance for m1.large instances on EC2 with on-premise systems, and reported that (for tightly coupled codes) EC2 is 6 times slower than a ‘mid-range’ Linux cluster and 20 times slower than a ‘modern’ HPC system. However, the analysis only compares hardware costs with instance rental costs. This excludes other significant costs associated with on-premise systems including energy, cooling, racks and networking equipment, local support and decommissioning. Indeed Gartner report a total cost of ownership of up to 500% of the original value of a typical desktop or server (Meulen, 2008).
Further, typical HPC clusters found at universities and in other scientific institutions are of a fixed size and hence in a given unit of time can only produce a fixed amount of work; such limitations are harder to encounter on the Cloud due to their vast scale. Growing on-premise cluster capacity is often non-trivial due to space, electricity and cooling requirements and cannot be done quickly. Consequently, when running at full capacity, any additional work will need to wait for existing work to finish, or for the cluster to grow.
Such considerations prompted Foster (2009) to ask the question: What’s faster: EC2 or a Super Computer? When considering a particular workload (with estimated execution times on EC2) he demonstrated that the estimated probability of completion within a given time was higher on EC2 than a national supercomputer due to stochastic time that jobs spend in queues waiting to be run in the latter, whilst on-demand provisioning on the former remove queue time allowing jobs to start immediately.
4.1.2 Performance Variation in Common Workloads
We may expect scientific codes that are typically run in environments with specialist low latency networking to have variable performance on Clouds. However, performance variation affecting a wide range of workloads has been reported. Armbrust et al. (2009) list performance unpredictability as the number 5 obstacle to the adoption of Cloud Computing. They use STREAM to measure the memory bandwidth of 75 instances, and report a mean bandwidth of 1355MB/s and a standard deviation of 52MB/s, giving a coefficient of variation (ratio of standard deviation to mean) of ~4%. Whilst this is one the earlier papers to report variation, this issue is still highly relevant as performance variation across a wide range of workloads and providers is still being reported on, for example by Leitner and Cito (2016).
A notable feature of the histogram presented, and not commented upon, is that the results break out into 3 distinct bands, with the best performing band showing ~40% increase in memory bandwidth as compared to lowest. Based on our results, presented in sections 5.2 – 5.4, we conjecture that the difference between those performances may be explained by differences in the underlying CPU model. However, as they do not report the instance type being measured we cannot directly compare with results we have obtained.
Cerotti et al. (2012) demonstrate performance variation on M1 family instance types on EC2 using the DaCapo benchmark suite, which consists of a number of real world Java workloads with non-trivial memory requirements. The latter means that the memory sub-systems will be stressed during workload execution, a crucial factor in performance and one which a number of benchmarks used to investigate Cloud performance fail to do. The observed variation is readily explained by the heterogeneity of the instance type.
Phillips et al. (2011) discovered a performance ‘anomaly’ when measuring performance with the so-called Dwarf workloads. In particular, they find that 10 m1.small instances benchmarked fell into 2 performance bands, which they refer to as type-1 and type-2, and that type-1 instances have different CPU models and cache sizes than type-2. They also report that some Dwarf workloads perform better on type-1 instances whilst others perform better on type 2; that is, there is no ‘best’ type for all workloads. Interestingly for our work, they demonstrate correlations between the dwarf benchmarks and certain workloads, and in follow up work Engen et al. (2012) suggest that workload performance can be predicted given knowledge of dwarf performance and correlations. This is useful for a performance broker as instance performance can be specified in terms of benchmark workloads from which users can deduce likely workload performance.
Ou et al. (2013) demonstrate performance variation on EC2 instances due to heterogeneity. However, the suggestion that performance variation across instances with the same CPU model is negligible does not correspond to our findings, presented in chapter 5. It is likely that the negligible variation across instances of the same type is a reflection of the benchmark used - UnixBench - whose working set typically fits into the CPU cache and consequently does not stress the memory hierarchy. As we discuss in section 4.2, this assumption leads to an underestimation of risk involved in a performance improvement strategy they propose.
Farley et al. (2012), using 4 benchmarks from SPEC CPU 2006 suite, observe that the degree of variation across instances with different CPU models and across instances with the same CPU model, differs by workload. Further, and in agreement with Phillips et al. (2011), they find that different workloads perform better/worse on different CPU models. In further confirmation of the workload specific nature of performance variation, Lenk et al. (2011), using a wide selection of different benchmarks, report that different workloads will perform better/worse on different architectures, stating ‘... machines running on an Intel Xeon CPU E5430 CPU perform better in most of the benchmarks and AMD Opteron 2218 HE CPUs only in few of the benchmarks’. In light of this, the result of Zhuang et al. (2013) showing that Clouds will suffer from resource starvation as users ‘game them’ by acquiring ‘best’ instances would need reappraising.
Regional differences have been reported by Tordsson et al. (2012), where they note increased variability in performance in US based Regions as compared to the EU Region. As they are running CPU bound workloads, we suspect that the variation is likely explained by heterogeneity but this cannot be confirmed as CPU model types were not observed. Before making a request for an instance it is not possible to know the CPU model that will be obtained, and from the perspective of a user it is a random process, so we can talk of the probability of obtaining an instance with a particular CPU model. Schad et al. (2010) found that, for a given CPU model, the proportion of instances running on it (which can be used as an estimate of probability of obtaining an instance running on the CPU) varies across different AZs within the Regions US East N. California and EU West Dublin. As a consequence, workload performance distribution differs by AZ, and so, depending upon the workload a user intends to run, different AZs offer opportunities for better/worse performance.
O’Loughlin and Gillam (2014) report differences in per-AZ proportions of particular CPU models obtained are widespread across EC2. For example, we report that, for the M1 family 87% of instances running on an Intel Xeon E5507 in us-west-1b, in us-east-1a 44%, and in eu-west-1a this drops to 2%. The simplest explanation for such findings is that different AZs contain different models in different proportions due to their age. As a new AZ is added to existing Region it is likely that the hardware it contains supporting the same instance types is different to that in extant AZs.
4.1.3 Noisy Neighbours and Resource Contention
Clouds are (in the main) multi-tenant environments, as a consequence instances owned by different users may run concurrently on the same physical host, and indeed a given user may have their own instances co-located. In section 3.3 we note that co-locating instances may share many different components, including caches and memory bandwidth, which cannot be fairly allocated amongst competing processes. Work at Intel Labs by Tickoo et al. (2009) finds that resource contention amongst co-locating instances can cause significant performance degradation, and the problem is commonly referred to as the noisy neighbour problem. Further, as discussed in section 3.4, hypervisors schedule CPU time for instances and so CPU cycles can be allocated to instances in a particular proportion. However, this is done on a best effort basis and does not guarantee that an instance will always receive its assigned allocation. In Xen, for example, I/O operations carried out on behalf of an instance are not accounted for, and so I/O bound workloads can ‘steal’ cycles.
A further level of resource invisibility exists in Intel CPUs that support hyper-threading. When enabled, a single CPU core holds the state of 2 execution threads simultaneously, thus increasing the number of instructions that can be retired per cycle. However, the execution of identical threads in the same core results in resource contention for the execution units, potentially causing performance degradation (Leng, 2002). We note the use of hyper-threading on major providers such as EC2 (Amazon Web Services, 2017) and GCE (Google Cloud Platform, 2017) is wide-spread.
To address concerns over shared resource abuse on Clouds, Intel developed cache allocation technology (CAT), available on Intel E5 v4 CPU families. CAT makes data placement in the last level cache (LLC) visible to the CPU, with the potential for alleviating noisy neighbour problems. These CPU models have recently started to appear on EC2, for example the E5-2686 v4 (Broadwell) is one of the CPUs that an M4 may be launched on, and so opportunities now exist to investigate the use of CAT for this purpose.
The noisy neighbour problem became so problematic on Google’s internal infrastructures, Zhang et al. (2013) developed and implemented a system known as CPI2, to detect and terminate offending instances. Google makes use of Linux containers (Linux Containers, 2017), often referred to as lightweight virtual machines, in order to provide workload isolation and resource allocation, and all Google’s workloads run in containers. For each of their workloads they can determine per CPU model distribution of cycles per instruction (CPI) retired, and this is used as a proxy for measuring workload performance. If the CPI for a workload is more than one standard deviation away from mean CPI then that workload is deemed to be performing poorly and a monitoring process is started to identify and terminate the noisy neighbour. CPI2 requires access to a CPU’s Performance Monitoring Unit (PMU), from which CPI data can be obtained. However, Johnson et al. (2012) discuss the various issues with regards to virtualising the PMU, and on EC2 access to the PMU is only available on dedicated instances.
The steal time metric, however, is available to instances and measures the proportion of time a vCPU spends in the run queue waiting to be serviced by a real CPU. This can be used to detect whenever an instance is not receiving its allocated share of CPU resources, as they are being ‘stolen’ by one or more noisy neighbours. Netflix, a large user of EC2, monitors this metric and whenever a certain threshold is reached the instance is terminated and a new one acquired. Lloyd et al. (2017) have developed a tool called cpuSteal, which aims to simplify the collection and use of the steal time metric in order to identify instances with noisy neighbours.
4.1.4 Performance Summary and Conclusions
In a survey of extant performance work on Clouds, Leitner and Cito (2016) formulate various hypotheses regarding performance of instances. These hypotheses are grouped into 3 categories: (1) performance predictability, by which they mean variation across different instances of the same type; (2) variability within instances, by which they mean per instance temporal variation and (3) Regional and temporal factors, by the latter they mean temporal factors common to all instances. They test these hypotheses by benchmarking a range of instance types across multiple providers and we summarise conclusions below:

  • Performance Predictability: Instances of the same type demonstrate performance variation and that this is true for both CPU and I/O bound workloads. In the former case, the performance of these workloads is primarily determined by CPU model, whilst in the latter case there is no observed association between CPU model and performance. However, we note that Ou et al. (2013) report that local disk performance of M1 instances on EC2 varies by CPU model, with the E5645 performing 15% better on average than the next best CPU model.

  • Per Instance Temporal Variation: The performance of an instance may vary over time. They find that CPU bound workloads vary less than I/O bound ones.

  • Regional and Temporal Variation: The performance of CPU bound workloads can vary by region (Cloud location), due to the differences in available hardware supporting the various instance types. For I/O bound workloads no such regional variation is reported. Further, they find no evidence to support temporal differences affecting a large proportion of instances. That is, they find no evidence of predictable busy periods when performance drops.

Interestingly, whilst EC2 has been extensively benchmarked, Leitner and Cito (2016) also benchmark Azure, and report a significant degree of heterogeneity resulting in performance variation. However, they state that in general heterogeneity appears less prevalent on Clouds than reported in earlier work. As an example of this they cite the fact that instances from the M3 class on EC2 all obtained the same CPU model, an Intel Xeon E5-2670, in their study. However, we note the M3 family is now advertised as heterogeneous, as instances may be obtained on this model or the next generation E5-2670 v2. Further, the cited example of homogeneity on GCE can be explained by the fact that at the time of this study it was in beta, whereas at the time of writing instances from both their Standard and High CPU families may launch on 1 of 4 different CPU models, depending upon location. Finally, we note that the EC2 M4 family, successor to the M3, is also heterogeneous, as instances can launch on either an E5-2676 v3 or an E5-2686 v4 CPUs.


As noted, the performance of any particular instance may vary over time with CPU bound workloads showing less variation than I/O bound ones. However, there have been significant recent changes in disk technologies available to instances, with SSD drives now commonplace, and EC2 now offer a range of performance assurances, specified in terms of input/output operations per second (IOPS), for I/O.
Despite the reporting of performance variation, and the implications in terms of workload execution times and costs, much extant work on Cloud costs fails to consider it. For example, Khajeh-Hosseini et al. (2012) model cost of Cloud use but implicitly assume no variation. Similarly, Truong and Dustdar (2010) note that costs are a major consideration for the scientific community when running workloads on Clouds, but assume constant performance across identical instances in their cost model. Unsurprisingly, use of Cloud for scientific workloads has received much attention, particularly the scheduling of so-called Bag of Tasks (BoT), defined as a set of independent (and often pre-emptible) batch jobs. A typical problem seeks to schedule in such a manner as to minimise the makespan (time to execute all tasks in the bag) subject to either budgetary or time constraints, or both. However, Cloud performance is typically modelled as homogeneous and constant, that is, all instances of the same type will have the same constant rate at which they can execute tasks, for example, both HoseinyFarahabady et al. (2013) and Oprescu (2010) make these assumptions.
Notably, the fundamental question of how the performance of an instance varies over time has received little attention with the exception of Leitner and Cito (2016) who benchmark the performance of the 15 instances over a 72 hour period and show that instances have negligible variation; that is, the performance of each instance is essentially constant over the period. Indeed, not only is there negligible variation within instances but also across the instances. However, their choice of sysbench as a CPU benchmark is problematic as it suffers from similar problems to UnixBench, namely, it does not stress memory hierarchy due to its code size. We suspect that instances running real-world benchmarks, of the type described in section 3.6, will likely show more variation over time than by running sysbench.
Other temporal studies however, such as Ou et al. (2013) do not track individual instances over a period of time, but instead periodically acquire a set of instances, take a cross-section of measurements and then release them back. The lack of reported temporal differences demonstrates a consistency in the magnitude of cross-sectional variation over time.
Based on Leitner and Cito (2016) one would model instance performance over time as constant with negligible variation. Ideally, this is what Cloud performance should be. However, such a model would not account for variation in homogeneous instances types, as reported by Lenk et al. (2011) for example, unless one allowed for differences between instances in their constant performance level. This does raise immediate questions, namely, can the ‘constant’ performance level of an instance change over time and if so with what frequency and how do we model the differences in the constant performance level of instances?
Extant work typically reports variation amongst instances with the same model in terms of mean and standard deviation. Use of the mean suggests a degree of symmetry in performance variation with an equal ‘upside’ and ‘downside’. We know that resource contention on a host can cause performance degradation with the potential for this to be severe, an issue for which, as noted: Google implements its CPI2 monitoring tool, Lloyd et al. (2017) develop cpuSteal, while Delimitrou and Kozyrakis (2017) develop a workload scheduler to minimise co-location performance interference. However, it is far from clear that we should expect similar upside deviations. Indeed, best performance will be obtained in the absence of any resource contention, and so there is a limit to how good performance can be, whilst contention leads only to worsening performance with no obvious limit beyond which performance cannot degrade. Arguably then, differences in performance across instances with the same CPU model should be described in terms of a best possible and deviations or degradations from this. As such there is a need to characterise the whole range of performance and simple summary statistics do not suffice.
However, Cloud resources can be released at any time, and so there is no requirement to hold onto poor performing instances. This simple observation forms the basis of so-called performance improvement strategies, which we discuss next.

Yüklə 1,38 Mb.

Dostları ilə paylaş:
1   ...   13   14   15   16   17   18   19   20   ...   49




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin