A performance Brokerage for Heterogeneous Clouds

Yüklə 1,38 Mb.

səhifə	24/49
tarix	09.01.2019
ölçüsü	1,38 Mb.
	#94329

1 ... 20 21 22 23 24 25 26 27 ... 49

Iozone read and write
5.1.2 Statistical Methodology

5.1 Methodology

5.1.1 Empirical Methodology
A key methodological question is: How should we design our experiments? A common approach in statistics is to take measurements from a random sample (of some size) of the population and use this to estimate population parameters. However, this approach typically assumes that the properties of the sample do not vary with time. In medical statistics, a cross-sectional study takes a sample from a population and measures a particular value, giving, in essence, a snapshot in time. A cross-section allows for differences, or variation, amongst the sample to be measured.
A longitudinal study involves repeated measurements over time with the objective of understanding how the characteristics of interest of a particular representative of the population vary over time. A panel study is a combination of a cross-section and longitudinal study that repeatedly measures a cross-section of a random sample over time. We are interested in performance differences between a set of instances at a particular point in time, as well as how the performance of individual instances vary over time. As such we conduct both cross-sectional and panel studies.
The objective of our benchmarking methodology is to be able to measure performance of instances in a consistent and repeatable manner, and to do so using commonly accepted workloads. As discussed in section 3.6 there are issues with benchmarks which fail to sufficiently stress systems, as is commonly the case for kernel benchmarks and so-called toy benchmarks. As such, we avoid use of the following: UnixBench whose CPU component consists of Dhrystone and Whetstone, Sysbench which simply calculates the number of primes up to a specified number and Livermoore Loops and a number of other small benchmarks. The OpenDwarfs benchmark suite also takes a kernel based approach, and develops benchmarks based on so-called computational dwarfs²². Unfortunately, at the time of writing all bar one of the benchmarks within the OpenDwarf suite is classified as beta, and indeed attempts to compile run beta benchmarks produced errors.
Concerns regarding developing applications specifically as benchmarks has led to a preference for real-world benchmarks as exemplified by the SPEC suite. This suite is developed by an industry led consortium and benchmarks from it are commonly found in performance work, for example Hennessay and Patterson (2012). As such, to measure compute performance, our choice of workloads is based on the SPEC suite. The SPEC CPU suite is distributed in source code form and must be compiled in accordance with stated conditions. This ensures that binaries are built in an identical fashion across systems, preventing system specific compiler optimisations from being made.
As SPEC requires compilation with specific options, pre-built binaries, as commonly found on Linux and Unix distributions, cannot be used for reporting official SPEC scores. However, using SPEC requires a commercial license; currently $800²³. For financial reasons, we choose to make use of prebuilt binaries where available. To ensure consistency we make use of the same binary when measuring performance variation across a set of instances. In certain cases, SPEC makes changes to the source code, most pertinently for us to bzip2 where they minimise the amount of disk I/O when reading and writing file. In this case, when using bzip2, our benchmark reads files into the disk cache first, and so it is present in RAM, and sends output to /dev/null.
To measure the compute performance of instances we choose 6²⁴ benchmarks from the SPEC CPU 2006 suite: 4 of which are part of the SPECint suite whilst the other 2 are part of SPECfp, ensuring a mix of integer benchmarks and floating point benchmarks. These were chosen in part for their ease of implementation, with a number of them already having pre-built binaries as part of the Operating System we use. Specific choices are made on the basis of the ‘type’ of computation, for example, massive multi-player games driven by AI engines are commonly hosted on Clouds and so we include the game GO. Similarly, image and video rendering are common activities on Clouds and so we include POV-Ray, a photo realistic image rendering tool.
We detail these below.

Sa-learn (SpamAssassin, n.d.): Sa-learn is a naïve Bayesian classifier used in the Spam identification program Spam-assassin, and classifies mail as either SPAM or HAM. For input (training set) we use SPAM and HAM datasets, which consists of thousands of emails, available from the Spam-assassin public corpus. Sa-learn is single threaded and cannot make use of multiple vCPUs. The benchmark involves timing how long it takes to learn a training set.
bzip2/pbzip2 (Seward, 2017): Bzip2 is a compression utility using the Burrows Wheeler transform. For input, we use an ISO file consisting of hundreds of different binary and text files, whereas the SPEC input file is a tar file consisting of multiple text and binary files. In addition to standard bzip2, we also make use of a multi-threaded version of bzip2, called pbzip2, which automatically detects and uses the number of vCPUs present. The benchmark measures the time taken to compress a set of input files.
GO (GNU, 2006): The GNU GO program is a computer version of the game GO. Given an input file, the program analyses the game position and determines the likely winner. For input, we use 500 freely available GO game files (available from www.u-go.net). GO is single threaded and cannot make use of multiple vCPUs. The benchmark consists of measuring how long GO takes to analyse a set of game positions.
Hmmer (HMMER, n.d.): Hmmer uses Hidden Markov Models (HMM) to search for patterns in protein sequence databases. The database used is uniport_sprot.fasta with workload fn3.hmm. Hmmer is a parallel program and automatically detects and uses the number of vCPUs present. The benchmark consists of measuring the time taken to perform specified searches against the database.
POV-Ray (POV-Ray, n.d.): The Persistence of Vision Ray Tracer, POV-Ray, produces photo realistic images from scene description files. For input we use a number of the advanced scene descriptions provided in the distribution. This includes the benchmark scene as used by SPEC. We use the latest version of POV-Ray, 3.7, which is a parallel program that automatically detects the number of vCPUs present. The benchmark consists of measuring the time taken to produce images from a scene description file set.
NAMD (University of Illinois, 2017): NAMD is molecular dynamics code used for simulating large bio-molecular systems. For input we use files included in the NAMD distribution that are the same as those used by SPEC. NAMD is a parallel code and automatically detects and uses the number of vCPUs present. The benchmark is the execution time for the simulation with the provided input files.

To calculate the performance of an instance, with respect to a given workload, we execute the workload 3 times and take the average. To ensure consistency in our benchmarking, we built an Amazon Machine Image (AMI) which includes all of the benchmarks above. We started with a 64 bit AMI from Canonical, with Ubuntu 12.04 (LTS) OS installed as a base image, and installed our required workloads, as well as custom execution scripts to start benchmarks, and record and retrieve benchmarking data. Unless otherwise stated, all instances benchmarked were launched from this AMI. The AMI was built in US-East and then exported to other Regions and so the same image has different AMI-ids in different Regions.

Whilst an AMI allows us to use the same image across EC2, when measuring other Clouds we had to run a script to install our benchmarks and collect results. This highlights some of the friction users experience when using different Clouds, and goes some way to explain the recent interest in application containers such as Docker²⁵.
In addition to compute performance, we also measure memory bandwidth, I/O performance and ‘overall’ system performance for some instances. To do so we made use of the following well known and commonly used benchmarks:

STREAM (McCalpin, n.d.): Standard memory bandwidth benchmark, which offers 3 different vector computations to run. Following Intel (Raman, 2013), we use the triad computation, which is a compound version of the other 2 computations, and consists of a scalar multiplication of a vector followed by vector additions. Stream reports results in MB/s.
Iozone read and write (iozone, 2016): We use Iozone to measure I/O performance, which is defined as the rate at which we can read or write data to disk and is measured in MB/s. Note that read and write speeds are typically different, and also depend upon disk access patterns. In our benchmarks, we use random access patterns. Further, Iozone reports max, min and average throughput, and we record the average throughput.

Pgbench (PostgreSQL, 2017): Postgres benchmark, pgbench, is modelled on the well known and widely used TPC-B benchmark (TCP, 2017) from the Transaction Processing Council (TPC) benchmarks. The benchmark measures performance of the database with respect to a fixed workload. The time taken to complete the workload is measured and the resulting metric is transactions per second (TPS). The amount of work each client must perform is 100,000 transactions. We choose to set the number of clients equal to the number of cores on an instance, and so on an instance with 4 vCPUs 400,000 transactions are performed.

Due to financial considerations, we use instances from the spot market. This gives access to the same resources that both reserved and on-demand instances are sold from at reduced prices. As discussed in section 2.2, such instances can be reclaimed by EC2 when needed, and indeed this did happen to some of instances in experiments. For this reason, sample numbers may not be neatly rounded as these instances did not finish and report.

5.1.2 Statistical Methodology
Having collected performance measurements, we then need to determine and apply appropriate statistical techniques to analyse them. The techniques applied will depend in part on the purpose of the investigation; that is, what we are trying to discover as well as the nature of the data itself. In our first hypothesis, we suggested that performance differences (variation) exist between instances of the same type and that this is true for many instance types on different Clouds. We are interested, therefore, in the whole range of possible performance. When collecting a random sample it is commonplace to compute the sample mean and standard deviation, which is a measure of central location and dispersion around it, and indeed sample mean is an unbiased estimator of the population mean (i.e. the expected value of a random variable). Skew is a measurement of the lack of symmetry in data, whilst Westfall (2014) shows that kurtosis, which until recently was assumed to be a measure of the ‘peak’ for a distribution; is in fact a measure of tail weight and excess kurtosis is a ‘…propensity to produce outliers’. By a propensity to produce outliers we understand that large deviations from the mean, measured in terms of multiples of standard deviations, will happen more frequently than we would expect if the data were normally distributed.
However, the extremities of the data are frequently ignored in a standard analysis and indeed outliers, typically defined as being more than 3 standard deviations from the mean, are sometimes ‘cleaned’ from data altogether. For our purposes, however, the extremities are important. The best performing instance (with minimum execution time) will incur the least cost when running a fixed workload, whilst the worst performing instance will incur the highest cost. Neither of these values should be considered an ‘outlier’ as both convey important information. The worst performing instance determines that the worst case for workload costs is at least this high, with no guarantee they won’t be higher. Similarly, the best performing instance will have lowest workload execution costs for a fixed workload. Deviations from this value show extra workload costs that instances whose performance is degraded with respect to best possible will incur. As such, these deviations are equally important, if not more so, than deviations, or spread, around the mean. Further, we note that the histogram of results presented by Armbrust et al. (2009) is multi-modal i.e. has multiple peaks, and in such cases the mean, and spread around it, can be misleading as it is not necessarily representative of the data. As such, when analysing a cross-section we choose the most appropriate statistics, which will either be in terms of percentiles and differences between them, or mean and standard deviation. Beyer et al. (2016) notes that Google Site Reliability Engineering recommend use of percentiles for system performance reporting as the data is often irregular and multi-modal, citing the mean as being misleading, and that the extremities are of interest as this is where SLA violations occur.
A longitudinal study collects data at regular intervals over a period of time, and the resultant set of data is known as a time series. In a time series the ordering of the data is important and frequently correlation between a time series and a lagged version of itself exists, known as autocorrelation. As such analysis differs from that applied to an independent random sample, and in particular attention is paid to a property of time series known as stationarity. The statistical properties of a stationary time series, such as mean and variance, do not differ over time. In a non-stationary time series either the mean or the variance differs at different points in time, and indeed both may. Non-stationary time series arise in a number of ways, such as trending up or down and so mean performance is varying, or a constant mean but with time varying variance. A locally-stationary time series is one where we have stationary period followed by an ‘instantaneous’ jump to a new stationary period with either change in mean or variance.
From a performance perspective stationarity is important as we have consistent performance with mean and variance that does not itself vary over time. We employ the Augmented Dickey-Fuller test (with a null hypothesis of non-stationarity) to detect stationary series, as well as using time-plots which visually reveal properties of the time series.

Yüklə 1,38 Mb.

Dostları ilə paylaş:

1 ... 20 21 22 23 24 25 26 27 ... 49