7.2 Critique of Chapter 5
7.2.1 Focus on EC2
In chapter 5 we conducted performance measurement experiments aimed at establishing various hypotheses regarding variation. Our focus was primarily on EC2, with one small scale experiment each on GoGrid and Rackspace respectively. We use this source of data to model instance performance, and we make the generalisation that instances from all providers on our hypothetical Cloud exchange will have the same characteristics. This does raise questions as to the suitability of choices made.
Our decision to focus on EC2 was primarily due to its prominence in the performance literature, and as such results from experiments on it are readily comparable to extant work. In a survey of performance work Leitner and Cito (2016) note that variation is reported on in other Cloud platforms, such as GCE and Azure. Indeed, from our single experiment on GoGrid and Rackspace we find heterogeneous instances and performance variation due to it on both Clouds. In addition, we choose to focus on EC2 due to its geographical spread and range of instance types, and we are able to measure performance in different global locations, and with respect to a multitude of different instance types. This helps to ensure we are not observing a localised problem only. Further, from a pragmatic point of view, we note the ease with which we can repeatedly execute the same benchmarks across EC2, which is vital to ensure consistency and reliability of results
7.2.2 Benchmark Choice
A notable feature of typical performance related work is the relative lack of discussion with regards to suitable benchmarks with which to measure performance. Arguably, this has led to some questionable choices being made; in particular we note use of UnixBench and Sysbench. As discussed in section 3.5, neither of these are capable of sufficiently stressing the memory hierarchy, and they are generally considered unsuitable to measure the performance of modern systems. Indeed, their use has led to erroneous conclusions such as homogeneous instances types having negligible variation, despite long standing evidence of the noisy neighbour effect.
Issues with previous generations of benchmarks has led to a preference for use of real-world workloads, and guided by Lilja (2008) and Hennessy and Patterson (2012) we implement some workloads found in the SPEC CPU 2006 suite. We also note the relatively widespread use of SPEC suite within the literature, indeed, a search on IEEE Xplore for “SPEC CPU” returns 179 results. By comparison, a search for UnixBench, Sysbench, OpenDwarfs returns 6, 6 and 2 respectively.
However, all benchmarks suites will eventually be less useful as systems develop, unless they are updated, and we note the release of SPEC CPU 2017 in 07/2017. Of the 6 choices made from the 2006 suite, bzip2 and GNU GO have now been replaced with a new compression algorithm and new implementation of GO: xz (Tukaani, n.d.) and Leela (Sjeng, n.d.) respectively. However, our other choices remain but with updated workloads. We are confident that our original choice is still suitable for the purpose of investigating variation.
7.2.3 Number of Benchmarks Used
We make use of 6 CPU bound benchmarks in our experiments. Whilst the number is not large, it is in line with that found in extant work, and we note it is sufficient to provide evidence to support all performance hypotheses posed in chapter 5. Indeed, as results from section 5.4 demonstrate, it is remarkable how few benchmarks are required before workload specificity is observed. However, arguably more (CPU bound) benchmarking data is required in order to accept the characterisation of them presented in section 6.4. We recall the characterisation: per CPU results are highly peaked close to best possible with a long tail, and performance as a whole is multi-modal when heterogeneity is present.
Multi-modal is simply due to variation, and we find no evidence, either in our work or in the extant literature, of different CPU models with indistinguishable performance. Further, with regard to per CPU shape, highly peaked close to best possible indicates that, in the main, instances are provided with, and can make use of, a consistent quantity of resource. However, degradation (deviations away from best possible) starts to occur when there is resource contention. The long tail indicates that it is possible for severe degradation to occur. Other per CPU distributions are possible, but would be a sign of more unreliable performance.
7.2.4 Choice of Instance Types
Finally, a number of the instance types reported on in section 5.2 through to 5.4 are now considered previous generation by AWS. Indeed, M1, M2 and C1 families are not available in recently launched Regions, although still available within the Regions benchmarked. However, the lesson from these instance types is that heterogeneity introduces variation.
7.3 Critique of Chapter 6
In this section we critique the work presented in chapter 6, and we start with an overview of the system as a whole, and discuss the assumptions made in our model.
7.3.1 System Overview and Assumptions
Our model consists of the two marketplaces: A primary marketplace and a secondary performance marketplace. The primary marketplace consists of a commodity exchange (CeX), together with providers who sell instances and brokers who buy, as detailed on Figure 20 below.
Figure : The primary marketplace, which consists of a commodity exchange (CeX), operating a CDA and which providers submit ask prices and brokers submit bid prices.
The CeX specifies available instance types allowing different providers to sell instances on the exchange so long as the specification is met. The CeX does not specify any particular hardware, as doing so is likely to be hard, if not impossible, to verify, and is likely to lead to reduced liquidity. Indeed, it also runs somewhat contrary to current Cloud practice. As such, the CeX is an extension of how providers currently operate, and so our model does not require any changes to current business models used by providers. As a consequence, in the same way that current providers typically supply supposedly identical instances that are, in fact, running on different hardware, and so typically have different performance properties, we will also encounter this on the CeX.
We limit the degree of heterogeneity found across the CeX through use of the NUM_CPUS parameter. When this is set to NUM_CPUS = 1 for example, all providers on the CeX sell instances running on the same hardware. Our model assumes that increasing heterogeneity will lead to increased performance variation, this is of course not necessarily true in all cases, but is common observation.
On standard commodity exchanges, buy at prices for some quantity and sell at prices for a quantity are submitted, known as bids and asks. The bid size is then ordered from highest bid price to lowest, whilst the ask side is ordered from lowest ask price to highest. Orders are placed onto the order, and are cleared by matching the best bid with the best asks – assuming an overlap in prices. Matching occurs periodically is the resultant system is known as continuous double auction (CDA). As noted, the majority of the world’s commodity exchanges use a CDA. In the literature, it is common to use a simplified model of a CDA. In particular, orders are executed immediately, unlike on real exchange which support a variety of orders. This simplifies the model as no CDA state needs to be maintained over time. Further, it is commonplace to only generate one side of the order book.
In our model we only generate an order book on the CeX whenever the performance broker requires new instances. We generate the ask side of the book by having each market maker submit and randomly generated ask price, and some randomly generated number of other providers also submit randomly generated asks. The performance broker submits a so-called market bid order, and such an order simply accepts the best ask price and must be executed immediately. When generating the order book we do not specify a quantity for the various asks, and simply assume the provider has sufficient resource to satisfy the request. This assumptions means we do not have to consider partial orders, where the same order may be satisfied by multiple providers. Arguably, in the presence of heterogeneity, such an assumption may produce more regular mixing of CPUs then we find in practice, as described in section 5.5.
We believe our model is faithful representation of how a future commodity exchange may be organized, and in particular we introduce no features that require a change in business provider model, or that is not currently found on exchanges.
In addition to the primary marketplace, we also have a secondary marketplace which we refer to as the secondary performance marketplace – as detailed in Figure 21 below.
Figure : The secondary performance marketplace consisting of the performance broker and its clients. The broker obtains instances from the CeX in anticipation of demand. The broker does not know a priori the performance level of the instances that will be obtained. Instance performance is measured with respect to 31 different workloads, and tranches are assigned based on workload performance history. Clients submit orders to the broker specifying (number of instances, workload, tranche required). If a price is agreed then a transaction occurs.
The secondary performance marketplace consists of the performance broker and its clients. An immediate observation is that there is only one broker. Arguably, if selling performance-assured instances is profitable we would expect to see multiple brokers attempting to profit. We discuss this further in section 8.3.6. The broker obtains instances from the CeX in anticipation of demand. We do not assume the broker has a privileged position, and so in particular, does not know a priori the performance level of the instances that will be obtained. When an instance is obtained its performance measured with respect to 31 different workloads, and tranches are defined based on workload performance history, with tranche A denoting the top 25% of performance, tranche B the top 50% and tranche C the top 75%.
Clients submit orders to the broker specifying the number of instances required, workload, and tranche. Client orders are generated from a workload trace released by Google, as discussed in detail in section 6.6.3. As we discuss in section 6.6.2, we feel this trace is suitable for this purpose, as it contains ~ 650,000 jobs submitted by 925 distinct users over a 29 day period. Crucially for our use, each jobs has a priority and we know from Google papers that usage at different priority levels incurs different costs. Arguably, the trace provides an example of performance based pricing, albeit we cannot infer an actual required performance or indeed a price.
When an order has been submitted, the broker randomly generates a price between the cost of the instance from the CeX and the mark up calculated by the performance gain relative to a base level. We also generate a price the client is prepared to pay, based on instance seeking costs. That is, the price a client is prepared to pay is limited by the expected cost of obtaining the required performance level. If the broker price is less than price the client is prepared to pay, a transaction occurs. In addition, we assume that 25% of clients are risk-averse and will accept prices higher than instance seeking costs. Arguably, the Google workload trace together with a mechanism for limiting the price a client will accept provides a plausible model of client behaviour.
The following are the parameters used in the simulation:
-
NUM_CPUS: The number of different CPU models amongst sellers on the CeX
-
Order_rate (λ): The average per minute rate of order arrivals the broker receives.
-
Min_charge: The minimum charging period applied at the CeX.
-
Spread: Spread determines (in part) how ‘close’ the performance of instances running on different CPU models is.
-
Peak: Peak determines the distance from min to median.
-
Tail: Tail is such that tail > peak and determines the distance from median to max.
We considered the following values: NUM_CPUS = 1, 2, 4. Min_charge was either 10 or 60 minutes referred to as GCE and EC2 respectively. The default order rate was 1. We defined a narrow workload as one with a peak of up to a 1.03, and tail of up to 1.1, whilst a wide workload is defined as up to 1.1 for the peak and 1.25 for the tail. Finally, for NUM_CPUS = 2 or 4, we defined a spread of 1.1 and 1.37
A simulation is a model in action, and we discuss next the use of simulation in science.
7.3.2 Scientific Method: Use of Simulation
We investigate the research problem posed in section 1.3 through use of simulation. This is not the only choice available: One approach would be to conduct a survey of potential users of such a service in order to inform its design, from which hypothesis can be formed regarding how the broker should operate and potential profitability, followed by building a real world brokerage offering performance-assured instances to test the hypothesis. Such a process follows the scientific method of observation, hypothesis, prediction, experiment and theory. However, exploring various ways in which the broker may operate before finding when it is profitable, if indeed it is at all, may well be prohibitively expensive and indeed numerous other practical matters are likely to arise.
As we make use of simulation we briefly consider the role of computer simulations within the accepted scientific method. Simulations are variously defined by:
-
Peck (2004): Capture and mimic of real-world systems that, unlike real-world systems, can be acted upon.
-
Reddy (1987): Simulation is a tool used to study the behavior of complex systems which are mathematically intractable.
-
Ord-Smith (1975): Simulation is the technique by which understanding the behaviour of a physical system is obtained by making measurements or observations of the behaviour of a model representing that system.
Computer simulations can be used as an intermediary step between hypothesis/prediction and experiment, in which new ideas can be tested and hypothesis refined before ‘real-world’ experiments are conducted. Peck (2004) describes how simulations ‘fit’ between hypothesis and (real-world) experiment, allowing for the preliminary investigation of the hypothesis, as well as the generation of data we don’t yet have, which allows for the hypothesis to be refined and indeed new hypotheses to be proposed. Other views as to the nature of computer simulations and their relationship to the standard scientific method exist, for example, a computer simulation may be considered simply as a tool for the implementation of numerical methods or as an experiment in its own right; and for a full discussion of the epistemology of computer simulations see for example Winsberg (2009).
Our perspective of the use of simulations in this thesis follows both Summers (2010) and Peck (2004): simulations are used for preliminary investigation, generation of new data and observations from which hypothesis can be constructed for testing in the real-world. As such, the demonstration of a profitable simulated broker leads to the following hypothesis: under the conditions assumed in the simulation a real-world broker may be profitable.
However, as noted in section 1.3, when using simulation it is important that models can be validated and verified, and we recall Easterbrook (2017) succinctly states: Validation: Are we building the right system? Verification: Are we building the system right?
Validation is a process to ensure the model has captured reality, whilst verification is the process to ensure we have correctly implemented our model in code. Both are important to ensure erroneous conclusions are not reached. With regards to improper validation, we note how a commonly used model for pricing Collateral Debt Obligations (CDOs) failed to fully appreciate the extent to which defaults of the underlying obligations may correlate, leading to an underestimation of the risk involved, exacerbating the financial crisis of 2007-2008. Even if a model is deemed valid, incorrect implementation of it into code can lead to erroneous results, as notably demonstrated by Cartlidge and Clamp (2014) who found that the profit reported by Rogers and Cliff (2012) broker was inflated due to the presence of 2 software bugs.
Validation and verification are not always straightforward processes. Indeed, Carson (2002) notes that ‘…no model is ever 100% validated and verified’. Further, he states: ‘…When designing a new system, a completely scientific validation is not possible simply because a real-system does not always exist as a basis for comparison’. This last point is particularly noteworthy as we are building a model of a system that does not exist. How then should we approach the issue of validation and verification?
7.3.3 Verification
Verification is generally considered somewhat more straightforward then validation, and techniques are often grounded in standard software engineering practices, including:
-
Walkthroughs: A walkthrough consists of ‘walking’ somebody through the code explaining how it implements various parts of the model.
-
Simplified Models: A simplified model is one where the model has been reduced to its simplest case. If the output of the simulation ‘appears’ unreasonable then there is almost certainly a problem, whilst reasonableness of output can allow for increasing levels of complexity to be tested for.
-
Deterministic Models: If the model has stochastic inputs then it can first be reduced to a deterministic one by replacing the stochastic input with a fixed known one. A deterministic model will produce the same output each time which we can often estimate, or indeed calculate exactly, in advance.
Given the issues with using extant simulators, as discussed in 4.6, we chose to write our own simulation in Python, allowing us to focus on the specifics of our problem, rather than extending either CloudSim or CReST. We note this is the approach followed by Rogers and Cliff (2012) who also implement their brokerage in Python. Arguably, implementing our model in code we develop entirely ourselves rather than extending that developed by others, should make using walkthrough, simplification and deterministic verification straightforward: Code should mimic the model closely, making for ease of comparison of the code by someone with knowledge of the model.
Further, we are not constrained to include any unnecessary features, and so we can reduce to simplest models and also use deterministic data in place of stochastic input data if required. The latter technique was particularly useful as the model has a large number of stochastic inputs. For example, setting the number of instances requested in all orders to 1, and the number of orders received per minute to 1, then over a 48 hour period the broker can sell at most 2880 instances. Further, setting the duration of all rental times to 60 minutes in all jobs means that broker has sold at most 172,800 minutes. Finally, fixing the price at the CeX for all instances to 1 unit of currency per minute, and setting the broker prices at a 5% markup, for example, provides an upper bound on total broker profit at 8640. If a simulation is run with inputs fixed at these values but gives a profit in excess of 8640 this indicates a problem. Another useful example of using deterministic input was to acquire a fixed number of instances at the start of the simulation and keep all of them throughout the next 48 simulated hours. That is, no instances in the pool were terminated and no additions made. This allows for checking of how broker costs are calculated.
Being able to simplify the model also aided development as it allows for complexity to be added once existing code has been satisfactorily verified. For example, the model was initially developed in the context of one provider only, before a CeX marketplace was implemented with multiple providers. We initially implemented multiple providers by having them all sell at the same price, and randomly selecting one to trade with. Once this had been verified we move to a CDA where sellers randomly generate asks for submission onto the order book.
Finally, we note that developing our code means we are not subject to any unresolved bugs. However, either route offers room for our own bugs.
7.3.4 Validation
Naylor and Finger (1967) suggest the following commonly used techniques for validation of computer simulations:
-
Build a model that has high face validity: A model is said to have high face validity if it appears to be an accurate representation of a real-world system to an expert in the field. During this process we can input data into the model and check for reasonable output.
-
Validation of Model Assumptions: Validating model assumptions falls into 2 categories: structural assumptions and data assumptions. Structural assumptions are with regard to how things operate, for example, assumptions regarding how the marketplace from which the broker will obtain instances works. Structural assumptions should reflect reality and Peck (2014) stresses the importance of the model being an accurate representation of reality, but notes care must be taken with how much detail to include. Data assumptions refer to assumptions made regarding the data used to build the model.
-
Compare the model input-output transformations to corresponding input-output transformations for the real system: It is here that we can further understand why Carson (2002) notes that full validation may not be possible when a corresponding real-world system does not exist. Further, even when they do, it may not be easy, or indeed even possible, to manipulate, for example, how one would one manipulate the formation of a planet in order to validate a model of planet formation?
The model presented in chapter 6 is the same as presented in O’Loughlin and Gillam (2017) “A Performance Brokerage for Heterogeneous Clouds”. Future Generation Computer Systems (FGCS), but with one addition: a mechanism to allow the client to decide whether or not to accept the broker’s quoted price. Previously, we simply calculated the minimum price the broker would have to set in order to break even, and assumed this is accepted by the client. However, all other structural elements remain the same. Given that the paper has been through the peer review process arguably the model has high face validity.
Further, the simulation behaves in a manner we would expect, providing additional evidence for high face validity as: (1) increasing demand increases profit; (2) increasing variation increases profit, and small variation lead to a loss; (3) gaming of the system leads to a fall in profit; (4) low minimum charges lead to a loss; (5) even when profitable the margin is small.
We strive for validity of structural and data assumptions as follows:
CeX: We implement a simplified version of a CDA, which is used in the majority of the world’s financial exchanges, as well as being we widely studied.
Sellers: Clouds are largely opaque, and it is difficult to determine much beyond identifying some basic hardware details of instances and measuring their performance. However, from section 5.5 we do know that allocation of hardware to instances can be somewhat irregular. We therefore implement a simple model of sellers: they are potentially heterogeneous, they allocate hardware to instances in an irregular manner and they provide instances from which the broker can determine a CPU model and measure performance
Clients: We make use of a Google workload trace to create a population of clients. The workload trace is an example of real-world users with performance needs in a Cloud like system. However, we do assume the population of users find the public benchmarks as offered by the broker useful, which we discuss in more detail in 8.3.4.
Instance Performance: We extensively benchmark EC2 in order to obtain sufficient data from which we can accurately model instance performance.
Of course we cannot say for certain that no mistakes are present. As with all scientific results this requires independent verification and replication and we hope others will find our work useful enough to want to do so.
7.3.5 Use of Benchmarks
The purpose of any benchmark is to help users summarise the performance of a machine so that that they can estimate the likely performance of their workloads on it. The broker offers a set of public benchmarks, and clients specify a performance requirement with respect to one of them. But to know which benchmark to choose and what performance tranche to request, requires the client to know how well the various benchmarks correlate to the workload they intend to run. A client will incur costs when renting instances from the broker for a given workload/tranche in order to run their own workloads, simply to determine degree of correlation. Indeed, the client may find that none of the benchmarks on offer correlate well. The broker may be able to alleviate overheads for prospective clients by offering opportunistic use of available instances in the pool. Or indeed, the broker may absorb the cost. However, we note that we haven’t factored additional costs such as these into our model, or indeed transactional costs which are typically present on a commodity exchange, and we further note the small margins the broker operates at.
The viability of the broker when offering public benchmarks is dependent upon being able to find a set of benchmarks which will attract sufficient demand. The advice from SPEC is perhaps germane here: ‘A standardised benchmark can serve as a useful reference point, but SPEC does not claim that any standardised benchmark can replace benchmarking your own actual application…’. Instead of offering public benchmarks (or perhaps in addition to), the broker can operate by requiring a client to upload/register a private benchmark for their exclusive use. Whenever a client requests an instance, it is with respect to one or more of its private benchmarks.
A private benchmark scheme should alleviate correlation issues that public benchmark schemes are likely to suffer from. However, use of them introduces new difficulties for the broker. In our formulation the broker periodically measures the performance of all available instances with respect to each benchmark, allowing the broker to generate a global performance history for each benchmark, from which tranche points are determined. By updating global history, as well as removing old data, the broker can update prices so as to reflect recent available performance. Further, this allows for an on-demand performance service as instances are already benchmarked prior to being allocated to clients. Benchmarks for which there is limited demand can be dropped from the publically available set, ensuring the broker is not incurring costs in benchmarking instances for which there is limited demand.
In a private benchmark scheme, the broker is faced with the question of how frequently to update history. For users with periodically bursty needs, the broker will want to avoid benchmarking instances in periods when the user is not ‘in the market’. Further, there is also the problem of time required to measure instances with respect to a large set of benchmarks. Indeed, in a population of 925 clients, all of whom are ‘in the market’, how far in advanced do instance need to be obtained in order to measure with respect to all private benchmarks? Clearly, a private benchmark scheme introduces new considerations for a broker.
Finally, when using public benchmarks there is potential for users to game the system. Suppose a number of the benchmarks have strong positive correlation, and so for a client they equally well predict performance of their workloads. In order to achieve the lowest price from the broker, the client will request a tranche with respect to the workload that has least variation. Suppose instead, there are workloads with strong negative correlations. In this case a client can request an instance in low performing tranche, and so pay lowest prices, and yet due to its strong negative correlation obtain one with a high tranche performance level with respect to some workload. To avoid these issues, the broker should aim for a mix of workloads with small or no correlation.
7.3.6 Market Considerations
In section 6.7 we model the number of orders arriving each minute as Po(λ) where λ is specified as a parameter of the simulation, with each order being for up to 20 instances. For the same value of λ, different runs of the simulation produces different levels of demand. This addresses an issue identified by Clamp and Cartlidge (2013) in the Rogers and Cliff (2012) broker, namely lack of stochastic variation in the demand pattern. However, whilst we have stochastic variation, expected demand is predictable, and yet market conditions tend to vary, particularly over a longer period. Use of non-homogeneous Poisson process Po(λ(t)) would allow for time varying mean, and so we can introduce, for example, periods of increasing demand up to a peak, followed by a decline. Such a demand pattern may perhaps be experienced over consecutive weekend for certain services. Further analysis will reveal the sensitivity of the broker to changes in mean.
In our model the broker obtains instances from a primary marketplace, the CeX, for sale on a secondary marketplace, the latter simply consisting of the performance broker and its clients. The performance broker quotes prices to its client which are either accepted or rejected. However, the broker is the only seller in the secondary market. If performance variation creates a viable market we would expect to see multiple sellers, and this will create price pressure as they compete with each other. The largest profit margin reported in section 6.10 was 4%, and so any additional downward pressure on this may mean that the broker is simply not viable. Cartlidge and Clamp (2014) showed that it was the introduction of the ARMA marketplace, with multiple buyers and sellers selling reserved instance months, which closed the opportunity identified by Rogers and Cliff (2012). Further analysis will reveal if the performance broker is to suffer a similar fate.
Dostları ilə paylaş: |