The broker considered in chapter 6 offers performance-assured instances. For the client, past performance provides an assurance of future performance but not a guarantee. Indeed, as the broker does not own the infrastructure or indeed is assumed not to have any privileged relationship with providers, they cannot make a guarantee of future performance. There is risk on both sides with this: the user may pay for an instance whose subsequent performance degrades beyond the minimum tranche level requested, whilst the broker may sell an instance whose performance subsequently improves beyond the price it is being sold for. For some users it is arguable whether this represents an unacceptable risk as there is presently no guaranteed worst case for performance. An SLA based approach would see the broker provide a guarantee of future performance and a penalty would bepaid should the guarantee not be met.
An SLA is an agreement between a service provider and a service user that typically provides a description of the service, the level/quality of service the user can expect, rights and responsibilities, contact information and potentially penalty clauses in the event the service is either unavailable or becomes non-performant. A key performance indicator (KPI) is a metric by which the quality of the service (QoS) is to be measured and a Service Level Objective (SLO) is a particular value of the KPI used to define expected QoS. Typical Cloud SLAs use Regional availability as a KPI with an SLO of, 99.95%, for example, on EC2 (Amazon Web Services, 2017). However, SLOs for the performance of resources is typically absent.
In order for SLAs to be useful, SLO violations must be detected. Clark et al. (2011) describe active monitoring as the process of performing specific measurements at specified intervals for this purpose. Emeakaroha et al. (2012) note that extant work on Cloud SLAs does not typically consider how violations are detected, and they propose a framework called LoM2HiS which aims to map low level metrics to high level SLOs. Similarly, as part of the RESERVOIR project, Clayman et al. (2012) describe combining raw data collected from monitoring ‘service clouds’ into KPIs. However, there is a lack of convincing examples provided to show that high level metrics may be determined by low level metrics. Indeed, the only example presented is defining the ‘high level’ availability in terms of ‘low level’ downtime. We do however note that a project at Google, CPI2 monitors cycles per instruction in order to detect likely poor workload performance.
A performance broker can add workload specific performance guarantees for instances. From our discussion in section 3.5 we know that the most useful and informative metrics are based on task progression, such as execution times or work done, rather than machine characteristics, and suppose the broker offers SLAs where the KPI’s are specified in terms of execution times with the SLO specifying a particular execution time. The SLA will define the frequency of measurement as well, specifying how many measurements below the SLO are needed for an SLA violation to occur, for example, an SLA may define a violation has occurred when performance is below the SLO for more than 5% of measurements in a specified duration.
Typically, an SLA violation will result in a penalty charge of some kind, with service providers such as EC2 offering credits for future service use. Aljoumah et al. (2015) state that an ‘ideal’ Cloud SLA should include penalties in the case of the SLA not being met. The question then is: how should the broker price this liability? Whilst it is typical for proposed SLAs to include a price (Marilly et al., 2002; Haq et al., 2010; Patel et al., 2009), only Li (2012), as far as we’re aware, addresses the pricing of liability. He makes use of synthetic Collaterised Debt Obligations (CDOs), which are made up of Credit Default Swaps (CDS) - insurance type contracts that pay out in the event of a credit default, but unlike standard insurance a CDS can be purchased without owning the underlying debt. With a synthetic CDO, one side of the arrangement agrees to make regular payments in the hope of receiving of a pay-off in the event of a credit event, whilst the other side receives regular payments but is liable for pay-outs.
Providing performance-assured instances at the point of sale alleviates the performance variation risk clients face when obtaining instances directly from the CeX. An SLA, however, alleviates risk of the future performance degradation beyond a minimum level and one would reasonably expect to pay a fee for this. How should the broker price SLAs?
Paying to alleviate future risk is common-place and forms the basis of the insurance industry, which generated over 5 trillion dollars in premiums collected world-wide in 2016 (Benfield, 2016). Insurance is defined as: ‘…a mechanism for reducing the adverse financial impact of random events that prevent the fulfilment of reasonable expectations’. Arguably, offering workload specific performance based SLAs fulfils this definition: random events, such as the unseen resource consuming actions of neighbours, impact the performance of an instance and may do so to such a degree that a client cannot fulfil their expectation (of the instance) in terms of work delivered. The financial impact is, at a minimum, the price being paid for the instance. Considering SLAs as akin to insurance policies, we consider a basic model for insurance pricing which may be applied to SLA pricing next.
An insurance premium is the amount an insuree pays to an insurer in order to receive protection against risk. The insurer receives regular fixed premiums but is liable for random pay-outs to insurees. Insurance pricing is the determination of the level of premiums required in order to ensure that the insurer can meet expected payouts and make a profit. So-called collective risk models are typically used to determine insurance premiums.
Rotar (2015) describes a ‘simple but relatively complete model of insurance with many clients’ as follows: Let Xi denote the random pay-out to the ith client (the random variable Xi are often called the severities), where 1 <= i <= n. Further assume that the Xi are independently and identically distributed (i.i.d), in which case the insurees are known as a homogeneous group. We let Sn denote the aggregate pay-out (also known as the aggregate loss):
1
The premium, ci > 0, charged to insureei is:
2
As the Xi are i.i.d., E[Xi ] is the same for all clients and so the premium c = ci is the same for all i. We write c = m + ε where m := E[Xi ].
The term ε > 0 is referred to as the security loading. The premium being charged is greater than the expected pay-out and so insurees are entering into a loss-making game. Thomas (2016) provides a mathematical (but not psychological) explanation, based on the use of utility functions, as to why loss making games may be preferable to some people. The profit of the insurer is a random variable:
profit := n*c - Sn. 3
The probability of not suffering a loss i.e. Pr(profit > 0) is:
Pr(profit >= 0) = 1 - Pr(|Sn/n -m| >ε). 4
As E[Xi] = m by the Weak Law of Large Numbers (LLN) we have, for arbitrarily small ε > 0 Pr(|Sn/n -m| >ε) → 0 as n → ∞. Consequently:
Pr(profit) → 1 as n → ∞ for arbitrarily small ε > 0. 5
There are two interpretations to this conclusion. For a fixed ε > 0 the insurer will not (or rather is increasingly unlikely to) suffer a loss so long as they can find a sufficient number of insurees. The second interpretation is that as the number of insurees increases the premium the insurer charges becomes ‘closer’ to the expected pay-out the insuree receives, and this shows how risk redistribution relies on the number of participants. The expected pay-out is often referred to as the fair actuarial price, however setting ε = 0, i.e. the insurer only charges the fair actuarial price, leads to a loss with a probability close to ½. Rotar (2015) notes how close the premium c should be to the expected pay-out is ‘...one of the main objects of study in Actuarial Modeling...’.
Rotar (2015) determines an approximation for the premium c payable by n insurees that is required to ensure the probability of profit is at least β = 0.95. The approximation is given by:
c ≈ m + (1.64*σ)/√n where var[Xi] = σ2. 6
Note that this is derived by using the Central Limit Theorem (CLT) which states that the standardised sum (Sn -nc)/ σ*√n is approximately standard normal for sufficiently large n and irrespective of the Xi distribution. The approximation for c is ‘valid’ only up to the degree of the approximation by the normal.
As an example of how we could apply insurance models to the pricing of SLAs consider the following: Suppose the broker sells SLAs on 100 instances to 100 clients for a given workload in a particular tranche for a 24 hour period. Suppose we define an SLA default as performance being under requested tranche level for at least 5% of measurements. Based on past history, the broker has estimated a probability of 0.01 of SLA failure. Further, when an SLA failure occurs, the broker pays a full refund which we assume to be 1440 units of currency based on a charge of one unit of currency per minute.
What is the premium the broker should charge per instance so that with a probability of 95% the broker does not make a loss? As before, we let Xi denote the random pay-out to the ith client and we assume that Xi are i.i.d. Now, Xi = 1440 with a probability of 0.01 or Xi = 0 with a probability of 0.99. In order to apply eq. 7.6 we need to know m = E[Xi] and σ2 = Var[Xi]. As Xi follows the binomial distribution we have: E[Xi] = 1440*0.01 = 14 and σ = 1440*√0.01*0.99 = 142.56. And so, as an approximation, c ≈ 14 + 1.64*142.56/10 = 37.4 units of currency.
In more complex models for aggregate pay-out/loss the number of clients is a stochastic process, typically either a homogeneous or non-homogeneous Poisson process as these are commonly used for modelling arrival rates (in this case claims for pay-outs). This leads to so-called ruin models defined as follows:
Rt := u + ct – S(t) 7
Where u is an initial surplus, ct is the aggregate premiums collected up to time t, S(t)
is the pay-out/loss process and Rt is referred to as the surplus or reserve process. When Rt < 0 the insurer is said to be in ruin and a key consideration of ruin theory is the probability of ruin over a finite time horizon.
However, collective risk models, and typical extensions of them, assume independence amongst the risks, but what if all the broker’s instances under an SLA were attached to the same network storage which suffered a hardware degradation? Indeed, what if all instances covered by an SLA are co-locating on the same host? There would appear to be clear potential for correlated risks, which we discuss further..
A key assumption in many models of insurance is the notion of independent risks and by the law of large numbers, as the number of insured risks grows the variation in the expected loss decreases. The decrease in volatility is what makes selling insurance feasible and is often referred to as risk redistribution or risk pooling. Independent risks do not move/respond to the same external shock. However Seifert et al. (2013) note that correlated risks abound, for example, flood risks are correlated. In this case the pricing of insurance premiums must account for the correlated risk, and Kunreuther et al. (2009) note that in this case premiums may considerably exceed an insuree’s expected loss.
Correlated risks also exist in financial markets. At the height of the financial crisis the world’s biggest insurance company American International Group (AIG) required a bail out of $85 billion to prevent it from failing. One of the causes of AIG’s problems was the selling of credit default swaps, a form of insurance offering protection in the event of a credit default event against regular collaterised debt obligations (CDOs). By 2007 the latter predominantly contained mortgages and the collapse of the sub-prime mortgage market in the United States led to large-scale defaults in the CDO market in turn leading to correlated pay-outs.
The broker is potentially also exposed to correlated risks, namely, that the performance of instances within the pool are correlated, and so move together in response to the same external stimulus. Such correlations could lead to multiple SLAs failing at the same time. Arguably, there is the potential for correlated performance to arise in a number of different ways. Instances from the same provider may well have infrastructure in common, such as network segments or network storage and this can lead to performance dependencies. However, identifying this may be difficult due to the opaque nature of Clouds. Similarly, instances co-locating on the same host will also have performance dependencies. Indeed, we have already demonstrated in section 4.9 how resource contention from co-locating neighbours can affect performance.
As the broker does not own infrastructure it is not possible to eliminate correlated risks. However, the broker may be able to reduce them through diversification.
Dostları ilə paylaş: |