The research problem we address in this thesis is:
To what extent can a performance broker profitably address performance variation in commodity Infrastructure Cloud marketplaces through offering performance-assured instances?
Is this problem of sufficient scale, impact, relevancy and permanency to justify an in depth investigation as presented in this thesis?
Judging by the rapid expansion of major providers such as AWS, Google and Microsoft there has been a significant uptake of Cloud services. Indeed, a recent Gartner report suggests that organisations are increasingly adopting a Cloud first approach to IT provisioning (Pettey, 2017) , whilst a European Commission report suggests that cloud computing may ‘…contribute up to €250 billion to EU GDP in 2020 and 3.8 million jobs’ (Bradshaw, 2012). As we have seen in chapter 3, performance variation of supposedly identical instances sold at the same price has been widely reported, and with a range of different benchmarks. As such it would be reasonable to suggest that the problem has the potential to affect a sizeable number of real-world users.
However, whilst the number of users affected is potentially large, it does not necessarily follow that the impact is large. The impact of performance variation depends upon the degree to which it is present and the tolerance of users to cost and time variations. Infrequent users of Cloud may not even notice variation, particularly if they are completing different types of work each time. Whilst for some other users, even the worst case cost and completion time may be insignificant as compared to the value of the work being produced. However, we would expect that as variation grows the number of users tolerant to it will diminish, and we note variation of up to 100% amongst C1 instances on EC2. Arguably, the number of users prepared to tolerate differences of the magnitude reported in chapter 5 is likely to be small.
Providers have an incentive to minimise performance variation in order to avoid a reputation for running an inconsistent and unreliable service. Minimising variation in a homogeneous instance type requires new hardware solutions, due to the sharing of numerous components which cannot be allocated fairly – as discussed in detail in section 3.3. Until this is resolved there is potential for degradation due to noisy neighbours. However, it is heterogeneity that is the major cause of performance variation. Undoubtedly, providers strive to minimise heterogeneity on a given instance type, however, as we have noted numerous times in this thesis, instance types typically start as homogeneous but invariably become heterogeneous. Evidence to date suggests that maintaining heterogeneity for popular instance types is difficult. Indeed, changes on GCE from initial beta launch in 2013 to date41 are instructive: At launch the HighCPU instance type was homogenous. As GCE expanded the type became heterogeneous, however individual zones within Regions were homogenous, and so depending upon zone availability it was possible to always obtain the same CPU model. Presently, zones are heterogeneous, and whilst CPU models may be specified in a request, allocation is subject to availability. We note that presently HighCPU instances are provisioned from 5 different models across GCE. There appears a permanency to the problem of maintaining homogeneity.
Issues due to heterogeneity are likely to arise in new systems built either atop or from extant Clouds. For example, equivalent instances in Cloud of Clouds and federated Clouds are highly likely to be heterogeneous. Indeed, work on Ultra Large Scale Systems (Northrop et al., 2006) defines heterogeneity as a characteristic of them. Further, new services, such as CaaS and FaaS, built atop Clouds with performance variation will almost certainly suffer the same problems. We note recent work by Billock (2017) demonstrating performance variation on FaaS services from AWS, Google and Azure, although we do not know the causes. Indeed, determining such information on this type of service is likely difficult, if indeed possible at all. We do however suggest one possible cause: the so-called cold start problem, where some of the observed variation is due to the time taken to initialise the system before the workloads can be run.
Figure : Performance variation on FaaS services across AWS, Google Cloud and Azure:http://blog.backand.com/wp-content/uploads/2017/08/10K-calls-across-FaaS-providers.png, Billock (2017). We note positive skew, high peaks and long tails.
We immediately notice the similarity with histograms presented in sections 5.2 through to 5.4. As our results in those section focused on EC2, we present the AWS Lambda results, and we observe a highly peaked multi-modal distribution, positive skew and long tails, as may be indicative of use of heterogeneous hardware.
Figure : The AWS Lambda histogram is multi-modal and we note similarity with histograms presented in sections 5.2 – 5.4 showing performance variation across EC2 instances. http://blog.backand.com/wp-content/uploads/2017/08/AWS-Lambda-performance-histogram.png.Billock (2017).
The development of the Internet of Things (IoT) envisages significant numbers of mobile devices connected to the Internet, however due to their limited computational capacity Aazam et al. (2015) suggest a so-called Cloud of Things model, where Cloud capacity is more dispersed than currently found, and devices can find and off-load work onto the nearest suitable edge location. It is likely that containers or function services will be key to enabling this. Arguably, computations off-loaded are likely to be latency sensitive raising questions as to the contract between an application and a remote C/FaaS provider.
From the discussion in this section, the research problem chosen is arguably of sufficient scope, impact, relevancy and permanency to justify an investigation into the causes of performance variation on Cloud systems and suggestions for solutions, as conducted in this thesis.
Dostları ilə paylaş: |