26.Appendix
Modern servers contain an array of different hardware components which may affect performance of workloads being run on them, with perhaps the most crucial including a memory store and the Central Processing Unit (CPU). Gregg (2012) describes the role of the CPU as ‘...driving all software...’, and Stokes (2010) states that at its most simple a CPU takes ‘...a stream of instructions (code) and stream of data as input, and it produces a stream of data as output’.
Instructions and data are stored in memory which is logically organised as a sequence of addressable locations of the same size. A bus connects memory and the CPU allowing for the transfer of instructions and data. Memory is random access, meaning that any location can be accessed without first having to access any others43 and as such is referred to with the abbreviation RAM. The storing of instructions in RAM is known as the stored-program concept and is a key contribution of the Von Neumann architecture (1945), one of the earliest models of computer design. In this model instructions and data are stored in the same physical RAM, whilst in the Harvard architecture they have separate RAM. The stored program concept is derived from earlier work by Turing on the limits of computability, in particular the notion of a Universal Turing Machine (UTM) which can simulate the operation of any other Turing Machine (TM). A TM has a unique binary encoding, and a UTM simulates the action of the TM on data by having the encoding plus the data as input.
A high-level black-box description of the operation of a CPU is as follows: an instruction is fetched from RAM, decoded and then executed. This fetch/decode/execute cycle, known as the instruction cycle, is then repeated. In this way instructions are executed in a sequential manner. CPU instructions have a binary representation (opcode) and by decoding we understand determining what the instruction is from its opcode. In addition, decoding will also determine if any operands i.e. data to be acted upon, need to be fetched from RAM. Execution of an instruction means carrying out a well-defined sequences of actions.
The set of instructions that a CPU implements is known as its Instruction Set Architecture (ISA). Hennessy and Patterson (2012) define a CPU’s ISA as ‘…programmer-visible instructions…’ and it serves as ‘…boundary between hardware and software…’. The Von Neumann model specified 21 different instructions. Modern CPUs typically have significantly more instructions available to them than the Von Neumann machine and different CPU models may well have different ISAs. The ISA of a so-called Reduced Instruction Set Computer (RISC) is designed to be a small number of low level general purpose instructions, and higher level operations are expressed as a sequence of low level ones. By contrast the ISA of a so-called Complex Instruction Set Computer (CISC) will contain a large set of both low and high level instructions. For example, a CISC ISA typically contains a square root instruction whilst a RISC one does not. Whilst a RISC ISA will require the execution of more than one instruction to perform a square root this does not necessarily mean it will take longer, as measured by wall-time, than taking a square root on a CISC machine.
Applications are typically written using high level programming languages which have to be compiled (translated) into a specific binary encoding suitable for execution on a particular ISA. If CPUs have different ISAs then the same program must be compiled twice, one for each ISA where it will run. The most prevalent ISAs of servers used by major Cloud providers is the x86 family. This is a family of ISAs which maintain backwards compatibility, and so programs compiled for earlier members of the family x86 can run on a later versions, but allow for new features or extensions to be added. Indeed, the development of virtualisation extensions allowing for ‘trap and emulate’ hypervisors to be run on (certain) x86 ISAs stimulated the development of Cloud Computing, a topic we discuss further in the next section.
The prevalence of the x86 ISA family no doubt eases use of Cloud, as users do not have to concern themselves with low level hardware differences, but makes taking advantage of particular extensions, which can improve performance, in a predictable manner difficult unless the particular CPU model which is being optimised for is guaranteed to be obtained. Further, new generations of x86 frequently implement the ISA in a manner that makes them more performant as compared to previous ones, and so potentially performance enhancements may accrue to users as hardware is updated and replaced.
Notably, the Intel x86 ISAs are CISC, but these instructions are then translated into a sequence of RISC ones before being executed. That is, the programmer-visible architecture is CISC but the underlying implementation is RISC. The reason for the translation is to improve application performance. However, in order to understand this better we must look inside the ‘black-box’.
CPU Internals
Modern CPUs have a multitude of functional units such as a Floating Point Unit (FPU) for executing floating point operations, an Arithmetic and Logic Unit (ALU) which executes integer and logical operations such as addition and comparison respectively, a Dispatch Unit which assigns an instruction to one of the execution units once it has been decoded, a Branch Unit, which is responsible for instructions which may alter the instruction flow, and the Control Unit (CU) responsible for coordinating overall CPU operation. In addition, the CPU will have a number of high speed memory locations known as registers, which are used for a variety of purposes including the staging of instructions and data to and from RAM and the functional units and the temporary storage of intermediary results. Certain registers have a well-defined purpose for which they are used exclusively for, for example the Program Counter (PC) contains the memory address of the next instruction to be executed.
It is common to describe a CPU in terms of the CU and the datapath, with the latter simply being all other components required to execute an instruction. The CU is responsible for the flow of instructions and data through the datapath. To allow for this flow of data and instructions, functional units and registers are connected by a bus. Register size is typically set to the width of the datapath, and is commonly referred to as the word size of the CPU. The most common word size found on servers used by providers is 64bit. Word size is often described as the natural ‘unit’ of CPU data and frequently determines other aspects of hardware. For example, the single largest unit of data that can be transferred from RAM to the CPU in one operation is typically set as the word size, whilst the highest memory address is frequently limited to the highest address that can be expressed in one word.
In order to co-ordinate and synchronise the various units within it, a clock inside the CPU periodically sends a signal, the frequency of which is known as the clock-rate or clock-speed. A clock-cycle is defined as the amount of time between consecutive signals. Early computers were typically implemented as a single instruction cycle machine. This means that the clock-speed was set so that any instruction cycle will complete within one clock-cycle. Whilst this has the advantage of simplicity in terms of implementation, it does mean that all instructions execute at the rate of the slowest one. For example, instructions which write results back to memory take longer than those that don’t, and yet both will take one clock-cycle to complete. With all else being equal, increasing the clock-rate leads to an improvement in performance.
A multi-cycle machine aims to improve performance by splitting the processing of an instruction over multiple stages, and within each stage an instruction is acted upon by a particular functional unit. The clock-cycle is set equal to the time of the slowest functional unit to complete, allowing for an instruction cycle to be one or more clock cycles. As such, different instructions may take different numbers of clock-cycles to execute, depending upon what the CPU needs to do in order to process them. Again, with all else being equal, increasing the clock-rate leads to an improvement in performance44.
In a multi-cycle machine the CPU can only process one instruction at a time and so we have functional units idle whilst others are busy. Ideally a CPU should keep all units busy all the time. CPU pipelines allow for multiple instructions to be processed concurrently as follows: CPU fetches an instruction from RAM and passes it to the decoding unit. Whilst the original instruction is being decoded a new instruction can be fetched – so two instructions are being processed within the CPU. Once the original instruction has been decoded it can be dispatched to an execution unit, and so the second instruction can progress from fetch to the decode unit, allowing for a third instruction be fetched.
The clock-cycle of a pipeline is set to the longest time taken for a functional unit to complete. In order to keep the instructions flowing through the pipeline there shouldn’t be a great disparity in the time taken to process different instructions. As a result, the ISA of pipelined CPU is typically RISC, for example the following are RISC pipeline designs: SPARC, PowerPC, MIPS, whilst as noted the x86 is a CISC but implemented as a RISC. Simply put, in a CISC ISA some instructions do too much. The number of stages in a pipeline is referred to as pipeline depth, with the following 5 stages commonly referred to as the classic/canonical RISC pipeline design:
-
Instruction Fetch (IF)
-
Instruction Decode (ID)
-
Instruction Execute (EX)
-
Memory Access (MEM)
-
Write Back (WB)
Pipeline depth can be increased by use of ever more functional units, each with a specific purpose. There is a corresponding increase in clock-rate, or equivalently, a decrease in the clock-cycle. However, this does not necessarily speed up the processing of an individual instruction, but rather aims to increase instruction throughput. In every clock-cycle an instruction is (potentially) executing, whereas in a non-pipelined CPU the current instruction being processed must complete before processing of the next one can begin.
However, pipelines are not without issue. The clock-rate in a pipelined CPU is typically set at the speed of the slowest unit. However, this is determined under ideal conditions. In practice a pipeline stall occurs whenever an instruction within a unit takes more than one cycle to progress – and this is often the case when operands are being fetched from memory or write back to memory occurs. Worse still, whenever a new instruction is fetched from memory the CPU must decide what the next instruction to be executed will be. As branching can cause non-linear instruction flow the CPU must make a prediction as to the next instruction. When a branch mis-prediction occurs the whole pipeline contains instructions that are not required and so must be ‘thrown away’. Depending upon the depth, it will take a number of clock-cycles to fill the pipeline back up.
Branch mis-prediction is an example of a so-called control hazard, but other hazards exist. For example, contention occurs when different instructions compete for the same resource, whilst a data hazard refers to a later instruction requiring data from an earlier one that is still being processed. Pipelines rely on instruction level parallelism (ILP), that is, the degree to which instructions can be processed concurrently. We discuss parallelism further in the next section.
Improving Performance through Parallelisation
The original Von Neumann machine is a single processor machine and improving the performance requires either (1) reducing the clock-cycle needed to complete the same amount of work or (2) completing more work in the same clock-cycle. The former approach becomes limited due to heat generation, whilst for a given clock-cycle the latter approach is limited by the amount of parallelism within an instruction stream. To overcome these limitations Symmetric Multi-Processor (SMP) machines were developed, which, similar to pipelines, attempt to improve throughput by exploiting ILP.
A SMP machine is an extension of the basic Von Neumann design, and is defined as a tightly-coupled system of multiple homogeneous processors. In particular, in a SMP machine processors share the same memory space. SMP systems are either multi-socket, multi-core or indeed frequently both. Multi-socket refers to multiple independent CPUs each on its own die, whilst multi-core refers to a CPU that has multiple functional units organised into ‘cores’ in such a way that each core is effectively an independent CPU. However, cores on the same CPU do share some components, for example they share the same interface to the memory bus. Further, they also share caches, which is a small high speed memory location on the die. We discuss caches in more detail later.
From the perspective of the Operating System each core is independent and considered and treated as a processor in its own right. As such, when discussing compute capacity of a machine it has become common to talk about cores rather than CPUs. SMP machines improve performance by being able to execute multiple independent instruction streams against multiple independent data streams. Note however, SMP machines do not necessarily speed up application latency unless the application itself can take advantage of increased core count.
A further level of parallelism may exist within a core. As noted, pipelines are efficient to the degree to which ILP exists within the instruction stream being executed. Simultaneous multi-threading (SMT), referred to as hyper-threading on Intel CPUs, attempts to increase throughput by increasing parallelism amongst instructions being executed. In particular, by adding additional registers, and some additional functional units, a core can hold the state of two threads of execution at the same time. This allows the core to rapidly switch between process instruction threads. However, there is no increase in the number of execution units (ALU or FPU) and so hyper-threading can lead to contention as the execution threads compete for the same resources. Indeed, it is not possible to ‘fairly’ allocate access to execution units amongst competing threads, which can lead to performance issues. Leng (2002) reports that the effect of hyper-threading on performance depends upon the workload, and provides an example of a performance degradation of ~50% with hyper-threading enabled as compared to disabled. Finally, a super-scaler CPU is one where a core contains multiple execution units, allowing for the simultaneous execution of multiple threads.
Using parallel systems to improve the time taken to solve problems is not a new idea, and indeed Flynn (Duncan, 1990) provides the following Taxonomy of computer systems in terms of parallelism found in their instruction and data streams as follows:
-
Single Instruction Stream, Single Data Stream (SISD): These systems sequentially execute one stream of instructions which operate on one stream of data.
-
Single Instruction Stream, Multiple Data Stream (SIMD): A single instruction stream is executed against multiple data streams. In practice, this means there are multiple processors each with its own data stream, however, the CPU control Unit dispatches the same instruction to all processors from the single stream. An example of a SIMD is a GPU.
-
Multiple Instruction Stream, Single Data Stream (MISD): Multiple instruction streams are executed against a single data stream. Hennessy and Patterson (2012) note that no commercial example of a MISD computer system has been built.
-
Multiple Instruction Stream, Multiple Data Stream (MIMD): The systems contain multiple processors each of which fetch their instruction stream for execution against their own data stream.
Physical servers, as used by Cloud providers, are classified as MIMD systems due to them typically having multiple processors each of which typically has multiple cores. GPUs however are SIMD systems as each core executes the same instruction but against an independent stream of data. However, this type of system is not suitable for all problems. In particular, branching of code on one processor (and not matched by others) results in other processors waiting for instructions whilst the control unit fetches and dispatches the instruction required by the branch.
Irrespective of architecture, a performance limitation common to all Von Neumann machines is the speed with which data and instructions can be transferred from RAM to the CPU. This is commonly referred to as the Von Neumann Bottleneck, and we discuss caches next, which are used to alleviate this problem.
Improving Performance through Caches
For moderns CPUs fetching instructions and data from RAM is slow and can lead to stalling in pipelines. To alleviate problems due to stalling each core will typically have 2 level one (L1) memory caches, one for data and one for instructions. Whilst slower than accessing registers, it is still significantly faster than accessing RAM. For most workloads the presence of the L1 caches should improve performance, and usually more is better. However, the degree of speed up varies by workloads as different workloads have different memory access patterns, and those having sequential access patterns particularly benefit from caches.
Whilst each core has its own L1 caches, frequently CPUs will have a level 2 (L2) cache which is shared amongst its cores. The L2 cache is typically larger than a core’s L1 cache, however as it is further away there is an increased latency in fetching instructions from it. Typically, an L3 cache is simply a larger but slower version of an L2 cache. When fetching an instruction/data L1 is always checked first, and if the instruction/data is not present then a cache miss is said to have occurred and in this case L2 is checked next, then L3 and finally if not found it is fetched from RAM. The various levels of memory, from registers to caches to RAM are collectively known as the memory hierarchy, and at each stage we have a larger store but one that is slower to access. Whilst cache latencies will vary, Gregg (2012) reports that typically they are of the following order:
Table : Memory hierarchy latency.
Memory
|
Latency in CPU cycles
|
L1
|
4
|
L2
|
12
|
L3
|
36
|
RAM
|
100
|
However, unlike CPU cycles, the OS cannot reserve some portion of a cache for use by a specific process (running application), and cache is referred to as an invisible resource. As such, cache usage is subject to resource contention and it becomes possible for one process to obtain a disproportionate amount of resource at the expense of others. The effect of this will differ by workload, depending upon how sensitive to cache usage it is.
When hyper-threading is enabled, as it commonly is amongst Cloud providers, the L1 cache will be shared amongst the competing threads, and so in this case we have 2 invisible resources: execution units and the L1 cache. Contention can lead to performance degradation as a process cannot access resources as required. Notably, further shared resources with potential for contention exist, as we discuss now.
Additional Shared Components
As Clouds are (in the main) a multi-tenant environment, instances owned by one tenant are likely to be sharing a variety of components with instances owned by other tenants, including: caches, memory bandwidth, execution units when hyper-threading is enabled, network card and local storage. Resource contention can cause some processes to obtain a disproportionate amount of resource at the expense of others. From the perspective of users of Cloud systems, even when hardware is consistent there is a clear potential for performance variation across supposedly identical instances due to resource contention, although we note that recent hardware developments including cache allocation technologies and SR-IOV are aimed at addressing these issues. Further, servers within an AZ’s data centre share a variety of components, in particular networking switches and routers, as well network storage systems. This again leads to potential for performance issues due to resource contention.
Dostları ilə paylaş: |