4.1 Introduction
4.2 High-Level Architecture of the Infrastructure Optimiser
Step No. | Step title | Description |
---|---|---|
1 | Input collection | Manages the data ingress of inputs and is essentially the same for offline modelling and the online implementation; what changes is the context. Inputs include: • Composition and structure of infrastructure available, i.e. landscape • Composition and structure of the application (e.g. application model and load distributions) • Associated telemetry data, e.g. from testbed or system of interest • Infrastructure provider and service/application provider KPIs |
2 | Modelling and/or selection | Creates/utilises models primarily in the offline mode to produce outputs that are subsequently codified within the online optimiser. (2a) Combines telemetry with an infrastructure landscape and filters as appropriate based on relevant KPIs. The process is the same for offline and online, only the context changes. Offline (2b) creates application load translation models, which map how application load correlates to resources, associated telemetry and/or KPIs. Online this step more simply involves appropriate selection of a subset of “application load to physical capacity mappings” from those modelled offline. Offline (2c) utilises KPIs and Multi-Attribute Utility Theory (MAUT) to formulate Utility Functions for application and infrastructure providers. Online this step selects a subset of utility functions from those previously formulated offline. |
3 | Modelling/selection output | Offline (3a) represents a consolidated infrastructure model, i.e. a testbed specific landscape that feeds the “load translation modelling” process. Online this is the use-case-specific landscape and is fed directly to the algorithm module step 4. Offline (3b) encompasses the complete set of possible “application load to physical capacity mappings” based on the testbed inputs. Online it is an appropriate subset of those modelled offline based on the use case inputs. Offline (3c) is the complete set of possible “Utility Functions”. Online it is an appropriate subset based on the given use case. |
4 | Algorithmic optimisation | This step is illustrated in Fig. 4.2 and takes (3a) and (3b) as inputs. The algorithm subsequently provides several valid solutions, over which the utility functions selected in step (3c) are applied to select a near-optimal option. The output of step 4 is a real-time application placement or a future infrastructure optimisation. |
4.3 Problem Formulation
-
Optimally match the requirements of components, e.g. VMs to physical resources
-
Adhere to SLAs negotiated between the infrastructure and application providers
-
And when to instantiate the virtual components relative to the required capacity
4.3.1 Infrastructure
4.3.2 Application
4.3.3 Mapping Applications to Infrastructure
4.3.4 Mapping Constraints
4.3.4.1 Capacity Requirement Constraint
-
The aggregated bandwidth of the physical path is the minimum observed throughput of all edges in the physical path. Mathematically, bg \( =\underset{e\in g}{\min }{\tau}_e \)
-
The aggregated latency of the physical path is the summation of the propagation latency in the physical path. Mathematically, lg \( =\sum \limits_{e\in g}{l}_e \)
4.3.4.2 Compositional Constraint
4.3.4.3 Service Level Agreement Constraint
4.3.4.4 Infrastructure-Policies Constraint
4.4 Models that Inform Infrastructure Optimisation Decisions
4.4.1 Infrastructure contextualisation models
-
The definition of resource sites (e.g. MSAN and Metro), i.e. information pertaining to individual physical infrastructure elements, their physical attributes, and configurations, the communication links between them, and the properties inherent to these links.
-
The definition of inter-site network bandwidth and latency, i.e. the available network bandwidth and latency values of the physical communication links between resource sites.
4.4.2 Load Translation Models
-
Infrastructure Information: It is important to understand the testbed/experimentation set-up and to understand its limitations, especially those relating to the heterogeneity of available server configurations. This helps define the information on the infrastructure, e.g. physical machine, virtual machine and container configurations, total number of machines, and their connections.
-
Associated Telemetry Data: A telemetry agent is initialised to collect data over multiple domains, e.g. compute, network, storage, and memory. All incoming data from these metrics are aggregated and normalised based on the domain context and studied as time series data.
-
Virtual Machine Configurations: The application to be placed on these machines needs to be understood, how many instances and types of instances can be run, whether they are deployed as service-chain or disjoint VMs/containers, along with the various configurations of these VMs/containers. This information is typically provided via the application optimiser.
-
End-User Metadata: The end-user behaviour to be emulated is determined based upon use case definitions and validation scenarios, and this specifies how users will access the applications, how many users will be initialised, how the number of users will increase/decrease, and how different user behaviours will be simulated in the testbed.
-
Data Wrangling: The collected data is isolated and labelled appropriately according to experimentally relevant timestamps.
-
Data Filtering: Files are filtered and integrated as appropriate for comparison.
-
Data Visualisation and Analysis: Visualisation and analysis are completed for each defined experiment. The appropriate metrics are defined and calculated per machine/VM/container as well as device names, e.g. for network interfaces and storage devices attached to the physical machine. The results are then summarised and are visualised as a comparison between the average usage for compute, memory, network, and storage resources.
4.5 Algorithmic Approach to Optimal Selection
4.5.1 Utility Functions
-
Gross Profit: This objective includes calculated revenue and expenditure costs for resource capacity allocated, cost of associated SLAs and SLA violations, and other costs of concern to the infrastructure provider.
-
Service Distribution: This objective includes the analysis of available capacity of resources after the application deployment.
-
Throughput: This objective includes the quantification of observed throughput over all physical links in the physical network, along with the analysis of dropped packets.
-
Latency: This objective includes the quantification of observed latency over all edges in the physical network, along with the analysis of packet delays.
4.5.2 Algorithms for Infrastructure Optimisation and Application Placement
-
Mutation is used to introduce new candidates in the population by spontaneously changing one gene. In RECAP, this means that multiple mappings within a possible placement solution are swapped to create a new possible placement solution.
-
Cross-over creates a mix of the available genes in the population by merging the genes from two candidates. In RECAP, this means that two mappings of different service components are taken from two different possible placement solutions and combined to create a new possible placement.