CNSG: show

Phase Aware Performance Modeling for Cloud Workloads

Arnamoy Bhattacharyya

PhD Thesis, University of Toronto, Toronto, April 2020

Abstract

Cloud computing is gaining enormous popularity every day. But with the growing demand of cloud computing systems, comes the challenge of efficiently managing the vast amount of workloads that are run on the cloud infrastructures. Quick detection in case of failures of any component of the complex system becomes essential for ensuring that the users of the cloud services are getting the best possible performance for their workloads. For achieving this goal, a proper and accurate method that can learn the performance of workloads and can predict the future performance becomes inevitable. Traditional approaches for characterizing workload and learning their performance either depend on the vast amount of monitored resource profiles, application log data that are collected at regular intervals of the cloud system or a combination of both. But these approaches suer from (1) faulty predictions, (2) inaccurate models, (3) storage overhead, (4) throughput degradation due to the amount and complexity of generated data. In this thesis, we provide methodologies for accurately learning workload performance on-the-fly. Our method is fully automatic, lightweight, does not require access to application source code and can learn the workload performance { starting from zero knowledge and getting more sophisticated as more data are collected for the system, therefore eliminating the necessity for storage cost and also the analysis complexity of huge amount of offline profile data. In stead of modeling in a human agnostic manner, we assign the heavy and complex work to the machines, and the human is brought into the loop to provide minimal feedback so that the machine can enhance its learning over time, life long. We provide a wide range of solutions depending on how much information is supplied by the user of the workload as well as the workload characteristics, and we compare and contrast each method. Our methods rely on integrating resource usage profiles with periodically collected thread-dumps. We leverage several popular online machine learning methods to build and update our models. We show use cases of our generated workload performance models in anomaly detection e.g. to detect real world hardware faults and software bugs.

Manuscript

Bibtex