admin管理员组

文章数量:1624319

Abstract

I/O efficiency is crucial to productivity in scientific computing, but the growing complexity of HPC systems and applications complicates efforts to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-theart I/O models are not robust enough for production use and underperform after being deployed.

We analyze four years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.

Index Terms—High performance computing, I/O, storage, machine learning

在科学计算中,I/O效率对于生产力至关重要,但是HPC系统和应用程序日益复杂,使得理解和优化大规模I/O行为的工作变得复杂。基于数据驱动的机器学习的I/O吞吐量模型提供了一种解决方案:它们可用于识别瓶颈、自动进行I/O调优,或以最少的人为干预优化作业调度。不幸的是,当前最先进的I/O模型对于生产使用不够健壮,并且在部署后表现不佳。

我们分析了两个领先级HPC平台上四年的应用程序、调度器和存储系统日志,以了解I/O模型在实践中表现不佳的原因。我们提出了一种由五类I/O建模错误组成的分类法:应用程序和系统建模不良、数据集覆盖不足、I/O争用和I/O噪声。我们开发了石蕊测试来量化每个类别,使研究人员能够缩小故障模式,增强I/O吞吐量模型,并改进未来几代高性能计算日志和分析工具。

索引术语:高性能计算、I/O、存储、机器学习

I. INTRODUCTION

As scientific applications push to leverage ever more capable computational platforms, there is a critical need to identify and address bottlenecks of all types. applications, the I/O subsystem is often a major source of performance bottlenecks, and it is common for applications to attain only a small fraction of the peak I/O rates [1]. These performance problems can severely limit the scalability of applications and are difficult to detect, diagnose, and fix. Data-driven machine learning-based models of I/O throughput can help practitioners understand application bottlenecks (e.g., [2]–[7]), and have the potential to automate I/O tuning and other tasks. However, current machine learning-based I/O models are not robust enough for production use [6]. A thorough investigation of why these models underperform when deployed on high performance computing (HPC) systems will provide key insights and guidance on how to address their shortcomings. The goal of our study is to help machine learning (ML)-driven I/O modeling techniques make the transition from theory to practice.

随着科学应用越来越多地利用功能强大的计算平台,迫切需要识别和解决所有类型的瓶颈。在应用程序中,I/O子系统通常是性能瓶颈的主要来源,并且应用程序通常只能达到峰值I/O速率的一小部分[1]。这些性能问题会严重限制应用程序的可伸缩性,并且难以检测、诊断和修复。基于数据驱动的机器学习的I/O吞吐量模型可以帮助从业者理解应用程序瓶颈(例如,[2]-[7]),并具有自动化I/O调优和其他任务的潜力。然而,目前基于机器学习的I/O模型对于生产使用来说不够健壮[6]。对这些模型在高性能计算(HPC)系统上部署时表现不佳的原因进行彻底调查,将为如何解决它们的缺点提供关键见解和指导。我们的研究目标是帮助机器学习(ML)驱动的I/O建模技术实现从理论到实践的过渡。

There are several reasons why machine learning-based I/O models underperform when deployed: poor modeling choices [2], [7], concept drift in the data [5], and weak generalization [6], among others. I/O models are often opaque, and there is no established methodology for diagnosing the root cause of model errors. In this work, we present a taxonomy of ML-based I/O modeling errors, as shown in Figure 1. Through this taxonomy, we show that I/O throughput prediction errors can be separated and quantified into five error classes: inadequate (1) application and (2) system models, (3) novel application or system behaviors, (4) I/O contention and (5) inherent noise. For each class, we present data-driven litmus tests that estimate the portion of modeling error caused by that class. The taxonomy enables independent study of each source of error and prescribes appropriate ML techniques to tackle the underlying sources of error.

基于机器学习的I/O模型在部署时表现不佳有几个原因:糟糕的建模选择[2],[7],数据中的概念漂移[5]和弱泛化[6]等。I/O模型通常是不透明的,并且没有确定的方法来诊断模型错误的根本原因。在这项工作中,我们提出了基于ml的I/O建模错误的分类,如图1所示。通过这种分类法,我们表明I/O吞吐量预测错误可以分为五类错误:不适当的(1)应用程序和(2)系统模型,(3)新的应用程序或系统行为,(4)I/O争用和(5)固有噪声。对于每个类,我们提供了数据驱动的石蕊测试,用于估计由该类引起的建模错误的部分。分类法支持对每个错误源进行独立研究,并规定了适当的ML技术来处理潜在的错误源。

Our contributions in this work are as follows:

  1. We introduce a taxonomy of ML-based I/O throughput modeling errors which consists of five classes of errors.

  2. We show that the choice of ML model algorithm, scaling the model size, and tuning hyperparameters cannot reduce all potential errors. We present two litmus tests that quantify error due to poor application and system modeling.

  3. We present a litmus test that estimates what portion of error is caused by rare jobs with previously unseen behavior, and apply uncertainty quantification methods to classify those jobs as out-of-distribution jobs.

  4. We present a method for quantifying the impact of I/O contention and noise on I/O throughput, which (1) defines a fundamental limit in how accurate ML models can become, and (2) gives HPC system users and administrators a practical estimate of the I/O throughput variance they should expect. We show that underlying system noise is the dominant source of errors, and not poor modeling or lack of application or system data.

  5. We present a framework for how the proposed taxonomy is practically applied to new systems and evaluate it on two leadership-class supercomputers: Argonne Leadership Computing Facility (ALCF) Theta and National Energy Research Scientific Computing Center (NERSC) Cori.

我们在这项工作中的贡献如下:
1)我们引入了基于ml的I/O吞吐量建模错误的分类,该分类由五类错误组成。

2)我们证明了机器学习模型算法的选择、模型大小的缩放和超参数的调优并不能减少所有潜在的误差。我们提出了两个石蕊测试,量化由于应用和系统建模不良的误差。

3)我们提出了一种石蕊试法,用于估计由以前未见行为的罕见工作引起的误差比例,并应用不确定性量化方法将这些工作分类为非分配工作。

4)我们提出了一种量化I/O争用和噪声对I/O吞吐量影响的方法,该方法(1)定义了ML模型准确度的基本限制,(2)为HPC系统用户和管理员提供了他们应该期望的I/O吞吐量变化的实际估计。我们表明,潜在的系统噪声是误差的主要来源,而不是糟糕的建模或缺乏应用程序或系统数据。

5)我们提出了一个框架,说明所提出的分类法如何实际应用于新系统,并在两台领导级超级计算机上进行了评估:阿贡领导计算设施(ALCF) Theta和国家能源研究科学计算中心(NERSC) Cori。

II. RELATED WORK

In recent years, automating HPC I/O system analysis through ML has received significant attention, with two prominent directions: (1) workload clustering to better understand groups of HPC jobs and automate handling of whole groups, and (2) I/O subsystem modeling and make predictions of HPC job I/O time, I/O throughput, optimal scheduling, etc.

Clustering HPC job logs has been explored in [2], [8], [9] with the goal of better understanding workload distribution, scaling I/O expert effort more efficiently, and revealing hidden trends and I/O workload patterns. ML-based modeling has been used for predicting I/O time [4], I/O throughput [2], [7], optimal filesystem configuration [10], [11], as well as for building black boxes of I/O subsystems in order to apply ML model interpretation techniques [2]. While there have been some attempts at creating analytical models of I/O subsystems [12], most attempts are data-driven, and rely on HPC system logs to create models of I/O [1], [2], [4], [7], [13]. Although the challenges of developing accurate machine learning models are well known, the nature of the domain requires special consideration: I/O subsystems have to service multiple competing jobs, their configuration evolves over time, they have periods of increased variability, they experience occasional hardware faults, etc. [14]–[16]. Diagnosing this I/O variability, where the performance of a job depends on external factors to the job itself has been extensively studied [3], [4], [14], [16], [17]. Finally, the deployment of I/O models has been shown to require special consideration as these models often significantly underperform on new applications [5], [6].

While different sources of model error have been studied individually, no prior work characterizes the relative impact of different sources of error on model accuracy.

近年来,通过ML实现HPC I/O系统分析自动化受到了广泛关注,主要有两个方向:(1)工作负载聚类,以更好地理解HPC作业组并自动化处理整个组;(2)I/O子系统建模,对HPC作业I/O时间、I/O吞吐量、最优调度等进行预测。

[2]、[8]、[9]对HPC作业日志的集群化进行了探讨,目的是更好地理解工作负载分布,更有效地扩展I/O专家的工作,并揭示隐藏的趋势和I/O工作负载模式。基于ML的建模已被用于预测I/O时间[4]、I/O吞吐量[2]、[7]、最优文件系统配置[10]、[11],以及用于构建I/O子系统的黑盒,以便应用ML模型解释技术[2]。虽然已经有一些尝试创建I/O子系统的分析模型[12],但大多数尝试都是数据驱动的,并且依赖于HPC系统日志来创建I/O模型[1],[2],[4],[7],[13]。虽然开发准确的机器学习模型的挑战是众所周知的,但该领域的性质需要特别考虑:I/O子系统必须服务于多个竞争作业,它们的配置随着时间的推移而发展,它们具有增加的变异性,它们偶尔会遇到硬件故障等[14]-[16]。对这种I/O可变性的诊断已经进行了广泛的研究[3],[4],[14],[16],[17]。作业的性能取决于作业本身的外部因素。最后,I/O模型的部署需要特别考虑,因为这些模型在新应用程序上的表现通常明显不佳[5],[6]。

虽然对模型误差的不同来源进行了单独研究,但之前没有工作描述不同误差来源对模型精度的相对影响。

III. MODELING HPC APPLICATIONS AND SYSTEMS

The behavior of an HPC system is governed by both complex rules and inherent noise. By formalizing the system as a mathematical function (or, more generally, a stochastic process) with its inputs and outputs, the process may be decomposed into smaller components more amenable to analysis.

高性能计算系统的行为受到复杂规则和固有噪声的双重控制。通过将系统形式化为具有输入和输出的数学函数(或者更一般地说,随机过程),过程可以分解为更易于分析的较小组件。

The I/O throughput of a system running specific sets of applications may be treated as a data-generating process from which I/O throughput measurements are drawn. While building a perfect model of an HPC system may not be possible, it is useful to understand the inputs to the ‘true’ process and the process’s functional properties. The theoretical model of the process must include all causes that might affect a real HPC system, such as: how well a job uses the system, hardware and software configurations over the life of the system, resource contention between concurrent jobs, inherent application-specific and system noise, applicationspecific noise sensitivity, etc. Although many of these causes are not directly observable since they work at short time scales or below the instruction set architecture where gaining lowlevel insight is not possible, the effects are cumulative and the system is affected by them. ML models of the system must take these causes into account or suffer modeling errors.

运行特定应用程序集的系统的I/O吞吐量可以被视为一个数据生成过程,从中可以得出I/O吞吐量测量值。虽然构建一个完美的HPC系统模型可能是不可能的,但了解“真实”过程的输入和进程的功能属性是有用的。过程的理论模型必须包括所有可能影响真正HPC系统的原因,例如:作业对系统的使用情况、系统生命周期内的硬件和软件配置、并发作业之间的资源争用、固有的特定应用程序和系统噪声、特定应用程序的噪声敏感性等。尽管许多这些原因不能直接观察到,因为它们在短时间尺度或低于指令集架构的情况下工作,不可能获得低层次的洞察力,但其影响是累积的,系统受到它们的影响。系统的ML模型必须考虑这些原因,否则就会出现建模错误。

To model system behavior, we adopt the system modeling formulation from [4], expressing the relationship between an HPC job j and its I/O throughput on the system ϕ(j) as:

为了对系统行为建模,我们采用[4]中的系统建模公式,将HPC作业j与其系统φ (j)上的I/O吞吐量之间的关系表示为:

Here, j represents HPC job behavior (e.g., I/O volume and access patterns, distribution of POSIX operations, etc.), ζ represents system state (e.g., file system health, system configuration, node availability, etc.) and system behavior (e.g., the behavior of other applications co-located with the modeled application during its run, contention from resource sharing, etc.) at a given time. ω represents the randomness acting on the system. The system ζ can be further decomposed as:

这里,j表示HPC作业行为(例如,I/O量和访问模式,POSIX操作的分布等),ζ表示系统状态(例如,文件系统健康,系统配置,节点可用性等)和系统行为(例如,在运行期间与建模应用程序共存的其他应用程序的行为,资源共享的争用等)在给定的时间。ω表示作用于系统的随机性。系统ζ可进一步分解为:

The component ζg(t) represents the global system impact on all jobs running on the system (e.g., a service degradation that equally impacts all jobs) and is only a function of time t. The component ζl(t, j) represents the local system impact on the I/O throughput of job j caused by resource contention and interactions with other jobs running on the system. Contrary to the ζg(t) component, ζl(t, j) is job-specific and depends on the behavior of the current set of applications running on the system and their location relative to j, the sensitivity of j to resource contention and noise, etc. Without loss of generality, the I/O throughput from Equation 1 can be expressed as:

组件ζg(t)表示系统对系统上运行的所有作业的全局系统影响(例如,服务降级对所有作业的影响相等),并且仅是时间t的函数。组件ζl(t, j)表示由于资源争用和与系统上运行的其他作业的交互而导致的局部系统对作业j的I/O吞吐量的影响。与ζg(t)分量相反,ζl(t, j)是特定于作业的,并且取决于当前在系统上运行的应用程序集的行为及其相对于j的位置,j对资源争用和噪声的敏感性等。在不失通用性的前提下,式1中的I/O吞吐量可以表示为:


Here, fa(j) represents the I/O throughput of a job j on an idealized system where the j is alone on the system, the system does not change over time, and there is no resource contention.fg(j, ζg(t)) represents how the evolving configuration of the system (hardware provisioning, software updates, etc.) affects a job’s I/O throughput. The fl(j, ζl(t, j)) component represents the per-job impact of resource contention and j’s I/O noise sensitivity. Finally, fn(j, ζ, ω) represents the impact of inherent system noise (e.g., dropped packets) on the job.

在这里,fa(j)表示理想系统上作业j的I/O吞吐量,其中j在系统上是单独的,系统不随时间变化,并且没有资源争用。fg(j, ζg(t))表示系统的演化配置(硬件供应、软件更新等)如何影响作业的I/O吞吐量。fl(j, ζl(t, j))分量表示资源争用和j的I/O噪声灵敏度对每个作业的影响。最后,fn(j, ζ, ω)表示固有系统噪声(例如,丢包)对作业的影响。

A. Modeling assumptions

The task of modeling a system’s I/O throughput involves predicting the behavior of the system when tasked with executing a job from some application on some data. Modeling I/O throughput requires modeling both the HPC system and the jobs running on it. Machine learning models used in this work attempt to learn the true function ϕ by mapping observable features of the job j and the system ζ to measured I/O throughputs ϕ(j). A model m(jo, ζo) is tasked with predicting throughput ϕ(j), where jo ⊆ j and ζo ⊆ ζ are the observable job and system features.

对系统I/O吞吐量进行建模的任务包括预测系统在执行某个应用程序对某些数据执行作业时的行为。对I/O吞吐量进行建模需要对HPC系统及其上运行的作业进行建模。本工作中使用的机器学习模型试图通过将作业j和系统ζ的可观察特征映射到测量的I/O吞吐量φ (j)来学习真正的函数φ。模型m(jo, ζ)的任务是预测吞吐量φ (j),其中jo和ζo是可观测的工作和系统特征。

When designing ML models, the choice of model architecture and model inputs is based on implicit assumptions about the process that generates the data. When incorrect assumptions are made about the domain, the model will suffer from errors that cannot be fixed within that modeling framework, e.g., through hyperparameter tuning or further data collection. We investigate four common assumptions about the HPC domain, shown by the branches in Figure 1.

在设计ML模型时,模型架构和模型输入的选择是基于对生成数据的过程的隐式假设。当对领域做出不正确的假设时,模型将遭受无法在该建模框架内修复的错误,例如,通过超参数调优或进一步的数据收集。我们研究了关于HPC域的四个常见假设,如图1中的分支所示。

All data is in-distribution: a common assumption that ML practitioners make is that all model errors are the product of insufficiently trained models, inadequate model architectures, or missing discriminative features. However, some jobs in the dataset may be Out of Distribution (OoD), that is, they may be collected at a different time or environment, or through a different process. The model may underperform on OoD jobs due to the lack of similar jobs in the training set and not due to lack of insight (features) into the job. The cause of the problem is epistemic uncertainty (EU) - the model suffers from reducible uncertainty, i.e., lack of knowledge, since a broader training set would make the OoD jobs in-distribution (ID). In the HPC domain, epistemic uncertainty is present in cases of rarely ran or novel jobs or uncommon system states.Without considering the possibility that a portion of the error is a product of epistemic uncertainty, practitioners may put effort into tuning models instead of collecting more underrepresented jobs. Referring to Equation 1, this assumption may be expressed as: deployment time jd and ζd are drawn from a different distribution from training time jt and ζt.

所有数据都是分布的: ML从业者的一个常见假设是,所有模型错误都是模型训练不足、模型架构不足或缺少判别特征的产物。然而,数据集中的一些作业可能处于未分发状态(Out of Distribution, OoD),也就是说,它们可能是在不同的时间或环境中收集的,或者是通过不同的进程收集的。由于训练集中缺乏类似的工作,而不是由于缺乏对工作的洞察力(特征),模型可能在OoD工作上表现不佳。问题的原因是认知不确定性(EU) -模型遭受可约不确定性,即缺乏知识,因为更广泛的训练集将使OoD作业分布(ID)。在高性能计算领域,认知不确定性存在于很少运行或新作业或不常见系统状态的情况下。如果不考虑部分错误是认知不确定性的产物的可能性,从业者可能会把精力放在调整模型上,而不是收集更多代表性不足的工作。参考式1,这个假设可以表示为:部署时间jd和ζd是从不同于训练时间jt和ζt的分布中得出的。

Noise is absent: all systems have some inherent noise that cannot be modeled and will impact predictions. Aleatory uncertainty (AU) refers to irreducible uncertainty which stems from inherent noise or lack of insight into jobs on the system.Modeling errors due to aleatory uncertainty are different from epistemic uncertainty because collecting more jobs may not reduce AU, and these errors may be fundamentally unfixable.Understanding and characterizing a system’s inherent I/O noise is necessary to quantify ML model uncertainty, and because the amount of noise in the data has a strong effect on the optimal choice of ML model. HPC I/O domain experts note that certain systems do have significantly higher or lower I/O noise [18], [19], but I/O modeling works rarely attempt to quantify ML model uncertainty [20]. The assumption that noise is not present in the dataset can be expressed as follows: The practitioner assumes that the data-generating process ϕ has the form of ϕ(j) = f(j, ζ) instead of ϕ(j) = f(j, ζ, ω), i.e., that the inherent noise impact is zero: fn(j, ζ, ω) = 0.

没有噪声: 所有系统都有一些无法建模的固有噪声,并且会影响预测。选择性不确定性(Aleatory uncertainty, AU)是指由于固有的噪声或对系统作业缺乏洞察力而产生的不可约的不确定性。由选择性不确定性引起的建模错误不同于认知不确定性,因为收集更多的作业可能不会减少AU,而且这些错误可能从根本上无法修复。理解和描述系统固有的I/O噪声对于量化ML模型的不确定性是必要的,因为数据中噪声的数量对ML模型的最佳选择有很大的影响。高性能计算I/O领域专家指出,某些系统确实具有明显更高或更低的I/O噪声[18],[19],但I/O建模工作很少尝试量化ML模型的不确定性[20]。假设数据集中不存在噪声,可以表示如下:从业者假设数据生成过程φ具有φ (j) = f(j, ζ)的形式,而不是φ (j) = f(j, ζ, ω),即固有噪声影响为零:fn(j, ζ, ω) = 0。

Sampling is independent: running a job on a system can be viewed as sampling the combination of application behavior and system state and measuring I/O throughput. Most I/O modeling works implicitly assume that multiple samples taken at the same time are independent of each other. The system is modeled as equally affecting all jobs running on it, that is, the placement of different jobs on nodes, the interactions between neighboring jobs, network contention, etc. do not affect the job. This assumption can then be expressed as: the process has the form of ϕ(j) = f(j, ζg(t)), not ϕ(j) = f(j, ζ, ω) i.e., that the resource contention impact is zero: fl(j, ζl(t, j)) = 0.

采样是独立的: 在系统上运行作业可以看作是对应用程序行为和系统状态的组合进行采样,并测量I/O吞吐量。大多数I/O建模工作隐含地假设同时采集的多个样本是相互独立的。将系统建模为对运行在其上的所有作业产生同等影响,即不同作业在节点上的放置、相邻作业之间的交互、网络争用等都不会对作业产生影响。这个假设可以表示为:该过程具有φ (j) = f(j, ζg(t))的形式,而不是φ (j) = f(j, ζ, ω),即资源争用影响为零:fl(j, ζl(t, j)) = 0。

Process is stationary: a common assumption ML practitioners make is that the data-generating process is stationary, and that the same job ran at different times achieves the same I/O throughput. As hardware fails, as new nodes are provisioned, and shared libraries get updates, the system evolves over time. The stationarity assumption is therefore incorrect, and ignoring it by e.g., not exposing the ML model to when a job is ran may cause hard-to-diagnose errors. This assumption implies sampling independence and absence of noise, and can be expressed using the system modeling formulation as: ϕ(j) = f(j) and fg(j, ζg(t)) = 0.

进程是平稳的: ML从业者的一个常见假设是数据生成过程是平稳的,并且在不同时间运行的相同作业实现相同的I/O吞吐量。随着硬件故障、新节点的供应和共享库的更新,系统会随着时间而发展。因此,平稳性假设是不正确的,忽略它(例如,在运行作业时不暴露ML模型)可能会导致难以诊断的错误。该假设意味着采样独立性和无噪声,并且可以使用系统建模公式表示为:φ (j) = f(j)和fg(j, ζg(t)) = 0。

IV. CLASSIFYING I/O THROUGHPUT PREDICTION ERRORS

No matter the problem to which machine learning is applied, a systematic characterization of the sources of errors is crucial to improve model accuracy. While there is no substitute for ‘looking at the data’ to understand the root cause of the problem, this approach does not scale for large datasets. We seek a systematic way to understand the barriers to greater accuracy and improve ML models applied to systems data

无论应用机器学习的问题是什么,系统地描述误差来源对于提高模型准确性至关重要。虽然没有什么可以替代“查看数据”来理解问题的根本原因,但这种方法并不适用于大型数据集。我们寻求一种系统的方法来理解提高准确性的障碍,并改进应用于系统数据的ML模型。

While the work presented here can be generalized past just I/O to e.g., compute or network modeling, we study I/O because I/O bottlenecks are more difficult to diagnose than compute bottlenecks, and because I/O has a coarser temporal granularity allowing software to observe I/O subsystems without the need for e.g., hardware performance counters or binary instrumentation. The key questions we ask in this work are: What are the impediments to the successful application of learning algorithms in understanding I/O? Should ML practitioners focus on acquiring more data on HPC applications or the HPC system? How much of the error stems from poor ML model architectures? How much of the error can be attributed to the dynamic nature of the system and the interactions between concurrent jobs? How much of the performance variation is caused by the system? What fraction of jobs exhibit truly novel I/O behavior compared to jobs observed thus far? At what point are the applications too novel, so much so that users should no longer trust the predictions of the I/O model? We now describe five error classes and dive deeper into error attribution in Sections VI, VII, IX and VIII.

虽然这里介绍的工作可以概括过去的I/O,例如,计算或网络建模,我们研究I/O,因为I/O瓶颈比计算瓶颈更难诊断,因为I/O具有更粗的时间粒度,允许软件观察I/O子系统,而不需要硬件性能计数器或二进制仪表。我们在这项工作中提出的关键问题是:在理解I/O中成功应用学习算法的障碍是什么?机器学习从业者是否应该专注于获取更多关于HPC应用程序或HPC系统的数据?有多少错误源于糟糕的机器学习模型架构?有多少错误可以归因于系统的动态特性和并发作业之间的交互?性能差异有多少是由系统造成的?与目前观察到的作业相比,有多少作业表现出真正新颖的I/O行为?在什么情况下,应用程序过于新颖,以至于用户不应该再相信I/O模型的预测?我们现在描述五种错误类,并在第VI、VII、IX和VIII节中更深入地研究错误归因。

The lack of application and system observability, the interaction between running jobs, the inherent system noise, and the novel or rare applications prevent ML models from fully capturing system behavior, causing errors. We define the I/O throughput prediction error of a model m in a job j as:

缺乏应用程序和系统可观察性、运行作业之间的交互、固有的系统噪声以及新颖或罕见的应用程序会阻止ML模型完全捕获系统行为,从而导致错误。我们将作业j中模型m的I/O吞吐量预测误差定义为:


Following the ϕ(j) terms from Eq. 3 and including the outof-distribution error, the error can be broken down as follows:

Here, the application modeling error eapp is caused by a poor model fit of application behavior (fa(j) component), the global system error esystem is caused by poor predictions of global system impact (fg(j, ζg(t)) component), the outof-distribution error eood is caused by weak model generalization on novel applications or system states, the contention error econtention is caused by poor predictions of job interactions (fl(j, ζl(t, j)) component), and the noise error enoise is caused by the inability of any model to predict inherent noise (fn(j, ζ, ω) component). These five classes of errors are shown as leaf nodes at the bottom of Figure 1.While attributing cumulative job error to each class may be difficult on a per-job basis, we will show that estimating each component across a whole dataset is possible.

在这里,应用程序建模误差eapp是由应用程序行为(fa(j)组件)的不良模型拟合引起的,全局系统误差系统是由对全局系统影响(fg(j, ζg(t))组件)的不良预测引起的,非分布误差eood是由对新应用程序或系统状态的弱模型泛化引起的,争用误差争用是由工作交互(fl(j, ζl(t, j))组件)的不良预测引起的,而噪声误差噪声是由任何模型无法预测固有噪声(fn(j, ζ, ω)分量)引起的。这五类错误显示为图1底部的叶节点。虽然在每个作业的基础上将累计作业错误归因到每个类可能是困难的,但我们将展示在整个数据集中估计每个组件是可能的。

A. I/O Model Error Taxonomy and Litmus Tests

We adopt the term litmus test to mean a test that evaluates the presence, amount, or ratio of a certain quantity. In the following sections, we introduce a four litmus tests that split the error from Equation 5 into five separate classes. The error classes in Equation 5 must be estimated in the order shown in the bottom row of Figure 1 due to the specifics of individual litmus tests. For example, before the effect of aleatory and epistemic uncertainty can be separated, a good model must be found [21]. Similarly, before global and local system modeling errors can be separated, OoD jobs must be identified.

我们采用“石蕊试验”一词来表示评估某种数量的存在、数量或比例的测试。在接下来的部分中,我们将介绍四个石蕊测试,它们将公式5中的误差划分为五个独立的类。由于个别石蕊试验的特殊性,必须按照图1底部一行所示的顺序估计公式5中的误差类别。例如,在区分选择性不确定性和认知不确定性的影响之前,必须找到一个好的模型[21]。类似地,在分离全局和局部系统建模错误之前,必须确定OoD作业。

Application modeling errors: ML models can have varying expressivity and may not always have the correct structure or enough parameters to fit the available data. Models whose structure or training prevents them from learning the shape of the data-generating process are said to suffer from approximation errors. Approximation errors cannot be classified as epistemic or aleatory in nature because no new features or jobs are necessary to remove this error. To estimate AU and EU in the dataset, methods such as AutoDEUQ [21] first require that an appropriate model architecture is found and trained, placing approximation errors as the first branch of the taxonomy.

应用程序建模错误:ML模型可能具有不同的表达能力,并且可能并不总是具有正确的结构或足够的参数来拟合可用数据。如果模型的结构或训练使其无法学习数据生成过程的形状,则称其存在近似误差。近似误差在本质上不能归类为认识论或选择性,因为不需要新的特征或工作来消除这种误差。为了估计数据集中的AU和EU, AutoDEUQ[21]等方法首先需要找到并训练合适的模型架构,将近似误差作为分类法的第一个分支。

Application modeling errors: ML models can have varying expressivity and may not always have the correct structure or enough parameters to fit the available data. Models whose structure or training prevents them from learning the shape of the data-generating process are said to suffer from approximation errors. Approximation errors cannot be classified as epistemic or aleatory in nature because no new features or jobs are necessary to remove this error. To estimate AU and EU in the dataset, methods such as AutoDEUQ [21] first require that an appropriate model architecture is found and trained, placing approximation errors as the first branch of the taxonomy.

应用程序建模错误:ML模型可能具有不同的表达能力,并且可能并不总是具有正确的结构或足够的参数来拟合可用数据。如果模型的结构或训练使其无法学习数据生成过程的形状,则称其存在近似误差。近似误差在本质上不能归类为认识论或选择性,因为不需要新的特征或工作来消除这种误差。为了估计数据集中的AU和EU, AutoDEUQ[21]等方法首先需要找到并训练合适的模型架构,将近似误差作为分类法的第一个分支。

Approximation errors are further divided into application and system modeling errors. Application modeling errors are caused by poor predictions of application behavior which can be fixed through hyperparameter searches or better model architectures. The first column of Figure 1 illustrates the impact of application modeling errors with an example hyperparameter search over two XGBoost parameters on the Theta dataset (introduced in the next section). The best configuration found by the grid search has 32 trees with a depth of 21, while the default XGBoost configuration uses 100 trees of depth 6.

近似误差进一步分为应用误差和系统建模误差。应用程序建模错误是由对应用程序行为的错误预测引起的,这可以通过超参数搜索或更好的模型体系结构来修复。图1的第一列说明了应用程序建模错误对Theta数据集上两个XGBoost参数的示例超参数搜索的影响(下一节将介绍)。网格搜索找到的最佳配置有32棵树,深度为21,而默认的XGBoost配置使用100棵树,深度为6。

System modeling errors: system behavior changes over time due to transient or long-term changes such as file system metadata issues, failing components, new provisions, etc. [22].A model that is only aware of application behavior, but not of system state implicitly assumes that the process is stationary.It will be forced to learn the average system response to I/O patterns, and will suffer greater prediction errors during periods when system behavior is perturbed. System modeling errors occur due to poor (or complete lack of) modeling of the global system component ζg(t). To illustrate this class of errors, the second experiment in Figure 1 shows the perweek average error of two models trained to predict job I/O throughput. The blue model can be written as m(jo), i.e., it is only exposed to observable application behavior jo. The orange model can be written as m(jo, t), i.e., it also knows the job start time t. During service degradations, the blue model has long periods of biased errors while the orange model does not, since it knows when the degradations happen.

系统建模错误:由于文件系统元数据问题、失效组件、新规定等短暂或长期变化而导致的系统行为随时间变化[22]。只知道应用程序行为而不知道系统状态的模型隐含地假定流程是平稳的。它将被迫学习系统对I/O模式的平均响应,并且在系统行为受到干扰时将遭受更大的预测错误。系统建模错误是由于对全局系统组件ζg(t)建模不良(或完全缺乏)造成的。为了说明这类错误,图1中的第二个实验显示了经过训练来预测作业I/O吞吐量的两个模型的每周平均错误。蓝色模型可以写成m(jo),也就是说,它只暴露给可观察的应用程序行为jo。橙色模型可以写成m(jo, t),也就是说,它也知道作业开始时间t。在服务降级期间,蓝色模型有很长一段时间的偏置误差,而橙色模型没有,因为它知道降级何时发生。

Generalization errors: ML models should perform well on data drawn from the same distribution from which their training set was collected. When exposed to samples highly dissimilar from their training set, the same models tend to make mispredictions. These samples are called ‘out-ofdistribution’ (OoD) because they come from new, shifted,distributions, or the training set does not have full coverage of the sample space. While models that generalize (perform well on OoD data) may exist, mispredictions on OoD samples are not always the fault of the model, and in those cases the only recourse is to (1) detect and exclude samples suspected as out-of-distribution, (2) seek an expanded training set covering those regions, or (3) apply domain-specific knowledge. In order not to pollute other classes of errors, samples that show high epistemic uncertainty must be detected and their error counted towards generalization errors before other errors are estimated. As an example, the third column of Figure 1 shows model error before (green) and after (red) deployment, with the error significantly rising when the model is evaluated on data collected outside the training time span.

泛化误差:ML模型应该在从其训练集收集的相同分布中提取的数据上表现良好。当暴露在与训练集高度不同的样本中时,相同的模型往往会做出错误的预测。这些样品被称为“配送外”(OoD),因为它们来自新的、转移的、分布,或者训练集没有完全覆盖样本空间。虽然可能存在泛化(在OoD数据上表现良好)的模型,但对OoD样本的错误预测并不总是模型的错误,在这种情况下,唯一的办法是(1)检测并排除疑似偏离分布的样本,(2)寻求覆盖这些区域的扩展训练集,或者(3)应用领域特定知识。为了不污染其他类型的误差,必须检测出具有高认知不确定性的样本,并在估计其他误差之前将其误差计入泛化误差。例如,图1的第三列显示了部署之前(绿色)和部署之后(红色)的模型误差,当模型在训练时间范围之外收集的数据上进行评估时,误差显著上升。

Contention and resource sharing errors: a diverse and variable number of applications compete for compute, networking, and I/O bandwidth on HPC systems and interact with each other through these shared resources [17], [23].Although the global system state will impact all jobs equally, the impact of resource sharing is specific to pairs of jobs that are interacting and is harder to observe and model.Prediction errors that occur due to lack of visibility into job interactions are called contention errors and are shown in the fourth column of Figure 1. Here, the I/O throughputs of a number of identical runs (same code and data) of different applications illustrate that some applications are more sensitive to contention than others, even when accounting for global system state.

争用和资源共享错误:不同数量的应用程序在HPC系统上竞争计算、网络和I/O带宽,并通过这些共享资源相互交互[17],[23]。尽管全局系统状态将平等地影响所有作业,但资源共享的影响是特定于相互作用的作业对的,并且更难观察和建模。由于缺乏对作业交互的可见性而发生的预测错误称为争用错误,如图1的第四列所示。这里,不同应用程序的许多相同运行(相同的代码和数据)的I/O吞吐量说明,一些应用程序对争用比其他应用程序更敏感,即使在考虑全局系统状态时也是如此。

Inherent noise errors: while hard to measure, contention and resource sharing errors can be potentially removed through greater insight into the system and workloads. What fundamentally cannot be removed are inherent noise errors: errors due to random behavior by the system (e.g., dropped packets, randomness introduced through scheduling, etc.). Inherent noise is problematic both because ML models are bound to make errors on samples affected by noise and because noisy samples may impede model training. The fifth column of Figure 1 shows the I/O throughput and start time differences between pairs of identical jobs. The leftmost column contains identical jobs that ran at exactly the same time, which often experience 5% or more difference in I/O throughput.

固有的噪声错误:虽然难以测量,但争用和资源共享错误可以通过更深入地了解系统和工作负载来潜在地消除。根本无法消除的是固有的噪声误差:由于系统的随机行为造成的误差(例如,丢包,通过调度引入的随机性等)。固有噪声是有问题的,因为ML模型在受噪声影响的样本上必然会出错,而且噪声样本可能会阻碍模型训练。图1的第五列显示了相同作业对之间的I/O吞吐量和启动时间差异。最左边的列包含完全同时运行的相同作业,它们在I/O吞吐量方面通常会有5%或更多的差异。

V. DATASETS AND EXPERIMENTAL SETUP

This work is evaluated on two datasets, one collected from the Argonne Leadership Computing Facility (ALCF) Theta supercomputer in the period from the beginning of 2017 to end of 2020, and one collected from the National Energy Research Scientific Computing Center (NERSC) Cori supercomputer in the period from beginning of 2018 to the end of 2019.Theta collects Darshan [24] and Cobalt logs and the Theta dataset consists of about 100K jobs with an I/O volume larger than 1GiB, while Cori collects Darshan and Lustre Monitoring Tools (LMT) logs, and the Cori dataset consists of 1.1M jobs larger than 1GiB.

这项工作在两个数据集上进行了评估,一个数据集收集自阿贡领导计算设施(ALCF) Theta超级计算机在2017年初至2020年底期间,另一个数据集收集自国家能源研究科学计算中心(NERSC) Cori超级计算机在2018年初至2019年底期间。Theta收集Darshan[24]和Cobalt日志,Theta数据集由大约100K个I/O量大于1GiB的作业组成,而Cori收集Darshan和Lustre Monitoring Tools (LMT)日志,Cori数据集由110万个大于1GiB的作业组成。

Darshan is an HPC I/O characterization tool that collects HPC job I/O access patterns on both POSIX and MPI-IO levels, and serves as our main insight into application behavior.It collects POSIX aggregate job-level data, e.g., the total number of bytes transferred, accesses made, read / write ratios, unique or shared files opened, distribution of accesses per access size, etc. MPI-IO is a library built on top of POSIX that offers higher-level primitives for performing I/O operations and can potentially offer the model greater insight into application semantics and behavior. Darshan collects MPIIO information for jobs that use it, and all MPI-IO operations are also visible on the POSIX level. Darshan also collects the number of processes ran, which is typically equal to or greater than the number of cores allocated to a job, but Darshan does not currently measure a job’s core count. This information is however available in Cobalt scheduler logs, which contain the number of nodes and cores assigned to a job, job start and end times, job placement, etc. Of the two systems observed in this work, only Theta stores Cobalt scheduler logs. LMT collects I/O subsystem information such as storage server load and file system utilization, and serves as our main insight into the I/O subsystem state as it changes over time. Every 5 seconds, LMT records the state of Lustre file system’s object storage servers (OSS), object storage targets (OST), metadata servers (MDS) and metadata targets (MDT). Some of the features collected are OSS and MDS CPU and memory utilization, number of bytes transferred to and from the OSTs, file system fullness, number of metadata operations (e.g., open, close, mkdir, etc.) performed by the metadata targets, etc.

Darshan是一个HPC I/O表征工具,它收集POSIX和MPI-IO级别的HPC作业I/O访问模式,并作为我们对应用程序行为的主要洞察。它收集POSIX集合作业级数据,例如,传输的总字节数、进行的访问、读/写比率、打开的唯一或共享文件、每个访问大小的访问分布,等等。MPI-IO是一个建立在POSIX之上的库,它为执行I/O操作提供了更高级的原语,并可能使模型更深入地了解应用程序语义和行为。Darshan为使用它的作业收集MPIIO信息,并且所有的MPI-IO操作也在POSIX级别上可见。Darshan还收集运行的进程数,这通常等于或大于分配给作业的核数,但Darshan目前不测量作业的核数。然而,这些信息可以在Cobalt调度器日志中获得,其中包含分配给作业的节点和核心数量、作业开始和结束时间、作业放置等。在本研究中观察到的两个系统中,只有Theta存储Cobalt调度器日志。LMT收集I/O子系统信息,如存储服务器负载和文件系统利用率,并作为我们对I/O子系统状态随时间变化的主要洞察。LMT每5秒记录一次Lustre文件系统的OSS (object storage server)、OST (object storage targets)、MDS (metadata server)和MDT (metadata targets)的状态。收集的一些特性包括OSS和MDS的CPU和内存利用率、进出ost的字节数、文件系统的满度、元数据目标执行的元数据操作(例如打开、关闭、mkdir等)的数量等。

Because LMT logs are collected independently from jobs running on the system, during a data preprocessing phase each job’s Darshan log is matched with all LMT measurements collected between the job’s start and stop time. LMT separately logs each OSS, OST, MDS, and MDT I/O node state, but since a job is served by an arbitrary number of these I/O nodes, only the minimum, maximum, mean and standard deviation of collected features are exposed to the ML model. Overall, models have access to 48 Darshan POSIX, 48 Darshan MPI-IO, 37 LMT, and 5 Cobalt features. All logs are sanitized and pre-processed as according to [2].During sanitization, jobs with missing features or illegitimate values (e.g., I/O throughput of zero) are removed. During preprocessing, bounded features (e.g., percentage features such as I/O R/W rate) are not modified, while unbounded features are either scaled first by taking a log10 and then applying min-max normalization, or alternatively these features are converted to artificial features (e.g., read and write access count features are converted to a bounded read / write ratio feature and an unbounded total access count feature). The final features exposed to the models are reported in Table I. The code and the dataset for this work are provided in the appendix.

由于LMT日志的收集独立于系统上运行的作业,因此在数据预处理阶段,每个作业的Darshan日志与作业开始和停止时间之间收集的所有LMT测量值相匹配。LMT分别记录每个OSS, OST, MDS和MDT I/O节点状态,但由于作业由任意数量的这些I/O节点提供服务,因此仅将收集到的特征的最小值,最大值,平均值和标准差暴露给ML模型。总体而言,模型可以访问48个达尔善POSIX, 48个达尔善MPI-IO, 37个LMT和5个钴功能。按照[2]对所有日志进行消毒和预处理。在清理过程中,缺少特性或非法值(例如,I/O吞吐量为零)的作业将被删除。在预处理期间,有界特征(例如,百分比特征,如I/O R/W率)不会被修改,而无界特征要么首先通过取log10然后应用最小-最大归一化来缩放,要么将这些特征转换为人工特征(例如,读写访问计数特征转换为有界读写比率特征和无界总访问计数特征)。表1中报告了模型暴露的最终特征。该工作的代码和数据集在附录中提供。

The ML models in this work are trained using supervised learning on the task of predicting the I/O throughput of individual HPC jobs. The model error is defined as:

这项工作中的ML模型是使用监督学习来训练预测单个HPC作业的I/O吞吐量的任务。模型误差定义为:

where yi and yˆi are the i-th job’s measured and predicted I/O throughputs. Because log(x) = −log(1/x), if a model overestimates or underestimates the I/O throughput by the same relative amount, the absolute error remains the same. We use percentages to write errors, where, e.g., a -25% error specifies that the model underestimated real I/O throughput by 25%.Some figures however show the absolute error when model bias is not important. While models try to minimize mean error, we report median values since some of the distributions have heavy tails that make mean estimates unreliable.

其中yi和yi是第i个作业的测量和预测i /O吞吐量。由于log(x) = - log(1/x),如果模型高估或低估I/O吞吐量的相对量相同,则绝对误差保持不变。我们使用百分比来写错误,例如,-25%的错误表示模型低估了25%的实际I/O吞吐量。然而,有些数字显示了当模型偏差不重要时的绝对误差。当模型试图最小化平均误差时,我们报告中位数,因为一些分布具有重尾,使得平均值估计不可靠。

VI. APPLICATION MODELING ERRORS

When an ML practitioner is tasked with a classification or a regression problem, the first model they evaluate will likely under-perform on the task, due to e.g., inadequate data preprocessing, architecture, or hyperparameters. Therefore, the model will suffer from approximation errors, which can be removed by tuning the model hyperparameters or finding more appropriate domain-specific ML model architectures. Since the choice of model architecture and parameters typically has a dominant effect on model error, approximation errors must be resolved before more subtle classes of errors become a limiting factor in improving model performance.

当机器学习从业者被分配分类或回归问题时,他们评估的第一个模型可能会在任务中表现不佳,例如,由于数据预处理,架构或超参数不足。因此,模型将遭受近似误差,这可以通过调整模型超参数或找到更合适的特定于领域的ML模型体系结构来消除。由于模型结构和参数的选择通常对模型误差有主要影响,因此必须在更细微的误差类别成为提高模型性能的限制因素之前解决近似误差。

Approximation errors can be split into errors caused by poor modeling of the available data (i.e., applications), and into errors caused by implicit assumptions about the domain (e.g., that I/O behavior of a system does not change over time). In this section we analyze application modeling errors, and in Section VII we analyze system modeling errors

近似误差可以分为由可用数据(即应用程序)的不良建模引起的误差,以及由关于域的隐式假设引起的误差(例如,系统的I/O行为不随时间改变)。在本节中,我们将分析应用程序建模错误,在第七节中,我们将分析系统建模错误。
This section asks the following questions: do I/O models build faithful representations of application behavior? What are the limits of I/O application modeling? In practice, do I/O models faithfully learn application behavior? Can I/O application modeling benefit from extra hyperparameter finetuning or new application features?

本节提出以下问题:I/O模型是否构建了应用程序行为的忠实表示?I/O应用程序建模的限制是什么?在实践中,I/O模型忠实地学习应用程序行为吗?I/O应用程序建模是否可以从额外的超参数调优或新的应用程序特性中获益?

A. Estimating limits of application modeling

Here we develop an application modeling error litmus test which separates the application modeling error eapp from the other four error classes in Equation 5. To do so, we seek a ‘golden model’ (GM) that predicts I/O throughput as accurately as possible given the observable application behavior. Application modeling error of a practical ML model is then estimated by comparing its error rate with that of a golden model

这里我们开发了一个应用程序建模错误试金石,它将应用程序建模错误eapp与公式5中的其他四个错误类分开。为此,我们寻求一种“黄金模型”(GM),在给定可观察到的应用程序行为的情况下,尽可能准确地预测I/O吞吐量。然后通过与黄金模型的错误率比较,估计实际机器学习模型的应用建模误差

To build this ‘golden model’, we rely on a property of synthetic datasets where the data-generating process can be freely and repeatedly sampled. When analyzing HPC logs, it is common to see records of the same application ran multiple times on the same data, or data of the same format.For example, system benchmarks such as IOR [25] may be run periodically to evaluate file system health and overall performance. We call these sets of repeated jobs ‘duplicate jobs’. Pairs of jobs are duplicates if they belong to the same application and all of their observable application features are identical, typically because the application was ran with the same configuration and input data. Because jobs from the same set of duplicates appear identical to an ML model, the model cannot distinguish between them. Given a training set that only contains sets of duplicate jobs, the highest possible accuracy can be achieved by mapping jobs from each individual set of duplicates to the set’s mean I/O throughput. A model that does not learn to predict a set’s mean value is said to suffer from application-modeling error.

为了构建这个“黄金模型”,我们依赖于合成数据集的一个特性,即数据生成过程可以自由地重复采样。在分析HPC日志时,经常会看到同一应用程序对同一数据或相同格式的数据多次运行的记录。例如,可以定期运行IOR[25]等系统基准测试,以评估文件系统的健康状况和整体性能。我们把这些重复作业的集合称为“重复作业”。如果作业对属于相同的应用程序,并且它们的所有可观察到的应用程序特性都相同,则它们是重复的,这通常是因为应用程序使用相同的配置和输入数据运行。由于来自同一组副本的作业对ML模型来说是相同的,因此模型无法区分它们。给定一个只包含重复作业集的训练集,通过将每个单独的重复作业集的作业映射到该集的平均I/O吞吐量,可以实现最高可能的准确性。一个不学习预测集合平均值的模型被称为应用建模错误。

By restricting the training set to only sets of duplicates, a golden model with a median absolute error e g can be built for which e g app = 0. This golden model performs only memorization and does not generalize at all, but is nonetheless useful for comparison against real ML models. Any practical model with a median error e p can then learn its application modeling error e p app on the restricted training set by comparing against the golden model e g as e p app = e p−e g . Since duplicate sets can have as few as two jobs, I/O throughput estimates for duplicate sets are biased, and the golden model (GM) may appear to perform better on small sets than on large sets. By applying Bessel’s correction [26], this effect is mitigated, and the litmus test is administered as:

通过将训练集限制为只有重复的集合,可以建立一个具有中位数绝对误差eg的黄金模型,其中eg app = 0。这个黄金模型只执行记忆,根本不泛化,但对于与真实的ML模型进行比较仍然很有用。然后,任何具有中值误差e p的实用模型都可以通过与黄金模型e p app = e p−e g进行比较,在受限训练集上学习其应用建模误差e p app。由于重复集可能只有两个作业,因此重复集的I/O吞吐量估计是有偏差的,并且黄金模型(GM)在小集上的表现可能比在大集上的表现更好。通过应用Bessel的校正[26],减轻了这种影响,石蕊试验按如下方式进行:

Assuming that duplicate jobs are drawn from the same distribution of applications as the rest of the dataset, the golden model median absolute error represents the lower bound on median absolute error a model can achieve on the whole dataset. Note that different applications may have different distributions of duplicate I/O throughputs, as shown in the fourth column of Figure 1. For this litmus test to be accurate, a large sample of applications representative of the HPC system workload must be acquired. When applied to Theta, 19010 duplicates (23.5% of the dataset) over 3509 sets show a median absolute error of 10.01%. Cori has 504920 duplicates (54%) in 77390 sets with a median absolute error of 14.15%. If the litmus test is applied correctly, practical ML models may approach the golden model’s error but cannot surpass it.

假设从与数据集其余部分相同的应用程序分布中提取重复的作业,黄金模型中位数绝对误差表示模型在整个数据集上可以实现的中位数绝对误差的下界。请注意,不同的应用程序可能具有不同的重复I/O吞吐量分布,如图1的第四列所示。为了使这个试金石测试准确,必须获得代表HPC系统工作负载的大量应用程序样本。当应用于Theta时,3509组的19010个重复(数据集的23.5%)显示出10.01%的中位数绝对误差。Cori在77390组中有504920个重复(54%),中位数绝对误差为14.15%。如果正确应用石蕊测试,实际的ML模型可能接近黄金模型的误差,但不能超过它。

B. Minimizing application modeling error

The next question is whether ML models can practically reach the error lower bound estimate e g . Several I/O modeling works have explored different types of ML models: linear regression [2], decision trees [27], gradient boosting machines [2], [27], [28], Gaussian processes [4], neural networks [5], etc. Here, we explore two types of models: XGBoost [29], an implementation of gradient boosting machines, and feedforward neural networks. These model types are chosen for their accuracy and previous success in I/O modeling.

下一个问题是ML模型是否能实际达到误差下界估计。一些I/O建模工作已经探索了不同类型的ML模型:线性回归[2]、决策树[27]、梯度增强机[2]、[27]、[28]、高斯过程[4]、神经网络[5]等。在这里,我们探讨了两种类型的模型:XGBoost[29],梯度增强机的实现,以及前馈神经网络。选择这些模型类型是因为它们的准确性和以前在I/O建模中的成功。

Neither type of model achieves ideal performance ‘out of the box’. XGBoost model performance can be improved through hyperparameter tuning, e.g., by exploring different (1) numbers of decision trees, (2) their depth, (3) the features each tree is exposed to, and (4) part of the dataset each tree is exposed to. Neural networks are more complex, since they require tuning hyperparameters (learning rate, weight decay, dropout, etc.), while also exploring different architectures (number of layers, their size, type, and connectivity). In the case of XGBoost, we exhaustively explore four hyperparameters listed above, for a total of 8046 XGBoost models. In the case of neural networks, exhaustive exploration is not feasible due to state space explosion, so we use AgEBO [30], a Network Architecture Search (NAS) method that trains populations of neural networks and updates each subsequent generation’s hyperparameters and architectures through Bayesian logic.

这两种型号都无法实现“开箱即用”的理想性能。XGBoost模型的性能可以通过超参数调优来提高,例如,通过探索不同的(1)决策树的数量,(2)决策树的深度,(3)每棵树暴露的特征,(4)每棵树暴露的数据集的一部分。神经网络更复杂,因为它们需要调整超参数(学习率、权重衰减、dropout等),同时还需要探索不同的架构(层数、层的大小、类型和连通性)。在XGBoost的情况下,我们详尽地研究了上面列出的四个超参数,总共有8046个XGBoost模型。在神经网络的情况下,由于状态空间爆炸,穷尽探索是不可实现的,因此我们使用了AgEBO[30],这是一种网络架构搜索(NAS)方法,该方法训练神经网络种群,并通过贝叶斯逻辑更新每一代的超参数和架构。

The leftmost column of Figure 1 shows a heatmap of an XGBoost exhaustive search over two parameters on the Theta dataset, with the other two parameters (% of columns and rows revealed to the trees) selected from the best possible result found. The best performing model has an error of 10.51% - close to the predicted bound of 10.01%. The Cori search arrives at a similar configuration with an error of 14.92%.In the case of neural networks, Figure 2 shows a scatter plot of test set errors of 10 generations of neural networks on the Cori system, with 30 networks per generation. The networks are evolved using a separate validation test to prevent leakage of the test set into the model parameters. Networks approach the estimated error limit, and the best result achieves a median absolute error of 14.3%. After extensive tuning both neural networks and XGBoost models asymptotically approach the estimated limit in model accuracy. Despite the 300 trained neural networks, NAS does little to improve models, since only 6 out of 300 different models improve on previous results (gold stars in Figure 2). This suggests that both types of ML models are impeded by the same barrier and that the architecture and the tuning of models are not the fundamental issue in achieving better accuracy, i.e., that the source of error lies elsewhere.

图1最左边的列显示了XGBoost对Theta数据集上的两个参数进行穷举搜索的热图,其他两个参数(显示给树的列和行的百分比)从找到的最佳可能结果中选择。表现最好的模型误差为10.51%——接近10.01%的预测范围。Cori搜索得到了类似的配置,误差为14.92%。以神经网络为例,图2为Cori系统上10代神经网络的测试集误差散点图,每代30个网络。网络使用单独的验证测试来进化,以防止测试集泄漏到模型参数中。网络接近估计的误差极限,最佳结果达到14.3%的中位数绝对误差。经过广泛的调整,神经网络和XGBoost模型都渐近逼近模型精度的估计极限。尽管有300个训练过的神经网络,NAS对改进模型的作用不大,因为300个不同的模型中只有6个比以前的结果有所改进(图2中的金星)。这表明,这两种ML模型都受到相同的障碍的阻碍,模型的架构和调优不是实现更高精度的根本问题,也就是说,错误的源头在别处。

C. Increasing visibility into applications

While hyperparameter and architecture searches approach but do not surpass the litmus test’s estimated lower bound on error, this is not conclusive evidence that all application modeling error has been removed and that error stems from other sources. Possibly, there exist missing application features that might further reduce errors. We explore two such sets of features: MPI-IO logs and Cobalt scheduler logs.

虽然超参数和体系结构搜索接近但没有超过石头石测试估计的误差下界,但这并不能证明所有应用程序建模错误已经被消除,错误源于其他来源。可能存在可能进一步减少错误的应用程序功能缺失。我们将探讨两组这样的特性:MPI-IO日志和Cobalt调度器日志。

Figure 3 shows the absolute error distribution of hyperparameter-tuned models trained on three Theta datasets: POSIX, POSIX + MPI-IO, and POSIX + Cobalt (Cori excluded because of the lack of Cobalt logs). None of the dataset enrichments help reduce error, corroborating the conclusion that poor application modeling is not a source of error for these models, and further insight into applications will not help.Note that this absence of evidence does not imply evidence of absence, i.e., it does not prove that there exist no features that may help improve predictions. However, this experiment does present a best-effort attempt at exposing novel features, and the model’s predictions stay within predicted limits.

图3显示了在三个Theta数据集上训练的超参数调优模型的绝对误差分布:POSIX、POSIX + MPI-IO和POSIX + Cobalt(由于缺乏Cobalt日志,Cori被排除在外)。没有一个数据集的丰富有助于减少错误,这证实了一个结论,即糟糕的应用程序建模不是这些模型的错误来源,对应用程序的进一步了解也没有帮助。请注意,证据的缺失并不意味着证据的缺失,也就是说,它不能证明不存在可能有助于改进预测的特征。然而,这个实验确实尽了最大的努力来揭示新的特征,并且模型的预测保持在预测范围内。


Adding Cobalt logs does reduce the error on the training set, and ablation studies show that the job start and end time features are the cause. Once timing features are present in the dataset, no two jobs are duplicates due to small timing variations. While previously the ML model was not able to overfit the dataset due to the existence of duplicates, this is no longer the case, and the ML model can differentiate and memorize each individual sample. In [2] authors remove timing features for a similar reason: ML models can learn Darshan’s implementation of I/O throughput calculation and make good predictions without observing job behavior.

添加Cobalt日志确实减少了训练集上的误差,消融研究表明,作业开始和结束时间特征是导致误差的原因。一旦数据集中出现了定时特征,就不会有两个作业因为定时的小变化而重复。虽然以前由于重复的存在,ML模型不能过拟合数据集,但现在不再是这种情况,ML模型可以区分和记忆每个单独的样本。在[2]中,作者出于类似的原因删除了定时功能:ML模型可以学习Darshan的I/O吞吐量计算实现,并且无需观察工作行为即可做出良好的预测。

VII. GLOBAL SYSTEM MODELING ERRORS

The second part of the approximation error in the taxonomy is the global system modeling error. This error refers to I/O climate and I/O weather effects [22] that affect all jobs running on the system, and corresponds to the second component in Equation 3. While global and local system impact on job performance have complex and overlapping effects, factorizing system impact into impact applied to all jobs versus the impact that is dependent on pairs of concurrent jobs is useful for modeling purposes. The main difference between the two is that modeling local system impact requires modeling relationships between all pairs of concurrent jobs, while modeling global system impacts requires modeling only a single but pervasive influence. In other words, global system impact modeling is insensitive to the number of concurrent jobs running on the system, and can be seen as a form of lossy compression of system state and contention impact on jobs.

分类法中近似误差的第二部分是全局系统建模误差。该误差是指影响系统上运行的所有作业的I/O气候和I/O天气效应[22],对应于式3中的第二个分量。虽然全局和局部系统对工作性能的影响具有复杂和重叠的影响,但将系统影响分解为应用于所有工作的影响与依赖于并发工作对的影响,对于建模目的是有用的。两者之间的主要区别在于,对局部系统影响进行建模需要对所有并发作业对之间的关系进行建模,而对全局系统影响进行建模只需要对单个但普遍存在的影响进行建模。换句话说,全局系统影响建模对系统上运行的并发作业数量不敏感,并且可以看作是对系统状态和争用对作业影响的有损压缩的一种形式。

We now ask: How does I/O contention impact job I/O throughput prediction? What are the limits of global system modeling? Can I/O models approach this limit? What I/O subsystem features can help improve I/O throughput predictions?

现在我们要问:I/O争用如何影响作业I/O吞吐量预测?全球系统建模的局限性是什么?I/O模型能接近这个极限吗?哪些I/O子系统特性可以帮助改进I/O吞吐量预测?

A. Estimating limits of global system modeling

Global system impact ζg(t) on job j from Equation 3 can be formalized as some function ζg(t) = g(J(t)) where J is the set of jobs running at time t. Since jobs have a start and end time, given a dataset with a dense enough sampling of J, g(J(t)) can be calculated for every point in time. During periods of time where e.g., the file system is suffering a service degradation, all jobs on the system will be impacted with varying severity. A model of the system does not need to understand how and why the degradation happened, it only needs to know degradation start and end times, and how different types of jobs were impacted. This time-based model is useless for predicting future performance, and its only utility is in evaluating how much of the degradation can be described as purely a function of time. A deployed model does not have insight into the future and will still need to observe the system.

等式3中作业j的全局系统影响ζg(t)可以形式化为函数ζg(t) = g(j (t)),其中j是在时间t运行的作业集。由于作业有开始和结束时间,给定一个具有足够密集采样j的数据集,g(j (t))可以计算每个时间点。在一段时间内,例如,文件系统正在遭受服务降级,系统上的所有作业都会受到不同程度的影响。系统模型不需要了解退化是如何以及为什么发生的,它只需要知道退化的开始和结束时间,以及不同类型的作业是如何受到影响的。这个基于时间的模型对于预测未来的性能是无用的,它的唯一用途是评估有多少退化可以被描述为纯粹的时间函数。已部署的模型不具有对未来的洞察力,并且仍然需要观察系统。

To evaluate the global system impact, a golden model that exhibits no global modeling error is developed, against which other, ‘real’ ML models can be compared. Since the global system impact ζg(t) only depends on time t and may ignore the set of all jobs J, only application behavior j and the job start time feature are exposed to the golden model.

Both real and golden models have optimized hyperparameters and should have eapp = 0, but only the golden model has esystem = 0 (assuming enough data to memorize ζg(t) is available). The litmus test compares these two models to determine e p system = e p − e g . Here, a golden model is an XGBoost model fine-tuned on a validation set and evaluated on a test set. Assuming that the golden model is exposed to enough jobs throughout the lifetime of the system, it will learn the impact of ζg(t) even without having access to the underlying system features causing that impact. This golden model is used in the following litmus test:

为了评估全局系统影响,开发了一个没有全局建模误差的黄金模型,可以与其他“真实”ML模型进行比较。由于全局系统影响ζg(t)仅取决于时间t,并且可能忽略所有作业J的集合,因此只有应用程序行为J和作业开始时间特征暴露于黄金模型。

真实模型和黄金模型都有优化的超参数,应该有eapp = 0,但只有黄金模型有system = 0(假设有足够的数据来记忆ζg(t))。石蕊试验比较了这两个模型,确定了e p系统= e p−e g。这里的黄金模型是在验证集上调优并在测试集上评估的XGBoost模型。假设黄金模型在系统的整个生命周期中接触到足够多的作业,它将学习到ζg(t)的影响,即使没有访问导致该影响的底层系统特征。这个黄金模型用于以下石蕊试验:

If the litmus test is applied correctly, the golden model only suffers from the last three classes of errors: poor generalization, local system impact, and inherent noise. Note that the litmus test is applied on the whole dataset, and not just duplicates, because the less numerous duplicate jobs do not cover the whole lifetime of the system well. In Figure 4 we evaluate a baseline model (blue) and a model enriched with the job start time (orange). Adding a start time feature has a large impact on error: on Cori, the error drops 40%, from 16.49% down to 10.02%, while on Theta the error drops by 30.8%. To obtain this higher accuracy on the POSIX+time dataset, a far larger model is needed, i.e., one that can remember the I/O weather throughout the lifetime of the system.

如果正确应用石蕊试法,黄金模型只会受到最后三类错误的影响:泛化不良、局部系统影响和固有噪声。请注意,石蕊测试应用于整个数据集,而不仅仅是副本,因为数量较少的重复作业不能很好地覆盖系统的整个生命周期。在图4中,我们评估了一个基线模型(蓝色)和一个包含作业开始时间的模型(橙色)。添加开始时间特征对误差有很大的影响:在Cori上,误差下降了40%,从16.49%下降到10.02%,而在Theta上,误差下降了30.8%。为了在POSIX+time数据集上获得更高的精度,需要一个更大的模型,即一个能够在系统的整个生命周期中记住I/O天气的模型。

Note that the timestamp feature fed to the golden model serves no purpose at deployment time, since the ML model cannot learn the state of the system as it is happening. This golden model is useful to retrospectively analyze past states and validate that deployment-time models are not suffering from system modeling errors.

请注意,提供给黄金模型的时间戳特性在部署时没有任何作用,因为ML模型无法在系统状态发生时了解系统状态。这个黄金模型对于回顾分析过去的状态和验证部署时模型没有遭受系统建模错误非常有用。

B. Improving modeling through I/O visibility

With an estimate of minimal error achievable assuming perfect application and global system modeling, we investigate whether I/O subsystem logs can help models approach this limit. Since Theta does not collect I/O subsystem logs, we analyze Cori, which collects both application and I/O logs.Figure 4 shows the XGBoost performance of three models: a baseline where eapp = 0 (blue), the litmus test’s golden model where also esystem = 0 (orange), and a Lustre-enriched model (green). Cori’s median absolute error is reduced by 40%, from 16.49% down to 9.96%. The Lustre-enriched results are surprisingly close to the litmus test’s predictions, and suggest that predictions cannot be improved through further I/O insight since the litmus test’s prediction is reached.

在假设完美的应用程序和全局系统建模的情况下,通过估计可实现的最小误差,我们研究I/O子系统日志是否可以帮助模型接近这个极限。由于Theta不收集I/O子系统日志,因此我们分析Cori,它同时收集应用程序和I/O日志。图4显示了三个模型的XGBoost性能:eapp = 0的基线(蓝色),石蕊测试的金色模型(橙色),以及一个富含光泽的模型(绿色)。Cori的中位数绝对误差降低了40%,从16.49%下降到9.96%。富含lustre的结果与石蕊测试的预测惊人地接近,并且表明由于达到了石蕊测试的预测,因此无法通过进一步的I/O洞察来改进预测。

VIII. GENERALIZATION ERRORS

The remaining three classes of error are caused by lack of data and not poor modeling, as the top branch of the taxonomy shows. While I/O contention and inherent noise errors are examples of aleatory uncertainty and are caused by lack of insight into specific jobs, generalization errors stem from epistemic uncertainty, i.e., the lack of other logged jobs around a specific job of interest. To motivate this section, in the third graph of Figure 1 we show error distribution of a model trained on data from January 2018 to July 2019. When evaluated on held-out data from the same period, the median absolute error is low (green line). Once the model is deployed and evaluated on the data collected after the training period (July 2019 and after), median error spikes up (red line).

其余三类错误是由于缺乏数据而不是糟糕的建模造成的,如分类法的顶部分支所示。虽然I/O争用和固有噪声错误是由于缺乏对特定作业的了解而导致的偶然性不确定性的例子,但泛化错误源于认知不确定性,即在感兴趣的特定作业周围缺乏其他已记录的作业。为了激励本节,在图1的第三个图中,我们显示了在2018年1月至2019年7月的数据上训练的模型的误差分布。当对同期的闲置数据进行评估时,绝对误差中值很低(绿线)。一旦在训练期(2019年7月及之后)之后收集的数据上部署和评估模型,中位数误差就会飙升(红线)。

A. Estimating generalization error

Estimating the amount of out-of-distribution error eood is important because any unaccounted OoD error will be classified as noise or contention. This will make systems that run a lot of novel jobs appear to be more noisy than they truly are. Because OoD and ID jobs likely have a similar amount of I/O and contention noise, false positives (ID jobs classified as OoD) are preferable over false negatives, since false negatives contribute to overestimating I/O noise. To estimate the impact of out-of-distribution jobs on error eood, we aim to quantify how much of the error is epistemic and how much is aleatory in nature, as shown in Figure 1 (upper right). The leading paradigm for uncertainty quantification works by training an ensemble of models and evaluating all of the models on the test set. If the models make the same error, the sample has high aleatory uncertainty, but if the models disagree, the sample has high epistemic uncertainty [31]. The intuition is that predictions on out-of-distribution samples will vary significantly on the basis of the model architecture, whereas predictions on ID but noisy samples will agree and exhibit the same bias. Since this method relies on ensemble to have great model diversity, several works have explored increasing diversity through different model hyperparameters [32], different architectures [33], or both [21]. We choose to use AutoDEUQ [21], a method that evolves an ensemble of neural network models and jointly optimizes both the architecture and hyperparameters of the models. While in theory any type of machine learning model can be used for the model population, neural networks are attractive due to their high hyperparameter count, diverse architectures found in practice, and high generalization capability. Additionally, AutoDEUQ’s Neural Architecture Search (NAS) is compatible with the NAS search from section VI, reducing the computational load of applying the taxonomy. Note that in order for AutoDEUQ to correctly split error into eood vs. econtention + enoise, first all application and system modeling errors eapp and esystem must be removed. Therefore, the function of the NAS is two-fold in this litmus test: (1) eliminate application and system modeling errors, and (2) create a diverse model population. Figure 5 shows the distribution of epistemic (EU) and aleatory uncertainties (AU) of Theta and Cori test sets. For both systems, aleatoric uncertainty is significantly higher than epistemic uncertainty. Furthermore, all jobs seem to have AU larger than about 0.05, hinting at the inherent noise present in the system. The inverse cumulative distributions on the margins (red) show what percentage of total error is caused by AU / EU below that value. For example, for both systems 50% of all error is caused by jobs with EU below 0.04, while in case of AU, 50% of error is below AU=0.25. The low total EU is expected since the test set was drawn from the same distribution as the training set, and increases on the 2020 set (omitted due to space concerns).

估计非分布误差eood的数量是很重要的,因为任何未被解释的OoD错误都将被归类为噪声或争用。这将使运行大量新作业的系统看起来比实际情况更嘈杂。因为OoD和ID作业可能有相似数量的I/O和争用噪声,假阳性(ID作业被分类为OoD)比假阴性更可取,因为假阴性会导致高估I/O噪声。为了估计非分布作业对错误率的影响,我们的目标是量化多少错误是认知性的,多少错误是偶然性的,如图1(右上)所示。不确定性量化的主要范例是通过训练模型集合并在测试集上评估所有模型来工作的。如果模型误差相同,则样本具有高度的认知不确定性;如果模型不一致,则样本具有高度的认知不确定性[31]。直觉是,在模型架构的基础上,对分布外样本的预测会有很大的不同,而对ID但有噪声样本的预测会一致并表现出相同的偏差。由于该方法依赖于集成具有很大的模型多样性,一些研究已经探索了通过不同的模型超参数[32]、不同的架构[33]或两者兼而有之[21]来增加多样性。我们选择使用AutoDEUQ[21],这是一种进化神经网络模型集合并共同优化模型架构和超参数的方法。虽然从理论上讲,任何类型的机器学习模型都可以用于模型群,但神经网络因其高超参数计数、实践中发现的多样化架构和高泛化能力而具有吸引力。此外,AutoDEUQ的神经架构搜索(NAS)与第6节的NAS搜索兼容,减少了应用分类法的计算负荷。请注意,为了使AutoDEUQ正确地将错误分为良好与争用+噪声,首先必须删除应用程序和系统中的所有应用程序和系统建模错误。因此,NAS在这个试金石试验中的功能是双重的:(1)消除应用程序和系统建模错误,(2)创建多样化的模型人口。图5显示了Theta和Cori测试集的认知不确定性(EU)和选择性不确定性(AU)的分布。对于这两个系统,任意不确定性明显高于认知不确定性。此外,所有作业的AU似乎都大于0.05,这暗示了系统中存在固有的噪声。空白处的反向累积分布(红色)显示了AU / EU低于该值所导致的总误差的百分比。例如,对于这两个系统,50%的错误是由EU低于0.04的作业引起的,而对于AU, 50%的错误低于AU=0.25。由于测试集与训练集来自相同的分布,因此预计总EU较低,并且在2020集上增加(由于空间问题而省略)。

Epistemic uncertainty does not directly translate into the out-of-distribution error eood from Equation 5. When a sample is truly OoD, it may not be possible to separate aleatory and epistemic uncertainty, since a good estimate of AU requires dense sampling around the job of interest. Therefore, we choose to attribute all errors of a sample marked as out-ofdistribution to eood. This error attribution requires classifying every test set sample as either in- or out-of-distribution, but since EU estimates are continuous values, an EU threshold which will separate OoD and ID samples is required. Although this threshold is specific to the dataset and may require tuning, the quick drop or ‘shoulder’ in the inverse cumulative error graph around EU=0.1 in Figure 5 makes the choice of an EU threshold robust. A litmus test that estimates the error due to out-of-order samples has the following steps:

认知不确定性不能直接转化为公式5中的分布外误差。当一个样本是真正的OoD时,可能不可能将选择性不确定性和认知不确定性分开,因为对AU的良好估计需要在感兴趣的工作周围密集采样。因此,我们选择将标记为偏离分布的样本的所有误差归因于食品。这种错误归因需要将每个测试集样本分类为分布内或分布外,但由于EU估计值是连续值,因此需要一个EU阈值来分离OoD和ID样本。尽管这个阈值是特定于数据集的,并且可能需要调优,但图5中EU=0.1附近的逆累积误差图中的快速下降或“肩”使得选择EU阈值具有鲁棒性。估计无序样本误差的石蕊测试有以下步骤:

On Theta, for an EU threshold of 0.24, .7% of the samples are classified as OoD, but constitute 2.4% of the errors, while on Cori 2.1% of error gets removed for the same EU threshold.In other words, the selected jobs have 3× larger average error than random samples. By visualizing the high-dimensional job features using the Gauge tool [8] and interactively exploring the types of jobs that do get removed, we confirm that OoDclassified jobs are typically rare or novel applications.

在Theta上,当EU阈值为0.24时,0.7%的样本被分类为OoD,但构成了2.4%的误差,而在Cori上,对于相同的EU阈值,2.1%的误差被去除。换句话说,所选作业的平均误差比随机样本大3倍。通过使用Gauge工具[8]可视化高维作业特征,并交互式地探索被删除的作业类型,我们确认ood分类作业通常是罕见或新颖的应用。

IX. I/O CONTENTION AND INHERENT NOISE ERRORS

With the ability to estimate the amount of application and system modeling error, as well as detect outlier jobs, leftover error is caused by system contention or inherent noise. Both of these error classes are caused by aleatory uncertainty, since the model lacks deeper insight into jobs or the system, as opposed the OoD case where the model lacks samples.While e.g., application error was explainable in terms of broad application behavior (e.g., this application is slow because it frequently writes to shared files, but the model fails to learn this effect), the impact of contention and noise on I/O throughput is caused by lower level, transient effects. Though it may be possible to observe and log such effects through microarchitectural hardware counters or network switch logs, such logging would require vast amounts of storage per job and may impact performance. Lack of practical logging tools makes the last two error categories typically unobservable.Furthermore, these two classes may only be separated in hindsight, and while I/O noise levels may be constant, the amount of I/O contention on the system is unpredictable for a job that is about to run.

由于能够估计应用程序和系统建模错误的数量,以及检测异常作业,剩余的错误是由系统争用或固有噪声引起的。这两种错误都是由选择性不确定性引起的,因为模型缺乏对作业或系统的更深入的了解,这与模型缺乏样本的OoD情况相反。例如,应用程序错误可以用广泛的应用程序行为来解释(例如,这个应用程序很慢,因为它经常写入共享文件,但模型无法了解这种影响),争用和噪声对I/O吞吐量的影响是由较低级别的瞬态影响引起的。虽然可以通过微架构硬件计数器或网络交换机日志来观察和记录这种影响,但这种日志记录将需要每个作业占用大量存储空间,并可能影响性能。由于缺乏实用的测井工具,最后两类错误通常无法察觉。此外,这两个类可能只是在事后才分开,虽然I/O噪声级别可能是恒定的,但对于即将运行的作业来说,系统上的I/O争用量是不可预测的。

The questions we ask in this section are: how can errors due to noise and contention be separated from errors due to poor modeling or epistemic uncertainty? Is there a fundamental limit to how accurate I/O models can become? What steps are necessary to quantify system I/O variability?

我们在本节中提出的问题是:如何将噪声和争论导致的错误与建模不良或认知不确定性导致的错误区分开来?I/O模型的精确度是否存在根本限制?量化系统I/O可变性需要哪些步骤?

A. Establishing the bounds of I/O modeling

To separate contention and noise impacts from the first three classes of error, we develop a litmus test based on the test from Section VI. There, by observing sets of duplicates, the error of a golden model e g was estimated, where e g app = 0. Comparing real models against this ideal model allows for calculating a real model’s eapp. This litmus test works by ‘holding constant’ application behavior j within a set of duplicates, i.e., by preventing any input variance from reaching the model. The here introduced noise and contention litmus test seeks to hold constant not only application behavior, but also global system impact, and impact from poor generalization. We design a litmus test that works by enforcing a stronger requirement on duplicate sets, where pairs of jobs are duplicates only if they have the both same application behavior j and same start time t. The test assumes that identical jobs ran at the same time are exposed to the same global system impact ζg(t), but not necessarily the same local impact. The litmus test therefore estimates the sum of contention and noise error for a golden model, where only concurrent duplicates are observed and both application behavior j and global system behavior ζg(t) are held static for each duplicate set.

为了将竞争和噪声影响从前三类误差中分离出来,我们基于第六节的测试开发了一个石蕊测试。在那里,通过观察重复集,估计了黄金模型eg的误差,其中eg = 0。将实际模型与理想模型进行比较,可以计算出实际模型的eapp。这个石蕊测试通过在一组重复项中“保持恒定”的应用程序行为j来工作,也就是说,通过防止任何输入方差到达模型。这里介绍的噪声和争用石蕊测试不仅要保持应用程序行为不变,还要保持全局系统影响不变,以及泛化不良造成的影响不变。我们设计了一个石蕊测试,它通过对重复集实施更强的要求来工作,其中作业对只有在它们具有相同的应用程序行为j和相同的开始时间t时才是重复的。该测试假设同时运行的相同作业暴露于相同的全局系统影响(θ g(t)),但不一定相同的本地影响。因此,石蕊测试估计黄金模型的争用和噪声误差的总和,其中只观察并发副本,并且每个副本集的应用程序行为j和全局系统行为ζg(t)都保持静态。


In the fifth column of Figure 1 we show the distribution of I/O throughput differences ∆ϕ and timing differences ∆t between all pairs of Cori duplicate jobs, weighted so that large duplicate sets are not overrepresented. The vertical strip on the left contains Cori duplicate jobs that were ran simultaneously, largely because they were batched together.These jobs share j and ζg, but may differ in ζl and ω. Due to the denser sampling around 1 minute to 1 hour range, it is not immediately apparent how the I/O difference changes between duplicates ran at the same time and duplicates ran with a small delay. By grouping duplicates from different ∆t ranges and independently scaling them, a better understanding of duplicate I/O throughput distributions across timescales can be made, as shown in Figure 6 (Theta shown, Cori omitted due to lack of space). For both systems, the distributions on the right contain jobs ran over large periods of time where global system impact ζg might have changed, explaining the asymmetric shape of some of them. The left-most distributions are similar, since variance only stems from contention ζl and noise ω. While some distributions (e.g., the 105 to 106 second) show complex multimodal behavior, all of the distributions seem to contain the initial zero-second (0s to 1s) distribution.

在图1的第五列中,我们显示了所有对Cori重复作业之间的I/O吞吐量差异∆φ和时间差异∆t的分布,加权后,大型重复集不会被过度代表。左边的垂直条包含同时运行的Cori重复作业,主要是因为它们被批处理在一起。这些功共有j和ζg,但ζl和ω可能不同。由于在1分钟到1小时的范围内进行密集采样,因此无法立即看出同时运行的副本和以较小延迟运行的副本之间的I/O差异变化。通过对不同∆t范围的副本进行分组并独立缩放,可以更好地了解跨时间尺度的副本I/O吞吐量分布,如图6所示(Theta为图中所示,由于篇幅不足,省略了Cori)。对于这两个系统,右边的分布包含了在很长一段时间内运行的工作,而全球系统影响ζg可能已经发生了变化,这解释了其中一些工作的不对称形状。最左边的分布是相似的,因为方差只源于竞争ζl和噪声ω。虽然一些分布(例如105到106秒)表现出复杂的多模态行为,但所有分布似乎都包含初始的零秒(0到15)分布。

By fitting a normal distribution to the ∆t = 0 distribution (0s to 1s) in Figure 6, we can both (1) learn the lower limit on total modeling error and (2) learn the system’s I/O noise level, i.e., how much I/O throughput variance should jobs running on the system expect. However, upon closer inspection, the ∆t = 0 distribution does not follow a normal distribution.

This is surprising, since if noise follows some (not necessarily normal) stationary distribution, and is independent over time, and its effects are cumulative, according to the central limit theorem the total noise impact is a normal distribution. The answer lies in how the concurrent (∆t = 0) duplicates are sampled. When observing duplicates, in general, duplicate sets have between 2 and hundreds of thousands of identical jobs in them. However, in duplicate sets with identical start times on Theta, 70% of the sets only have two identical jobs, and 96% have 6 jobs or less, with similar results on Cori. The issue stems from how small (sub-30 sample) duplicate set errors are calculated: when only a small number of jobs exist in the set, the mean I/O throughput of the set is biased by the sampling, i.e., the estimated mean is closer to the samples than the real mean is. This causes the set I/O throughput variance to decrease and therefore duplicate error estimate will be reduced as well. Student’s t-distribution describes this effect: when the true mean of a distribution is known, error calculations follow a normal distribution. When the true mean is not known, the biased mean estimate makes the error follow the t-distribution. As the set size increases, the t-distribution approaches a normal distribution. However, naively taking the variance of the t-distribution will produce a biased sample variance σ 2 , which can be de-biased by applying Bessel’s correction as σ¯ 2 = n n−1 σ 2 .

通过将正态分布拟合到图6中的∆t = 0分布(0到1),我们可以(1)了解总建模误差的下限,(2)了解系统的I/O噪声水平,即系统上运行的作业期望的I/O吞吐量方差。然而,仔细观察,∆t = 0分布并不符合正态分布。

这是令人惊讶的,因为如果噪声遵循一些(不一定是正态)平稳分布,并且随着时间的推移是独立的,并且它的影响是累积的,根据中心极限定理,总的噪声影响是正态分布。答案在于如何对并发(∆t = 0)副本进行抽样。在观察副本时,通常,副本集中有2到数十万个相同的作业。然而,在Theta上开始时间相同的重复集合中,70%的集合只有两个相同的作业,96%的集合有6个或更少的作业,在Cori上的结果相似。问题源于计算重复集误差的大小(低于30个样本):当集合中只存在少量作业时,集合的平均I/O吞吐量会受到抽样的影响,也就是说,估计的平均值比实际平均值更接近样本。这将导致集合I/O吞吐量方差减少,因此重复错误估计也将减少。学生t分布描述了这种效应:当一个分布的真实均值已知时,误差计算遵循正态分布。当真实均值未知时,偏均值估计使误差服从t分布。随着集合大小的增加,t分布趋于正态分布。然而,天真地取t分布的方差会产生偏倚的样本方差σ 2,可以通过应用贝塞尔修正σ¯2 = n n−1 σ 2来消除偏倚。

With de-biasing in place, we estimate the I/O noise variance of the two systems. Results show that a job running on Theta can expect an I/O throughput within ±5.71% of the predicted value 68% of the time, or within ±10.56% 95% of the time.For Cori, these values are ±7.21% and ±14.99%, respectively.This is a fundamental barrier not just to I/O model improvement, but to predictable system usage in general. Although some insight into contention can be gained through low-level logging tools, noise cannot be overcome. I/O practitioners can use this litmus test to evaluate the noise levels of their systems,and ML practitioners should reconsider how they evaluate models, since some systems may be simply harder to model.

在去除偏置后,我们估计了两个系统的I/O噪声方差。结果表明,在Theta上运行的作业的I/O吞吐量在68%的时间内低于预测值的±5.71%,在95%的时间内低于预测值的±10.56%。对于Cori,这些值分别为±7.21%和±14.99%。这不仅是对I/O模型改进的一个基本障碍,而且通常是对可预测的系统使用的一个基本障碍。尽管可以通过低级日志工具获得对争用的一些了解,但是无法克服噪声。I/O从业者可以使用这个试金石测试来评估他们系统的噪声水平,ML从业者应该重新考虑他们如何评估模型,因为有些系统可能很难建模。

X. APPLYING THE TAXONOMY

We now illustrate how the proposed taxonomy can be used in practice. In Figure 7, we show the steps a modeler can follow to evaluate the taxonomy on a new system. Step 1: The modeler splits the available data into training and test sets, and then trains and evaluates some baseline machine learning model on the task of predicting I/O throughput. This model does not have to be fine-tuned, as the taxonomy will reveal the main sources of error and approximately how much the quality of the model is at fault. Step 2.1: The modeler estimates application modeling errors by finding duplicate jobs and evaluating the mean predictor performance on every set of duplicates. Assuming that the distribution of duplicate HPC jobs is representative of the whole population of jobs, this step provides the modeler with a lower bound on the application modeling error. Step 2.2: By contrasting the baseline model error (Step 1) and the estimated application modeling error, the modeler can estimate the percentage of error that can be attributed to poor modeling. The modeler performs a hyperparameter or network architecture search and arrives at a good model close to the bound. Step 3.1: The modeler estimates system modeling errors by exposing the job start time feature to a golden model. This step requires that the modeler has developed a well-performing model in Step 2.2, i.e., one that achieves close to the estimated ideal performance. The test set error of the model serves as an estimate of the application + system modeling lower bound. Step 3.2: The modeler explores adding sources of system data to improve the performance of the baseline model up to the estimated limit of application and system modeling. Step 4: The modeler identifies out-ofdistribution samples using AutoDEUQ, calculates OoD error that stems from these samples, and removes them from the dataset. Step 5: The modeler estimates the error that can be attributed to contention and noise, as well as I/O variance of the system. This estimate is made by observing the I/O throughput differences between sets of concurrent duplicates, i.e., duplicate jobs ran at around the same time.

现在我们将演示如何在实践中使用所建议的分类法。在图7中,我们展示了建模人员在新系统上评估分类法时可以遵循的步骤。步骤1:建模者将可用数据分成训练集和测试集,然后在预测I/O吞吐量的任务上训练和评估一些基线机器学习模型。这个模型不需要进行微调,因为分类法将揭示错误的主要来源,以及模型的质量有多大的问题。步骤2.1:建模者通过查找重复作业并评估每组重复作业上的平均预测器性能来估计应用程序建模误差。假设重复HPC作业的分布代表了全部作业,此步骤为建模者提供了应用程序建模错误的下限。步骤2.2:通过对比基线模型误差(步骤1)和估计的应用程序建模误差,建模者可以估计由于不良建模而导致的误差百分比。建模者执行超参数或网络架构搜索,并在边界附近得到一个好的模型。步骤3.1:建模者通过将作业开始时间特征暴露给黄金模型来估计系统建模误差。这一步要求建模者在步骤2.2中开发了一个性能良好的模型,也就是说,一个接近于估计的理想性能的模型。模型的测试集误差作为应用+系统建模下界的估计。步骤3.2:建模者探索添加系统数据的来源,以改进基线模型的性能,直到达到应用程序和系统建模的估计极限。步骤4:建模者使用AutoDEUQ识别分布外样本,计算来自这些样本的OoD错误,并从数据集中删除它们。步骤5:建模者估计可归因于争用和噪声的误差,以及系统的I/O方差。这个估计是通过观察并发副本组之间的I/O吞吐量差异得出的,即在大约同一时间运行的重复作业。

In Figure 8 we show the average baseline model error (inner pink circle segment) of both ANL Theta and NERSC Cori systems, and how that error is broken down into different classes of error. We do not focus on the cumulative (total) error value of the two systems; instead, we focus on attributing the baseline model error into the five classes of errors in the taxonomy (middle circle segments of the pie chart), and on the percentage of error that can be removed through improved application and system modeling (outer segments of the pie chart). The inner blue section of the two pie charts represents the estimated application modeling error, as arrived at in Step 2.1. The outer blue section represents how much of the error can be fixed through hyperparameter exploration, as explored in Step 2.2. The inner green section represents the estimated system modeling error, derived in Step 3.1. Note that the total percentage of system modeling error is relatively small on both systems; i.e., I/O contention, filesystem health, hardware faults, etc., do not have a dominant impact on I/O throughput.The outer green circle segment represents the percentage of error that can be fixed by including system logs (LMT logs in our case), as described in Step 3.2. Only the Cori pie chart has this segment, as Theta does not collect LMT logs.On Cori, the inclusion of LMT logs helps remove most of the system modeling errors, reinforcing the conclusion that including other logs (i.e., topology, networking) may not help to significantly reduce errors. The inner red segment represents the percentage of error that can be attributed to out-ofdistribution samples of the two systems, as calculated in Step 4. Finally, the yellow circle segment represents the percentage of error that can be attributed to aleatory uncertainty. For both Theta and Cori, this is a rather large amount, pointing to the fact that there exists a lot of innate noise in the behavior of these systems, and setting a relatively high lower bound on ideal model error.

在图8中,我们展示了ANL Theta和NERSC Cori系统的平均基线模型误差(内粉色圆圈段),以及该误差如何被分解为不同的误差类别。我们不关注两个系统的累积(总)误差值;相反,我们关注的是将基线模型误差归为分类法中的五类错误(饼图的中间圆圈部分),以及通过改进的应用程序和系统建模(饼图的外部部分)可以消除的错误百分比。两个饼状图的内部蓝色部分表示步骤2.1中估计的应用程序建模误差。外部的蓝色部分表示通过超参数探索可以修复多少误差,如步骤2.2所示。内部的绿色部分表示在步骤3.1中导出的估计系统建模误差。请注意,系统建模误差的总百分比在两个系统上都相对较小;例如,I/O争用、文件系统健康、硬件故障等不会对I/O吞吐量产生主要影响。外部的绿色圆圈段表示可以通过包含系统日志(在我们的例子中是LMT日志)来修复的错误百分比,如步骤3.2所述。只有Cori饼图有这个段,因为Theta不收集LMT日志。
在Cori上,包含LMT日志有助于消除大多数系统建模错误,从而强化了包含其他日志(即拓扑、网络)可能无助于显著减少错误的结论。内部的红色部分表示可归因于两个系统的分布外样本的误差百分比,如步骤4中计算的那样。最后,黄色圆圈段表示可归因于随机不确定性的误差百分比。对于Theta和Cori来说,这都是一个相当大的量,这表明这些系统的行为中存在很多固有的噪声,并且对理想模型误差设置了一个相对较高的下界。

The similarity between the modeling error estimates (Steps 2.1 and 3.1) and the actual updated model performance (Steps 2.2 and 3.2) is surprising and serves as evidence for the quality of the error estimates. However, the estimates of the five error classes do not add up to 100%. The first three error estimates are just that - estimates, derived from a subset of data (duplicate HPC jobs) that do not necessarily follow the same distribution as the rest of the dataset and may be biased. If we add the estimates, we see that on Theta 32.9% of the error is unexplained, and on Cori 13.5% of the error is unexplained.Cori’s lower unexplained error may be due to the fact that we have collected some 1.1M jobs compared to 100K on Theta.

建模误差估计(步骤2.1和3.1)与实际更新的模型性能(步骤2.2和3.2)之间的相似性令人惊讶,并作为误差估计质量的证据。然而,五种错误类别的估计值加起来并不等于100%。前三个误差估计仅仅是从数据子集(重复的HPC作业)中得出的估计,这些数据不一定与数据集的其余部分遵循相同的分布,并且可能存在偏差。如果我们加上估计,我们看到在Theta上32.9%的误差是无法解释的,在Cori上13.5%的误差是无法解释的。Cori较低的无法解释的错误可能是由于我们收集了大约110万个job,而Theta上收集了10万个job。

XI. DISCUSSION AND FUTURE WORK

Developing production-ready machine learning models that analyze HPC jobs and predict I/O throughput is difficult: the space of all application behaviors is large, HPC jobs are competing for resources, and the system changes over time.To efficiently improve these models, we present a taxonomy of HPC I/O modeling errors that enables independent study of different types of errors, helps quantify their impact, and identifies the most promising avenues for model improvement.Our taxonomy breaks errors into five categories: (1) application and (2) system modeling errors, (3) poor generalization, (4) resource contention, and (5) I/O noise. We present litmus tests that quantify what percentage of model error should be attributed to each class, and show that models improved by using the taxonomy are within a percentage point of an estimated best-case I/O throughput modeling accuracy. We show that a large portion of I/O throughput modeling error is irreducible and stems from I/O variability. We provide tests that quantify the I/O variability and establish an upper bound on how accurate I/O models can become. Our test shows that jobs ran on Theta and Cori can expect an I/O throughput standard deviation of 5.7% and 7.2%, respectively.

开发用于分析HPC作业和预测I/O吞吐量的生产就绪机器学习模型是困难的:所有应用程序行为的空间很大,HPC作业竞争资源,并且系统随着时间而变化。为了有效地改进这些模型,我们提出了一种HPC I/O建模错误的分类法,可以独立研究不同类型的错误,帮助量化它们的影响,并确定最有希望的模型改进途径。我们的分类法将错误分为五类:(1)应用程序和(2)系统建模错误,(3)泛化不良,(4)资源争用,以及(5)I/O噪声。我们提供了石蕊测试,这些测试量化了应该归因于每个类的模型错误的百分比,并显示通过使用分类法改进的模型与估计的最佳情况I/O吞吐量建模精度相差不到一个百分点。我们表明,很大一部分I/O吞吐量建模误差是不可约的,源于I/O可变性。我们提供了量化I/O可变性的测试,并建立了I/O模型能够变得多么精确的上限。我们的测试表明,在Theta和Cori上运行的作业的I/O吞吐量标准偏差分别为5.7%和7.2%。

In future work, we plan to explore why error classes in Figure 8 do not add up to 100%. Our hypothesis that poor duplicate distribution is the source of this discrepancy, and that instead of duplicate jobs, a targeted set of repeated microbenchmarks may better inform the framework introduced in this work. By tuning and executing microbenchmarks representative of the system’s application distribution, we hope to build a minimal set of workloads that evaluate system parameters such as I/O noise amount or application parameters such as I/O contention sensitivity. We also plan to explore how transferable this set of benchmarks is, and whether different HPC system workloads can be accurately represented by a set of weighted microbenchmarks.

在未来的工作中,我们计划探索图8中的错误类加起来没有达到100%的原因。我们的假设是,糟糕的重复分布是这种差异的根源,而不是重复的作业,一组有针对性的重复微基准测试可能会更好地为本工作中引入的框架提供信息。通过调优和执行代表系统应用程序分布的微基准测试,我们希望构建一组最小的工作负载来评估系统参数(如I/O噪声量)或应用程序参数(如I/O争用灵敏度)。我们还计划探索这组基准的可转移性,以及不同的HPC系统工作负载是否可以通过一组加权微基准准确地表示。

本文标签: sourcesErrorTaxonomyHPCModels