Kubernetes调度器插件

参考链接:

https://github.com/kubernetes/kubernetes/tree/master/pkg/scheduler/framework/plugins

Scheduler Framework Plugins(调度程序框架插件)

Creating a new in-tree plugin(创建一个新的树内插件)

Read the docs to understand the different extension points within the scheduling framework.

TODO(#95156): flesh this out a bit more.

阅读文档以了解调度框架内的不同扩展点。

TODO(#95156):将其充实一些。

Adding plugin configuration parameters through KubeSchedulerConfiguration(通过KubeSchedulerConfiguration添加插件配置参数)

You can give users the ability to configure parameters in scheduler plugins using KubeSchedulerConfiguration. This section covers how you can add arguments to existing in-tree plugins (example PR). Let’s assume the plugin is called FooPlugin and we want to add an optional integer parameter named barParam.

您可以使用户能够使用KubeSchedulerConfiguration在调度程序插件中配置参数。 本节介绍如何向现有的树内插件(例如PR)添加参数。 假设该插件名为FooPlugin,我们想添加一个名为barParam的可选整数参数。

Defining and registering the struct(定义和注册结构)

First, we need to define a struct type named FooPluginArgs in pkg/scheduler/apis/config/types_pluginargs.go, which is the representation of the configuration parameters that is internal to the scheduler.

首先,我们需要在pkg / scheduler / apis / config / types_pluginargs.go中定义一个名为FooPluginArgs的结构类型,该结构类型表示调度程序内部的配置参数。

1
2
3
4
5
6
7
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

type FooPluginArgs struct {
// metav1 is k8s.io/apimachinery/pkg/apis/meta/v1
metav1.TypeMeta
BarParam int32
}

Note that we embed k8s.io/apimachinery/pkg/apis/meta/v1.TypeMeta to include API metadata for versioning and persistence. We add the +k8s:deepcopy-gen:interfaces comment to auto-generate a DeepCopy function for the struct.

Similarly, define FooPluginArgs in k8s.io/kube-scheduler/config/{version}/types_pluginargs.go, which is the versioned representation used in the kube-scheduler binary used for deserialization. This time, however, in order to allow implicit default values for arguments, the type of the struct’s fields may be pointers; leaving a parameter unspecified will set the pointer field to its zero value (nil), which can be used to let the framework know that it must fill in the default value. BarParam is of type int32 and let’s say we want a non-zero default value for it:

请注意,我们嵌入了k8s.io/apimachinery/pkg/apis/meta/v1.TypeMeta,以包含用于版本控制和持久性的API元数据。 我们添加+ k8s:deepcopy-gen:interfaces注释以自动生成该结构的DeepCopy函数。

同样,在k8s.io/kube-scheduler/config/{version}/types_pluginargs.go中定义FooPluginArgs,这是用于反序列化的kube-scheduler二进制文件中使用的版本表示形式。 但是,这一次,为了允许参数使用隐式默认值,结构字段的类型可以是指针。 保留未指定的参数会将指针字段设置为其零值(nil),可用于使框架知道它必须填写默认值。 BarParam的类型为int32,假设我们想要一个非零的默认值:

1
2
3
4
5
6
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

type FooPluginArgs struct {
metav1.TypeMeta `json:",inline"`
BarParam *int32 `json:"barParam,omitempty"`
}

For each types_pluginargs.go addition, remember to register the type in the corresponding register.go, which will allow the scheduler to recognize KubeSchedulerConfiguration values at parse-time.

对于每个types_pluginargs.go,请记住在相应的register.go中注册类型,这将使调度程序在解析时识别KubeSchedulerConfiguration值。

Setting defaults(设置默认值)

When a KubeSchedulerConfiguration object is parsed (happens in cmd/kube-scheduler/app/options/options.go), the scheduler will convert from the versioned type to the internal type, filling in the unspecified fields with defaults. Speaking of defaults, define SetDefaults_FooPluginArgs in pkg/scheduler/apis/config/v1beta1/defaults.go as follows:

解析KubeSchedulerConfiguration对象时(发生在cmd / kube-scheduler / app / options / options.go中),调度程序将从版本化类型转换为内部类型,并使用默认值填充未指定的字段。 说到默认值,请在pkg / scheduler / apis / config / v1beta1 / defaults.go中定义SetDefaults_FooPluginArgs,如下所示:

1
2
3
4
5
6
// v1beta1 refers to k8s.io/kube-scheduler/config/v1beta1.
func SetDefaults_FooPluginArgs(obj *v1beta1.FooPluginArgs) {
if obj.BarParam == nil {
obj.BarParam = pointer.Int32Ptr(42)
}
}

Validating configuration at runtime(在运行时验证配置)

Next, we need to define validators to make sure the user’s configuration and your default values are valid. To do this, add something like this in pkg/scheduler/apis/config/validation/validation_pluginargs.go:

接下来,我们需要定义验证器以确保用户的配置和您的默认值有效。 为此,请在pkg / scheduler / apis / config / validation / validation_pluginargs.go中添加以下内容:

1
2
3
4
5
6
7
8
9
// From here on, FooPluginArgs refers to the type defined in pkg/scheduler
// definition, not the kube-scheduler definition. We're dealing with
// post-default values.
func ValidateFooPluginArgs(args config.FooPluginArgs) error {
if args.BarParam < 0 && args.BarParam > 100 {
return fmt.Errorf("must be in the range [0, 100]")
}
return nil
}

Code generation(代码生成)

We have defined everything necessary to run code generation now. Remember to commit all your changes (not sure why this is needed) and do a make clean first. Then:

我们已经定义了现在运行代码生成所需的一切。 记住要提交所有更改(不确定为什么需要这样做),并先进行清理。 然后:

1
2
3
4
5
$ cd $GOPATH/src/k8s.io/kubernetes
$ git add -A && git commit
$ make clean
$ ./hack/update-codegen.sh
$ make generated_files

This should automatically generate code to deep copy objects, convert between different struct types, convert pointer types to raw types, and set defaults.

这将自动生成代码以深层复制对象,在不同的结构类型之间转换,将指针类型转换为原始类型,并设置默认值。

Testing(测试)

After code generation, go back and write tests for all of the changes you made in the previous section:

生成代码后,返回并为您在上一节中所做的所有更改编写测试:

  • pkg/scheduler/apis/config/v1beta1/defaults_test.go to unit test the defaults. pkg / scheduler / apis / config / v1beta1 / defaults_test.go对默认值进行单元测试。
  • pkg/scheduler/apis/config/validation/validation_pluginargs_test.go to unit test the validator. pkg / scheduler / apis / config / validation / validation_pluginargs_test.go对验证器进行单元测试。
  • pkg/scheduler/apis/config/scheme/scheme_test.go to test the whole pipeline using a KubeSchedulerConfiguration definition. pkg / scheduler / apis / config / scheme / scheme_test.go使用KubeSchedulerConfiguration定义来测试整个管道。

Receiving the arguments in the plugin(接收插件中的参数)

We can now finally receive FooPluginArgs in the plugin code. To do this, modify the plugin’s New method signature like so:

现在,我们终于可以在插件代码中接收FooPluginArgs了。 为此,请修改插件的New方法签名,如下所示:

1
2
3
4
5
6
7
8
9
10
11
func New(fpArgs runtime.Object, fh framework.FrameworkHandle) (framework.Plugin, error) {
// config.FooPluginArgs refers to the pkg/scheduler struct type definition.
args, ok := fpArgs.(*config.FooPluginArgs)
if !ok {
return nil, fmt.Errorf("got args of type %T, want *FooPluginArgs", fpArgs)
}
if err := validation.ValidateFooPluginArgs(*args); err != nil {
return nil, err
}
// Use args.BarParam as you like.
}

Scheduling Framework(调度框架)

FEATURE STATE: Kubernetes v1.15 [alpha]

The scheduling framework is a pluggable architecture for Kubernetes Scheduler that makes scheduler customizations easy. It adds a new set of “plugin” APIs to the existing scheduler. Plugins are compiled into the scheduler. The APIs allow most scheduling features to be implemented as plugins, while keeping the scheduling “core” simple and maintainable. Refer to the design proposal of the scheduling framework for more technical information on the design of the framework.

调度框架是Kubernetes Scheduler的可插入架构,可简化调度程序的自定义。 它将一组新的“插件” API添加到现有的调度程序中。 插件被编译到调度程序中。 这些API允许大多数调度功能实现为插件,同时使调度“核心”保持简单且可维护。 有关该框架设计的更多技术信息,请参阅调度框架的设计建议。

Framework workflow(框架工作流程)

The Scheduling Framework defines a few extension points. Scheduler plugins register to be invoked at one or more extension points. Some of these plugins can change the scheduling decisions and some are informational only.

Each attempt to schedule one Pod is split into two phases, the scheduling cycle and the binding cycle.

计划框架定义了一些扩展点。 Scheduler插件注册以在一个或多个扩展点处调用。 这些插件中的一些可以更改计划决策,而某些仅提供信息。

每次调度一个Pod的尝试都分为两个阶段,即调度周期和绑定周期。

Scheduling Cycle & Binding Cycle(调度周期和绑定周期)

The scheduling cycle selects a node for the Pod, and the binding cycle applies that decision to the cluster. Together, a scheduling cycle and binding cycle are referred to as a “scheduling context”.

Scheduling cycles are run serially, while binding cycles may run concurrently.

A scheduling or binding cycle can be aborted if the Pod is determined to be unschedulable or if there is an internal error. The Pod will be returned to the queue and retried.

调度周期为Pod选择一个节点,并且绑定周期将该决定应用于集群。 调度周期和绑定周期一起被称为“调度上下文”。

调度周期是串行运行的,而绑定周期可能是同时运行的。

如果确定Pod不可调度或存在内部错误,则可以中止调度或绑定周期。 Pod将返回队列并重试。

Extension points(扩展点)

The following picture shows the scheduling context of a Pod and the extension points that the scheduling framework exposes. In this picture “Filter” is equivalent to “Predicate” and “Scoring” is equivalent to “Priority function”.

One plugin may register at multiple extension points to perform more complex or stateful tasks.

下图显示了Pod的调度上下文以及调度框架公开的扩展点。 在此图片中,“过滤器”等效于“谓词”,“评分”等效于“优先级功能”。

一个插件可以在多个扩展点注册以执行更复杂或有状态的任务。

1

scheduling framework extension points

调度框架扩展点

QueueSort(队列排序)

These plugins are used to sort Pods in the scheduling queue. A queue sort plugin essentially provides a function. Only one queue sort plugin may be enabled at a time.Less(Pod1, Pod2)

扩展用于对 Pod 的待调度队列进行排序,以决定先调度哪个 Pod,QueueSort 扩展本质上只需要实现一个方法 Less(Pod1, Pod2) 用于比较两个 Pod 谁更优先获得调度即可,同一时间点只能有一个 QueueSort 插件生效。

PreFilter

These plugins are used to pre-process info about the Pod, or to check certain conditions that the cluster or the Pod must meet. If a PreFilter plugin returns an error, the scheduling cycle is aborted.

扩展用于对 Pod 的信息进行预处理,或者检查一些集群或 Pod 必须满足的前提条件,如果 pre-filter 返回了 error,则调度过程终止。

Filter

These plugins are used to filter out nodes that cannot run the Pod. For each node, the scheduler will call filter plugins in their configured order. If any filter plugin marks the node as infeasible, the remaining plugins will not be called for that node. Nodes may be evaluated concurrently.

扩展用于排除那些不能运行该 Pod 的节点,对于每一个节点,调度器将按顺序执行 filter 扩展;如果任何一个 filter 将节点标记为不可选,则余下的 filter 扩展将不会被执行。调度器可以同时对多个节点执行 filter 扩展。

PostFilter

These plugins are called after Filter phase, but only when no feasible nodes were found for the pod. Plugins are called in their configured order. If any postFilter plugin marks the node as , the remaining plugins will not be called. A typical PostFilter implementation is preemption, which tries to make the pod schedulable by preempting other Pods.Schedulable

是一个通知类型的扩展点,调用该扩展的参数是 filter 阶段结束后被筛选为可选节点的节点列表,可以在扩展中使用这些信息更新内部状态,或者产生日志或 metrics 信息。

PreScore

These plugins are used to perform “pre-scoring” work, which generates a sharable state for Score plugins to use. If a PreScore plugin returns an error, the scheduling cycle is aborted.

这些插件用于执行“预评分”工作,从而为Score插件使用提供可共享的状态。 如果PreScore插件返回错误,则调度周期将中止。

Score

These plugins are used to rank nodes that have passed the filtering phase. The scheduler will call each scoring plugin for each node. There will be a well defined range of integers representing the minimum and maximum scores. After the NormalizeScore phase, the scheduler will combine node scores from all plugins according to the configured plugin weights.

扩展用于为所有可选节点进行打分,调度器将针对每一个节点调用 Soring 扩展,评分结果是一个范围内的整数。在 normalize scoring 阶段,调度器将会把每个 scoring 扩展对具体某个节点的评分结果和该扩展的权重合并起来,作为最终评分结果。

NormalizeScore

These plugins are used to modify scores before the scheduler computes a final ranking of Nodes. A plugin that registers for this extension point will be called with the Score results from the same plugin. This is called once per plugin per scheduling cycle.

For example, suppose a plugin ranks Nodes based on how many blinking lights they have.BlinkingLightScorer

这些插件用于在调度程序计算节点的最终排名之前修改分数。 注册此扩展点的插件将与同一插件的得分结果一起调用。 每个插件每个调度周期调用一次。

例如,假设一个插件根据节点有多少个闪烁的灯光对节点进行排名。

1
2
3
func ScoreNode(_ *v1.pod, n *v1.Node) (int, error) {
return getBlinkingLightCount(n)
}

However, the maximum count of blinking lights may be small compared to . To fix this, should also register for this extension point.NodeScoreMax``BlinkingLightScorer

但是,与相比,闪烁灯的最大数量可能少。 要解决此问题,还应注册此扩展点。NodeScoreMaxBlinkingLightScorer

1
2
3
4
5
6
7
8
9
func NormalizeScores(scores map[string]int) {
highest := 0
for _, score := range scores {
highest = max(highest, score)
}
for node, score := range scores {
scores[node] = score*NodeScoreMax/highest
}
}

If any NormalizeScore plugin returns an error, the scheduling cycle is aborted.

如果任何NormalizeScore插件返回错误,则调度周期将中止。

Note: Plugins wishing to perform “pre-reserve” work should use the NormalizeScore extension point.

希望执行“预保留”工作的插件应使用NormalizeScore扩展点。

Reserve

A plugin that implements the Reserve extension has two methods, namely and , that back two informational scheduling phases called Reserve and Unreserve, respectively. Plugins which maintain runtime state (aka “stateful plugins”) should use these phases to be notified by the scheduler when resources on a node are being reserved and unreserved for a given Pod.Reserve``Unreserve

The Reserve phase happens before the scheduler actually binds a Pod to its designated node. It exists to prevent race conditions while the scheduler waits for the bind to succeed. The method of each Reserve plugin may succeed or fail; if one method call fails, subsequent plugins are not executed and the Reserve phase is considered to have failed. If the method of all plugins succeed, the Reserve phase is considered to be successful and the rest of the scheduling cycle and the binding cycle are executed.Reserve``Reserve``Reserve

The Unreserve phase is triggered if the Reserve phase or a later phase fails. When this happens, the method of all Reserve plugins will be executed in the reverse order of method calls. This phase exists to clean up the state associated with the reserved Pod.Unreserve``Reserve

实现Reserve扩展的插件有两个方法,即和和,分别支持两个信息调度阶段,分别称为Reserve和Unreserve。 维护运行时状态的插件(也称为“有状态插件”)应在给定Pod保留和不保留节点上的资源时使用这些阶段由调度程序通知。

保留阶段发生在调度程序实际将Pod绑定到其指定节点之前。 它的存在是为了防止调度程序在等待绑定成功时出现争用情况。 每个Reserve插件的方法可能成功或失败; 如果一个方法调用失败,则不执行后续插件,并且保留阶段被视为失败。 如果所有插件的方法都成功,则认为Reserve阶段成功,并且将执行其余的调度周期和绑定周期。

如果保留阶段或后续阶段失败,则会触发取消保留阶段。 发生这种情况时,所有Reserve插件的方法将以与方法调用相反的顺序执行。 存在此阶段以清理与保留Pod.UnreserveReserve关联的状态。

Caution: The implementation of the method in Reserve plugins must be idempotent and may not fail.Unreserve

警告:Reserve插件中方法的实现必须是幂等的,并且不能失败。

Permit

Permit plugins are invoked at the end of the scheduling cycle for each Pod, to prevent or delay the binding to the candidate node. A permit plugin can do one of the three things:

在每个Pod的调度周期结束时,将调用许可插件,以防止或延迟与候选节点的绑定。 一个许可插件可以做以下三件事之一:

  1. approve
    Once all Permit plugins approve a Pod, it is sent for binding. 一旦所有许可插件批准Pod,便将其发送以进行绑定。
  2. deny
    If any Permit plugin denies a Pod, it is returned to the scheduling queue. This will trigger the Unreserve phase in Reserve plugins. 如果任何许可证插件拒绝Pod,则将其返回到调度队列。 这将触发Reserve插件中的Unreserve阶段。
  3. wait (with a timeout)
    If a Permit plugin returns “wait”, then the Pod is kept in an internal “waiting” Pods list, and the binding cycle of this Pod starts but directly blocks until it gets approved. If a timeout occurs, wait becomes deny and the Pod is returned to the scheduling queue, triggering the Unreserve phase in Reserve plugins. 如果许可证插件返回“ wait”,则Pod会保留在内部的“ waiting” Pods列表中,此Pod的绑定周期开始,但会直接阻塞,直到获得批准为止。 如果发生超时,等待将变为拒绝,并且Pod将返回到调度队列,从而触发Reserve插件中的Unreserve阶段。

Note: While any plugin can access the list of “waiting” Pods and approve them (see FrameworkHandle), we expect only the permit plugins to approve binding of reserved Pods that are in “waiting” state. Once a Pod is approved, it is sent to the PreBind phase.

注意:尽管任何插件都可以访问“正在等待”的Pod列表并进行批准(请参阅FrameworkHandle),但我们希望只有allow插件才能批准处于“等待”状态的保留Pod的绑定。 批准Pod后,将其发送到PreBind阶段。

PreBind

These plugins are used to perform any work required before a Pod is bound. For example, a pre-bind plugin may provision a network volume and mount it on the target node before allowing the Pod to run there.

If any PreBind plugin returns an error, the Pod is rejected and returned to the scheduling queue.

这些插件用于执行绑定Pod之前所需的任何工作。 例如,预绑定插件可以在允许Pod在此处运行之前预配置网络卷并将其安装在目标节点上。

如果任何PreBind插件返回错误,则Pod将被拒绝并返回到调度队列。

Bind

These plugins are used to bind a Pod to a Node. Bind plugins will not be called until all PreBind plugins have completed. Each bind plugin is called in the configured order. A bind plugin may choose whether or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the remaining bind plugins are skipped.

这些插件用于将Pod绑定到节点。 在所有PreBind插件完成之前,不会调用绑定插件。 每个绑定插件均按配置顺序调用。 绑定插件可以选择是否处理给定的Pod。 如果绑定插件选择处理Pod,则会跳过其余的绑定插件。

PostBind

This is an informational extension point. Post-bind plugins are called after a Pod is successfully bound. This is the end of a binding cycle, and can be used to clean up associated resources.

这是一个信息扩展点。 成功绑定Pod后,将调用后绑定插件。 这是绑定周期的结束,可用于清理关联的资源。

Plugin API

There are two steps to the plugin API. First, plugins must register and get configured, then they use the extension point interfaces. Extension point interfaces have the following form.

插件API有两个步骤。 首先,插件必须注册并配置,然后才能使用扩展点接口。 扩展点接口具有以下形式。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
type Plugin interface {
Name() string
}

type QueueSortPlugin interface {
Plugin
Less(*v1.pod, *v1.pod) bool
}

type PreFilterPlugin interface {
Plugin
PreFilter(context.Context, *framework.CycleState, *v1.pod) error
}

// ...

Plugin configuration(插件配置)

You can enable or disable plugins in the scheduler configuration. If you are using Kubernetes v1.18 or later, most scheduling plugins are in use and enabled by default.

In addition to default plugins, you can also implement your own scheduling plugins and get them configured along with default plugins. You can visit scheduler-plugins for more details.

If you are using Kubernetes v1.18 or later, you can configure a set of plugins as a scheduler profile and then define multiple profiles to fit various kinds of workload. Learn more at multiple profiles.

您可以在调度程序配置中启用或禁用插件。 如果您使用的是Kubernetes v1.18或更高版本,则大多数调度插件都在使用中并默认启用。

除了默认插件外,您还可以实现自己的计划插件,并与默认插件一起配置它们。 您可以访问调度程序插件以获取更多详细信息。

如果您使用的是Kubernetes v1.18或更高版本,则可以将一组插件配置为调度程序配置文件,然后定义多个配置文件以适合各种工作负载。 了解更多有关多个配置文件的信息。


Kubernetes调度器插件
https://fulequn.github.io/2020/11/Article202011213/
作者
Fulequn
发布于
2020年11月21日
许可协议