The Dask Operator for Kubernetes is experimental. So any bug reports are appreciated!
What is the operator?¶
The Dask Operator is a small service that runs on your Kubernetes cluster and allows you to create and manage your Dask clusters as Kubernetes resources. Creating clusters can either be done via the Kubernetes API with kubectl or the Python API with the experimental KubeCluster.
To install the operator you needs to apply some custom resource definitions that allow us to describe Dask resources and the operator itself which is a small Python application that
watches the Kubernetes API for events related to our custom resources and creates other resources such as
What resources does the operator manage?¶
The operator manages a heirarcy of resources, some custom resources to represent Dask primitives like clusters and worker groups, and native Kubernetes resporces such as pods and services to run the cluster processes and facilitate communication.
DaskWorkerGroup represents a homogenous group of workers that can be scaled. The resource is similar to a native Kubernetes
Deployment in that it manages a group of workers
with some intelligence around the
Pod lifecycle. A worker group must be attached to a Dask Cluster resource in order to function.
DaskCluster custom resource creates a Dask cluster by creating a scheduler
Service and default
DaskWorkerGroup which in turn creates worker
Workers connect to the scheduler via the scheduler
Service and that service can also be exposed to the user in order to connect clients and perform work.
The operator also has support for creating additional worker groups. These are extra groups of workers with different configuration settings and can be scaled separately. You can then use resource annotations to schedule different tasks to different groups.
For example you may wish to have a smaller pool of workers that have more memory for memory intensive tasks, or GPUs for compute intensive tasks.
DaskJob is a batch style resource that creates a
Pod to perform some specific task from start to finish alongside a
DaskCluster that can be leveraged to perform the work.
Once the job
Pod runs to completion the cluster is removed automatically to save resources. This is great for workflows like training a distributed machine learning model with Dask.