Deploying Apache Airflow on a Kubernetes Cluster

Deploying Apache Airflow on a Kubernetes Cluster

These days, data-driven companies usually have a huge number of workflows and tasks running in their environments: these are automated processes which are supporting daily operations and activities for most of their departments, and include a wide variety of tasks, from simple file transfers to complex ETL workloads or infrastructure provisioning. To obtain better control and visibility of what is going on in the environments where these processes are executed, there needs to be a controlling mechanism, usually called a scheduler.

 

One such scheduler is Apache Airflow. Airflow is an open-source platform which can help companies to monitor and schedule their daily processes, able to programmatically author, schedule and monitor workflows using Python, and it can be integrated with the most well-known cloud and on-premise systems which provide data storage or data processing. On top of this, it also offers an integrated web UI where users can create, manage and observe workflows and their completion status, ensuring observability and reliability.

 

In today’s technological landscape, where resources are precious and often spread thinly across different elements of an enterprise architecture, Airflow also offers scalability and dynamic pipeline generation, by being able to run on top of Kubernetes clusters, allowing us to automatically spin up the workers inside Kubernetes containers. This enables users to dynamically create Airflow workers and executors whenever and wherever they need more power, optimising the utilisation of available resources (and the associated costs!).

 

In this blog series, we will dive deep into Airflow: first, we will show you how to create the essential Kubernetes resources to be able to deploy Apache Airflow on two nodes of the Kubernetes cluster (the installation of the K8s cluster is not in the scope of this article, but if you need help with that, you can check out this blog post!); then, in the second part of the series, we will develop an Airflow DAG file (workflow) and deploy it on the previously installed Airflow service on top of Kubernetes.

 

Overview

 

As mentioned above, the objective of this article is to demonstrate how to deploy Airflow on a K8s cluster. To do so, we will need to create and initialise a set of auxiliary resources using YAML configuration files.

 

The first items that we will need to create are the Kubernetes resources. K8s resources define how and where Airflow should store data (Persistent Volumes and Claims, Storage class resources), and how to assign an identity – and the required permissions – to the deployed Airflow service (Service account, Role, and Role binding resources).

 

Then we will need a Postgres database: Airflow uses an external database to store metadata about running workflows and their tasks, so we will also show how to deploy Postgres on top of the same Kubernetes cluster where we want to run Airflow. First, we will need to create a Configmap, to configure the database username and password; then, a Persistent Volume and a Claim, to store the data on a physical disk; finally, a Postgres service resource, to expose the database connection to the Airflow Scheduler.

 

Finally, we will be able to deploy the actual Airflow components: the Airflow scheduler and the Airflow webserver. These are the two main components of the Airflow platform. The scheduler runs the task instances in the structured order defined by what is called DAG, and the webserver is the web UI to monitor all the scheduled workflows.

 

In the picture below, all the main blocks of the deployment are shown:

 

Build-blocks

 

 

Creating Template Files

 

So, as we said, the first thing to do to prepare our Airflow environment on Kubernetes is to prepare YAML files for the deployment of the required resources.

 

Usually, when creating such files, we can download templates from the Helm charts repository using the Helm package manager. Once downloaded, the files can be adapted to our specific needs, in order to reflect the desired application deployment.

 

To do so for our Airflow instance, we can use the following commands to define the Airflow repository, download the Helm chart locally, and finally create the template YAML files.

 

helm repo add apache-airflow https://airflow.apache.org 
helm repo update apache-airflow 
mkdir airflow-yamls 
cd airflow-yamls 
helm template –output-dir ./ /root/.cache/helm/repository/airflow-1.5.0.tgz 

 

In this way, the template files are created in the current path with a bunch of files and directories.

 

Note that not all of them are used in this blog post. For example, we can ignore those related to the redis, celery and flower services, which we won’t be using. When all the unnecessary files have been removed from the directory, we end up with only those needed to deploy Airflow, as shown in the picture below. Of course, these are just template files, and they need to be adapted to have valid YAML files.

 

resources

 

In the following sections, we will explain how each of them was used to define and create all the required resources for our Airflow environment. The YAML files from the picture above will be referenced below wherever they are used.

 

Preparing the Postgres Deployment

 

As we mentioned above, Airflow uses an internal database to store all the metadata about running tasks and workflows. For production-ready deployments, we recommend using an external database, so we deployed a PostgreSQL database for demonstration purposes.

 

To do so, the first thing we need is a ConfigMap resource. ConfigMaps are Kubernetes resources which can be used to automatically configure applications running in the container when the Kubernetes pod is starting up. For our deployment, we used ConfigMap to set the Postgres database name, username, and password.

 

The file we used to create this resource is called postgres-configmap.yaml and it is shown below:

 

apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: postgres-config 
  labels: 
    app: postgres 
data: 
  POSTGRES_DB: postgresdb 
  POSTGRES_USER: admin 
  POSTGRES_PASSWORD: test123. 

 

After the Configmap has been defined, it can be created with the command below:

 

kubectl apply -f postgres-configmap.yaml -n airflow 

 

Note that in our example, for simplicity, the credentials are in plaintext. This is obviously not the best approach from a security perspective. For a production deployment where there are sensitive values, use Kubernetes Secrets and store the secrets in external vaults like HashiCorp Vault, or store sensitive values in a cloud service where they will be safe.

 

Next, we need to create PersistentVolume and PersistentVolumeClaims for the Postgres service.

 

PersistentVolumes are Kubernetes resources which define how the applications running inside a Kubernetes cluster store the data (quality and directory). A PersistentVolume also requires a StorageClass. StorageClass is a resource which defines in which quality the data will be stored (whether in cloud or locally). In this way, cluster administrators can define where the data will be stored and can automatically provision the data storage if a cloud provider is used.

 

PersistentVolumeClaims, on the other hand, are Kubernetes resources used to logically bind an application and the physical location of the data storage. They can be defined as an interface between the data storage and the application.

 

The YAML file we used to create the PersistentVolume and the PersistentVolumeClaim used by Postgres is called postgres-storage.yaml and is shown below:

 

kind: PersistentVolume 
apiVersion: v1 
metadata: 
  name: postgres-pv-volume 
  labels: 
    type: local 
    app: postgres 
spec: 
  storageClassName: manual 
  capacity: 
    storage: 5Gi 
  accessModes: 
    - ReadWriteMany 
  hostPath: 
    path: "/opt/data" 
--- 
kind: PersistentVolumeClaim 
apiVersion: v1 
metadata: 
  name: postgres-pv-claim 
  labels: 
    app: postgres 
spec: 
  storageClassName: manual 
  accessModes: 
    - ReadWriteMany 
  resources: 
    requests: 
      storage: 5Gi 

 

After the YAML file has been defined, the resources can be created with the command below:

 

kubectl apply -f postgres-storage.yaml -n airflow 

 

Once done, check that they were created successfully:

 

kubectl get pv -n airflow 
kubectl get pvc -n airflow 

 

The output of the command above should look like the picture below:

 

Picture 3

 

We can see that PersistentVolume is called postgres-pv-volume and is bound by PersistentVolumeClaim postgres-pv-claim, while Reclaim Policy is defined as Retain, which means that when PersistentVolumeClaim is deleted, the volume will not be deleted but retained, thus becoming available to other Pods or applications.

 

At this point we can create Deployment and Kubernetes Service resources for our Postgres deployment.

 

Deployment is a Kubernetes resource which is used to control a set of Pods in which applications are running, so that administrators can control the scalability of the application, create ReplicaSets, rollback previous versions, and so on.

 

The YAML file to deploy the Postgres service is called postgres-deployment.yaml and is shown below. Note that this Postgres deployment uses a pre-built docker image.

 

apiVersion: apps/v1 
kind: Deployment 
metadata: 
  name: postgres 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      app: postgres 
  template: 
    metadata: 
      labels: 
        app: postgres 
    spec: 
      containers: 
        - name: postgres 
          image: postgres:10.1 
          imagePullPolicy: "IfNotPresent" 
          ports: 
            - containerPort: 5432 
          envFrom: 
            - configMapRef: 
                name: postgres-config 
          volumeMounts: 
            - mountPath: /var/lib/postgresql/data 
              name: postgredb 
      volumes: 
        - name: postgredb 
          persistentVolumeClaim: 
            claimName: postgres-pv-claim 

 

To create the Postgres deployment, we can use the following command:

 

kubectl apply -f postgres-deployment.yaml -n airflow 

 

Once again, check that the deployment has been created successfully:

 

kubectl get pods -n airflow 

 

The output should look like this:

 

Picture 4

 

Note how the Postgres deployment postgres-5cb8b67d8f-xzg26 is running with one Pod running, which is shown by the column READY and the value 1/1.

 

Now that the Postgres deployment is finished, we need to create its service. A service is a Kubernetes resource used to expose an application to other applications through a network port.

 

The YAML file to create a Kubernetes service for the Postgres deployment is postgres-service.yaml, shown below:

 

--- 
apiVersion: v1 
kind: Service 
metadata: 
  name: postgres 
  labels: 
    app: postgres 
spec: 
  type: NodePort 
  ports: 
    - port: 5432 
  selector: 
    app: postgres 

 

As defined above, the Postgres service will listen on port 5432 and it will be accessible from inside the cluster via the same port. This port will be used when configuring the Airflow scheduler. We create the Postgres service with the same command:

 

kubectl apply -f postgres-service -n airflow 

 

Now that we have a Postgres service up and running, we need a database with all the schemas and tables used by the Airflow scheduler. This can be done by using the Python package for integration between the Airflow and PostgreSQL databases: postgres-airflow. First, Python must be installed inside the Postgres container. To do so, open the Postgres container bash CLI with the following command:

 

kubectl exec --stdin --tty {container_name} -- /bin/bash 

 

Once inside the container bash CLI, install Python and the postgres-airflow package with the following commands:

 

sudo apt-get update 
sudo apt-get install Python3.8 
AIRFLOW_VERSION=2.2.4 
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)" 
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}
/constraints-${PYTHON_VERSION}.txt" 
pip install "apache-airflow[postgres]==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}" 

 

Now, the database inside Postgres can be initialised like this:

 

airflow db init 

 

Preparing Resources for Airflow Deployment

 

As mentioned earlier, in order to deploy Airflow, there are a few Kubernetes resources that need to be created first: these include PersistentVolumes, PersistentVolumeClaims, ConfigMap, Role, RoleBinding, and ServiceAccount, similar to the Postgres service we have seen above.

 

PersistentVolumes and PersistentVolumeClaims are defined in a similar way to the Postgres service resources, but with one difference: the StorageClass resource is created specifically for the Airflow scheduler and not left on the default value manual. In this case, we are using a fast StorageClass, which defines VolumeClaims data storage on local SSD disks. The created resources are shown in the picture below and named dags-pv (for DAG files), and logs-pv (for the Airflow scheduler logs). To create these resources, we used the files below and the kubectl apply command that we used for the Postgres service.

 

  • pv-dags.yaml
  • pvc-dags.yaml
  • pv-logs.yaml
  • pvc-logs.yaml
  • airflow-storageclass.yaml

 

Picture 5

 

The next resource to create is ConfigMap. This consists of two environment variables which will be injected into the container when the scheduler Pod starts up. This Configmap is defined in the YAML file airflow-configmap.yaml and is shown below:

 

apiVersion: v1 
kind: ConfigMap 
metadata: 
  name: airflow-config 
data: 
  sql_alchemy_conn: "postgresql://admin:test123.@postgres:5432/postgresdb" 
  executor: "KubernetesExecutor" 

 

The first variable defines the SQLAlchemy connection string, so that the scheduler can connect to the Postgres database; the second variable defines the scheduler executor. KubernetesExecutor enables the scheduler to request a worker Pod from the Kubernetes API server when you need more power for executing the tasks. The Pod then runs the tasks, reports the results back to the scheduler, and terminates. This is the true power of running Airflow on a Kubernetes cluster! To create the ConfigMap, use the same commands as before:

 

kubectl apply -f airflow-configmap.yaml -n airflow 

 

More details about the created ConfigMap can also be obtained by using the describe command, as shown below:

 

kubectl describe configmap airflow-config 

 

Picture 6

 

In order to give the Airflow scheduler an identity, we also need to create a ServiceAccount and a Role. Applications and services running in the Kubernetes cluster need to have identities to send requests to the Kubernetes API server, and to guarantee control and to establish authorisation for the API server, Roles and ServiceAccounts need to be created and connected to the applications using RoleBinding resources. The mechanism to achieve that is called RBAC – Role Based Access Control. RBAC can be also used to restrict access to the Secrets and ConfigMaps in the Kubernetes cluster.

 

First, we create a ServiceAccount for the Airflow scheduler, defined in scheduler-serviceaccount.yaml as shown below:

 

kind: ServiceAccount 
apiVersion: v1 
metadata: 
  name: airflow-scheduler 
  labels: 
    tier: airflow 
    component: scheduler 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm  

 

Then we create a Role. A Role is Kubernetes resource which defines all the permissions that the application or service has for the Kubernetes API server. The Role for our Airflow scheduler is defined in the file called pod-launcher-role.yaml and is shown below:

 

kind: Role 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  name: pod-launcher-role 
  namespace: "airflow" 
  labels: 
    tier: airflow 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
rules: 
  - apiGroups: [""] 
    resources: ["services", "endpoints", "pods"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["pods/logs"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["pods/exec"] 
    verbs: ["get", "list", "create", "delete", "watch", "patch"] 
  - apiGroups: [""] 
    resources: ["events"] 
    verbs: ["list"] 

 

After the Role has been created, it can be bound to the previously created ServiceAccount with the RoleBinding resource defined in pod-launcher-rolebinding.yaml, shown below:

 

kind: RoleBinding 
apiVersion: rbac.authorization.k8s.io/v1 
metadata: 
  namespace: "airflow" 
  name: pod-launcher-rolebinding 
  labels: 
    tier: airflow 
    release: release-name 
    chart: "airflow-1.5.0" 
    heritage: Helm 
roleRef: 
  apiGroup: rbac.authorization.k8s.io 
  kind: Role 
  name: pod-launcher-role 
subjects: 
  - kind: ServiceAccount 
    name: airflow-scheduler 
    namespace: "airflow" 

 

All resources can now be created, following the exact order of the commands below:

 

kubectl apply -f scheduler-serviceaccount.yaml -n airflow 
kubectl apply -f pod-launcher-role.yaml -n airflow 
kubectl apply -f pod-launcher-rolebinding.yaml -n airflow 

 

Deploying Airflow on a Kubernetes Cluster

 

All the resources are now ready to create an Airflow deployment on top of the Kubernetes cluster. The deployment YAML file for the Airflow scheduler and Airflow webserver is called airflow-deployment.yaml:

 

--- 
kind: Deployment 
apiVersion: apps/v1 
metadata: 
  name: airflow 
  namespace: "airflow" 
spec: 
  replicas: 1 
  selector: 
    matchLabels: 
      deploy: airflow 
      name: airflow 
      component: webserver 
  template: 
    metadata: 
      labels: 
        deploy: airflow 
        name: airflow 
        component: webserver 
    spec: 
      serviceAccountName: airflow-scheduler 
      containers: 
        - name: airflow-scheduler 
          image: 'apache/airflow:2.2.4' 
          imagePullPolicy: IfNotPresent 
          env: 
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: sql_alchemy_conn 
            - name: AIRFLOW__CORE__EXECUTOR 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: executor 
          volumeMounts: 
            - name: logs-pv 
              mountPath: /opt/airflow-kubernetes/logs 
            - name: dags-pv 
              mountPath: /opt/airflow-kubernetes/dags 
          command: 
            - airflow 
          args: 
            - scheduler 
        - name: airflow-webserver 
          env: 
            - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: sql_alchemy_conn 
            - name: AIRFLOW__CORE__EXECUTOR 
              valueFrom: 
                configMapKeyRef: 
                  name: airflow-config 
                  key: executor 
          image: 'apache/airflow:2.2.4' 
          imagePullPolicy: IfNotPresent 
          ports: 
            - containerPort: 8080 
          command: 
            - airflow 
          args: 
            - webserver 
      restartPolicy: Always 
      volumes: 
        - name: dags-pv 
          persistentVolumeClaim: 
            claimName: dags-pvc 
        - name: logs-pv 
          persistentVolumeClaim: 
            claimName: logs-pvc 

 

This deployment file contains keywords for all the created resources: PersistentVolumes, ServiceAccount, and ConfigMap.

 

To deploy the Airflow scheduler and webserver, we run the following command:

 

kubectl apply -f airflow-deployment.yaml -n airflow 

 

When the Pod starts up, all the resources are used by the containers. Remember we have two containers here: one for the Airflow scheduler and one for the Airflow webserver. To check this, we can use the following command:

 

kubectl get pods -n airflow 

 

The output of the above command should look like this:

 

Picture 7

 

In the READY column, we can see that two containers are running, shown by the value 2/2.

 

The Airflow webserver container should be accessible by the user, so as mentioned before, we need to create a Kubernetes service for it. The YAML file to create the webserver service with the defined network port is called airflow-service.yaml and is shown below:

 

--- 
kind: Service 
apiVersion: v1 
metadata: 
  name: webserver-svc 
  namespace: airflow 
spec: 
  type: NodePort 
  selector: 
    deploy: airflow 
    name: airflow 
    component: webserver 
  ports: 
    - name: airflow-ui 
      protocol: TCP 
      port: 8080 
      targetPort: 8080

 

The Airflow webserver is bound to the network port 8080, but this is just the internal port accessible from the inside the cluster. After creating the webserver service with the following command, the service itself and its details can be checked again with the describe command:

 

kubectl apply -f airflow-service.yaml -n airflow 
kubectl describe service webserver-svc -n airflow 

 

describe command

 

Now when Airflow is finally deployed on the Kubernetes cluster and is running successfully, we can open the Airflow webserver UI using the local VM IP address and the Kubernetes internal proxy port 30315, as shown in the picture above. In the browser, the webserver UI should open, and the login screen should appear like this:

 

Airflow webserver UI

 

Conclusion

 

In this blog post we have gone through the process of Apache Airflow deployment on top of a Kubernetes cluster running on Ubuntu virtual machines. The same process can be a good starting point to deploy Airflow on bare metal servers and to configure Airflow for safe production usage.

 

As we mentioned in the article, some of the Airflow resources, like ConfigMaps and sensitive values, are created in plaintext, and this could be more secure: mechanisms to cover this are Kubernetes Secrets and an RBAC (Role-based Access Control) approach to secure the resources on the Kubernetes cluster at user level. It is also recommendable to employ external storage services such as HashiCorp Vault to store sensitive values, deployed on-prem or in the cloud.

 

We hope this first part of the Airflow blog post series has sparked your interest in a secure Airflow deployment in your production environment. In the next part of the series, we will get our hands dirty and show you how to run and monitor Airflow workflows running in a Kubernetes cluster. If you need further support for your Airflow deployment, or if you are interested in possible use cases for this technology, or if you want to see how to integrate it with your existing architecture, do not hesitate to contact us – our experts will be happy to help you!

 

Big Data and Cloud Services blog banner

Tomislav N
tomislav.novosel@clearpeaks.com