Enabling Private Custom Python Packages in Databricks

In our previous blog post, we explored how to package reusable, well-tested components and integrate them into Azure DevOps, covering design choices, CI/CD, smart versioning, rigorous testing, and publishing via Azure Artifacts. Building on that foundation, now we’re going to shift from how we ship to how we run, focusing on making that package a first-class citizen in Databricks by enabling clusters to pull securely and consistently from private PyPI sources (known as Artifact Feeds in Azure DevOps), so that teams have reliable and consistent access to internal packages across workspaces and jobs.

 

 

The Challenge: Custom Python Packages in Databricks

 

By default, Databricks clusters pull Python packages from the public PyPI index. Whilst this works perfectly for open-source libraries, enterprise environments often require additional controls, such as:

  1. Custom internal packages hosted on private repositories for code reusability and standardisation.
  2. Enhanced security by controlling package sources.
  3. Version control over dependencies across different environments.
  4. Compliance with organisational policies on external package usage.

 

 

The Solution: Init Scripts for Private PyPI Configuration

 

We designed a reliable shell script to configure private PyPI repositories automatically each time a Databricks cluster starts up. By running this process during cluster initialisation, we ensure that all the necessary package sources are available right from the start, streamlining both the setup and ongoing development:

 

#!/bin/bash

# Check for required environment variables
if [[ -z "$AZ_DEVOPS_TOKEN" || -z "$AZ_DEVOPS_FEED_NAME" || -z "$AZ_DEVOPS_ORG_NAME" || -z "$AZ_DEVOPS_PROJECT_NAME" ]]; then
  echo "Required environment variables are missing." >&2
  exit 1
fi

# Backup existing pip.conf if it exists
if [ -f /etc/pip.conf ]; then
  cp /etc/pip.conf /etc/pip.conf.bak
fi

# Write to NEW pip.conf
printf "[global]\n" > /etc/pip.conf
printf "extra-index-url = " >> /etc/pip.conf
printf "https://${AZ_DEVOPS_TOKEN}@pkgs.dev.azure.com/${AZ_DEVOPS_ORG_NAME}/${AZ_DEVOPS_PROJECT_NAME}/_packaging/${AZ_DEVOPS_FEED_NAME}/pypi/simple/\n" >> /etc/pip.conf

 

When a Databricks cluster starts with this init script, these steps are followed:

  • Validation: The script first validates that all the required environment variables are present.
  • Backup: Any existing pip configuration is safely backed up.
  • Configuration: A new conf file is created with the custom index URL.
  • Authentication: The Azure DevOps token is embedded in the URL for seamless authentication.

 

Overall Execution of the Init Script

Overall Execution of the Init Script

 

The resulting configuration allows pip to search both the public PyPI and your private Azure DevOps artifacts feed, offering the best of both worlds.

 

Since init scripts run during cluster initialisation, they can add to the overall boot time: the heavier the script (network calls, large downloads, complex logic), the longer the startup.

 

Cluster init scripts must reside in workspace-accessible locations. Common options include:

  • Databricks Volumes (UC): Store the script in a managed Volume and reference its path in the cluster (or via a policy). This provides workspace-native paths, consistent access across clusters, and simpler administration.
  • Cloud storage: Keep the script in the organisation’s standard object storage and access it via the native filesystem connector or a mounted path.

 

We host the init script in Unity Catalog–integrated Databricks Volumes, as recommended; its upload is fully automated through Terraform. In line with best practices, all infrastructure resources, including this script, are defined and managed with Terraform to ensure repeatability, version control, and standardised deployments across environments.

 

The script requires four critical environment variables to authenticate and locate the Azure DevOps artifacts feed. For sensitive configurations, we use Azure Key Vault to securely provide these variables to the script:

  • AZ_DEVOPS_TOKEN: Personal Access Token for authentication.
  • AZ_DEVOPS_FEED_NAME: Name of your artifacts feed.
  • AZ_DEVOPS_ORG_NAME: Azure DevOps organisation name.
  • AZ_DEVOPS_PROJECT_NAME: Project containing the artifacts feed.

 

In our Databricks cluster, sensitive configuration values are managed through Azure Key Vault-backed secrets, accessed via the spark_env_vars configuration. The setup uses this structure:

 

spark_env_vars = {
    AZ_DEVOPS_ORG_NAME = "{{secrets/kv-app/common-az-devops-org-name}}"
    AZ_DEVOPS_PROJECT_NAME = "{{secrets/kv-app/common-az-devops-project-name}}"
    AZ_DEVOPS_FEED_NAME = "{{secrets/kv-app/common-az-devops-feed-name}}"
    AZ_DEVOPS_TOKEN = "{{secrets/kv-app/common-az-devops-token}}"
  }

 

Each placeholder follows the Databricks secrets format shown below, where kv-app points to the Azure Key Vault instance and the trailing segment specifies the secret name:

 

{{secrets/<scope>/<secret-name>}}

 

These secrets are resolved at runtime by Databricks, keeping sensitive values out of code repositories. By managing credentials this way, we combine strong security practices with flexibility and ease of use for teams working across diverse environments.

 

During development, we found that editing shell scripts on Windows added Windows-style newlines (\r\n) to our .sh files. As Databricks clusters operate on Linux, these scripts need Unix-style (\n) line endings to execute properly. To address this, we converted our script files to the proper Linux format with this command:

 

# Convert Windows line endings to Unix line endings
dos2unix your-init-script.sh

 

This adjustment ensures our initialisation scripts run smoothly on Databricks clusters without errors caused by incompatible line endings.

 

 

Conclusion

 

Setting up private PyPI repositories on Databricks can be straightforward. By using a well-crafted initialisation script and managing authentication tokens securely with Azure Key Vault, you can make custom Python packages seamlessly available across your data platform.

 

This method doesn’t just tackle current package management challenges, but it also lays the groundwork for more advanced dependency management as your platform evolves.

 

If you’d like to implement this setup or explore how it can be adapted to your specific environment, our team at ClearPeaks can help! We work closely with numerous organisations to design secure, scalable, and maintainable Databricks solutions that accelerate development and streamline operations, so get in touch with us to see how we can help you get started!

 

Saqib T
saqib.tamli@clearpeaks.com