Data Science & AI Workbench enables you to create local copies of a repositories so users can access packages from a centralized, on-premise location. This process is called mirroring. You can mirror the full content of a repository, or include only specific packages or types of packages from the repository in your mirror. You can also create mirrors in an air-gapped network improve performance and security.

You can mirror an online repository, or you can use a tarball containing package data to populate a channel in Workbench.

Prerequisites:

It can take several hours to mirror an entire repository, depending on its size.

Creating a conda mirror

The basic steps for creating a conda mirror are:

  1. Prepare your mirror configuration file.

  2. Log in to the Workbench CLI.

  3. If necessary, create a channel in the internal Workbench repository.

  4. Initiate the mirror by running the following command:

    anaconda-mirror-ae5 --file /path/to/<mirror.yaml>
    

    Append --dry-run to the command to see what actions would be taken by the mirror, without performing actual modifications.

Preparing your mirror configuration file

Create a <mirror>.yaml file that details the configurations for the mirror.

You can name this file whatever you’d like. Anaconda recommends naming it the same as the channel you are mirroring to.

Basic configurations

Define source channel locations, package platforms, and destination/storage location details. Manage package formats, clean up outdated packages, and test configurations without applying changes by including these configurations.

ParameterDescription
channelsList of URLs for channels you want to mirror from.

If a short channel name is supplied, Workbench uses its system-level .condarc file’s channel_alias: value to complete the channel URL.
platformsList of platforms you want to mirror packages for. For example, win-64 or linux-64.

If no value is supplied, the mirror will include packages for all platforms available on the source channel.
dest_channelThe short name for the internal Workbench repository channel you are mirroring to. The rest of the channel URL is automatically completed by anaconda-enterprise-cli for you.
dest_siteThe web address or path where you want to store your mirrored packages.

The specific formatting and necessity of this value depends on the type of destination repository. For more information, see repository-specific configurations.
format_policyDetermines how the mirror manages .conda and .tar.bz2 files:
  • prefer-conda or prefer-tarbz2 – Mirror one package format over the other if both are available for a package. If the preferred type is unavailable, the other file type is still downloaded.
  • only-conda or only-tarbz2 – Mirror packages that are available in the preferred file format only.
  • transmute-conda or transmute-tarbz2 – Convert packages to the preferred format as necessary. Requires the conda-package-handling package to function.
  • keep-both – Mirror both file types for all available packages.

Defaults to prefer-conda for repositories that support .conda formatting and only-tarbz2 for those that do not.
If your repository does not support .conda formatting, Anaconda recommends installing the conda-package-handling package and using the transmute-tarbz2 option.
cleantrue / false - If true, removes packages from the destination channel that are not on the source channel when updating.

Default: false (to ensure packages are not inadvertently removed)
dry_runtrue / false - If true, outputs what actions would be taken by the mirror, without performing actual modifications.

Default: false

Filtering configurations

Fine-tune which packages are included in the mirror. Specify versions of Python or R packages that your packages should be compatible with, include only specific packages, or exclude packages by name and license family type.

ParameterDescription
python_versionsA comma-separated list of Python versions. Restricts all Python packages and packages that depend on Python to these versions.
r_versionsA comma-separated list of R versions. Restricts all R packages and packages that depend on R to these versions.
pkg_listList of package names or valid MatchSpec strings. If supplied, only the specified packages will be mirrored, not their dependencies.

Cannot be paired with license_exclude, exclude, or include.
license_excludeList of license families to exclude from the mirror. To see a list of valid license families, use the anaconda-mirror-ae5 --help command. Cannot be paired with pkg_list.
excludeList of package names or valid MatchSpec strings to exclude.

Cannot be paired with pkg_list.
includeList of package names or valid MatchSpec strings to override the mirror’s other filters and include these packages even if they would otherwise be filtered out.

Cannot be paired with pkg_list.

For more information about MatchSpec, see package match specifications.

Advanced configurations

Configure repository authentication, enforce platform restrictions, and manage SSL verification for secure connections.

ParameterDescription
username / passwordSupplies credentials for repository authentication. For more information, see repository-specific configurations.
strict_platformstrue / false - If true, excludes noarch from the mirror.

Default: false (all platforms use noarch)
max_attemptsNumber of retry attempts for failed connections.

Default: 5
max_failuresNumber of failed transactions before stopping. Default: 100
verify_ssltrue / false - Enables or disables SSL verification.

Default: true

If Workbench is installed in a proxied environment, see Configuring conda in Workbench for information on setting the NO_PROXY variable.

Repository-specific configurations

JFrog Artifactory

For Artifactory destinations, the dest_site can be a repository hostname, or a full URL.

If you supply the hostname only, anaconda-mirror interprets the channel path as:

https://<dest_site>/artifactory/<dest_channel>

To authenticate to a JFrog Artifactory repository

  • Configure the username and password values in your .yaml file to contain your credentials. If both values are supplied, they are delivered using basic HTTP authentication. You can substitute an access token for your password if necessary.
  • Configure just the password value in your .yaml file. This is delivered as a bearer token using the Authorization: Bearer header. This must be an access token.
  • Configure your .netrc file to store your username and password for the repository. These values are delivered using basic HTTP authentication.

S3 bucket

For Simple Storage Service (S3) buckets, the channel path is a concatenation of the dest_site and dest_channel values.

For example, if you were mirroring to an S3 bucket, your dest_site would be set to <bucket_name>/full/path/to/ and the full channel path is interpreted as:

<bucket_name>/full/path/to/<dest_channel>

Authentication to an S3 source is currently controlled entirely by the environment. For example, you can use the aws CLI tool to configure the target region and authenticate. You may wish to use the AWS_PROFILE environment variable to select among multiple configurations.

Local

Much like the S3 bucket, the local repository channel path consists of a concatenation of the dest_site and dest_channel values.

No authentication is necessary for local repositories.

anaconda-enterprise-cli

The dest_site value defaults to the <SITE_NAME> value established when you configure the workbench CLI. If you have only configured the CLI to be able to access one site (i.e. your Workbench instance), there is no need to specify this value.

Authentication is handled when you log in to the CLI.

Example conda and R mirrors

Here are some example mirror .yaml files you can use to mirror some common repositories:

Mirroring a PyPI repository

The full PyPI mirror size is currently close to 10TB, so ensure that your file storage location has sufficient disk space before proceeding.

Because anaconda-mirror does not handle .pip package formatting, mirrors for PyPI repositories containing such packages are managed by the anaconda-enterprise-cli tool.

The steps are identical to creating a conda mirror:

  1. Prepare your mirror configuration file.

  2. Log in to the Workbench CLI.

  3. If necessary, create a channel in the internal Workbench repository.

  4. Initiate the mirror by running the following command:

    anaconda-enterprise-cli mirror pypi --config pypi-mirror.yaml
    

This command loads the packages on https://pypi.org into the user’s account.

Mirrored packages can be viewed at https://<FQDN>/repository/pypi/pypi/simple/, replacing <FQDN> with the fully qualified domain name of your installation of Workbench. (The second pypi in the url should match the user configuration value described below.)

PyPI configurations:

PyPI mirror .yaml configuration values consist of the following:

ParameterDescription
userThe local user under which the PyPI packages are imported.

Default: pypi
pkg_listList of package names to mirror. If supplied, only the specified packages will be mirrored, not their dependencies.

Cannot be paired with blocklist or allowlist.
allowlistList of package names to mirror. If supplied, only the specified packages will be mirrored, not their dependencies.

Cannot be paired with pkg_list.
blocklistList of package names to skip. Packages listed here are not mirrored.

Cannot be paired with pkg_list.
latest_onlyIf supplied, only the latest package versions are mirrored.

Default: false
remote_urlThe URL of the PyPI mirror.

/pypi is appended to build the XML RPC API URL, /simple for the simple index and /pypi/{package}/{version}/json for the JSON API.

Default: https://pypi.python.org/
xml_rpc_api_urlA custom value for XML RPC URL. If this value is present, it takes precedence over the URL built using remote_url Default: null.
simple_index_urlA custom value for the simple index URL. If this value is present, it takes precedence over the URL built using remote_url. Default: null.
use_xml_rpcWhether to use the XML RPC API as specified by PEP381. If this is set to true, the XML RPC API is used to determine which packages to check. Otherwise, the script falls back to the simple index. If the XML RPC fails, the simple index is used.

Default: true
use_serialIf set to true, uses the serial number provided by the XML RPC API. Only packages updated since the last serial saved are checked. If this is set to false, all PyPI packages are checked for updates.

Default: true
create_orgCreates the mirror user as an organization instead of a regular user account. All superusers are added to the Owners group of the organization.

Default: false

All mirrored PyPI-like channels are publicly available to pull packages from both inside and outside Workbench (no authentication is required).

Configuring pip

To configure pip to use this new mirror, create pip.conf as follows:

# Replace <WORKBENCH_URL> with the actual URL to your Workbench instance
[global]
index-url=<WORKBENCH_URL>/repository/pypi/pypi/simple/

To configure Workbench sessions and deployments to automatically use the pip.conf, run the following command.

anaconda-enterprise-cli spark-config --config /etc/pip.conf pip.conf

For more specific information on configuring pip, see the official pip documentation.