Anaconda Package Download Data: Updates and Fixes for More Accurate Statistics
Dasha Gurova
Reviewers/Approvers: Jannis Leidel, Nick Cappadona
TL;DR: We’ve rolled out major improvements to our package download statistics, including more accurate download counts and better tracking of .conda artifacts. If you maintain or use conda packages, you now have access to more reliable data about how these packages are being used across the ecosystem.
A Window into the Conda Ecosystem
Since 2017, Anaconda Package Data has been our community’s window into how conda packages are being used. This dataset tracks download statistics from both Anaconda Distribution channels (repo.anaconda.com) and the anaconda.org public repository, helping maintainers and users understand the reach of their packages.
What’s New?
When we first launched this public dataset in 2019, our goal was to provide transparency into package usage. Today’s updates make that data significantly more accurate and comprehensive.
More Accurate Download Counts
We’ve rebuilt our data pipeline from the ground up. By directly processing raw HTTP request data from both anaconda.com and anaconda.org, we now capture a more complete picture of package downloads, including .conda artifacts that we previously missed.
We also discovered and fixed an important issue: download counts for March-May 2024 were accidentally inflated due to CDN clone requests. By properly filtering requests to conda-static.anaconda.org, we’ve eliminated this double-counting.
Channel Coverage
Packages on anaconda.org are organized in channels and can be maintained by the community or a company. The anaconda-package-data dataset covers a wide range of popular conda channels, including:
- anaconda
- bioconda
- conda-forge
- nvidia
- plotly
- pytorch and pytorch-test
- pyviz
- rapidsai and rapidsai-nightly
- Rdkit
Technical Improvements
To ensure everyone can continue using their existing tools and workflows, we’ve carefully aligned our new pipeline with the legacy system. This includes maintaining millisecond precision in time columns for hourly data and preserving Pandas properties in the Parquet metadata.
Historical Data Updates
We’ve also taken this opportunity to update our historical data. We’ve replaced the hourly and monthly data from June 2022 through May 2024 with corrected versions, and we’ve published new data covering June 2024 through November 2024. December 2024 data was released on January 2, 2025.
Getting Started with the Dataset
The updated and improved dataset is available at s3://anaconda-package-data at hourly and monthly aggregations. In the anaconda-package-data GiHub repository, you can find a quick start notebook and documentation about the dataset structure and available fields.
One thing to keep in mind about the data: conda’s package caching system means that users typically need to download a package only once, even when creating multiple environments. As a result, download counts aren’t a perfect measure of actual package usage, and packages that release updates more frequently naturally show higher download numbers.
Why Does This Matter?
These improvements make a real difference for package maintainers and the broader conda community. With more accurate download statistics, maintainers can better understand their package adoption and make informed decisions about development priorities. Channel coverage in the dataset provides insights into the broader conda ecosystem while eliminating double counting ensures these insights are based on reliable data.This update is part of our commitment to supporting the conda ecosystem wmbination can transform your approach to operations management.