Sharded repodata#

This document provides an overview on how conda implements CEP-16 Sharded Repodata.

Sharded repodata splits repodata.json into an index mapping package names to shard hashes in repodata_shards.msgpack.zst. A shard contains repodata for every package with a given name. Since shards are named after a hash of their contents, they can be cached without having to check the server for freshness. Individual shards only need to change when an individual package has changed, so only the much smaller index has to be re-fetched often.

Sharded Repodata in conda#

Originally developed in conda-libmamba-solver and later ported into conda, we wanted a way to implement sharded repodata in Python that was independent of compiled solver code.

We do this by treating all repodata as if it was sharded repodata. Starting with a list of installed packages and to-be-installed packages, we gather all repodata for those packages and look for all package names listed in their dependencies. We repeat the process for every discovered package name that we have not already visited, fetching repodata shards or examining all artifacts with that package name as found in monolithic repodata.json. This process gathers all versions of all packages that we might depend on. We do not consider package versions at this stage; that’s the solver’s job.

As of this writing, conda create -c conda-forge --dry-run python finds 35 package names; conda 137 package names, and vaex, a dataframe library with a complex dependency tree, 678 package names. That’s a lot less than the 31k packages total according to https://conda-forge.org/, and a manageable number to pre-process in Python before doing a solve with libmamba. As long as we can fetch those packages quickly enough, from cache or from the network, we will save RAM, disk space, bandwidth and time compared to parsing every package on the channel every time.

Threading and concurrency#

In order to achieve concurrency, our sharded repodata implementation uses the Python threading module. We have two separate thread workers for fetching cache and network data. These threads communicate to each other via the following queues:

  • cache_in_queue every requested shard goes here first where the cache worker sees if we have a valid cache record.

  • cache_miss_queue for every shard not in cache, we send it this queue where the network worker thread downloads it.

  • shard_out_queue once a shard has been fetched from either the cache or network worker threads, it is placed here so we can gather all needed shards at the end to build our repodata subset.

            sequenceDiagram
        loop
            Main ->> Main: "Fetch" in-memory shard
            Main ->> Cache: Fetch shard
            Cache ->> Network: Cache miss
            Cache ->> Main: Cache hit
            Network ->> Main: Network result
            Main ->> Main: Find new (channel, package) from shard data
        end

    

Source code#

The shard handling code is split into conda/_private/shards/shards.py, conda/_private/shards/cache.py, conda/_private/shards/subset.py, conda/_private/shards/typing.py, and conda/_private/shards/misc.py in conda/_private/. conda/gateways/shards/ re-exports build_repodata_subset(). When context.repodata_use_shards is enabled, conda/plugins/manager.py injects it into solver backends that accept a build_repodata_subset constructor parameter. Solver plugins such as conda-libmamba-solver pass the injected callable to their index helper, which calls it and converts the resulting repodata to solver objects in memory. If no channel provides sharded repodata, build_repodata_subset() returns None and the solver falls back to classic repodata.json loading.

conda/_private/shards/shards.py#

conda/_private/shards/shards.py provides an interface to treat sharded repodata and monolithic repodata.json in the same way. It checks a channel for sharded repodata, returning an object that implements the ShardLike interface.

conda/_private/shards/subset.py#

conda/_private/shards/subset.py accepts a list of ShardLike instances and a list of initial packages to compute a repodata subset. The traversal is simplified thanks to the ShardLike interface, so the algorithm doesn’t have to worry too much about the type of each channel.

conda/_private/shards/cache.py#

conda/_private/shards/cache.py implements a sqlite3 cache used to store individual shards. When traversing shards, the cache is checked before making a network request. The shards cache is a single database for all channels in $CONDA_PREFIX/pkgs/cache/repodata_shards.db.

The shards index repodata_shards.msgpack.zst is cached in the same way as repodata.json, in individual files in $CONDA_PREFIX/pkgs/cache/ named after URL hashes. A has_<format> remembers if a channel has shards, or not. If has_shards is false then we wait 7 days after last_checked to make another request looking for repodata_shards.msgpack.zst. The same system remembers whether a channel provides repodata.json.zst, and stores ETag and Last-Modified used to refresh the cache.

...
"has_shards": {
    "last_checked": "2025-10-15T17:19:44.408989Z",
    "value": true
},

conda/_private/shards/typing.py#

conda/_private/shards/typing.py provides type hints for data structures used in sharded repodata, but it is not normative; it only includes fields used by the sharded repodata system.

conda/_private/shards/misc.py#

conda/_private/shards/misc.py provides URL helpers, batching utilities, and connection-pool configuration used by the other shards modules.

tests/shards/#

Tests under tests/shards/ cover the shards-related code in conda/_private/shards/*.py.

Example dependency graph for Python#

This is what Python’s dependencies look like on conda-forge as of this writing.

If sharded repodata is asked to install Python, we look for python in every active channel. The python shard(s) tells us we can fetch bzip2, libffi, ... in parallel, discovering a third layer including icu, ca-certificates, and others. ca-certificates also depends on some virtual packages, but the traversal quickly determines that these packages don’t appear in any channel by checking the repodata_shards.msgpack.zst index. The solver will let us know if these missing packages are a problem, virtual or no.

The first draft of sharded repodata in conda literally generated classic repodata.json with package subsets to load into the solver, but now the solver gets a subset that yields individual package records, so that it can convert each record into solver objects in memory.

The subset gives the solver every possible dependency for a specific request. The transfer and parsing saved by not processing the full repodata makes up for the time spent generating a subset.

        flowchart LR
    python --> bzip2
    python --> libffi
    python --> libzlib
    python --> ncurses
    python --> openssl
    python --> readline
    python --> sqlite
    python --> tk
    python --> tzdata
    python --> xz
    python --> libsqlite
    python --> libcxx
    python --> zlib
    python --> __osx
    python --> liblzma
    python --> libexpat
    python --> libmpdec
    python --> python_abi
    python --> zstd
    python --> _python_rc
    python --> expat
    python --> libiconv
    libsqlite --> icu
    openssl --> ca-certificates
    python_abi --> pypy3.6
    python_abi --> pypy3.7
    python_abi --> pypy3.8
    python_abi --> pypy3.9
    xz --> liblzma-devel
    xz --> xz-gpl-tools
    xz --> xz-tools
    zstd --> lz4-c
    ca-certificates --> __win
    ca-certificates --> __unix