Sharded repodata#
This document provides an overview on how conda implements
CEP-16 Sharded Repodata.
Sharded repodata splits repodata.json into an index mapping package names to
shard hashes in repodata_shards.msgpack.zst. A shard contains repodata for
every package with a given name. Since shards are named after a hash of their
contents, they can be cached without having to check the server for freshness.
Individual shards only need to change when an individual package has changed, so
only the much smaller index has to be re-fetched often.
Source code#
The shard handling code is split into conda/_private/shards/shards.py, conda/_private/shards/cache.py,
conda/_private/shards/subset.py, conda/_private/shards/typing.py, and conda/_private/shards/misc.py in conda/_private/.
conda/gateways/shards/ re-exports build_repodata_subset(). When
context.repodata_use_shards is enabled, conda/plugins/manager.py injects it
into solver backends that accept a build_repodata_subset constructor parameter.
Solver plugins such as conda-libmamba-solver pass the injected callable to
their index helper, which calls it and converts the resulting repodata to solver
objects in memory. If no channel provides sharded repodata,
build_repodata_subset() returns None and the solver falls back to classic
repodata.json loading.
conda/_private/shards/shards.py#
conda/_private/shards/shards.py provides an interface to treat sharded repodata and monolithic
repodata.json in the same way. It checks a channel for sharded repodata,
returning an object that implements the ShardLike interface.
conda/_private/shards/subset.py#
conda/_private/shards/subset.py accepts a list of ShardLike instances and a list of initial
packages to compute a repodata subset. The traversal is simplified thanks to the
ShardLike interface, so the algorithm doesn’t have to worry too much about the
type of each channel.
conda/_private/shards/cache.py#
conda/_private/shards/cache.py implements a sqlite3 cache used to store individual shards.
When traversing shards, the cache is checked before making a network request.
The shards cache is a single database for all channels in
$CONDA_PREFIX/pkgs/cache/repodata_shards.db.
The shards index repodata_shards.msgpack.zst is cached in the same way as
repodata.json, in individual files in $CONDA_PREFIX/pkgs/cache/ named after
URL hashes. A has_<format> remembers if a channel has shards, or not. If
has_shards is false then we wait 7 days after last_checked to make another
request looking for repodata_shards.msgpack.zst. The same system remembers
whether a channel provides repodata.json.zst, and stores ETag and
Last-Modified used to refresh the cache.
...
"has_shards": {
"last_checked": "2025-10-15T17:19:44.408989Z",
"value": true
},
conda/_private/shards/typing.py#
conda/_private/shards/typing.py provides type hints for data structures used in sharded
repodata, but it is not normative; it only includes fields used by the sharded
repodata system.
conda/_private/shards/misc.py#
conda/_private/shards/misc.py provides URL helpers, batching utilities, and
connection-pool configuration used by the other shards modules.
tests/shards/#
Tests under tests/shards/ cover the shards-related code in
conda/_private/shards/*.py.
Example dependency graph for Python#
This is what Python’s dependencies look like on conda-forge as of this writing.
If sharded repodata is asked to install Python, we look for python in every
active channel. The python shard(s) tells us we can fetch bzip2, libffi,
... in parallel, discovering a third layer including icu, ca-certificates,
and others. ca-certificates also depends on some virtual packages, but the
traversal quickly determines that these packages don’t appear in any channel by
checking the repodata_shards.msgpack.zst index. The solver will let us know if
these missing packages are a problem, virtual or no.
The first draft of sharded repodata in conda literally generated classic
repodata.json with package subsets to load into the solver, but now the solver
gets a subset that yields individual package records, so that it can convert
each record into solver objects in memory.
The subset gives the solver every possible dependency for a specific request. The transfer and parsing saved by not processing the full repodata makes up for the time spent generating a subset.
flowchart LR
python --> bzip2
python --> libffi
python --> libzlib
python --> ncurses
python --> openssl
python --> readline
python --> sqlite
python --> tk
python --> tzdata
python --> xz
python --> libsqlite
python --> libcxx
python --> zlib
python --> __osx
python --> liblzma
python --> libexpat
python --> libmpdec
python --> python_abi
python --> zstd
python --> _python_rc
python --> expat
python --> libiconv
libsqlite --> icu
openssl --> ca-certificates
python_abi --> pypy3.6
python_abi --> pypy3.7
python_abi --> pypy3.8
python_abi --> pypy3.9
xz --> liblzma-devel
xz --> xz-gpl-tools
xz --> xz-tools
zstd --> lz4-c
ca-certificates --> __win
ca-certificates --> __unix