Benchmarks
This document benchmarks diffhouse against PyDriller, a popular Python-based repository mining tool. Benchmarks were run with pyperf on GitHub Actions runners running Ubuntu 24.04.
Please note the following caveats:
- Repositories vary greatly in content and structure, so these benchmarks are intended as general performance indicators rather than definitive results.
- Each run executes
git clone, so runtimes may fluctuate due to network speed and can introduce additional sampling noise.
In all charts, lower is better.
Mid-Sized Repositories1
We used the 1,000-commit tween.js repo as a representative case. diffhouse was 2x+ faster overall; commit extraction took about 75% less time than PyDriller.
Large Codebases2
For the repo with ~10k commits, PyDriller slowed down significantly, while diffhouse kept a good pace, leading to major runtime improvements.
Binary Stores3
For repositories with lots of binary content, the gap narrowed; the speedup gained via diffhouse was less than 2x.
Tiny Projects4
Small repos finish in a few seconds with either tool; PyDriller is still slower, but only slightly.