Skip to content

Benchmarks

This document benchmarks diffhouse against PyDriller, a popular Python-based repository mining tool. Benchmarks were run with pyperf on GitHub Actions runners running Ubuntu 24.04.

Please note the following caveats:

  • Repositories vary greatly in content and structure, so these benchmarks are intended as general performance indicators rather than definitive results.
  • Each run executes git clone, so runtimes may fluctuate due to network speed and can introduce additional sampling noise.

In all charts, lower is better.

Mid-Sized Repositories1

We used the 1,000-commit tween.js repo as a representative case. diffhouse was 2x+ faster overall; commit extraction took about 75% less time than PyDriller.

tweenjs/tween.js benchmark results
tweenjs/tween.js

Large Codebases2

For the repo with ~10k commits, PyDriller slowed down significantly, while diffhouse kept a good pace, leading to major runtime improvements.

scrapy/scrapy benchmark results
scrapy/scrapy

Binary Stores3

For repositories with lots of binary content, the gap narrowed; the speedup gained via diffhouse was less than 2x.

sqlparser/sqlflow_public benchmark results
sqlparser/sqlflow_public

Tiny Projects4

Small repos finish in a few seconds with either tool; PyDriller is still slower, but only slightly.

microsoft/Detours benchmark results
microsoft/Detours