User Guide
This guide aims to cover the basic use cases of diffhouse. For a full list of objects, consider reading the API Reference.
Installation
Install diffhouse from PyPI:
pip install diffhouse
Optional Dependencies
If you plan to combine diffhouse with pandas or Polars, install the package with their respective extras:
| pandas | pip install diffhouse[pandas] |
| Polars | pip install diffhouse[polars] |
Quickstart
from diffhouse import Repo
with Repo('https://github.com/user/repo') as r:
for c in r.commits:
print(c.commit_hash[:10], c.date, c.author_email)
if len(r.branches.to_list()) > 100:
print('🎉')
df = r.diffs.to_pandas()
To start, create a Repo instance by passing either a Git-hosting URL or a local path as its source argument. Next, use the Repo in a with statement to clone the source into a local, non-persistent
location.
Inside the with block, you can access data through the following properties:
| Property | Description | Record Type |
|---|---|---|
Repo.commits |
Commit history of the repository. | Commit |
Repo.filemods |
File modifications across the commit history. | FileMod |
Repo.diffs |
Source code changes across the commit history. | Diff |
Repo.branches |
Branches of the repository. | Branch |
Repo.tags |
Tags of the repository. | Tag |
Querying Results
Data accessors like Repo.commits are Extractor objects and can output their results in various formats:
Looping Through Objects
You can use extractors in a for loop to process objects one by one. Data will be extracted on demand for memory efficiency:
with Repo('https://github.com/user/repo') as r:
for c in r.commits:
print(c.commit_hash[:10])
print(c.author_name)
if c.in_main:
break
iter_dicts() is a for loop alternative that yields dictionaries instead of diffhouse objects. A good use case for this is writing results into a newline-delimited JSON file:
import json
with (
Repo('https://github.com/user/repo') as r,
open('commits.jsonl', 'w') as f
):
for c in r.commits.iter_dicts():
f.write(json.dumps(c) + '\n')
Converting to Dataframes
pandas and Polars DataFrame APIs are supported out of the box. To convert result sets to dataframes, call the following methods:
to_pandas()orpd()for pandasto_polars()orpl()for Polars
with Repo('https://github.com/user/repo') as r:
df1 = r.filemods.to_pandas() # pandas
df2 = r.diffs.to_polars() # Polars
Preliminary Filtering
You can filter data along certain dimensions before processing takes place to reduce extraction time and/or network load.
Note
Filters are a WIP feature. Additional options like date and branch filtering are planned for future releases.
Skipping File Downloads
If no blob-level data is needed, pass blobs=False when creating the Repo to skip file downloads during cloning. Note that this will not populate:
files_changed,lines_addedandlines_deletedfields ofRepo.commitsRepo.filemodsRepo.diffs
with Repo('https://github.com/user/repo', blobs=False) as r:
for b in r.branches:
pass # business as usual
r.filemods # throws FilterError