User Guide

This guide aims to cover the basic use cases of diffhouse. For a full list of objects, consider reading the API Reference.

Installation

Install diffhouse from PyPI:

pip install diffhouse

Optional Dependencies

If you plan to combine diffhouse with pandas or Polars, install the package with their respective extras:

pandas	`pip install diffhouse[pandas]`
Polars	`pip install diffhouse[polars]`

Quickstart

from diffhouse import Repo

with Repo('https://github.com/user/repo') as r:
    for c in r.commits:
        print(c.commit_hash[:10], c.date, c.author_email)

    if len(r.branches.to_list()) > 100:
        print('🎉')

    df = r.diffs.to_pandas()

To start, create a Repo instance by passing either a Git-hosting URL or a local path as its source argument. Next, use the Repo in a with statement to clone the source into a local, non-persistent location.

Inside the with block, you can access data through the following properties:

Property	Description	Record Type
`Repo.commits`	Commit history of the repository.	`Commit`
`Repo.filemods`	File modifications across the commit history.	`FileMod`
`Repo.diffs`	Source code changes across the commit history.	`Diff`
`Repo.branches`	Branches of the repository.	`Branch`
`Repo.tags`	Tags of the repository.	`Tag`

Querying Results

Data accessors like Repo.commits are Extractor objects and can output their results in various formats:

Looping Through Objects

You can use extractors in a for loop to process objects one by one. Data will be extracted on demand for memory efficiency:

with Repo('https://github.com/user/repo') as r:
    for c in r.commits:
        print(c.commit_hash[:10])
        print(c.author_name)

        if c.in_main:
            break

iter_dicts() is a for loop alternative that yields dictionaries instead of diffhouse objects. A good use case for this is writing results into a newline-delimited JSON file:

import json

with (
    Repo('https://github.com/user/repo') as r,
    open('commits.jsonl', 'w') as f
):
    for c in r.commits.iter_dicts():
        f.write(json.dumps(c) + '\n')

Converting to Dataframes

pandas and Polars DataFrame APIs are supported out of the box. To convert result sets to dataframes, call the following methods:

to_pandas() or pd() for pandas
to_polars() or pl() for Polars

with Repo('https://github.com/user/repo') as r:
    df1 = r.filemods.to_pandas()  # pandas
    df2 = r.diffs.to_polars()  # Polars

Preliminary Filtering

You can filter data along certain dimensions before processing takes place to reduce extraction time and/or network load.

Note

Filters are a WIP feature. Additional options like date and branch filtering are planned for future releases.

Skipping File Downloads

If no blob-level data is needed, pass blobs=False when creating the Repo to skip file downloads during cloning. Note that this will not populate:

files_changed, lines_added and lines_deleted fields of Repo.commits
Repo.filemods
Repo.diffs

with Repo('https://github.com/user/repo', blobs=False) as r:
    for b in r.branches:
        pass  # business as usual

    r.filemods  # throws FilterError