Skip to content

API Reference

diffhouse

Repository mining tool for structuring Git metadata at scale.

Repo

Git repository wrapper for querying mined data.

When used via its load() method or in a with statement, it sets up a temporary clone to retrieve information; this may take long depending on the repository size and network speed.

__init__

__init__(
    location: str,
    blobs: bool = False,
    verbose: bool = False,
)

Initialize the repository.

When sourcing from a local path, the blobs = False filter may not be available.

Parameters:

Name Type Description Default
location str

URL or local path pointing to a git repository.

required
blobs bool

Whether to load file content and extract associated metadata.

False
verbose bool

Whether to log progress to stdout.

False

load

load() -> Repo

Load all repository data into memory.

This is a convenience method to access objects without the with statement. For large repositories, take a look at the streaming options instead.

Returns:

Type Description
Repo

self

branches

branches: list[str]

Branch names of the repository.

tags

tags: list[str]

Tag names of the repository.

commits

commits: list[Commit]

Commit history of the default branch.

Requires load().

changed_files

changed_files: list[ChangedFile]

Files of all default-branch commits.

Requires load() and blobs = True.

diffs

diffs: list[Diff]

Diffs of all default-branch commits.

Requires load() and blobs = True.

stream_commits

stream_commits() -> Iterator[Commit]

Stream the commit history of the default branch.

Requires wrapping the Repo in a with statement.

Yields:

Type Description
Commit

Commit data.

stream_changed_files

stream_changed_files() -> Iterator[ChangedFile]

Stream the files of all default-branch commits.

Requires blobs = True and wrapping the Repo in a with statement.

Yields:

Type Description
ChangedFile

File change metadata.

stream_diffs

stream_diffs() -> Iterator[Diff]

Stream diffs of all default-branch commits.

Requires blobs = True and wrapping the Repo in a with statement.

Yields:

Type Description
Diff

Text diffs.

location

location: str

Location where the repository was cloned from.

Can either be a remote URL or a local file URI based on the original input.

Commit

Commit metadata.

to_dict

to_dict() -> dict

Convert the object to a dictionary.

Returns:

Type Description
dict

A dictionary representation of the commit.

commit_hash

commit_hash: str

Full hash of the commit.

is_merge

is_merge: bool

Whether the commit is a merge commit.

parents

parents: list[str]

List of parent commit hashes.

author_name

author_name: str

Author name.

author_email

author_email: str

Author email.

author_date

author_date: str

Original commit date and time.

Formatted as an ISO 8601 datetime string (YYYY-MM-DDTHH:MM:SS±HH:MM).

committer_name

committer_name: str

Committer name.

committer_email

committer_email: str

Committer email.

committer_date

committer_date: str

Actual commit date and time.

Formatted as an ISO 8601 datetime string (YYYY-MM-DDTHH:MM:SS±HH:MM).

message_subject

message_subject: str

Commit message subject.

message_body

message_body: str

Commit message body.

files_changed

files_changed: int | None

Number of files changed in the commit.

Available if blobs = True.

lines_added

lines_added: int | None

Number of lines inserted in the commit.

Available if blobs = True.

lines_deleted

lines_deleted: int | None

Number of lines deleted in the commit.

Available if blobs = True.

ChangedFile

Snapshot of a file that was modified in a specific commit.

to_dict

to_dict() -> dict

Convert the object to a dictionary.

Returns:

Type Description
dict

A dictionary representation of the changed file.

commit_hash

commit_hash: str

Full hash of the commit.

path_a

path_a: str

Path to the file before applying the commit's changes.

path_b

path_b: str

Path to the file after applying the commit's changes.

Differs from path_a for renames and copies.

changed_file_id

changed_file_id: str

Unique record identifier hashed from commit_hash, path_a, and path_b.

change_type

change_type: str

Single-letter code representing the change type.

Most commonly one of A (added), C (copied), D (deleted), M (modified) or R (renamed). See git-status for all possible values.

similarity

similarity: int

Similarity index between the two file versions.

0-100 for renames and copies, 100 otherwise.

lines_added

lines_added: int

Number of lines added to the file in the commit.

lines_deleted

lines_deleted: int

Number of lines deleted from the file in the commit.

Diff

Changes made to a hunk of code in a specific commit.

to_dict

to_dict() -> dict

Convert the object to a dictionary.

Returns:

Type Description
dict

A dictionary representation of the diff.

commit_hash

commit_hash: str

Full hash of the commit.

path_a

path_a: str

Path to the file before applying the commit's changes.

path_b

path_b: str

Path to the file after applying the commit's changes.

Differs from path_a for renames and copies.

changed_file_id

changed_file_id: str

Hash of commit_hash, path_a, and path_b.

Use it to match with a ChangedFile.

start_a

start_a: int

Line number that starts the hunk in file version A.

length_a

length_a: int

Line count of the hunk in file version A.

start_b

start_b: int

Line number that starts the hunk in file version B.

length_b

length_b: int

Line count of the hunk in file version B.

lines_added

lines_added: int

Number of lines added.

lines_deleted

lines_deleted: int

Number of lines deleted.

additions

additions: list[str]

Text content of added lines.

deletions

deletions: list[str]

Text content of deleted lines.