API Reference
diffhouse
Repository mining tool for structuring Git metadata at scale.
Repo
Git repository wrapper for querying mined data.
When used via its load() method or in a with statement, it sets up a
temporary clone to retrieve information; this may take long
depending on the repository size and network speed.
__init__
__init__(
location: str,
blobs: bool = False,
verbose: bool = False,
)
Initialize the repository.
When sourcing from a local path, the blobs = False filter
may not be available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
location
|
str
|
URL or local path pointing to a git repository. |
required |
blobs
|
bool
|
Whether to load file content and extract associated metadata. |
False
|
verbose
|
bool
|
Whether to log progress to stdout. |
False
|
load
load() -> Repo
Load all repository data into memory.
This is a convenience method to access objects without the with
statement. For large repositories, take a look at the streaming options
instead.
Returns:
| Type | Description |
|---|---|
Repo
|
self |
branches
branches: list[str]
Branch names of the repository.
tags
tags: list[str]
Tag names of the repository.
changed_files
changed_files: list[ChangedFile]
Files of all default-branch commits.
Requires load() and blobs = True.
stream_commits
stream_commits() -> Iterator[Commit]
Stream the commit history of the default branch.
Requires wrapping the Repo in a with statement.
Yields:
| Type | Description |
|---|---|
Commit
|
Commit data. |
stream_changed_files
stream_changed_files() -> Iterator[ChangedFile]
Stream the files of all default-branch commits.
Requires blobs = True and wrapping the Repo in a with statement.
Yields:
| Type | Description |
|---|---|
ChangedFile
|
File change metadata. |
stream_diffs
stream_diffs() -> Iterator[Diff]
Stream diffs of all default-branch commits.
Requires blobs = True and wrapping the Repo in a with statement.
Yields:
| Type | Description |
|---|---|
Diff
|
Text diffs. |
location
location: str
Location where the repository was cloned from.
Can either be a remote URL or a local file URI based on the original input.
Commit
Commit metadata.
to_dict
to_dict() -> dict
Convert the object to a dictionary.
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary representation of the commit. |
commit_hash
commit_hash: str
Full hash of the commit.
is_merge
is_merge: bool
Whether the commit is a merge commit.
parents
parents: list[str]
List of parent commit hashes.
author_name
author_name: str
Author name.
author_email
author_email: str
Author email.
author_date
author_date: str
Original commit date and time.
Formatted as an ISO 8601 datetime string (YYYY-MM-DDTHH:MM:SS±HH:MM).
committer_name
committer_name: str
Committer name.
committer_email
committer_email: str
Committer email.
committer_date
committer_date: str
Actual commit date and time.
Formatted as an ISO 8601 datetime string (YYYY-MM-DDTHH:MM:SS±HH:MM).
message_subject
message_subject: str
Commit message subject.
message_body
message_body: str
Commit message body.
files_changed
files_changed: int | None
Number of files changed in the commit.
Available if blobs = True.
lines_added
lines_added: int | None
Number of lines inserted in the commit.
Available if blobs = True.
lines_deleted
lines_deleted: int | None
Number of lines deleted in the commit.
Available if blobs = True.
ChangedFile
Snapshot of a file that was modified in a specific commit.
to_dict
to_dict() -> dict
Convert the object to a dictionary.
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary representation of the changed file. |
commit_hash
commit_hash: str
Full hash of the commit.
path_a
path_a: str
Path to the file before applying the commit's changes.
path_b
path_b: str
Path to the file after applying the commit's changes.
Differs from path_a for renames and copies.
changed_file_id
changed_file_id: str
Unique record identifier hashed from commit_hash, path_a, and path_b.
change_type
change_type: str
Single-letter code representing the change type.
Most commonly one of A (added), C (copied), D (deleted), M
(modified) or R (renamed). See
git-status for all
possible values.
similarity
similarity: int
Similarity index between the two file versions.
0-100 for renames and copies, 100 otherwise.
lines_added
lines_added: int
Number of lines added to the file in the commit.
lines_deleted
lines_deleted: int
Number of lines deleted from the file in the commit.
Diff
Changes made to a hunk of code in a specific commit.
to_dict
to_dict() -> dict
Convert the object to a dictionary.
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary representation of the diff. |
commit_hash
commit_hash: str
Full hash of the commit.
path_a
path_a: str
Path to the file before applying the commit's changes.
path_b
path_b: str
Path to the file after applying the commit's changes.
Differs from path_a for renames and copies.
changed_file_id
changed_file_id: str
Hash of commit_hash, path_a, and path_b.
Use it to match with a ChangedFile.
start_a
start_a: int
Line number that starts the hunk in file version A.
length_a
length_a: int
Line count of the hunk in file version A.
start_b
start_b: int
Line number that starts the hunk in file version B.
length_b
length_b: int
Line count of the hunk in file version B.
lines_added
lines_added: int
Number of lines added.
lines_deleted
lines_deleted: int
Number of lines deleted.
additions
additions: list[str]
Text content of added lines.
deletions
deletions: list[str]
Text content of deleted lines.