Summarise file activity by time period and structural path

Aggregates file-level observations (e.g. from scan_storage()) into time-based summaries grouped by a deterministic structural path prefix.

Usage

summarise_activity(
  df,
  extensions = c("r", "bak"),
  path_col = "rel_path",
  time_unit = c("week", "month", "day", "year"),
  max_files = 20
)

summarize_activity(
  df,
  extensions = c("r", "bak"),
  path_col = "rel_path",
  time_unit = c("week", "month", "day", "year"),
  max_files = 20
)

Arguments

df

A data.frame representing a filesystem snapshot. Must conform to the canonical schema (see normalise_snapshot_schema()), including:

rel_path
filename
mtime (POSIXct)
extension
optionally git_tracked

extensions

Character vector of file extensions to include (case-insensitive, without leading dots).

path_col

Character. Name of the column containing file paths (default: "rel_path").

time_unit

One of "week", "month", "day", "year".

max_files

Integer. Maximum number of file names shown per group.

Value

A data.frame with one row per (period × group_path), containing:

period: Time bucket identifier (e.g. "2026-17").
group_path: Project-level grouping derived from the first components of rel_path, typically representing project and module (e.g. _packages/iocodelists/R).
start: Earliest modification date in the group.
end: Latest modification date in the group.
file_names: Pipe-separated list of filenames (truncated).
n_files: Number of file observations in the group.
n_unique_files: Number of distinct files (rel_path) in the group.
untracked: Number of files not tracked by Git (if available).

Details

The function derives:

a time bucket (period) from file modification times (mtime)
a grouping key (group_path) derived from the project and its immediate subdirectory (module), using an internal structural parser

and summarises activity within each (period × group_path) combination.

This provides a reproducible, structure-aware view of observed activity, suitable for exploratory analysis, forensic reconstruction, and audit workflows.

This function operates on observational data:

grouping is structural and deterministic, based on the first components of rel_path, typically corresponding to project and module folders (e.g. R, tests, data-raw)
no assumptions are made about project structure or file roles
identical inputs always produce identical outputs

The group_path is a project–module level projection of rel_path. It is derived by extracting the first components of the path (e.g. _packages/iocodelists/R) and is intended for aggregation and reporting.

The output is intended for analysis and reporting, not for file-level identity or joins. For identity, use rel_path.

Modification times (mtime) are treated as a proxy for activity. They indicate observed changes, not a complete editing history.

Files under .Trash are excluded by default. This approach aligns grouping with typical project layouts (e.g. R packages), where the first directory levels correspond to project boundaries and functional modules.

Examples

if (FALSE) { # \dontrun{
df <- scan_storage("D:/_eviota")

# Weekly overview
summarise_activity(df, time_unit = "week")

# Monthly overview
summarise_activity(df, time_unit = "month")
} # }