Reporting Data Structure • fscontext

library(fscontext)

Conceptual Model (RiC-aligned, operational)

This package adopts a minimal and operational interpretation of the Records in Contexts (RiC) model for file-level archival tracking, time accounting, and Git integration.

The model is deliberately observational. It records what can be observed in a filesystem at a given point in time, without assuming that the observed state is complete, canonical, or historically exhaustive.

Analytical layers

The package operates across several distinct analytical layers.

1. Observational layer

Raw filesystem observations created by scan_storage().

At this stage the package only records:

observed file occurrences
filesystem metadata
optional content signatures
repository context

No interpretation is performed.

2. Contextual layer

Contextual layers may support the construction of lightweight contextual Record Sets.

These Record Sets are operational and analytical projections derived from filesystem observations, repository structures, curatorial groupings, or analytical heuristics.

They should not automatically be interpreted as authoritative archival arrangements or fonds structures.

Examples:

structural grouping heuristics for later record set creation
grouping by project roots
repository affiliation
analytical path prefixes

The package sharply distinguishes between:

observed filesystem evidence
derived contextual interpretations
analytical reconstruction

3. Analytical reconstruction layer

Analytical outputs may be represented either as:

contextualised recordset_df objects, where rows still represent contextual Record Set members or operational resources,

or:

derived dataset_df analytical summaries, where rows represent aggregates, metrics, or statistical projections rather than individual resources.

Core assumptions

Record Resource (operational approximation): a logical file or document inferred from repeated observations of file Instantiations.
Instantiation: a specific occurrence of a file on a given storage at a given observation time (scan_time), as it evolves through file versions, copies across storage points or network repositories.
Record Part: not parsed at this stage, but may later include R functions, document sections, bibliographic entries, or other sub-file structures.
Record Set: an operational or intellectual aggregation of Record Resources constructed from observational evidence, curatorial decisions, repository structure, filesystem organisation, or analytical grouping rules.

This means that the package does not attempt to decide, during scanning, what the “true” or “canonical” file is. It records file occurrences first. Later analytical functions may compare, group, or reconcile those observations.

The package also does not assume that filesystem folders, repositories, or storage locations inherently correspond to Record Sets.

Operational grouping functions may derive structural heuristics from filesystem paths, repository roots, or storage layouts. These derived groupings are analytical projections and should not be interpreted as authoritative archival Record Sets without additional contextual or curatorial interpretation.

Rationale

This model reflects the needs of:

reconstructing work activity, for example for timesheets or audits
aligning local work with Git repositories
identifying uncommitted, duplicated, or distributed work
preserving evidence before file metadata changes further

It deliberately treats operational filesystem resources as the primary observable unit because they are:

consistently visible across operating systems
directly linked to version control systems
stable enough for modification-time and path-based reconstruction
suitable for later comparison by content signature

Higher-level intellectual objects, such as reports, R packages, publications, or software components, may later support Record Set construction or other analytical interpretations, but are not treated as primary Records during scanning.

Canonical variable semantics

To avoid ambiguity across snapshots and analytical outputs, the package uses a fixed naming convention. The distinction between atomic, derived, and aggregated variables is part of the data model.

File-level atomic variables

Each row in a scan represents one observed file occurrence.

Variable	Meaning
`rel_path`	Full relative path from the scan root. This is the primary file-instance identifier within a scan.
`filename`	Basename of the file. This is not unique and must not be used alone as an identifier.
`dir_path`	Directory containing the file, derived from `rel_path`.

rel_path is the safest key for joining file-level data within a snapshot. filename is a weak identity: the same name may occur in many folders and may refer to identical copies or genuinely different versions.

Derived structural variables

Some variables are computed from paths for grouping or interpretation.

Variable	Meaning
`dir_path`	Immediate containing directory of a file.
`group_path`	Analytical structural grouping key derived with `path_prefix()`.

group_path is not necessarily a filesystem folder. It is a deterministic analytical prefix used for reporting, activity summaries, and lightweight structural grouping for later record set creation.

Aggregated analytical variables

Aggregated outputs introduce variables that should not be confused with file-level identity.

Variable	Meaning
`file_names`	Concatenated filenames for display in summaries.
`n_files`	Number of file instances in an analytical group.

file_names is for human-readable reporting only. It is not suitable for joins, identity resolution, or duplicate detection.

Naming rules

The following names are reserved for specific meanings:

rel_path: full relative file path
filename: one basename
dir_path: containing directory
group_path: analytical grouping key
file_names: aggregated display string
n_files: count of file instances

The following names are avoided in package outputs because they are ambiguous:

file
files
folder

Provenance

Each scan records minimal provenance metadata, including:

the time of observation (scan_time, also stored as created_at)
the function used to generate the dataset
the package and package version
detected Git repositories, where available

This ensures reproducibility while keeping the data model lightweight.

Future extensions

The model allows later refinement:

splitting files into Record Parts, such as functions or document sections
constructing contextual and semantically enriched Record Sets from repeated observational evidence, repository structures, or analytical grouping heuristics
resolving identity across multiple Instantiations
classifying duplicated files as identical copies or versioned copies
linking file observations to Git commits and repository states

These steps are intentionally deferred. The first responsibility of the package is to preserve a robust and transparent observational base.

Operational Functions and Outputs

This package provides a small set of functions that implement the observational model in a reproducible and auditable way.

Core functions

`scan_storage()`

scan_storage() scans a directory recursively and returns a data.frame where each row represents one observed file occurrence.

Purpose:

create an inventory of accessible files on a storage
capture file-level metadata for audit and reconstruction
compute optional content signatures
detect Git repository context and Git tracking status

Key features:

recursive scan of accessible files
read-only operation
cross-platform filtering of common system artefacts
optional content fingerprinting with quick_sig
detection of Git repositories and file membership

Output:

A data.frame with one row per file occurrence.

Identification variables:

Variable	Meaning
`storage_id`	Identifier of the storage being scanned.
`person_id`	Identifier of the person or operator responsible for the scan.
`full_path`	Absolute path to the observed file.
`rel_path`	Full relative path from the scan root; primary file-instance identifier within the scan.
`filename`	Basename of the file; not unique.
`dir_path`	Directory containing the file, derived from `rel_path`.
`repo_rel_path`	Path relative to the Git repository root, if applicable.
`storage_path_id`	Stable file occurrence identifier, constructed as `storage_id::rel_path`.

File metadata variables:

Variable	Meaning
`stem`	Filename without extension.
`extension`	Lowercase file extension, without leading dot.
`type`	Filesystem type reported by `fs::file_info()`.
`size`	File size in bytes.
`mtime`	Last modification time.
`ctime`	Metadata change time or creation-like time, depending on platform.
`atime`	Last access time, if available.
`birth_time`	File creation time, if available.
`depth`	Number of path components in `rel_path`.
`links`	Number of hard links, if reported by the filesystem.
`permissions`	Filesystem permissions.

Content fingerprint variables:

Variable	Meaning
`quick_sig`	Fast, non-cryptographic content signature used for duplicate detection.

Git integration variables:

Variable	Meaning
`repo_root`	Nearest Git repository root, if detected.
`git_tracked`	Whether the file is tracked by Git, if repository information is available.

Provenance variables:

Variable	Meaning
`scan_time`	Timestamp of the observation, repeated for each row.

Attributes:

created_at: timestamp of the scan
created_by: function name
package: package name
package_version: package version, if available
repos: detected repositories with remote and branch information

`snapshot_storage()`

snapshot_storage() runs scan_storage() and immediately saves the result as an .rds snapshot file.

Purpose:

create persistent, timestamped records of a storage state
preserve file observations before metadata changes further
support later comparison between scans
provide auditable artefacts for reporting and forensic analysis

Behaviour:

executes a scan with the supplied parameters
optionally attaches a human-readable label
saves the scan result as an .rds file
invisibly returns the saved file path

`save_scan()`

save_scan() saves an existing scan result to disk using a structured filename.

Filename structure:

scan_<storage_id><label>_<timestamp>_<hash>.rds

Example:

scan_l480-ssd_d_drive_full_20260501-104052_1cecc8.rds

Purpose:

ensure uniqueness of snapshot filenames
provide chronological ordering
preserve the storage identity and optional scan scope label

`make_scan_filename()`

make_scan_filename() generates a filesystem-safe, chronologically sortable filename for a scan snapshot.

Key properties:

includes storage_id
includes an optional scope label
includes a normalised timestamp
includes a short deterministic hash

`quick_signature()`

quick_signature() computes a fast content signature for a file.

Method:

hashes selected byte blocks from the file
uses a fast, non-cryptographic hash
avoids the cost of full-file hashing for large scans

Purpose:

detect likely identical copies
identify likely versioned copies
support duplicate diagnostics

Limitations:

not cryptographically secure
collisions are possible
intended for operational duplicate detection, not legal proof of integrity

`summarise_activity()`

summarise_activity() aggregates file observations by modification time and path structure.

It derives:

period: time bucket such as day, week, month, or year
group_path: analytical grouping key derived with path_prefix()
file_names: display string of filenames in the group
n_files: number of file instances in the group

The output is intended for reporting and exploratory reconstruction, not for file-level joins.

`summarise_duplicates()`

summarise_duplicates() compares files with the same filename using their content signatures.

It reports, per filename:

total_copies: number of observed file instances
identical_copies: size of the largest identical-signature cluster, if larger than one
versioned_copies: number of additional copies outside the largest identical-signature cluster
n_versions: number of distinct signatures observed

This function treats filename as a weak identity and quick_sig as the content-equivalence signal.

Interpretation of Results

Observational nature of the data

A snapshot represents the state of accessible files at a specific point in time. Each row corresponds to one observed file occurrence, identified by rel_path within a given storage_id and scan_time.

This means that:

the data reflects what is observable, not necessarily what is complete
multiple rows may correspond to files with the same filename
identical filenames may represent identical copies or different versions
absence of a file in a snapshot does not prove it never existed

Identity and comparison

File identity should be interpreted carefully:

rel_path identifies a file occurrence within a snapshot
filename is a weak identity and must not be used alone for joins
quick_sig provides a heuristic signal for content equivalence

Analytical functions such as summarise_duplicates() combine these signals to distinguish:

identical copies (same filename, same quick_sig)
versioned copies (same filename, different quick_sig)
unique files

Aggregated outputs

Functions such as summarise_activity() and summarise_duplicates() produce aggregated views of the data.

These outputs:

group file occurrences by time (period) and structure (group_path)
provide counts (n_files) and summaries (file_names)
are intended for reporting and interpretation

They should not be used as a replacement for file-level data when performing:

joins between datasets
identity resolution
detailed provenance tracking

Temporal interpretation

Modification times (mtime) provide a useful but imperfect proxy for activity.

they reflect the last modification observed at scan time
they do not capture full editing history
they may be affected by copying, syncing, or system processes

Therefore, activity summaries should be interpreted as:

evidence of observed modification patterns, not exact work logs.

Practical implications

Taken together, the snapshot and derived summaries allow:

reconstruction of development activity patterns
identification of duplicated or fragmented work
detection of files outside version control
alignment of local work with repository structure

They do not, on their own, establish:

authoritative file versions
complete development history
semantic meaning of file contents

Use in forensic analysis and audit contexts

Snapshot files (.rds) provide a timestamped observational record of a filesystem at a given point in time.

Each snapshot contains:

a reproducible dataset of observed file instances
file-level metadata (paths, timestamps, size, permissions)
optional content signatures (quick_sig)
information about Git repositories and tracking status

This allows the snapshots to be used in a broad range of contexts, including:

forensic analysis of development environments
technical audits of data processing workflows
reconstruction of project activity across multiple folders or systems
identification of duplicated, diverging, or unmanaged files
alignment between local files and version-controlled repositories

When multiple snapshots are available over time, they enable:

reconstruction of activity patterns and timelines
detection of changes in file populations and structures
comparison of states across systems or storage devices
identification of long-lived vs transient files

Importantly, the snapshots provide observational evidence, not a complete or authoritative history. They should be interpreted together with:

version control systems (e.g. Git)
project documentation
domain knowledge of the workflows involved

Recommended usage pattern

A practical and robust workflow is:

perform periodic scans of entire storages (without signatures) to maintain a lightweight inventory
perform targeted scans (with signatures) on active project areas where duplicate detection and version comparison are important
store snapshots persistently as .rds files to create a longitudinal audit trail
apply analytical functions (e.g. summarise_activity(), summarise_duplicates()) on filtered subsets of snapshots

This separation keeps scanning fast and analysis flexible.

Future extensions

The package deliberately separates:

observational acquisition,
contextual projection,
semantic stabilisation,
and analytical interpretation.

This staged architecture allows lightweight operational workflows while preserving compatibility with future semantic enrichment, provenance modelling, and RiC-aligned contextual interpretation.

The current model focuses on file-level observation. Several extensions are possible without breaking the core design:

Record Part extraction: parsing files into functions, sections, or structured components
Cross-snapshot identity resolution: linking file instances across scans into longer-lived entities
Canonical file selection: identifying the most relevant version among duplicates based on time, location, or repository context
Integration with Git history: linking file observations to commits and branches
Quality and risk diagnostics: detecting patterns such as uncontrolled duplication, untracked work, or fragmented project structures

These extensions build on the same principle:

preserve a reliable observational base first, then add interpretation layers in a controlled and reproducible way.