Skip to contents
library(fscontext)

Conceptual Model (RiC-aligned, operational)

This package adopts a minimal and operational interpretation of the Records in Contexts (RiC) model for file-level archival tracking, time accounting, and Git integration.

The model is deliberately observational. It records what can be observed in a filesystem at a given point in time, without assuming that the observed state is complete, canonical, or historically exhaustive.

Analytical layers

The package operates across several distinct analytical layers.

1. Observational layer

Raw filesystem observations created by scan_storage().

At this stage the package only records:

  • observed file occurrences
  • filesystem metadata
  • optional content signatures
  • repository context

No interpretation is performed.

2. Contextual layer

Contextual layers may support the construction of lightweight contextual Record Sets.

These Record Sets are operational and analytical projections derived from filesystem observations, repository structures, curatorial groupings, or analytical heuristics.

They should not automatically be interpreted as authoritative archival arrangements or fonds structures.

Examples:

  • structural grouping heuristics for later record set creation
  • grouping by project roots
  • repository affiliation
  • analytical path prefixes

The package sharply distinguishes between:

  • observed filesystem evidence
  • derived contextual interpretations
  • analytical reconstruction

3. Analytical reconstruction layer

Analytical outputs may be represented either as:

  • contextualised recordset_df objects, where rows still represent contextual Record Set members or operational resources,

or:

  • derived dataset_df analytical summaries, where rows represent aggregates, metrics, or statistical projections rather than individual resources.

Core assumptions

  • Record Resource (operational approximation): a logical file or document inferred from repeated observations of file Instantiations.
  • Instantiation: a specific occurrence of a file on a given storage at a given observation time (scan_time), as it evolves through file versions, copies across storage points or network repositories.
  • Record Part: not parsed at this stage, but may later include R functions, document sections, bibliographic entries, or other sub-file structures.
  • Record Set: an operational or intellectual aggregation of Record Resources constructed from observational evidence, curatorial decisions, repository structure, filesystem organisation, or analytical grouping rules.

This means that the package does not attempt to decide, during scanning, what the “true” or “canonical” file is. It records file occurrences first. Later analytical functions may compare, group, or reconcile those observations.

The package also does not assume that filesystem folders, repositories, or storage locations inherently correspond to Record Sets.

Operational grouping functions may derive structural heuristics from filesystem paths, repository roots, or storage layouts. These derived groupings are analytical projections and should not be interpreted as authoritative archival Record Sets without additional contextual or curatorial interpretation.

Rationale

This model reflects the needs of:

  • reconstructing work activity, for example for timesheets or audits
  • aligning local work with Git repositories
  • identifying uncommitted, duplicated, or distributed work
  • preserving evidence before file metadata changes further

It deliberately treats operational filesystem resources as the primary observable unit because they are:

  • consistently visible across operating systems
  • directly linked to version control systems
  • stable enough for modification-time and path-based reconstruction
  • suitable for later comparison by content signature

Higher-level intellectual objects, such as reports, R packages, publications, or software components, may later support Record Set construction or other analytical interpretations, but are not treated as primary Records during scanning.

Canonical variable semantics

To avoid ambiguity across snapshots and analytical outputs, the package uses a fixed naming convention. The distinction between atomic, derived, and aggregated variables is part of the data model.

File-level atomic variables

Each row in a scan represents one observed file occurrence.

Variable Meaning
rel_path Full relative path from the scan root. This is the primary file-instance identifier within a scan.
filename Basename of the file. This is not unique and must not be used alone as an identifier.
dir_path Directory containing the file, derived from rel_path.

rel_path is the safest key for joining file-level data within a snapshot. filename is a weak identity: the same name may occur in many folders and may refer to identical copies or genuinely different versions.

Derived structural variables

Some variables are computed from paths for grouping or interpretation.

Variable Meaning
dir_path Immediate containing directory of a file.
group_path Analytical structural grouping key derived with path_prefix().

group_path is not necessarily a filesystem folder. It is a deterministic analytical prefix used for reporting, activity summaries, and lightweight structural grouping for later record set creation.

Aggregated analytical variables

Aggregated outputs introduce variables that should not be confused with file-level identity.

Variable Meaning
file_names Concatenated filenames for display in summaries.
n_files Number of file instances in an analytical group.

file_names is for human-readable reporting only. It is not suitable for joins, identity resolution, or duplicate detection.

Naming rules

The following names are reserved for specific meanings:

  • rel_path: full relative file path
  • filename: one basename
  • dir_path: containing directory
  • group_path: analytical grouping key
  • file_names: aggregated display string
  • n_files: count of file instances

The following names are avoided in package outputs because they are ambiguous:

  • file
  • files
  • folder

Provenance

Each scan records minimal provenance metadata, including:

  • the time of observation (scan_time, also stored as created_at)
  • the function used to generate the dataset
  • the package and package version
  • detected Git repositories, where available

This ensures reproducibility while keeping the data model lightweight.

Future extensions

The model allows later refinement:

  • splitting files into Record Parts, such as functions or document sections
  • constructing contextual and semantically enriched Record Sets from repeated observational evidence, repository structures, or analytical grouping heuristics
  • resolving identity across multiple Instantiations
  • classifying duplicated files as identical copies or versioned copies
  • linking file observations to Git commits and repository states

These steps are intentionally deferred. The first responsibility of the package is to preserve a robust and transparent observational base.

Operational Functions and Outputs

This package provides a small set of functions that implement the observational model in a reproducible and auditable way.

Core functions

scan_storage()

scan_storage() scans a directory recursively and returns a data.frame where each row represents one observed file occurrence.

Purpose:

  • create an inventory of accessible files on a storage
  • capture file-level metadata for audit and reconstruction
  • compute optional content signatures
  • detect Git repository context and Git tracking status

Key features:

  • recursive scan of accessible files
  • read-only operation
  • cross-platform filtering of common system artefacts
  • optional content fingerprinting with quick_sig
  • detection of Git repositories and file membership

Output:

A data.frame with one row per file occurrence.

Identification variables:

Variable Meaning
storage_id Identifier of the storage being scanned.
person_id Identifier of the person or operator responsible for the scan.
full_path Absolute path to the observed file.
rel_path Full relative path from the scan root; primary file-instance identifier within the scan.
filename Basename of the file; not unique.
dir_path Directory containing the file, derived from rel_path.
repo_rel_path Path relative to the Git repository root, if applicable.
storage_path_id Stable file occurrence identifier, constructed as storage_id::rel_path.

File metadata variables:

Variable Meaning
stem Filename without extension.
extension Lowercase file extension, without leading dot.
type Filesystem type reported by fs::file_info().
size File size in bytes.
mtime Last modification time.
ctime Metadata change time or creation-like time, depending on platform.
atime Last access time, if available.
birth_time File creation time, if available.
depth Number of path components in rel_path.
links Number of hard links, if reported by the filesystem.
permissions Filesystem permissions.

Content fingerprint variables:

Variable Meaning
quick_sig Fast, non-cryptographic content signature used for duplicate detection.

Git integration variables:

Variable Meaning
repo_root Nearest Git repository root, if detected.
git_tracked Whether the file is tracked by Git, if repository information is available.

Provenance variables:

Variable Meaning
scan_time Timestamp of the observation, repeated for each row.

Attributes:

  • created_at: timestamp of the scan
  • created_by: function name
  • package: package name
  • package_version: package version, if available
  • repos: detected repositories with remote and branch information

snapshot_storage()

snapshot_storage() runs scan_storage() and immediately saves the result as an .rds snapshot file.

Purpose:

  • create persistent, timestamped records of a storage state
  • preserve file observations before metadata changes further
  • support later comparison between scans
  • provide auditable artefacts for reporting and forensic analysis

Behaviour:

  • executes a scan with the supplied parameters
  • optionally attaches a human-readable label
  • saves the scan result as an .rds file
  • invisibly returns the saved file path

save_scan()

save_scan() saves an existing scan result to disk using a structured filename.

Filename structure:

scan_<storage_id><label>_<timestamp>_<hash>.rds

Example:

scan_l480-ssd_d_drive_full_20260501-104052_1cecc8.rds

Purpose:

  • ensure uniqueness of snapshot filenames
  • provide chronological ordering
  • preserve the storage identity and optional scan scope label

make_scan_filename()

make_scan_filename() generates a filesystem-safe, chronologically sortable filename for a scan snapshot.

Key properties:

  • includes storage_id
  • includes an optional scope label
  • includes a normalised timestamp
  • includes a short deterministic hash

quick_signature()

quick_signature() computes a fast content signature for a file.

Method:

  • hashes selected byte blocks from the file
  • uses a fast, non-cryptographic hash
  • avoids the cost of full-file hashing for large scans

Purpose:

  • detect likely identical copies
  • identify likely versioned copies
  • support duplicate diagnostics

Limitations:

  • not cryptographically secure
  • collisions are possible
  • intended for operational duplicate detection, not legal proof of integrity

summarise_activity()

summarise_activity() aggregates file observations by modification time and path structure.

It derives:

  • period: time bucket such as day, week, month, or year
  • group_path: analytical grouping key derived with path_prefix()
  • file_names: display string of filenames in the group
  • n_files: number of file instances in the group

The output is intended for reporting and exploratory reconstruction, not for file-level joins.

summarise_duplicates()

summarise_duplicates() compares files with the same filename using their content signatures.

It reports, per filename:

  • total_copies: number of observed file instances
  • identical_copies: size of the largest identical-signature cluster, if larger than one
  • versioned_copies: number of additional copies outside the largest identical-signature cluster
  • n_versions: number of distinct signatures observed

This function treats filename as a weak identity and quick_sig as the content-equivalence signal.

Interpretation of Results

Observational nature of the data

A snapshot represents the state of accessible files at a specific point in time. Each row corresponds to one observed file occurrence, identified by rel_path within a given storage_id and scan_time.

This means that:

  • the data reflects what is observable, not necessarily what is complete
  • multiple rows may correspond to files with the same filename
  • identical filenames may represent identical copies or different versions
  • absence of a file in a snapshot does not prove it never existed

Identity and comparison

File identity should be interpreted carefully:

  • rel_path identifies a file occurrence within a snapshot
  • filename is a weak identity and must not be used alone for joins
  • quick_sig provides a heuristic signal for content equivalence

Analytical functions such as summarise_duplicates() combine these signals to distinguish:

  • identical copies (same filename, same quick_sig)
  • versioned copies (same filename, different quick_sig)
  • unique files

Aggregated outputs

Functions such as summarise_activity() and summarise_duplicates() produce aggregated views of the data.

These outputs:

  • group file occurrences by time (period) and structure (group_path)
  • provide counts (n_files) and summaries (file_names)
  • are intended for reporting and interpretation

They should not be used as a replacement for file-level data when performing:

  • joins between datasets
  • identity resolution
  • detailed provenance tracking

Temporal interpretation

Modification times (mtime) provide a useful but imperfect proxy for activity.

  • they reflect the last modification observed at scan time
  • they do not capture full editing history
  • they may be affected by copying, syncing, or system processes

Therefore, activity summaries should be interpreted as:

evidence of observed modification patterns, not exact work logs.

Practical implications

Taken together, the snapshot and derived summaries allow:

  • reconstruction of development activity patterns
  • identification of duplicated or fragmented work
  • detection of files outside version control
  • alignment of local work with repository structure

They do not, on their own, establish:

  • authoritative file versions
  • complete development history
  • semantic meaning of file contents

Use in forensic analysis and audit contexts

Snapshot files (.rds) provide a timestamped observational record of a filesystem at a given point in time.

Each snapshot contains:

  • a reproducible dataset of observed file instances
  • file-level metadata (paths, timestamps, size, permissions)
  • optional content signatures (quick_sig)
  • information about Git repositories and tracking status

This allows the snapshots to be used in a broad range of contexts, including:

  • forensic analysis of development environments
  • technical audits of data processing workflows
  • reconstruction of project activity across multiple folders or systems
  • identification of duplicated, diverging, or unmanaged files
  • alignment between local files and version-controlled repositories

When multiple snapshots are available over time, they enable:

  • reconstruction of activity patterns and timelines
  • detection of changes in file populations and structures
  • comparison of states across systems or storage devices
  • identification of long-lived vs transient files

Importantly, the snapshots provide observational evidence, not a complete or authoritative history. They should be interpreted together with:

  • version control systems (e.g. Git)
  • project documentation
  • domain knowledge of the workflows involved

A practical and robust workflow is:

  • perform periodic scans of entire storages (without signatures) to maintain a lightweight inventory
  • perform targeted scans (with signatures) on active project areas where duplicate detection and version comparison are important
  • store snapshots persistently as .rds files to create a longitudinal audit trail
  • apply analytical functions (e.g. summarise_activity(), summarise_duplicates()) on filtered subsets of snapshots

This separation keeps scanning fast and analysis flexible.

Future extensions

The package deliberately separates:

  • observational acquisition,
  • contextual projection,
  • semantic stabilisation,
  • and analytical interpretation.

This staged architecture allows lightweight operational workflows while preserving compatibility with future semantic enrichment, provenance modelling, and RiC-aligned contextual interpretation.

The current model focuses on file-level observation. Several extensions are possible without breaking the core design:

  • Record Part extraction: parsing files into functions, sections, or structured components
  • Cross-snapshot identity resolution: linking file instances across scans into longer-lived entities
  • Canonical file selection: identifying the most relevant version among duplicates based on time, location, or repository context
  • Integration with Git history: linking file observations to commits and branches
  • Quality and risk diagnostics: detecting patterns such as uncontrolled duplication, untracked work, or fragmented project structures

These extensions build on the same principle:

preserve a reliable observational base first, then add interpretation layers in a controlled and reproducible way.