
Summarise repeated and divergent filesystem observations
summarise_duplicates.RdAggregates observed filesystem observations by filename and
lightweight content signatures (quick_sig).
Arguments
- df
A snapshot
data.frameconforming to the canonical snapshot schema created byscan_storage()orread_snapshot().The dataset must contain:
filenamequick_sig(may containNA)
Value
A data.frame with one row per filename.
The returned variables include:
- filename
File basename used as grouping key.
- total_copies
Total number of observed filesystem occurrences.
- identical_copies
Size of the largest identical-signature group.
- versioned_copies
Number of observations outside the largest identical-signature group.
- n_versions
Number of distinct observed signatures.
Details
The function identifies:
repeated identical observations;
potentially synchronised copies;
diverging versions of similarly named resources;
distributed working duplicates.
The function operates on observational filesystem evidence only.
It does not:
infer authoritative file identity;
establish Record Resource equivalence;
reconstruct provenance lineage;
determine curatorial relationships.
In RiC-aligned operational terminology:
rows in the snapshot represent filesystem observations;
repeated identical
quick_sigvalues provide operational evidence that multiple observations may correspond to the same underlying digital resource;differing signatures associated with the same filename may indicate divergent versions, forks, or independently evolving resources.
The function therefore supports:
longitudinal reconstruction;
distributed workflow analysis;
duplicate detection;
exploratory Record Set construction;
provenance-aware analytical workflows.
Duplicate observations are not inherently anomalous.
In distributed development workflows the same file may legitimately appear:
across multiple machines;
across synchronised project folders;
in backup or staging locations;
in derived analytical Record Sets.
The function therefore reports observational duplication rather than asserting erroneous copying.
The function treats:
filenameas a weak identity signal;quick_sigas a lightweight content equivalence signal.
Missing signatures (NA) are treated as a valid observational group.
This means:
multiple
NAsignatures are considered identical;a mix of
NAand non-NAsignatures counts as versioning.
The function operates on observational snapshots and does not resolve identity across time or storage contexts.
Examples
data("fscontextdemo_snapshot_01")
data("fscontextdemo_snapshot_01")
combined_snapshot <- rbind(
fscontextdemo_snapshot_01,
fscontextdemo_snapshot_01
)
summarise_duplicates(combined_snapshot)
#> filename total_copies identical_copies versioned_copies
#> 1 .Rbuildignore 2 2 0
#> 2 .gitignore 6 2 4
#> 3 404.html 2 2 0
#> 4 404.md 2 2 0
#> 5 DESCRIPTION 2 2 0
#> 6 LICENSE 2 2 0
#> 7 LICENSE-text.html 2 2 0
#> 8 LICENSE-text.md 2 2 0
#> 9 LICENSE.html 2 2 0
#> 10 LICENSE.md 4 2 2
#> 11 NAMESPACE 2 2 0
#> 12 README.Rmd 2 2 0
#> 13 README.md 2 2 0
#> 14 _pkgdown.yml 2 2 0
#> 15 all.css 2 2 0
#> 16 all.min.css 2 2 0
#> 17 authors.html 2 2 0
#> 18 authors.md 2 2 0
#> 19 autocomplete.jquery.min.js 2 2 0
#> 20 bootstrap-toc.min.js 2 2 0
#> 21 bootstrap.bundle.min.js 2 2 0
#> 22 bootstrap.bundle.min.js.map 2 2 0
#> 23 bootstrap.min.css 2 2 0
#> 24 clipboard.min.js 2 2 0
#> 25 country_barplot.jpg 2 2 0
#> 26 country_barplot.png 4 2 2
#> 27 country_barplot.svg 2 2 0
#> 28 create_fsdemo_country_data.R 2 2 0
#> 29 data-deps.txt 2 2 0
#> 30 data-fsdemo_country_data.R 2 2 0
#> 31 demo.Rmd 2 2 0
#> 32 fa-brands-400.ttf 2 2 0
#> 33 fa-brands-400.woff2 2 2 0
#> 34 fa-regular-400.ttf 2 2 0
#> 35 fa-regular-400.woff2 2 2 0
#> 36 fa-solid-900.ttf 2 2 0
#> 37 fa-solid-900.woff2 2 2 0
#> 38 fa-v4compatibility.ttf 2 2 0
#> 39 fa-v4compatibility.woff2 2 2 0
#> 40 fscontextdemo.Rproj 2 2 0
#> 41 fsdemo_country_data.Rd 2 2 0
#> 42 fsdemo_country_data.csv 2 2 0
#> 43 fsdemo_country_data.html 2 2 0
#> 44 fsdemo_country_data.md 2 2 0
#> 45 fsdemo_country_data.rda 2 2 0
#> 46 fuse.min.js 2 2 0
#> 47 headroom.min.js 2 2 0
#> 48 hello_world.R 2 2 0
#> 49 hello_world.Rd 2 2 0
#> 50 hello_world.html 2 2 0
#> 51 hello_world.md 2 2 0
#> 52 index.html 4 2 2
#> 53 index.md 4 2 2
#> 54 jQuery.headroom.min.js 2 2 0
#> 55 jquery-3.6.0.js 2 2 0
#> 56 jquery-3.6.0.min.js 2 2 0
#> 57 jquery-3.6.0.min.map 2 2 0
#> 58 katex-auto.js 2 2 0
#> 59 lightswitch.js 2 2 0
#> 60 link.svg 2 2 0
#> 61 llms.txt 2 2 0
#> 62 mark.min.js 2 2 0
#> 63 package_initialisation.R 2 2 0
#> 64 pkgdown.js 2 2 0
#> 65 pkgdown.yaml 2 2 0
#> 66 pkgdown.yml 2 2 0
#> 67 search.json 2 2 0
#> 68 sitemap.xml 2 2 0
#> 69 test-hello_world.R 2 2 0
#> 70 testthat.R 2 2 0
#> 71 v4-shims.css 2 2 0
#> 72 v4-shims.min.css 2 2 0
#> n_versions
#> 1 1
#> 2 3
#> 3 1
#> 4 1
#> 5 1
#> 6 1
#> 7 1
#> 8 1
#> 9 1
#> 10 2
#> 11 1
#> 12 1
#> 13 1
#> 14 1
#> 15 1
#> 16 1
#> 17 1
#> 18 1
#> 19 1
#> 20 1
#> 21 1
#> 22 1
#> 23 1
#> 24 1
#> 25 1
#> 26 2
#> 27 1
#> 28 1
#> 29 1
#> 30 1
#> 31 1
#> 32 1
#> 33 1
#> 34 1
#> 35 1
#> 36 1
#> 37 1
#> 38 1
#> 39 1
#> 40 1
#> 41 1
#> 42 1
#> 43 1
#> 44 1
#> 45 1
#> 46 1
#> 47 1
#> 48 1
#> 49 1
#> 50 1
#> 51 1
#> 52 2
#> 53 2
#> 54 1
#> 55 1
#> 56 1
#> 57 1
#> 58 1
#> 59 1
#> 60 1
#> 61 1
#> 62 1
#> 63 1
#> 64 1
#> 65 1
#> 66 1
#> 67 1
#> 68 1
#> 69 1
#> 70 1
#> 71 1
#> 72 1