Stage 0 · discover

Walk a library. Propose events. Never copy a byte.

Discover is the cheap preprocessing step that turns a flat tree of FTP uploads into a list of proposed events you can accept or reject. It runs before ingest, reads EXIF only, and is safe to re-run every time new frames land.

The thing it does

The library is a rolling tree — every frame the camera has ever uploaded. Events are the atomic unit of work. Discover reads EXIF only, clusters by time, and proposes events. It never opens pixel data. It never copies a file.

Re-running is safe. Already-known frames are skipped. Discover is the cheap, idempotent step you can re-run every time new FTP uploads land.

The algorithm, in seven steps
  1. Validate inputs.

    The library path must exist and be a directory. The events root from ~/.photofly/config.json must be set, or the command exits with a hint to run photofly init.

    code anchorsrc/agent-photofly/internal/discover/discover.go:62–83
  2. Index already-known frames.

    Read every existing event.json under <eventsRoot>/events/. For each member, compute the key library + "\x00" + relative-path and build a set. This single set is the entire idempotency mechanism.

    code anchorsrc/agent-photofly/internal/event/event.go:IndexedMembers
  3. Walk the library.

    Depth-first via filepath.WalkDir. For each file: filter by extension (JPEG only in v1 — .jpg, .jpeg, .JPG, .JPEG), skip if the (library, relpath) key is in the known set, otherwise read EXIF. The EXIF reader extracts DateTimeOriginal (becomes captured), Model (becomes camera), and LensModel (becomes lens). When a JPEG has no EXIF — re-saved files, screenshots — the file's mtime is used as captured and the photo still flows through. Emit a record per frame.

    code anchorsdiscover.go:104–150 (the walk), internal/exif/exif.go:Read (the EXIF read)
  4. Sort by captured time, ascending.

    A single sort.Slice over the collected frames.

    code anchordiscover.go:142
  5. Cluster.

    Walk the sorted slice pairwise. Start a new event whenever either of these is true: the gap between this frame and the previous one is greater than the gap threshold (default 2 hours, configurable with --gap-hours); or day-split is on (default) and the day-of-year has changed. Otherwise, append the frame to the current cluster.

    code anchordiscover.go:cluster (around line 152)
  6. Build a manifest per cluster.

    For each cluster, derive a slug of the shape YYYY-MM-DD-event-N where N disambiguates multiple events the same day. Unique the cameras and lenses across members. Set status to proposed. Set genre to unknown— discover does not predict genre in v1; that's ingest's job once it has the materialized set.

    code anchordiscover.go:buildEvent (around line 175)
  7. Write event.json atomically.

    One file per event under <eventsRoot>/events/<slug>/event.json. Atomic write: tmp file plus rename, so a crashed run never leaves a partial manifest.

    code anchorinternal/event/event.go:Write
Honest v1 dumbness · failure modes
What v1 deliberately does not do
Time-only clusteringNo GPS, no burst-rate detection, no image-content similarity.GPS gates on EXIF GPS being present (many photos don't have it). Image-content clustering needs the ML runtime decision in ADR 0002. EXIF time is the one signal that's universally available.
No hashingFrames identified by (library, relpath, size), not sha256.SHA256 over 75 GB is prohibitive on a first run. A --with-hash flag will land when you actually need duplicate detection across libraries.
JPEG only.ARW, .NEF, .CR3 are skipped.The Sony A7 IV FTP-uploads JPEG by default. RAW support is an additive swap to evanoberholster/imagemeta.
Genre = "unknown"Discover does not predict genre.The signal is too thin without face counts and lens variance context. Ingest refines genre on the materialized set.
Slug = <date>-event-NNo copenhagen-zoo, no --hint.Reverse-geocoding needs an offline geo database or a network call. Both are v2. photofly events rename will support the human-friendly slug.
Failure modes
A photo with a wrong camera clockThe frame lands in the wrong event.photofly events split / events merge (currently stubs — high on the list once we hit a real misclassification on the live tree).
A photo with no EXIF at allDiscover falls back to file mtime. If upload time differs from capture time, the frame may land in a wrong event.Same as above. Consider running discover on the library before it's re-saved or re-uploaded; preserve the camera's original timestamps.
A burst across midnightDay-split puts it in two events.ADR 0003 specifies that an active burst should join the split — heuristic not yet implemented. v1 currently splits blindly.
The contract · what discover writes

Discover writes one file per event. The schema is documented in PRD §5.4. The shape, abbreviated, looks like this:

{
  "id": "2026-05-15-event-1",
  "name": "Fri May 15 2026",
  "status": "proposed",
  "discoveredAt": "2026-05-17T22:15:44Z",
  "cluster": {
    "method": "exif-v1",
    "timeWindow": { "start": "2026-05-15T09:01:00Z", "end": "2026-05-15T12:47:00Z" },
    "cameras": ["ILCE-7M4"]
  },
  "genre":  { "detected": "unknown", "confidence": 0 },
  "source": {
    "library":     "/var/lib/.../uploads/camera/",
    "memberCount": 412,
    "members":     [{ "path": "DCIM/100MSDCF/DSC00001.JPG", ... }]
  }
}
members[] is truncated — the real file lists all 412 entries per the example.
What you do with it

Review with photofly events ls. Rename, merge, or split if a cluster looks wrong. Then events accept <slug> to gate ingest. Accept, merge, split, and rename are stubs in v1 — the manifests are the contract; the editing tools land next.