No description
Find a file
PottedRosePetal 289a3d38cc search spec
2026-04-09 01:10:52 +02:00
search_spec search spec 2026-04-09 01:10:52 +02:00
README.md search spec 2026-04-09 01:10:52 +02:00

Personal Ebook Archive — Project Description

Goal

Build a self-hosted, searchable library of stories and books — primarily fanfiction from AO3, but designed to scale to millions of works including original fiction, non-fiction, and other long-form text. The library must be:

  • Comprehensive — no significant works missing from sources we choose to mirror
  • Searchable — meaningful filtering across many independent dimensions (genre, theme, content warnings, characters, etc.), not just title/author
  • Portable — accessible from all devices via VPN, with metadata that travels with the files
  • Self-hosted — no reliance on external services that could disappear or change terms
  • Resilient — backups, no single point of failure, schema-versioned so future changes don't invalidate old work

Scale

The target is millions of works (think the entire AO3 archive — ~17 million works as of 2026 — or more). Existing chaotic dataset is already ~400GB of raw text. This rules out approaches that work fine for 10K books but break at this scale:

  • Full per-file decompression on every search → too slow
  • One row per work in a single SQLite table → fine for the metadata, but not for full-text
  • Naive AI re-tagging on every classification change → cost-prohibitive
  • Loading the whole library into memory → impossible

Infrastructure (existing or planned)

  • Homeserver running Proxmox with Docker VMs
  • Calibre-Web (Docker container) — self-hosted ebook server providing web UI, OPDS feed for e-readers, metadata management. Calibre handles ~100K books well; beyond that we may need to shard or supplement.
  • Tailscale mesh VPN (via self-hosted Headscale) — connects all devices, library accessible from laptop/phone/anywhere
  • Kiwix — separate offline reference content (Wikipedia, Project Gutenberg) — not part of this project
  • Existing 400GB messy text dataset — needs deduplication, format normalization, and classification

File Format

EPUB as the storage format:

  • Already compressed (ZIP-based)
  • Universally readable across devices
  • Standard metadata format (OPF) supports custom fields
  • Calibre handles them natively
  • Even at large scale, individual files stay small (~50KB-2MB typical)

Acquisition

  • AO3 — built-in EPUB download, allows non-commercial personal use. FanFicFare (Calibre plugin) automates downloading with rate-limit respect.
  • Other fanfic sites — FanFicFare supports many sources
  • Original fiction / nonfiction — Project Gutenberg, Standard Ebooks, manual sources
  • Existing dataset — needs cleanup pipeline before integration

Classification Taxonomy

A custom multi-axis classification system. Each work gets values across independent axes that capture different aspects of the reading experience. Designed from scratch because no existing system (Dewey, Library of Congress, BISAC, AO3 freeform tags) covers all the dimensions needed for granular search at this scale.

Axes Overview

# Axis What It Describes Detection
0 Extractable Metadata Word count, author, dates, source, hash, raw AO3 tags Automated
1 Language Text language (ISO 639-1) Automated
2 Format Flash, one-shot, novel, book series, article, paper Auto + AI
3 Work Type Fanfiction, original fiction, nonfiction, poetry, script, religious text AI
4 Completion Status Complete, incomplete (substantial/partial/early) AI + metadata
5 Narrative POV First, second, third limited/omniscient, multiple, epistolary AI
6 Origin Cultural/literary tradition (American, British, Japanese, Chinese, etc.) — non-fanfic only AI
7 Genre Multi-select with sub-genre trees. ~20 top-level genres including web fiction (Cultivation, Travel/Portal, etc.) and Sexual Content/Adult AI
8 Themes Tonal qualities, story shapes, settings — genre-independent AI
9 Content Warnings Specific content readers may want to filter — sexual level, violence, psychological, language AI
10 Tags Structural flags, entities/creatures, animals AI
11 Characters Protagonist gender, age, alignment, power arc, social status, species, occupation, personality, characterization style, identity, narrative role, conditions AI
Fandom Metadata Fandoms, character pairings, AU type, crossover AI (future)

Full axis definitions live in axis-*.md files. Genre sub-trees live in genres/*.md.

Design Principles

  • Independence — axes are orthogonal where possible. A genre choice doesn't constrain a theme choice.
  • Multi-select where natural — most axes allow many values. Fiction is rarely one thing.
  • Sub-genre depth — detailed sub-categories within each genre, based on BISAC fiction codes plus web fiction additions
  • Deduplication — concepts live in exactly one place. If something exists as a genre, it doesn't also appear as a theme.
  • Fallback always allowed — every axis accepts "Unknown / Unclassified"
  • Granularity for filtering — categories are split when readers genuinely want to filter at that level (e.g., "two teenagers together" vs "adult and teenager" are separate sub-genres of Sexual Content)
  • Schema-versioned — taxonomy evolves; works should be tagged with the schema version they were classified against

AI Tagging Pipeline (Proposed)

A multi-pass approach to balance accuracy, cost, and context limits.

Pass 1: Triage / Top-Level

Send the work + only the top-level genre list (one-line descriptions). AI returns multi-select genre array. Cheap, broad, multi-select reduces lock-out risk.

Pass 2: Sub-Genre Drill-Down

For each top-level genre identified in Pass 1, send only that genre's sub-genre file along with the work. Focused context = better decisions.

Pass 3: Independent Axes

Themes, Warnings, Characters, Tags, POV, Origin run with their own focused prompts. Can run in parallel after Pass 1.

Pass 4: Validation

Final pass takes all assigned tags + the work and asks "do these classifications match?" Catches obvious mistakes.

Bootstrap from existing metadata

For AO3 fanfic, the raw freeform tags are gold. A pre-pass that maps known AO3 tags to taxonomy entries can do most of the work, with AI only filling gaps. This dramatically reduces token cost and improves accuracy on the messy bulk import.

Long-work handling

Most novels can't be sent in full. Options:

  • First chapter + summary + selected later passages (cross-section)
  • Source metadata + AO3 tags + first chapter (bootstrap)
  • Chunked analysis with synthesis pass

Storage Architecture

Per-File Metadata

Classification data lives inside each EPUB in the OPF metadata file using a custom namespace. The EPUB is the source of truth. Format: JSON blob in a custom <meta> element.

<meta property="archive:taxonomy">
{"schema_version":"0.9","work_type":"Fanfiction","genres":["Fantasy","Romance"],...}
</meta>

Why embedded

  • Metadata travels with files
  • No external DB to keep in sync
  • Backups = backing up EPUBs
  • Re-import to fresh Calibre = automatic re-population
  • Schema-versioned so old tags can be identified and re-processed

Index / Search Layer

A separate search index derived from the embedded metadata. Calibre's SQLite DB works for the metadata side (classification fields → custom columns). Full-text search at this scale is the open problem.

Open Problems

1. Full-Text Search at Scale

Calibre-Web's built-in search isn't designed for millions of works. The 400GB existing dataset is mostly raw text that's already indexable. Open questions:

  • Can EPUBs be searched without decompression? (Short answer: not really — EPUB is ZIP, and you need to decompress chapters to read them. But indexing each EPUB once and then searching the index is fast.)
  • Best index format? Candidates: SQLite FTS5, Elasticsearch, Meilisearch, Tantivy, Manticore, Typesense, Vespa
  • Index size: at ~30-50% of source text typical for inverted indexes, expect 100-200GB of index for 400GB of text
  • Update strategy: incremental indexing as new works arrive
  • Search query complexity: combine taxonomy filtering (structured) with full-text (unstructured) in one query

2. Calibre at Scale

Calibre starts struggling well before millions of works. Options:

  • Shard library across multiple Calibre instances (e.g., by fandom or by source)
  • Replace Calibre with a custom front-end and use Calibre only for OPDS export
  • Use Calibre as a metadata viewer over a custom SQL backend
  • Accept that classification metadata lives in a separate system from Calibre's view

3. Deduplication

The existing 400GB has duplicates, reuploads, split stories, and inconsistent naming. Need a pipeline:

  • Compute file hashes (exact dupes)
  • Compute content hashes (after stripping formatting — near-dupes)
  • Detect "same story, different format/quality"
  • Detect split stories (one work spread across multiple files)
  • Decide canonical version (longest, latest, highest quality)

4. AI Cost / Volume

Tagging millions of works isn't cheap. Strategies:

  • Batch process during off-hours
  • Use cheaper/smaller models for Pass 1, larger for ambiguous cases
  • Bootstrap aggressively from existing metadata (AO3 tags)
  • Partial coverage: prioritize what gets tagged based on popularity or user interest
  • Re-tagging when schema changes: only re-tag affected axes, not whole works

5. Schema Evolution

The taxonomy will change. When it does:

  • Old works keep their old schema_version
  • New works use the new version
  • Migration scripts or AI re-passes handle version bumps for selected fields
  • Must avoid breaking changes that invalidate huge chunks of data

6. Calibre Custom Columns

Calibre's custom column system has limits — number of columns, query speed with many filters. Multi-axis taxonomy with multi-select fields may push past what custom columns can do gracefully. Investigate before committing.

Current State

  • Taxonomy axes: drafted (v0.9)
  • Genre sub-trees: drafted with BISAC + web fiction additions
  • AI pipeline design: outlined, not implemented
  • Storage format: outlined, not implemented
  • Search architecture: open question
  • Deduplication pipeline: not started
  • Calibre integration: not started

Next Steps

  1. Validate taxonomy on ~50 sample works manually
  2. Choose search index technology and test on a subset
  3. Build the AO3-tag-to-taxonomy bootstrap mapper
  4. Build Pass 1 prompt and test multi-select accuracy
  5. Define the JSON schema for embedded EPUB metadata
  6. Build a small end-to-end pipeline: download → tag → embed → index → search
  7. Iterate