No description

Find a file

PottedRosePetal 289a3d38cc search spec		2026-04-09 01:10:52 +02:00
search_spec	search spec	2026-04-09 01:10:52 +02:00
README.md	search spec	2026-04-09 01:10:52 +02:00

README.md

Personal Ebook Archive — Project Description

Goal

Build a self-hosted, searchable library of stories and books — primarily fanfiction from AO3, but designed to scale to millions of works including original fiction, non-fiction, and other long-form text. The library must be:

Comprehensive — no significant works missing from sources we choose to mirror
Searchable — meaningful filtering across many independent dimensions (genre, theme, content warnings, characters, etc.), not just title/author
Portable — accessible from all devices via VPN, with metadata that travels with the files
Self-hosted — no reliance on external services that could disappear or change terms
Resilient — backups, no single point of failure, schema-versioned so future changes don't invalidate old work

Scale

The target is millions of works (think the entire AO3 archive — ~17 million works as of 2026 — or more). Existing chaotic dataset is already ~400GB of raw text. This rules out approaches that work fine for 10K books but break at this scale:

Full per-file decompression on every search → too slow
One row per work in a single SQLite table → fine for the metadata, but not for full-text
Naive AI re-tagging on every classification change → cost-prohibitive
Loading the whole library into memory → impossible

Infrastructure (existing or planned)

Homeserver running Proxmox with Docker VMs
Calibre-Web (Docker container) — self-hosted ebook server providing web UI, OPDS feed for e-readers, metadata management. Calibre handles ~100K books well; beyond that we may need to shard or supplement.
Tailscale mesh VPN (via self-hosted Headscale) — connects all devices, library accessible from laptop/phone/anywhere
Kiwix — separate offline reference content (Wikipedia, Project Gutenberg) — not part of this project
Existing 400GB messy text dataset — needs deduplication, format normalization, and classification

File Format

EPUB as the storage format:

Already compressed (ZIP-based)
Universally readable across devices
Standard metadata format (OPF) supports custom fields
Calibre handles them natively
Even at large scale, individual files stay small (~50KB-2MB typical)

Acquisition

AO3 — built-in EPUB download, allows non-commercial personal use. FanFicFare (Calibre plugin) automates downloading with rate-limit respect.
Other fanfic sites — FanFicFare supports many sources
Original fiction / nonfiction — Project Gutenberg, Standard Ebooks, manual sources
Existing dataset — needs cleanup pipeline before integration

Classification Taxonomy

A custom multi-axis classification system. Each work gets values across independent axes that capture different aspects of the reading experience. Designed from scratch because no existing system (Dewey, Library of Congress, BISAC, AO3 freeform tags) covers all the dimensions needed for granular search at this scale.

Axes Overview

#	Axis	What It Describes	Detection
0	Extractable Metadata	Word count, author, dates, source, hash, raw AO3 tags	Automated
1	Language	Text language (ISO 639-1)	Automated
2	Format	Flash, one-shot, novel, book series, article, paper	Auto + AI
3	Work Type	Fanfiction, original fiction, nonfiction, poetry, script, religious text	AI
4	Completion Status	Complete, incomplete (substantial/partial/early)	AI + metadata
5	Narrative POV	First, second, third limited/omniscient, multiple, epistolary	AI
6	Origin	Cultural/literary tradition (American, British, Japanese, Chinese, etc.) — non-fanfic only	AI
7	Genre	Multi-select with sub-genre trees. ~20 top-level genres including web fiction (Cultivation, Travel/Portal, etc.) and Sexual Content/Adult	AI
8	Themes	Tonal qualities, story shapes, settings — genre-independent	AI
9	Content Warnings	Specific content readers may want to filter — sexual level, violence, psychological, language	AI
10	Tags	Structural flags, entities/creatures, animals	AI
11	Characters	Protagonist gender, age, alignment, power arc, social status, species, occupation, personality, characterization style, identity, narrative role, conditions	AI
—	Fandom Metadata	Fandoms, character pairings, AU type, crossover	AI (future)

Full axis definitions live in axis-*.md files. Genre sub-trees live in genres/*.md.

Design Principles

Independence — axes are orthogonal where possible. A genre choice doesn't constrain a theme choice.
Multi-select where natural — most axes allow many values. Fiction is rarely one thing.
Sub-genre depth — detailed sub-categories within each genre, based on BISAC fiction codes plus web fiction additions
Deduplication — concepts live in exactly one place. If something exists as a genre, it doesn't also appear as a theme.
Fallback always allowed — every axis accepts "Unknown / Unclassified"
Granularity for filtering — categories are split when readers genuinely want to filter at that level (e.g., "two teenagers together" vs "adult and teenager" are separate sub-genres of Sexual Content)
Schema-versioned — taxonomy evolves; works should be tagged with the schema version they were classified against

AI Tagging Pipeline (Proposed)

A multi-pass approach to balance accuracy, cost, and context limits.

Pass 1: Triage / Top-Level

Send the work + only the top-level genre list (one-line descriptions). AI returns multi-select genre array. Cheap, broad, multi-select reduces lock-out risk.

Pass 2: Sub-Genre Drill-Down

For each top-level genre identified in Pass 1, send only that genre's sub-genre file along with the work. Focused context = better decisions.

Pass 3: Independent Axes

Themes, Warnings, Characters, Tags, POV, Origin run with their own focused prompts. Can run in parallel after Pass 1.

Pass 4: Validation

Final pass takes all assigned tags + the work and asks "do these classifications match?" Catches obvious mistakes.

Bootstrap from existing metadata

For AO3 fanfic, the raw freeform tags are gold. A pre-pass that maps known AO3 tags to taxonomy entries can do most of the work, with AI only filling gaps. This dramatically reduces token cost and improves accuracy on the messy bulk import.

Long-work handling

Most novels can't be sent in full. Options:

First chapter + summary + selected later passages (cross-section)
Source metadata + AO3 tags + first chapter (bootstrap)
Chunked analysis with synthesis pass

Storage Architecture

Per-File Metadata

Classification data lives inside each EPUB in the OPF metadata file using a custom namespace. The EPUB is the source of truth. Format: JSON blob in a custom <meta> element.

<meta property="archive:taxonomy">
{"schema_version":"0.9","work_type":"Fanfiction","genres":["Fantasy","Romance"],...}
</meta>

Why embedded

Metadata travels with files
No external DB to keep in sync
Backups = backing up EPUBs
Re-import to fresh Calibre = automatic re-population
Schema-versioned so old tags can be identified and re-processed

Index / Search Layer

A separate search index derived from the embedded metadata. Calibre's SQLite DB works for the metadata side (classification fields → custom columns). Full-text search at this scale is the open problem.

Open Problems

1. Full-Text Search at Scale

Calibre-Web's built-in search isn't designed for millions of works. The 400GB existing dataset is mostly raw text that's already indexable. Open questions:

Can EPUBs be searched without decompression? (Short answer: not really — EPUB is ZIP, and you need to decompress chapters to read them. But indexing each EPUB once and then searching the index is fast.)
Best index format? Candidates: SQLite FTS5, Elasticsearch, Meilisearch, Tantivy, Manticore, Typesense, Vespa
Index size: at ~30-50% of source text typical for inverted indexes, expect 100-200GB of index for 400GB of text
Update strategy: incremental indexing as new works arrive
Search query complexity: combine taxonomy filtering (structured) with full-text (unstructured) in one query

2. Calibre at Scale

Calibre starts struggling well before millions of works. Options:

Shard library across multiple Calibre instances (e.g., by fandom or by source)
Replace Calibre with a custom front-end and use Calibre only for OPDS export
Use Calibre as a metadata viewer over a custom SQL backend
Accept that classification metadata lives in a separate system from Calibre's view

3. Deduplication

The existing 400GB has duplicates, reuploads, split stories, and inconsistent naming. Need a pipeline:

Compute file hashes (exact dupes)
Compute content hashes (after stripping formatting — near-dupes)
Detect "same story, different format/quality"
Detect split stories (one work spread across multiple files)
Decide canonical version (longest, latest, highest quality)

4. AI Cost / Volume

Tagging millions of works isn't cheap. Strategies:

Batch process during off-hours
Use cheaper/smaller models for Pass 1, larger for ambiguous cases
Bootstrap aggressively from existing metadata (AO3 tags)
Partial coverage: prioritize what gets tagged based on popularity or user interest
Re-tagging when schema changes: only re-tag affected axes, not whole works

5. Schema Evolution

The taxonomy will change. When it does:

Old works keep their old schema_version
New works use the new version
Migration scripts or AI re-passes handle version bumps for selected fields
Must avoid breaking changes that invalidate huge chunks of data

6. Calibre Custom Columns

Calibre's custom column system has limits — number of columns, query speed with many filters. Multi-axis taxonomy with multi-select fields may push past what custom columns can do gracefully. Investigate before committing.

Current State

Taxonomy axes: drafted (v0.9)
Genre sub-trees: drafted with BISAC + web fiction additions
AI pipeline design: outlined, not implemented
Storage format: outlined, not implemented
Search architecture: open question
Deduplication pipeline: not started
Calibre integration: not started

Next Steps

Validate taxonomy on ~50 sample works manually
Choose search index technology and test on a subset
Build the AO3-tag-to-taxonomy bootstrap mapper
Build Pass 1 prompt and test multi-select accuracy
Define the JSON schema for embedded EPUB metadata
Build a small end-to-end pipeline: download → tag → embed → index → search
Iterate