Diagnose · Place · Build
Where does your system actually need to be?
Most stats feel wrong because they're built one tier too low — or one tier too high. Answer a few questions, find your rung on the ladder, then see what it costs, why it works, the pain that tells you to climb, and how to build it.
The ladder
Climb by ambition (the tier); branch by plumbing (the event level). Open any rung to see what it costs, why it works, and how to build it.
Tier 1 Live query on current state Compute it on demand from rows as they exist now. ▸
- Complexity
- trivial
- Setup
- DB + the right index
- Maintenance
- none
What it captures
The current value, exact to the second. "How many contacts," "open jobs," "current pipeline."
Why it works
Always correct, zero infrastructure, nothing to invalidate. The right answer far more often than people think.
The pains / hard limits
No history — the table only knows now. Deletes and edits erase the past. Cannot answer "how many last March."
What people do to avoid it
Raise timeouts, throw hardware at the DB, sprinkle ad-hoc view caching.
Result: Mystery slow pages and a count query that quietly becomes the app bottleneck.
How to build it
1. Index what you filter on
# migration
add_index :contacts, :status
2. Just ask the database
Contact.where(status: :active).count
Job.where(state: :open).sum(:value_cents)
3. That is the whole tier
# no job, no table, no cache. climb only when a trigger below fires.
Tier 2 Cached current state The same value, memoized so the read is cheap. ▸
- Complexity
- low
- Setup
- counter_cache / Redis / matview
- Maintenance
- keeping the cache correct (invalidation)
What it captures
Same questions as Tier 1, at scale or on hot paths.
Why it works
Fast reads. counter_cache is the clean form — the framework keeps the number current as rows come and go, so reads are free.
The pains / hard limits
Still no history. New risk: a cache that drifts if invalidation is wrong. Freshness becomes a choice, not a guarantee.
What people do to avoid it
Cache it manually in fifty places with no single invalidation path.
Result: Stale numbers nobody trusts and bugs you can never reproduce.
How to build it
1. counter_cache — the clean version
class Job < ApplicationRecord
belongs_to :contact, counter_cache: true
end
# migration: add_column :contacts, :jobs_count, :integer, default: 0
# backfill: Contact.find_each { |c| Contact.reset_counters(c.id, :jobs) }
2. Or a Redis counter for cross-cutting totals
# on write
$redis.incr("open_jobs") # decr on close
# on read
$redis.get("open_jobs").to_i
3. Or a materialized view on a refresh schedule
CREATE MATERIALIZED VIEW pipeline_now AS ...;
-- refresh nightly / on demand
REFRESH MATERIALIZED VIEW CONCURRENTLY pipeline_now;
Tier 3 Periodic snapshot A scheduled job writes the current value to a history table. ▸
- Complexity
- low–med
- Setup
- a scheduled job + a metrics table
- Maintenance
- the cron job; backfilling gaps when a run fails
What it captures
Trends of a level over time: "contacts at each month-end," "pipeline over the year."
Why it works
First tier that creates history the current-state tables can't reconstruct. Deleting a contact never touches yesterday's snapshot.
The pains / hard limits
Granularity is fixed by cadence — daily means no hourly, ever. Captures levels, lies about flow: 3 created + 2 deleted shows as +1, and the 3 and 2 are gone.
What people do to avoid it
Keep soft-deleted rows forever to reconstruct history; track numbers by hand in a spreadsheet.
Result: Bloated tables, "deleted" data leaking into live queries, and a spreadsheet that becomes the real (always-stale) source of truth.
How to build it
1. A history table keyed by date
create_table :daily_metrics do |t|
t.date :as_of, null: false
t.integer :contacts, :open_jobs
t.bigint :pipeline_cents
end
add_index :daily_metrics, :as_of, unique: true
2. A nightly job that samples current state
class SnapshotJob
def perform
DailyMetric.upsert({
as_of: Date.today,
contacts: Contact.count,
open_jobs: Job.open.count,
pipeline_cents: Job.open.sum(:value_cents)
}, unique_by: :as_of)
end
end
3. Schedule it (sidekiq-cron / whenever)
# every day at 00:05
SnapshotJob.perform_at("5 0 * * *")
Tier 4 Event log + rollups Record every occurrence as an immutable event; aggregate from it. ▸
- Complexity
- high — the big jump
- Setup
- event defs · emission at each change · events table · rollups · registry
- Maintenance
- rollup correctness + reconciliation · event versioning
What it captures
Flow and causality exactly — every thing that happened, in order, with its values frozen. Rates, conversion, time-between-states, drift-proof attribution.
Why it works
The churn problem disappears (you see the 3 and the 2). Attribution survives later edits because the value was copied into the event, not pointed at. Any new breakdown is just a re-aggregation.
The pains / hard limits
You only get the dimensions you snapshotted into the payload. Records what happened; does not rebuild an entity's arbitrary past state. Most moving parts so far.
What people do to avoid it
Compute rates from same-period snapshots (this-month closes ÷ this-month leads); store the daily ratio instead of its two counts; derive "events" by diffing state at end of day.
Result: Metrics that invert under volume, average-of-averages errors, and a corrupt event log with fabricated timestamps — the "feels wrong because it IS wrong" problem.
How to build it
1. Define the event (schema-first)
class DealClosed < EventDefinition
required_payload :lead_id, :salesperson_id, :source_id
required_payload :amount_cents
# occurred_at (spine) = when it happened
end
2. Emit at the moment of the change — freeze the values
def close_deal!(deal)
deal.update!(state: :won)
emit :deal_closed,
lead_id: deal.lead_id, salesperson_id: deal.rep_id,
source_id: deal.lead.source_id, # frozen NOW, survives later edits
amount_cents: deal.total_cents
end
3. Project into a permanent, queryable events table
create_table :events do |t|
t.uuid :event_id; t.string :name
t.datetime :occurred_at
t.bigint :source_id, :salesperson_id # promoted = filterable
t.bigint :amount_cents
t.jsonb :payload
end
add_index :events, :event_id, unique: true
add_index :events, [:name, :occurred_at]
4. Roll up by recomputation (idempotent, auditable)
INSERT INTO sales_rollups (period, source_id, deals, revenue_cents)
SELECT date_trunc('month', occurred_at), source_id,
count(*), sum(amount_cents)
FROM events WHERE name = 'deal_closed'
GROUP BY 1, 2
ON CONFLICT (period, source_id) DO UPDATE SET ...;
5. Reconcile — the recompute is the auditor
# nightly: rebuild from events, diff vs incremental rollup.
# any mismatch = a caught bug. unique event_id stops double-counting.
Tier 5 Event sourcing Events ARE the record; state is a projection you replay. ▸
- Complexity
- high — model inverts
- Setup
- event store · replay · snapshotting
- Maintenance
- event versioning/upcasting forever · replay performance
What it captures
Everything Tier 4 does, plus arbitrary past state and full audit replay — questions you never anticipated.
Why it works
One source of truth. Reconstruct any moment. Fix a projection and re-derive it cleanly. Strongest possible audit position.
The pains / hard limits
Heaviest to build and reason about. Overkill for most data. Replay and old-event versioning become permanent concerns. Resist it unless you truly need reconstruction or regulatory replay.
What people do to avoid it
Bolt on an audit_log; add history columns ad hoc; keep a changes table.
Result: Partial history covering only what you anticipated, audit gaps exactly where you didn't log, two sources of truth that disagree.
How to build it
1. Events are the only writes
# never UPDATE state. append facts.
store.append(deal_id, :deal_closed, amount_cents: 1_850_000)
2. State is folded from the stream
def rebuild(deal_id)
store.events_for(deal_id).reduce(Deal.new) do |deal, event|
deal.apply(event) # pure function: (state, event) -> state
end
end
3. Snapshot for replay speed; query as-of any date
# "what did this look like on 2026-03-31?"
store.events_for(deal_id, before: Date.new(2026, 3, 31)).reduce(...)
Event levels
How durably events are collected — these only matter from Tier 4 up.
Level In-process · sync / async Emit in code, or enqueue a job. ▸
- Complexity
- trivial
- Setup
- a method call / an ActiveJob
- Maintenance
- low
What it captures
Events handled inside your app, synchronously or in a background job.
Why it works
Cheapest possible event handling. Right when losing one is genuinely fine (directional, insight-grade metrics).
The pains / hard limits
Events lost on crash or job failure, silently. No dedup, no dead-letter, can't bind the event to the business transaction.
What people do to avoid it
Naive retries with no idempotency.
Result: Duplicate or lost events and no way to know what you missed.
How to build it
1. Sync (Level 1)
subscribers.each { |s| s.handle(event) } # in the request
2. Async (Level 2)
HandleEventJob.perform_later(event) # survives the request, not a crash
Level Outbox · durable Committed in the same transaction as the change. ▸
- Complexity
- medium
- Setup
- outbox table · relay/publisher · idempotency keys
- Maintenance
- monitor outbox · dead-letter · retention pruning
What it captures
Events that cannot be lost and are atomic with the business write.
Why it works
Solves the dual-write problem: the event and the state change commit together or not at all. The floor for money-grade and for event sourcing.
The pains / hard limits
Real machinery to run and monitor. The outbox is a queue — mutable, pruned — so it is not your permanent record (project to an events table for that).
What people do to avoid it
Dual-write — write the DB, then publish separately.
Result: The classic failure: DB commits but publish fails (or the reverse), and state and events silently diverge.
How to build it
1. Write event in the same txn as the change
transaction do
deal.update!(state: :won)
Outbox.create!(event_id: SecureRandom.uuid, name: "deal_closed", payload: {...})
end
2. A relay ships committed rows, then marks them
# background loop
Outbox.unpublished.find_each { |o| deliver(o); o.update!(published_at: Time.now) }
Level Broker · cross-service Events must reach another service. ▸
- Complexity
- high (ops)
- Setup
- Kafka/broker · consumers · schema registry · partitioning
- Maintenance
- broker ops · consumer lag · ordering · replay
What it captures
Durable delivery to consumers outside your process.
Why it works
Decouples services through an event stream; supports replay and many independent consumers.
The pains / hard limits
Operational weight: a broker to run, ordering and lag to reason about, schema evolution across teams.
What people do to avoid it
Services poll each other's databases or share one DB.
Result: Tight coupling, brittle integrations, no replay — a distributed monolith.
How to build it
1. Outbox relays to the broker
producer.produce(topic: "deals", key: deal_id, payload: event.to_json)
Level Telemetry sink · fire-and-forget High-volume anonymous signals (page views, clicks). ▸
- Complexity
- medium
- Setup
- collector endpoint · batching · columnar sink · pre-aggregation
- Maintenance
- the ingest pipeline + columnar store
What it captures
Pre-identity, high-frequency web signals feeding top-of-funnel marketing stats.
Why it works
Built for volume the outbox could never carry. Lossy by design — one page view is worthless, the aggregate is everything. Stitches to business events at lead capture.
The pains / hard limits
A whole second store to operate. Never your system of record. Must not share the business-events table.
What people do to avoid it
Stuff page views into the same events/outbox table.
Result: Outbox and DB overwhelmed; billions of near-worthless rows drown your business events; retention becomes a nightmare.
How to build it
1. Thin collector: accept, validate, enqueue, 202
# POST /collect
def collect
validate!(params) # against the shared schema
Buffer.push(params) # fire-and-forget
head :accepted # 202, no inline work
end
2. Batch-flush to the columnar store, pre-aggregate
# worker drains Buffer -> ClickHouse/lake in batches
# pre-roll sessions, landing-page CVR, A/B variant lift on ingest
3. Stitch to a business event at conversion
# form_submitted -> lead_created, freezing utm/variant/anonymous_id
emit :lead_created, source_id:, variant_id:, anonymous_id:
Which combinations are real
Tier × event level — not every pairing is valid.
| L0 · lossy sink | L1–2 · in-proc | L3 · outbox | L4 · broker | |
|---|---|---|---|---|
| Tier 1–3 | N/A | N/A | N/A | N/A |
| Tier 4 · insight | ✓ telemetry | ✓ ok | ✓ ideal | ✓ multi-svc |
| Tier 4 · money | ✗ wrong src | ⚠ dangerous | ✓ required | ✓ |
| Tier 5 · sourcing | N/A | ✗ contradiction | ✓ floor | ✓ |
- ⚠ Tier 4 on lossy collection silently undercounts rollups with no way to detect or recompute — fine for a directional glance, unacceptable for anything tied to money.
- ✗ Tier 5 on lossy is a contradiction: you can't make events the source of truth if the pipe drops them.