DYB
← All labs

Diagnose · Place · Build

Where does your system actually need to be?

Most stats feel wrong because they're built one tier too low — or one tier too high. Answer a few questions, find your rung on the ladder, then see what it costs, why it works, the pain that tells you to climb, and how to build it.

Explore the ladder

The ladder

Climb by ambition (the tier); branch by plumbing (the event level). Open any rung to see what it costs, why it works, and how to build it.

Tier 1 Live query on current state Compute it on demand from rows as they exist now.
Complexity
trivial
Setup
DB + the right index
Maintenance
none

What it captures

The current value, exact to the second. "How many contacts," "open jobs," "current pipeline."

Why it works

Always correct, zero infrastructure, nothing to invalidate. The right answer far more often than people think.

The pains / hard limits

No history — the table only knows now. Deletes and edits erase the past. Cannot answer "how many last March."

What people do to avoid it

Raise timeouts, throw hardware at the DB, sprinkle ad-hoc view caching.

Result: Mystery slow pages and a count query that quietly becomes the app bottleneck.

How to build it

1. Index what you filter on

# migration
add_index :contacts, :status

2. Just ask the database

Contact.where(status: :active).count
Job.where(state: :open).sum(:value_cents)

3. That is the whole tier

# no job, no table, no cache. climb only when a trigger below fires.
Tier 2 Cached current state The same value, memoized so the read is cheap.
Complexity
low
Setup
counter_cache / Redis / matview
Maintenance
keeping the cache correct (invalidation)

What it captures

Same questions as Tier 1, at scale or on hot paths.

Why it works

Fast reads. counter_cache is the clean form — the framework keeps the number current as rows come and go, so reads are free.

The pains / hard limits

Still no history. New risk: a cache that drifts if invalidation is wrong. Freshness becomes a choice, not a guarantee.

What people do to avoid it

Cache it manually in fifty places with no single invalidation path.

Result: Stale numbers nobody trusts and bugs you can never reproduce.

How to build it

1. counter_cache — the clean version

class Job < ApplicationRecord
  belongs_to :contact, counter_cache: true
end
# migration: add_column :contacts, :jobs_count, :integer, default: 0
# backfill: Contact.find_each { |c| Contact.reset_counters(c.id, :jobs) }

2. Or a Redis counter for cross-cutting totals

# on write
$redis.incr("open_jobs")   # decr on close
# on read
$redis.get("open_jobs").to_i

3. Or a materialized view on a refresh schedule

CREATE MATERIALIZED VIEW pipeline_now AS ...;
-- refresh nightly / on demand
REFRESH MATERIALIZED VIEW CONCURRENTLY pipeline_now;
Tier 3 Periodic snapshot A scheduled job writes the current value to a history table.
Complexity
low–med
Setup
a scheduled job + a metrics table
Maintenance
the cron job; backfilling gaps when a run fails

What it captures

Trends of a level over time: "contacts at each month-end," "pipeline over the year."

Why it works

First tier that creates history the current-state tables can't reconstruct. Deleting a contact never touches yesterday's snapshot.

The pains / hard limits

Granularity is fixed by cadence — daily means no hourly, ever. Captures levels, lies about flow: 3 created + 2 deleted shows as +1, and the 3 and 2 are gone.

What people do to avoid it

Keep soft-deleted rows forever to reconstruct history; track numbers by hand in a spreadsheet.

Result: Bloated tables, "deleted" data leaking into live queries, and a spreadsheet that becomes the real (always-stale) source of truth.

How to build it

1. A history table keyed by date

create_table :daily_metrics do |t|
  t.date :as_of, null: false
  t.integer :contacts, :open_jobs
  t.bigint :pipeline_cents
end
add_index :daily_metrics, :as_of, unique: true

2. A nightly job that samples current state

class SnapshotJob
  def perform
    DailyMetric.upsert({
      as_of: Date.today,
      contacts: Contact.count,
      open_jobs: Job.open.count,
      pipeline_cents: Job.open.sum(:value_cents)
    }, unique_by: :as_of)
  end
end

3. Schedule it (sidekiq-cron / whenever)

# every day at 00:05
SnapshotJob.perform_at("5 0 * * *")
Tier 4 Event log + rollups Record every occurrence as an immutable event; aggregate from it.
Complexity
high — the big jump
Setup
event defs · emission at each change · events table · rollups · registry
Maintenance
rollup correctness + reconciliation · event versioning

What it captures

Flow and causality exactly — every thing that happened, in order, with its values frozen. Rates, conversion, time-between-states, drift-proof attribution.

Why it works

The churn problem disappears (you see the 3 and the 2). Attribution survives later edits because the value was copied into the event, not pointed at. Any new breakdown is just a re-aggregation.

The pains / hard limits

You only get the dimensions you snapshotted into the payload. Records what happened; does not rebuild an entity's arbitrary past state. Most moving parts so far.

What people do to avoid it

Compute rates from same-period snapshots (this-month closes ÷ this-month leads); store the daily ratio instead of its two counts; derive "events" by diffing state at end of day.

Result: Metrics that invert under volume, average-of-averages errors, and a corrupt event log with fabricated timestamps — the "feels wrong because it IS wrong" problem.

How to build it

1. Define the event (schema-first)

class DealClosed < EventDefinition
  required_payload :lead_id, :salesperson_id, :source_id
  required_payload :amount_cents
  # occurred_at (spine) = when it happened
end

2. Emit at the moment of the change — freeze the values

def close_deal!(deal)
  deal.update!(state: :won)
  emit :deal_closed,
    lead_id: deal.lead_id, salesperson_id: deal.rep_id,
    source_id: deal.lead.source_id,   # frozen NOW, survives later edits
    amount_cents: deal.total_cents
end

3. Project into a permanent, queryable events table

create_table :events do |t|
  t.uuid :event_id; t.string :name
  t.datetime :occurred_at
  t.bigint :source_id, :salesperson_id  # promoted = filterable
  t.bigint :amount_cents
  t.jsonb :payload
end
add_index :events, :event_id, unique: true
add_index :events, [:name, :occurred_at]

4. Roll up by recomputation (idempotent, auditable)

INSERT INTO sales_rollups (period, source_id, deals, revenue_cents)
SELECT date_trunc('month', occurred_at), source_id,
  count(*), sum(amount_cents)
FROM events WHERE name = 'deal_closed'
GROUP BY 1, 2
ON CONFLICT (period, source_id) DO UPDATE SET ...;

5. Reconcile — the recompute is the auditor

# nightly: rebuild from events, diff vs incremental rollup.
# any mismatch = a caught bug. unique event_id stops double-counting.
Tier 5 Event sourcing Events ARE the record; state is a projection you replay.
Complexity
high — model inverts
Setup
event store · replay · snapshotting
Maintenance
event versioning/upcasting forever · replay performance

What it captures

Everything Tier 4 does, plus arbitrary past state and full audit replay — questions you never anticipated.

Why it works

One source of truth. Reconstruct any moment. Fix a projection and re-derive it cleanly. Strongest possible audit position.

The pains / hard limits

Heaviest to build and reason about. Overkill for most data. Replay and old-event versioning become permanent concerns. Resist it unless you truly need reconstruction or regulatory replay.

What people do to avoid it

Bolt on an audit_log; add history columns ad hoc; keep a changes table.

Result: Partial history covering only what you anticipated, audit gaps exactly where you didn't log, two sources of truth that disagree.

How to build it

1. Events are the only writes

# never UPDATE state. append facts.
store.append(deal_id, :deal_closed, amount_cents: 1_850_000)

2. State is folded from the stream

def rebuild(deal_id)
  store.events_for(deal_id).reduce(Deal.new) do |deal, event|
    deal.apply(event)   # pure function: (state, event) -> state
  end
end

3. Snapshot for replay speed; query as-of any date

# "what did this look like on 2026-03-31?"
store.events_for(deal_id, before: Date.new(2026, 3, 31)).reduce(...)

Event levels

How durably events are collected — these only matter from Tier 4 up.

Level In-process · sync / async Emit in code, or enqueue a job.
Complexity
trivial
Setup
a method call / an ActiveJob
Maintenance
low

What it captures

Events handled inside your app, synchronously or in a background job.

Why it works

Cheapest possible event handling. Right when losing one is genuinely fine (directional, insight-grade metrics).

The pains / hard limits

Events lost on crash or job failure, silently. No dedup, no dead-letter, can't bind the event to the business transaction.

What people do to avoid it

Naive retries with no idempotency.

Result: Duplicate or lost events and no way to know what you missed.

How to build it

1. Sync (Level 1)

subscribers.each { |s| s.handle(event) }  # in the request

2. Async (Level 2)

HandleEventJob.perform_later(event)  # survives the request, not a crash
Level Outbox · durable Committed in the same transaction as the change.
Complexity
medium
Setup
outbox table · relay/publisher · idempotency keys
Maintenance
monitor outbox · dead-letter · retention pruning

What it captures

Events that cannot be lost and are atomic with the business write.

Why it works

Solves the dual-write problem: the event and the state change commit together or not at all. The floor for money-grade and for event sourcing.

The pains / hard limits

Real machinery to run and monitor. The outbox is a queue — mutable, pruned — so it is not your permanent record (project to an events table for that).

What people do to avoid it

Dual-write — write the DB, then publish separately.

Result: The classic failure: DB commits but publish fails (or the reverse), and state and events silently diverge.

How to build it

1. Write event in the same txn as the change

transaction do
  deal.update!(state: :won)
  Outbox.create!(event_id: SecureRandom.uuid, name: "deal_closed", payload: {...})
end

2. A relay ships committed rows, then marks them

# background loop
Outbox.unpublished.find_each { |o| deliver(o); o.update!(published_at: Time.now) }
Level Broker · cross-service Events must reach another service.
Complexity
high (ops)
Setup
Kafka/broker · consumers · schema registry · partitioning
Maintenance
broker ops · consumer lag · ordering · replay

What it captures

Durable delivery to consumers outside your process.

Why it works

Decouples services through an event stream; supports replay and many independent consumers.

The pains / hard limits

Operational weight: a broker to run, ordering and lag to reason about, schema evolution across teams.

What people do to avoid it

Services poll each other's databases or share one DB.

Result: Tight coupling, brittle integrations, no replay — a distributed monolith.

How to build it

1. Outbox relays to the broker

producer.produce(topic: "deals", key: deal_id, payload: event.to_json)
Level Telemetry sink · fire-and-forget High-volume anonymous signals (page views, clicks).
Complexity
medium
Setup
collector endpoint · batching · columnar sink · pre-aggregation
Maintenance
the ingest pipeline + columnar store

What it captures

Pre-identity, high-frequency web signals feeding top-of-funnel marketing stats.

Why it works

Built for volume the outbox could never carry. Lossy by design — one page view is worthless, the aggregate is everything. Stitches to business events at lead capture.

The pains / hard limits

A whole second store to operate. Never your system of record. Must not share the business-events table.

What people do to avoid it

Stuff page views into the same events/outbox table.

Result: Outbox and DB overwhelmed; billions of near-worthless rows drown your business events; retention becomes a nightmare.

How to build it

1. Thin collector: accept, validate, enqueue, 202

# POST /collect
def collect
  validate!(params)        # against the shared schema
  Buffer.push(params)      # fire-and-forget
  head :accepted           # 202, no inline work
end

2. Batch-flush to the columnar store, pre-aggregate

# worker drains Buffer -> ClickHouse/lake in batches
# pre-roll sessions, landing-page CVR, A/B variant lift on ingest

3. Stitch to a business event at conversion

# form_submitted -> lead_created, freezing utm/variant/anonymous_id
emit :lead_created, source_id:, variant_id:, anonymous_id:

Which combinations are real

Tier × event level — not every pairing is valid.

L0 · lossy sink L1–2 · in-proc L3 · outbox L4 · broker
Tier 1–3 N/A N/A N/A N/A
Tier 4 · insight ✓ telemetry ✓ ok ✓ ideal ✓ multi-svc
Tier 4 · money ✗ wrong src ⚠ dangerous ✓ required
Tier 5 · sourcing N/A ✗ contradiction ✓ floor
  • ⚠ Tier 4 on lossy collection silently undercounts rollups with no way to detect or recompute — fine for a directional glance, unacceptable for anything tied to money.
  • ✗ Tier 5 on lossy is a contradiction: you can't make events the source of truth if the pipe drops them.