Designing Data-Intensive Applications, 2nd Edition

From My Limbic Wiki

Book: Designing Data-Intensive Applications, 2nd Edition

The core mental model

  • Goals first: optimize for Reliability, Scalability, and Maintainability. Make the trade-offs explicit with concrete SLIs/SLOs (p99 latency, throughput, error rate, RPO/RTO).
  • Workload-driven design: let access patterns and failure modes drive every choice (data model, storage engine, replication, partitioning).

Data models & schemas

  • Pick models by shape of data + queries: relational (SQL), document, wide-column, graph.
  • Treat schema as a contract: enforce compatibility (backward/forward), version messages (Avro/Protobuf/JSON-Schema), and keep a Schema Registry.
  • Prefer evolution over “schema-less”: you always have a schema—make it explicit.

Storage engines & indexing

  • B-Tree vs LSM-Tree:
    • B-Trees → fast point reads/updates, steadier read amp.
    • LSMs → high write throughput via SSTables + compaction; mind write-amp/read-amp, Bloom filters, compaction tuning.
  • Know your primary and secondary indexes (and their costs) and when to denormalize.

Replication: why, how, and trade-offs

  • Leader-based (single leader, followers), multi-leader, leaderless (quorums)—each shifts the balance of consistency, availability, and latency.
  • User-visible guarantees to design for: read-your-writes, monotonic reads, causal consistency.
  • Conflict handling: last-write-wins is easy but lossy; consider CRDTs or app-level merge.

Partitioning (sharding)

  • Choose partition keys to spread load and enable locality; beware hot keys and skew.
  • Plan rebalancing (consistent hashing, power-of-two choices) and the impact on secondary indexes and joins.
  • Co-partition data that must be joined frequently.

Transactions & correctness

  • ACID is a toolbox, not a religion. Pick the isolation level that prevents the anomalies you actually care about: serializable > snapshot isolation > read committed, etc.
  • Recognize anomalies (write-skew, lost update) and when to use explicit locks, unique constraints, or serializable.
  • In distributed systems, “exactly-once” usually means idempotent processing + deduplication at boundaries.

Distributed systems realities

  • Clocks drift; wall-clock time lies. Use logical clocks, vector clocks, or watermarks for ordering.
  • Consensus (Raft/Paxos) is for small, critical state (membership, metadata), not for your hot data path.
  • Treat network partitions and partial failures as normal; design degraded modes.

Batch, streaming, and the dataflow unification

  • Batch (e.g., Spark): great for large recomputations and backfills.
  • Streaming (e.g., Flink/Kafka Streams): treats batch as a special case; focus on event time, watermarks, stateful operators, backpressure, and checkpointing.
  • Clarify delivery semantics: at-most-once, at-least-once, effectively-once (exactly-once processing semantics)—and still make writes idempotent.

Data integration & change propagation

  • Prefer Change Data Capture (CDC) over app dual-writes to avoid inconsistencies.
  • Use event logs as the source of truth; consider event sourcing when auditability and replay matter.
  • Establish lineage and data contracts to prevent “silent breakage” between teams.

Security, governance, and cost

  • Basics that bite: encryption at rest/in transit, least-privilege access, key rotation.
  • Build for privacy/compliance (e.g., data minimization, retention, erasure workflows) from day one.
  • In cloud, model cost as a constraint: storage tiers, egress, compaction, checkpoint size, cross-region traffic.

Operability (what senior engineers sweat)

  • Observability: RED/USE metrics, distributed tracing, slow-query visibility, checkpoint and compaction dashboards.
  • Failure drills: test quorum loss, lag catch-up, index rebuilds, backfills, replays.
  • Runbooks and SLO-based alerts (actionable, with clear owner and playbook).

Common mistakes to avoid

  • Designing for theoretical consistency when your SLOs demand availability/latency (or vice-versa).
  • Ignoring schema evolution and compatibility—most “outages” in data land are contract breaks.
  • Blind sharding without thinking about co-location for joins/aggregations.
  • Trusting “exactly-once” to save you—skip the magic, implement idempotence.
  • Underestimating state size in streaming (checkpoint bloat, long recovery times).

Quick “day-two” checklist

  • Define SLIs/SLOs (latency, throughput, durability, freshness).
  • Pick the data model and storage engine for your read/write mix.
  • Decide replication mode and consistency guarantees users will feel.
  • Choose partition keys and a rebalance strategy.
  • Lock in schema versioning rules and a registry.
  • Design idempotent endpoints and dedupe keys for pipelines.
  • Add observability, lineage, access control, and retention before launch.