Latest revision as of 22:11, 5 October 2025

Book: Designing Data-Intensive Applications, 2nd Edition

The core mental model

Goals first: optimize for Reliability, Scalability, and Maintainability. Make the trade-offs explicit with concrete SLIs/SLOs (p99 latency, throughput, error rate, RPO/RTO).
Workload-driven design: let access patterns and failure modes drive every choice (data model, storage engine, replication, partitioning).

Data models & schemas

Pick models by shape of data + queries: relational (SQL), document, wide-column, graph.
Treat schema as a contract: enforce compatibility (backward/forward), version messages (Avro/Protobuf/JSON-Schema), and keep a Schema Registry.
Prefer evolution over “schema-less”: you always have a schema—make it explicit.

Storage engines & indexing

B-Tree vs LSM-Tree:
- B-Trees → fast point reads/updates, steadier read amp.
- LSMs → high write throughput via SSTables + compaction; mind write-amp/read-amp, Bloom filters, compaction tuning.
Know your primary and secondary indexes (and their costs) and when to denormalize.

Replication: why, how, and trade-offs

Leader-based (single leader, followers), multi-leader, leaderless (quorums)—each shifts the balance of consistency, availability, and latency.
User-visible guarantees to design for: read-your-writes, monotonic reads, causal consistency.
Conflict handling: last-write-wins is easy but lossy; consider CRDTs or app-level merge.

Partitioning (sharding)

Choose partition keys to spread load and enable locality; beware hot keys and skew.
Plan rebalancing (consistent hashing, power-of-two choices) and the impact on secondary indexes and joins.
Co-partition data that must be joined frequently.

Transactions & correctness

ACID is a toolbox, not a religion. Pick the isolation level that prevents the anomalies you actually care about: serializable > snapshot isolation > read committed, etc.
Recognize anomalies (write-skew, lost update) and when to use explicit locks, unique constraints, or serializable.
In distributed systems, “exactly-once” usually means idempotent processing + deduplication at boundaries.

Distributed systems realities

Clocks drift; wall-clock time lies. Use logical clocks, vector clocks, or watermarks for ordering.
Consensus (Raft/Paxos) is for small, critical state (membership, metadata), not for your hot data path.
Treat network partitions and partial failures as normal; design degraded modes.

Batch, streaming, and the dataflow unification

Batch (e.g., Spark): great for large recomputations and backfills.
Streaming (e.g., Flink/Kafka Streams): treats batch as a special case; focus on event time, watermarks, stateful operators, backpressure, and checkpointing.
Clarify delivery semantics: at-most-once, at-least-once, effectively-once (exactly-once processing semantics)—and still make writes idempotent.

Data integration & change propagation

Prefer Change Data Capture (CDC) over app dual-writes to avoid inconsistencies.
Use event logs as the source of truth; consider event sourcing when auditability and replay matter.
Establish lineage and data contracts to prevent “silent breakage” between teams.

Security, governance, and cost

Basics that bite: encryption at rest/in transit, least-privilege access, key rotation.
Build for privacy/compliance (e.g., data minimization, retention, erasure workflows) from day one.
In cloud, model cost as a constraint: storage tiers, egress, compaction, checkpoint size, cross-region traffic.

Operability (what senior engineers sweat)

Observability: RED/USE metrics, distributed tracing, slow-query visibility, checkpoint and compaction dashboards.
Failure drills: test quorum loss, lag catch-up, index rebuilds, backfills, replays.
Runbooks and SLO-based alerts (actionable, with clear owner and playbook).

Common mistakes to avoid

Designing for theoretical consistency when your SLOs demand availability/latency (or vice-versa).
Ignoring schema evolution and compatibility—most “outages” in data land are contract breaks.
Blind sharding without thinking about co-location for joins/aggregations.
Trusting “exactly-once” to save you—skip the magic, implement idempotence.
Underestimating state size in streaming (checkpoint bloat, long recovery times).

Quick “day-two” checklist

Define SLIs/SLOs (latency, throughput, durability, freshness).
Pick the data model and storage engine for your read/write mix.
Decide replication mode and consistency guarantees users will feel.
Choose partition keys and a rebalance strategy.
Lock in schema versioning rules and a registry.
Design idempotent endpoints and dedupe keys for pipelines.
Add observability, lineage, access control, and retention before launch.

@@ Line 1: / Line 1: @@
 Book: [https://drive.google.com/file/d/1SoKNd7g_h92FWd-cOjlwYqvv2HvfMWkE/view?usp=sharing Designing Data-Intensive Applications, 2nd Edition]
-=== Core ===
+= The core mental model =
+* '''Goals first:''' optimize for '''Reliability''', '''Scalability''', and '''Maintainability'''. Make the trade-offs explicit with concrete '''SLIs/SLOs''' (p99 latency, throughput, error rate, RPO/RTO).
+* '''Workload-driven design:''' let '''access patterns''' and '''failure modes''' drive every choice (data model, storage engine, replication, partitioning).
+= Data models & schemas =
+* Pick models by shape of data + queries: '''relational (SQL)''', '''document''', '''wide-column''', '''graph'''.
+* Treat '''schema''' as a contract: enforce '''compatibility''' (backward/forward), version messages (Avro/Protobuf/JSON-Schema), and keep a '''Schema Registry'''.
+* Prefer '''evolution''' over “schema-less”: you always have a schema—make it explicit.
+= Storage engines & indexing =
+* '''B-Tree vs LSM-Tree''':
+** B-Trees → fast point reads/updates, steadier read amp.
+** LSMs → high write throughput via '''SSTables''' + '''compaction'''; mind '''write-amp/read-amp''', '''Bloom filters''', compaction tuning.
+* Know your '''primary''' and '''secondary indexes''' (and their costs) and when to denormalize.
+= Replication: why, how, and trade-offs =
+* '''Leader-based''' (single leader, followers), '''multi-leader''', '''leaderless (quorums)'''—each shifts the balance of '''consistency''', '''availability''', and '''latency'''.
+* User-visible guarantees to design for: '''read-your-writes''', '''monotonic reads''', '''causal consistency'''.
+* Conflict handling: last-write-wins is easy but lossy; consider CRDTs or app-level merge.
+= Partitioning (sharding) =
+* Choose '''partition keys''' to spread load and enable locality; beware '''hot keys''' and '''skew'''.
+* Plan '''rebalancing''' (consistent hashing, power-of-two choices) and the impact on secondary indexes and joins.
+* Co-partition data that must be joined frequently.
+= Transactions & correctness =
+* '''ACID''' is a toolbox, not a religion. Pick the '''isolation level''' that prevents the anomalies you actually care about: '''serializable''' > '''snapshot isolation''' > '''read committed''', etc.
+* Recognize anomalies (write-skew, lost update) and when to use '''explicit locks''', '''unique constraints''', or '''serializable'''.
+* In distributed systems, “'''exactly-once'''” usually means '''idempotent''' processing + '''deduplication''' at boundaries.
+= Distributed systems realities =
+* Clocks drift; wall-clock time lies. Use '''logical clocks''', '''vector clocks''', or '''watermarks''' for ordering.
+* '''Consensus''' (Raft/Paxos) is for small, critical state (membership, metadata), not for your hot data path.
+* Treat '''network partitions''' and '''partial failures''' as normal; design degraded modes.
+= Batch, streaming, and the dataflow unification =
+* '''Batch (e.g., Spark)''': great for large recomputations and backfills.
+* '''Streaming (e.g., Flink/Kafka Streams)''': treats batch as a special case; focus on '''event time''', '''watermarks''', '''stateful operators''', '''backpressure''', and '''checkpointing'''.
+* Clarify delivery semantics: '''at-most-once''', '''at-least-once''', '''effectively-once (exactly-once processing semantics)'''—and still make writes '''idempotent'''.
+= Data integration & change propagation =
+* Prefer '''Change Data Capture (CDC)''' over app dual-writes to avoid inconsistencies.
+* Use '''event logs''' as the source of truth; consider '''event sourcing''' when auditability and replay matter.
+* Establish '''lineage''' and '''data contracts''' to prevent “silent breakage” between teams.
+= Security, governance, and cost =
+* Basics that bite: '''encryption at rest/in transit''', '''least-privilege access''', key rotation.
+* Build for '''privacy/compliance''' (e.g., data minimization, retention, erasure workflows) from day one.
+* In cloud, model '''cost''' as a constraint: storage tiers, egress, compaction, checkpoint size, cross-region traffic.
+= Operability (what senior engineers sweat) =
+* '''Observability:''' RED/USE metrics, distributed tracing, slow-query visibility, checkpoint and compaction dashboards.
+* '''Failure drills:''' test quorum loss, lag catch-up, index rebuilds, backfills, replays.
+* '''Runbooks''' and '''SLO-based alerts''' (actionable, with clear owner and playbook).
+= Common mistakes to avoid =
+* Designing for theoretical '''consistency''' when your SLOs demand '''availability/latency''' (or vice-versa).
+* Ignoring '''schema evolution''' and '''compatibility'''—most “outages” in data land are contract breaks.
+* Blind sharding without thinking about '''co-location''' for joins/aggregations.
+* Trusting “'''exactly-once'''” to save you—skip the magic, implement '''idempotence'''.
+* Underestimating '''state size''' in streaming (checkpoint bloat, long recovery times).
+= Quick “day-two” checklist =
+* Define SLIs/SLOs (latency, throughput, durability, freshness).
+* Pick the '''data model''' and '''storage engine''' for your read/write mix.
+* Decide '''replication''' mode and '''consistency''' guarantees users will feel.
+* Choose '''partition keys''' and a '''rebalance''' strategy.
+* Lock in '''schema versioning''' rules and a '''registry'''.
+* Design '''idempotent''' endpoints and '''dedupe keys''' for pipelines.
+* Add '''observability''', '''lineage''', '''access control''', and '''retention''' before launch.

Anonymous

Search

Designing Data-Intensive Applications, 2nd Edition: Difference between revisions

Namespaces

More

Page actions

Latest revision as of 22:11, 5 October 2025

Contents

The core mental model

Data models & schemas

Storage engines & indexing

Replication: why, how, and trade-offs

Partitioning (sharding)

Transactions & correctness

Distributed systems realities

Batch, streaming, and the dataflow unification

Data integration & change propagation

Security, governance, and cost

Operability (what senior engineers sweat)

Common mistakes to avoid

Quick “day-two” checklist

Navigation

Navigation

Wiki tools

Wiki tools

Anonymous

Search

Designing Data-Intensive Applications, 2nd Edition: Difference between revisions

Latest revision as of 22:11, 5 October 2025

The core mental model

Data models & schemas

Storage engines & indexing

Replication: why, how, and trade-offs

Partitioning (sharding)

Transactions & correctness

Distributed systems realities

Batch, streaming, and the dataflow unification

Data integration & change propagation

Security, governance, and cost

Operability (what senior engineers sweat)

Common mistakes to avoid

Quick “day-two” checklist

Navigation

Wiki tools

Page tools