Data Layer & Databases


1. Key Theoretical Foundations

CAP Theorem

Tradeoffs:


ACID Transactions

Example (Bank Transfer):

begin transaction
    debit Alice $100
    credit Bob $100
commit

If the system crashes mid-way, rollback ensures no partial transfer.


BASE Model (common in NoSQL)


2. Database Types & Design Choices

Type Example Systems Strengths Limitations When to Use
Relational (SQL) PostgreSQL, MySQL, Oracle Strong consistency, ACID, rich queries (SQL, joins) Harder to scale horizontally, schema rigidity Banking, ERP, analytics with strong correctness
Key-Value Redis, DynamoDB, Riak Very fast lookups, simple model No joins, limited query flexibility Caching, session stores, user profiles
Document MongoDB, Couchbase Flexible schema, hierarchical docs Joins are weak, eventual consistency common CMS, product catalogs, flexible JSON data
Columnar Cassandra, HBase, Bigtable Wide-column storage, great for writes and time series Complex setup, weaker joins Logging, time series, IoT data
Graph Neo4j, ArangoDB Optimized for relationships (edges) Slower for non-graph workloads Social networks, recommendation engines

3. Core Database Concepts

Indexing

Normalization vs Denormalization

Scaling Approaches


4. Patterns in System Design Interviews


Type Example Systems CAP Posture (typical) Strengths Limitations Common Uses
Relational (SQL) PostgreSQL, MySQL (single-node) CA* Strong consistency on a single node, ACID, rich SQL/joins Vertical scaling limits; schema rigidity OLTP, finance, ERP, analytics where correctness matters
Relational (clustered/replicated) Postgres w/ sync replicas, Galera/MySQL Group Replication CP (sync quorum) / AP-ish (async replicas) ACID with high correctness; can survive node loss with quorum May refuse writes under partition (CP); async replicas risk stale reads High-integrity systems needing HA
Key–Value DynamoDB AP (tunable) Massive scale, predictable latency; eventual or strong reads selectable Limited querying/joins; modeling discipline needed Caching, user profiles, session stores
  Redis (standalone) CA* Extremely fast, simple; great cache Single-node unless clustered; no durability by default Caching, rate limiting, queues
  Redis Cluster AP (tunable) Scales horizontally, high throughput Eventual consistency windows; key-hash slotting constraints Distributed cache, pub/sub
Document MongoDB (replica set) CP (tunable) Flexible schema (JSON/BSON), rich queries & indexes Cross-document transactions newer/limited by design Content mgmt, product catalogs
Wide-Column Cassandra AP Writes-first design, linear scalability, multi-DC friendly Eventual consistency; complex data modeling Time-series, logging, large-scale writes
  HBase / Bigtable CP Strong consistency, huge tables, range scans Operationally heavier; limited ad-hoc queries Analytics backends, large KV/range workloads
Search Elasticsearch, OpenSearch AP (tunable) Full-text search, aggregations, near-real-time Eventual consistency; not a primary source of truth Search, logs, analytics
Graph Neo4j, ArangoDB CA* (single-node) / CP (cluster) Efficient traversals, relationship-heavy queries Not ideal for wide scans/OLAP Social graphs, recommendations

* CA is only meaningful when there’s effectively no partition (e.g., single-node or same-box). In real distributed settings you must pick between CP and AP under partitions. Tunable = posture can be adjusted (e.g., quorum reads/writes, read/write concerns).

If you want, I can drop this directly into the Data Layer section and keep going with indexing strategies, normalization vs. denormalization, replication vs. sharding, and caching—all in the same reference style.

Indexing Strategies

What is an index?

A secondary data structure that lets the database find rows without scanning the whole table.

Common Index Types

Index Best For Notes
B-Tree Range scans, ordering (BETWEEN, <, >, ORDER BY) Default in most RDBMS; balanced tree keeps O(log n) lookups
Hash Exact matches (=) No range scans; O(1) average lookups
GIN / GIST Full-text, arrays, geo Postgres: GIN for inverted (contains), GiST for spatial/nearest
Bitmap Low-cardinality columns Often used in data warehouses for analytics
Covering Index Query can be answered by the index alone Add included columns so base table lookup is avoided

General Rules

SQL Snippets (PostgreSQL-flavored)

-- B-Tree (default)
CREATE INDEX idx_users_email ON users(email);

-- Composite index (equality, equality, then range)
CREATE INDEX idx_orders_user_date ON orders(user_id, status, created_at);

-- Partial index (only for active rows)
CREATE INDEX idx_active_users_email ON users(email) WHERE is_active = true;

-- Covering index (include extra columns)
CREATE INDEX idx_orders_lookup ON orders(order_id) INCLUDE (total_amount, status);

-- Full-text (GIN)
CREATE INDEX idx_docs_ft ON docs USING GIN (to_tsvector('english', body));

Gotchas


Normalization vs Denormalization

Normalization (3NF+)

Denormalization

When to Use

Pattern: Start normalized → measure → denormalize targeted hot paths.


Replication vs Sharding

Replication (Copy the same data to multiple nodes)

Terms

Example Read Scaling

Sharding (Horizontal partitioning; split data across nodes)

When to Shard

Cross-shard Challenges


Caching (and Layers)

Why Cache?

Reduce latency and database load by serving hot data from memory.

Types of Caches

Layer Example Pros Cons
Client-side Browser cache Zero network hops Stale control limited
CDN/Edge CloudFront, Fastly Global low-latency Purge/invalidation needed
Application cache In-process LRU Ultra fast Memory pressure; not shared
Distributed cache Redis, Memcached Shared across services Network hop; consistency risk
Database cache Buffer pool, query cache Transparent DB-specific behavior

Caching Strategies

Redis Patterns

# Cache-aside (pseudo/Python)
value = redis.get(key)
if value is None:
    value = db.read(key)
    redis.setex(key, ttl_seconds, serialize(value))
return value

Invalidation

Gotchas


Putting It Together — Quick Decision Guide


If you want, I can now fold these sections into your master document and then move on to the DevOps & Cloud chapter (Terraform/IaC with CLI & examples, AWS/Azure/GCP core services, CI/CD including Jenkins, Observability with CloudWatch/Splunk/New Relic, Chaos Engineering tools, Containers beyond Docker, Security/HIPAA).