Design a Calendar and Scheduling Service

TL;DR

Build Google Calendar -- a system where users create events, schedule meetings with others, and share calendars. The deceptively hard part is recurring events. "Team standup every weekday at 9 AM" is a single RRULE that expands to ~260 events per year. Do you store 260 rows or 1 row? Google stores the rule and expands on read -- this makes writes trivial but reads expensive. Editing recurring events introduces the three-way edit problem: "this instance only," "this and all future," or "all instances" -- each requires a different storage strategy. Then there are the edge cases that haunt every calendar engineer: February 29th (what happens to a monthly event on the 31st?), DST transitions (a 2:30 AM event during spring-forward does not exist), and timezone changes when the user travels. Conflict detection for meeting room booking uses an interval tree to find overlapping reservations in O(log N + K) time. CalDAV sync tokens enable efficient incremental sync between client and server.

The System

Google Calendar. A user creates a recurring team standup at 9 AM every weekday. Their manager can see it on a shared calendar. The system suggests a meeting time when all attendees are free (free/busy lookup). Conference rooms are bookable resources with capacity limits. The calendar syncs across phone, laptop, and web -- changes on one device appear on all others within seconds.

Why is this hard? Because time is hard. Seriously. Time zones, daylight saving transitions, recurring event expansion, timezone-aware conflict detection, and multi-device sync are each individually tricky. Combined, they create an astonishing number of edge cases. Google Calendar has had DST-related bugs reported every single year since launch. Microsoft Outlook's recurring event implementation has 40+ known edge cases documented in their RFC 5545 conformance notes. If you think this is a simple CRUD application, you have not thought about what happens to "monthly meeting on the 31st" in February.

Requirements

Functional Requirements

Requirement	Details
Create/edit/delete events	Single events and recurring events (daily, weekly, monthly, yearly, custom).
Recurring event editing	Edit this instance, this and all future, or all instances.
Attendee management	Invite users. Track RSVP status (accepted, declined, tentative).
Free/busy lookup	Given a set of users and a time range, find slots where all are free.
Meeting room booking	Reserve rooms as resources. Rooms have capacity. Prevent double-booking.
Calendar sharing	Share calendars with read or read-write access.
Multi-device sync	Changes sync to all devices within 5 seconds.
Reminders/notifications	Push notification N minutes before event start.

Non-Functional Requirements

Requirement	Target
Event creation latency	< 200 ms
Calendar view load (month view)	< 500 ms
Free/busy query	< 300 ms for 10 attendees over 5-day range
Sync latency	< 5 seconds to all devices
Availability	99.99% (calendar is mission-critical for business users)
Scale	500M users, 2B events, 50K event writes/sec

Back-of-Envelope Math

Total users:                 500 million (Google Calendar's approximate user base)
Daily active users:          200 million
Events per user:             ~20 per week (including recurring expansions)
Events per user (stored):    ~5 per week (recurring rules, not individual instances)

Event writes per day:        200M users * 2 event changes/day = 400M writes/day
Event writes per second:     ~4,600 avg, ~15,000 peak (Monday mornings)

Calendar view reads:
  Each user checks calendar ~10 times/day
  Reads per day:             200M * 10 = 2 billion
  Reads per second:          ~23,000 avg, ~70,000 peak

Recurring event expansion:
  Avg recurring events per user: 10
  Avg expansions per view:       10 events * 30 days = 300 instance calculations
  Reads requiring expansion:     70,000/sec * 300 = 21M instance calculations/sec

Event storage:
  Event record: ~500 bytes (title, time, attendees, description, recurrence rule)
  Total events: 2 billion
  Raw storage: 2B * 500 = 1 TB
  With indexes: ~3 TB

The number that matters: 21 million recurring event expansions per second during peak reads. This is why the "expand on read" decision is so important and why you cache aggressively.

Naive Design

PostgreSQL with a simple events table.

Schema:

CREATE TABLE events (
    event_id UUID PRIMARY KEY,
    user_id UUID,
    title VARCHAR(255),
    start_time TIMESTAMPTZ,
    end_time TIMESTAMPTZ,
    timezone VARCHAR(50),
    is_recurring BOOLEAN,
    recurrence_rule TEXT,   -- RRULE string
    description TEXT,
    created_at TIMESTAMPTZ
);

CREATE TABLE attendees (
    event_id UUID,
    user_id UUID,
    status VARCHAR(20),  -- ACCEPTED, DECLINED, TENTATIVE, NEEDS_ACTION
    PRIMARY KEY (event_id, user_id)
);

Monthly view query:

SELECT * FROM events
WHERE user_id = :user
  AND start_time >= '2024-03-01' AND start_time < '2024-04-01'
ORDER BY start_time;

Problem: This query does not return recurring events whose start_time is before March but whose recurrence extends into March. A "daily standup" created on January 1st has start_time = 2024-01-01. It will not appear in the March query.

Workaround: Also query recurring events regardless of start_time and expand them in application code. Now you are scanning ALL recurring events for every calendar view.

Where It Breaks

Problem 1: Recurring Event Expansion Is Expensive

A user has 10 recurring events. Each must be expanded for the queried time window. A weekly event over 30 days = ~4 instances. A daily event = ~30 instances. With 10 events, that is 50-200 instances to compute. At 70K reads/sec, that is 3.5-14M instance computations per second. In the application layer. On every request.

Problem 2: The Three-Way Edit Problem

User has "Team standup every weekday at 9 AM." They want to move Thursday's standup to 10 AM. Three options:

This instance only: Create an exception for this specific Thursday. All other instances stay at 9 AM.
This and all future: Split the recurring series. All past instances stay at 9 AM. This Thursday and all future instances move to 10 AM.
All instances: Move the entire series to 10 AM. All past and future instances change.

Each option requires a different storage operation. "This instance only" adds an exception record. "This and all future" splits the RRULE into two series. "All instances" modifies the original RRULE. If you do not handle all three, your calendar is broken for enterprise users.

Problem 3: February 29th and Month-End Dates

"Monthly meeting on the 31st." What happens in February? April? Options:

Skip months without a 31st (user misses meetings in Feb, Apr, Jun, Sep, Nov).
Fall back to the last day of the month (Feb 28/29, Apr 30, etc.).
RFC 5545 says: BYMONTHDAY=31 falls on the last day of months shorter than 31 days. But not all implementations agree.

"Yearly event on February 29th." What happens in non-leap years? Skip? Move to Feb 28? Move to Mar 1? Google Calendar skips. Outlook moves to Feb 28. Both are valid interpretations.

Problem 4: DST Transitions

"Daily alarm at 2:30 AM." On the spring-forward night, 2:30 AM does not exist (clocks jump from 2:00 to 3:00). On the fall-back night, 2:30 AM happens twice. What does the system do?

The standard answer: Store events in the user's local timezone with a timezone identifier (e.g., "America/New_York"), not UTC offset. When expanding, use the tz database to compute the actual UTC time for each instance. During spring-forward, an event at 2:30 AM is skipped or moved to 3:00 AM (platform-specific). During fall-back, the first occurrence is used.

Problem 5: Multi-Device Sync Conflicts

User edits an event on their phone (offline). Simultaneously edits the same event on their laptop. Phone comes online. Which edit wins?

Real Design

Calendar Scheduling — Calendar Scheduling High-Level Design

Architecture Overview

┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  Web Client  │  │  Mobile App  │  │  CalDAV      │
│              │  │              │  │  Client      │
└──────┬───────┘  └──────┬───────┘  └──────┬───────┘
       │                 │                 │
       └────────────┬────┴────────────────┘
                    │
           ┌────────┴────────┐
           │  Calendar API    │
           │  Service         │
           └────────┬────────┘
                    │
    ┌───────────────┼───────────────┐
    │               │               │
┌───┴───────┐ ┌─────┴─────┐ ┌──────┴──────┐
│ Event     │ │ Free/Busy │ │ Notification│
│ Store     │ │ Service   │ │ Service     │
│ (sharded  │ │ (interval │ │ (push +     │
│  by user) │ │  tree +   │ │  email)     │
│           │ │  cache)   │ │             │
└───────────┘ └───────────┘ └─────────────┘

Component 1: RRULE -- Recurring Events Stored as Rules

RFC 5545 defines RRULE, the standard format for recurring event rules. Google Calendar, Apple Calendar, and Outlook all use it.

Examples:

Every weekday at 9 AM:
  RRULE:FREQ=WEEKLY;BYDAY=MO,TU,WE,TH,FR

Every 2 weeks on Monday:
  RRULE:FREQ=WEEKLY;INTERVAL=2;BYDAY=MO

Monthly on the 15th, 10 occurrences:
  RRULE:FREQ=MONTHLY;BYMONTHDAY=15;COUNT=10

Yearly on March 1, until 2030:
  RRULE:FREQ=YEARLY;BYMONTH=3;BYMONTHDAY=1;UNTIL=20300101T000000Z

Storage: One row in the database per recurring series. The RRULE string is stored as a text field. Individual instances are NOT stored as rows.

Expand on read, not write: When a user views March 2024, the server:

Fetches all single events in the March window.
Fetches all recurring event rules for the user (regardless of start date).
For each rule, computes which instances fall within March 2024 using the RRULE expansion algorithm.
Merges single events and expanded instances.
Returns the combined list.

Why not expand on write? Because a "every weekday" event would generate ~260 rows per year. Over 5 years, one event becomes 1,300 rows. A user with 10 recurring events has 13,000 rows for events that are conceptually 10 items. Write amplification is massive. Worse: editing "this and all future" on a write-expanded event requires updating hundreds of rows. With expand-on-read, you update one RRULE record.

The cost of expand-on-read: CPU. Each RRULE expansion requires parsing the rule and iterating through dates. A complex rule (e.g., "every third Wednesday of months where the first day is a Tuesday") can take 1-5 ms to expand over a 30-day window. For simple rules (daily, weekly), it is < 0.1 ms.

Component 2: Exception Handling for "This Instance Only"

When a user modifies a single instance of a recurring event, you store an exception (called EXDATE in RFC 5545).

Data model:

CREATE TABLE recurring_events (
    event_id UUID PRIMARY KEY,
    user_id UUID,
    title VARCHAR(255),
    start_time TIMESTAMPTZ,    -- first occurrence
    duration INTERVAL,
    timezone VARCHAR(50),
    rrule TEXT,
    created_at TIMESTAMPTZ
);

CREATE TABLE recurring_exceptions (
    event_id UUID,             -- parent recurring event
    original_date DATE,        -- which instance is being overridden
    exception_type VARCHAR(10),-- MODIFIED or DELETED
    -- Modified fields (null if DELETED):
    new_title VARCHAR(255),
    new_start_time TIMESTAMPTZ,
    new_end_time TIMESTAMPTZ,
    PRIMARY KEY (event_id, original_date)
);

Expansion with exceptions:

Expand the RRULE to get all instances in the window.
For each instance, check the exceptions table:
If DELETED: skip this instance.
If MODIFIED: use the modified fields instead of the rule-generated values.
Return the final list.

"This and all future" (THISANDFUTURE):

This is the hardest case. The user splits the series. Implementation:

Set an UNTIL date on the original RRULE (ending before the split date).
Create a new recurring event with a new RRULE starting from the split date, with the modifications applied.
Exceptions before the split stay with the original event. Exceptions after the split are deleted (they are now covered by the new rule).

Component 3: Conflict Detection with Interval Tree

Free/busy lookup and meeting room double-booking prevention both reduce to: "given a time interval, find all overlapping events."

Naive approach: Scan all events for the user/room in the time range and check for overlaps. O(N) per query where N is events in the range.

Interval tree approach: Build an augmented BST where each node represents an event interval [start, end]. Each node also stores the maximum endpoint in its subtree.

Overlap query for [query_start, query_end]:
1. If current node's interval overlaps [query_start, query_end]:
   add to results.
2. If left subtree's max endpoint > query_start:
   recurse into left subtree.
3. If right subtree exists and current node's start < query_end:
   recurse into right subtree.

Complexity: O(log N + K) where K is the number of overlapping events.

Free/busy for N attendees: For each attendee, query their interval tree for the time range. Merge the results. Find gaps. Return available slots. Total: O(N * (log E + K)) where E is events per user and K is events in the range.

Caching: For popular meeting rooms and executives (whose calendars are frequently queried for scheduling), cache the interval tree in Redis. The cache is invalidated on event creation/modification. Cache hit rate for free/busy queries on rooms: ~90% (rooms are queried far more than they are modified).

Component 4: Timezone Handling

This is where most calendar implementations have bugs. The rules:

Rule 1: Store timezone identifiers, not UTC offsets.

America/New_York is correct. UTC-05:00 is wrong. Why? Because New York is UTC-05:00 in winter and UTC-04:00 in summer. If you store UTC-05:00, your event is wrong for 6 months of the year.

Rule 2: Store the event's "wall clock" time in the user's timezone.

"Meeting at 9 AM in New York" is stored as start_time = 09:00, timezone = America/New_York. When displayed to a user in London, the system converts: 9 AM Eastern = 2 PM GMT (or 1 PM GMT during US EST, because DST transitions differ between the US and UK).

Rule 3: Use the IANA tz database (tzdata) for conversions.

The tz database is updated 10-20 times per year as governments change DST rules. Your system must use the latest version. Failing to update caused a wave of incorrect meeting times when Egypt abolished DST in 2014 with 2 weeks notice.

Rule 4: All-day events have no timezone.

"Birthday on March 15" is the same date regardless of timezone. Store it as a DATE, not a TIMESTAMP. Display it as March 15 everywhere. Do not convert.

Component 5: Meeting Room Booking as a Reservation Pattern

Meeting rooms are shared resources with capacity constraints. Booking a room is a reservation problem.

Requirements:

No double-booking: two events cannot overlap on the same room.
Capacity: room has N seats. Events can request a capacity-appropriate room.
Auto-release: if the meeting organizer cancels, the room is freed.
Suggested rooms: given a time slot and attendee list, suggest available rooms near the attendees' usual office.

Implementation: Use optimistic locking on the room resource:

-- Attempt to book room
INSERT INTO room_bookings (room_id, event_id, start_time, end_time, version)
SELECT :room_id, :event_id, :start, :end, 0
WHERE NOT EXISTS (
    SELECT 1 FROM room_bookings
    WHERE room_id = :room_id
      AND start_time < :end
      AND end_time > :start
);
-- If INSERT affects 0 rows, the room is already booked for that time.

For high-contention rooms (the one good conference room on the floor that everyone wants for 10 AM Monday), add a short TTL hold: "Room tentatively held for 2 minutes while you finalize the event." This prevents the scenario where you select a room, spend 5 minutes adding attendees and a description, and the room is taken when you click Save.

Component 6: CalDAV Sync Tokens

CalDAV (RFC 4791) is the standard protocol for calendar synchronization. The key mechanism is sync tokens -- an opaque string that represents the calendar's state at a point in time.

Sync flow:

Client does initial sync: REPORT /calendar/ HTTP/1.1 with no sync token. Server returns ALL events and a sync token: sync-token: "abc123".
Client stores the sync token.
On subsequent sync: REPORT /calendar/ HTTP/1.1 with sync-token: "abc123". Server returns only events CHANGED since that token was issued.
Server returns a new sync token: sync-token: "def456".

Implementation: The sync token is a logical timestamp (or a database sequence number). The server stores a change log:

CREATE TABLE calendar_changes (
    change_id BIGSERIAL PRIMARY KEY,  -- monotonically increasing = sync token
    user_id UUID,
    event_id UUID,
    change_type VARCHAR(10),  -- CREATED, UPDATED, DELETED
    changed_at TIMESTAMPTZ
);

When a client syncs with sync-token: 42, the server queries SELECT * FROM calendar_changes WHERE user_id = :user AND change_id > 42. Returns the changed events and the latest change_id as the new sync token.

Efficiency: Instead of transferring the entire calendar on every sync (thousands of events), only changed events are transferred. For a user who syncs every 5 seconds, the typical delta is 0-2 events. This reduces sync bandwidth by 99%+.

Deep Dives

Calendar Scheduling — Calendar Scheduling Rrule

Deep Dive 1: The February 29th and Month-End Problem

This is the kind of edge case that separates a production calendar from a toy one.

Scenario: "Monthly on the 31st."

January 31    ✓ (31 days)
February 28/29 ? (no 31st)
March 31      ✓
April 30      ? (no 31st)
May 31        ✓

RFC 5545 behavior: RRULE:FREQ=MONTHLY;BYMONTHDAY=31 generates instances only on months that have a 31st day. February and April are skipped. This is often not what the user wants.

Google Calendar behavior: If you create "monthly on the 31st," Google stores BYMONTHDAY=31 and skips short months. If you want "last day of every month," Google uses BYMONTHDAY=-1 (RFC 5545 supports negative offsets: -1 = last day, -2 = second-to-last, etc.).

System design impact: Your RRULE expansion library must correctly handle negative BYMONTHDAY values and month-length variations. Test cases:

BYMONTHDAY=31 in February  -> skip (no 31st)
BYMONTHDAY=-1 in February  -> Feb 28 (or 29 in leap year)
BYMONTHDAY=29 in February  -> skip in non-leap year, Feb 29 in leap year
BYMONTHDAY=30 in February  -> always skip

Deep Dive 2: DST and the "Lost Hour" Problem

Spring forward (March, US):

Clocks jump from 2:00 AM to 3:00 AM. The time 2:30 AM does not exist.

If a recurring event is scheduled for 2:30 AM: - Google Calendar moves it to 3:30 AM (the next valid time). - Apple Calendar moves it to 3:00 AM (the new DST time). - Both are defensible. Neither is "correct."

Fall back (November, US):

Clocks go from 2:00 AM back to 1:00 AM. The time 1:30 AM occurs twice.

If a recurring event is at 1:30 AM: - Most implementations use the first occurrence (before the clocks change back). - Some implementations show the event twice (once for each 1:30 AM).

International DST differences:

The US and Europe change DST on different dates. A "9 AM New York" meeting is 2 PM London for most of the year, but during the 2-3 weeks between US and European DST transitions, it is 1 PM London or 3 PM London.

System design rule: ALWAYS store the timezone identifier (America/New_York), not the UTC offset. Recompute UTC times on every expansion using the latest tz database. Update the tz database promptly when governments change DST rules. Run a nightly job that checks for upcoming DST transitions and pre-computes affected events.

Deep Dive 3: Conflict-Free Multi-Device Sync

Two devices edit the same event offline. Which edit wins?

Last-write-wins (LWW):

Simple. The device that syncs last overwrites the other's changes. Problem: the user on Device A changes the title, the user on Device B changes the location. LWW preserves only one change.

Field-level merge:

Track changes at the field level. If Device A changes the title and Device B changes the location, merge both changes. Conflict only if both change the same field.

Event at sync time:
  title: "Standup"  location: "Room 1"  time: "9 AM"

Device A (offline): title -> "Daily Standup"
Device B (offline): location -> "Room 2"

Merged result:
  title: "Daily Standup"  location: "Room 2"  time: "9 AM"

Implementation: Each event has a vector clock or version number per field. On sync, the server compares field versions and auto-merges non-conflicting changes. If two devices modify the same field, the server uses LWW for that field and notifies both devices of the conflict.

Google Calendar's approach: Google uses operational transformation (similar to Google Docs) for calendar events. Each change is an operation (set_title, set_time, add_attendee). Operations are commutative where possible and transformed when they conflict. This provides a better user experience than LWW but is significantly more complex to implement.

Alternative Designs

Approach	Pros	Cons	When to Use
RRULE + expand on read (described above)	Minimal storage. Trivial writes. Clean model.	CPU-intensive reads. Complex expansion logic.	Google Calendar, Apple Calendar, any standard calendar.
Expand on write (materialize all instances)	Simple reads (just query by time range). No expansion logic at read time.	Massive write amplification. "Edit all future" touches hundreds of rows. Infinite recurring events cannot be stored.	Small-scale calendars with bounded recurrence (e.g., "10 occurrences").
Hybrid: expand and cache	Pre-expand for a rolling 90-day window. Cache in Redis. Expand on-demand outside the window.	Cache invalidation on edits. Two code paths.	When read performance is critical and recurring events are heavily queried.
Event sourcing	Full history of all changes. Trivial undo/redo.	Complex read path (replay events to compute state). Higher storage.	When audit trail and change history are primary requirements.
Shared database table (all users in one table)	Simple schema. Easy joins for free/busy.	Hotspots on popular users' partitions. Scale limit ~10M events per table before index performance degrades.	< 1M users.

Scaling Math Verification

Event Store

Total events (stored):       2 billion (including recurring rules, not expansions)
Event size:                  500 bytes average
Total storage:               1 TB
With indexes (user_id, time): ~3 TB

Sharding:                    By user_id
Shard count:                 100
Events per shard:            20 million
Storage per shard:            30 GB
Read throughput per shard:    700/sec (70K total / 100 shards)
Write throughput per shard:   150/sec (15K total / 100 shards)

Recurring Event Expansion

Peak read requests:          70,000/sec
Requests requiring expansion: ~60% (rest are single events or cached)
Expansions per request:       10 recurring events * 30 days = 300 instances
Total expansions/sec:         70K * 0.6 * 300 = 12.6M/sec
CPU per expansion:            0.1-1 us (simple rules) to 5 us (complex rules)
Total CPU:                    12.6M * 1 us avg = 12.6 seconds/sec = 13 cores
Server count for expansion:   13 / 8 cores = 2 servers (trivial with caching)

With caching (90% hit rate for repeated views):
  Backend expansions/sec:      1.26M
  CPU:                         1.3 cores

Free/Busy Service

Free/busy queries/sec:       5,000 (scheduling meetings)
Attendees per query:         5 avg
Events per attendee/week:    20
Interval tree operations:    5K * 5 * log(20) = ~108K tree operations/sec
CPU per operation:           0.5 us
Total CPU:                   54 ms/sec (one core handles this easily)

Cache hit rate for rooms:    90%
Cache hit rate for users:    60%

Sync Token Queries

Active sync clients:         200M users * 2 devices avg = 400M
Sync interval:               30 seconds (background) to 5 seconds (active)
Sync queries/sec:            400M / 30 = 13.3M/sec (background)
                             + 5M / 5 = 1M/sec (active foreground)
                             = ~14.3M/sec total

Per query:                   Index lookup on (user_id, change_id > token)
Changes per sync (avg):      0.2 events (most syncs find nothing)
Change log table size:        ~50GB (30-day retention, daily compaction)

Failure Analysis

Failure	Impact	Mitigation
Event store shard goes down	Users on that shard cannot view or create events.	Read replicas for calendar views. Primary failover for writes. Stale reads from replica are acceptable for calendar views.
RRULE expansion library bug	Recurring events display incorrectly for affected patterns.	Comprehensive test suite with 100+ edge cases (Feb 29, DST, month-end, timezone changes). Canary deployment with regression tests.
tz database out of date	Events display at wrong times after DST rule changes.	Auto-update tz database from IANA releases. Monitor government announcements for DST changes. Recompute affected events on tz update.
Sync token log grows unbounded	Database runs out of disk. Sync queries slow down.	Compact change log: retain last 30 days only. If a client presents a token older than 30 days, force a full resync.
Meeting room double-booking	Two meetings booked in the same room at the same time.	Database-level constraint (unique index on room_id + time range). If the constraint is violated, reject the second booking with a clear error.
Multi-device sync conflict	User sees unexpected event changes. Data loss if one device's changes are silently overwritten.	Field-level merge for non-conflicting changes. Conflict notification for same-field conflicts. User can choose which version to keep.
Notification service down	Users miss meeting reminders. They are late or miss meetings.	Retry with exponential backoff. Client-side local notifications as fallback (phone alarm for events synced locally).

Level Expectations

Level	What the Interviewer Expects
Mid (L4)	Simple events table with CRUD operations. Knows about recurring events conceptually. Basic free/busy query. Mentions timezone handling as a concern.
Senior (L5)	RRULE storage with expand-on-read. Three-way edit problem (this/future/all) with concrete solution. Interval tree for conflict detection. Timezone identifiers vs. UTC offsets. CalDAV sync tokens for incremental sync. Meeting room booking with optimistic locking.
Staff+ (L6)	February 29th and BYMONTHDAY=-1 edge cases. DST "lost hour" handling with concrete examples. Field-level merge for multi-device conflicts. tz database update pipeline. Hybrid caching strategy for recurring expansion. Quantified expansion cost and caching impact. Reference to RFC 5545 and real Google Calendar / Outlook behaviors.

References from Our Courses

Partitioning, Replication, and Pooling — time-range partitioning for calendar event storage
CRDTs — conflict resolution for concurrent calendar edits
Redis Data Structures and Use Cases — caching free/busy slots for availability lookups

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Calendar and Scheduling Service →