Design a Price Alert Service
TL;DR
A price alert service monitors product prices across e-commerce sites and notifies users when prices drop below their target. The hard part is not checking a price -- it is doing it efficiently at scale when you have 50 million tracked products and 200 million active alerts. Crawling every product hourly would generate 50 million HTTP requests per hour, most of which discover that nothing changed. The key insight: use adaptive crawl frequency (popular products hourly, long tail weekly), store only price changes (80-90% write reduction), and deduplicate notifications so a user who set a $50 alert on a product that fluctuates between $49 and $51 does not get pinged every hour. Amazon's PA-API rate limits (1 request/sec for new associates) force creative architectural decisions around batching and caching.
The System
A price alert service lets users set target prices on products. "Tell me when this laptop drops below $800." When the product price falls to or below the target, the user gets notified via push, email, or SMS. The user then clicks through, makes the purchase, and the service earns affiliate commission.
CamelCamelCamel tracks prices on 26+ million Amazon products and has sent hundreds of millions of price alerts since launch. Honey (acquired by PayPal for $4 billion) monitors prices and applies coupons at checkout. Google Shopping Alerts, Keepa, and Slickdeals all operate in this space. The economics are driven by affiliate commissions (typically 1-10% of purchase price) -- if you alert a user to a $500 price drop and they buy, you earn $5-50. At millions of alerts per month, this is a viable business model. The Chrome extension model (Honey, Keepa) adds a dimension: the extension can detect when a user is viewing a product and retroactively create alerts, driving engagement without requiring explicit user action.
Requirements
Functional
- Create alert: User specifies a product URL and a target price (e.g., "Alert me when this item is under $50")
- Price tracking: System periodically checks current prices for all tracked products
- Notification: When a product's price drops to or below the target, notify the user via email (primary), push notification, or SMS
- Price history: Show the historical price graph for a product (useful for users deciding whether a current price is actually a deal)
- Multiple alerts per product: Many users can set alerts on the same product at different target prices
- Browser extension: Optionally, a Chrome extension that shows price history on product pages and lets users set alerts in one click
Non-Functional
- Freshness: Detect price changes within 1 hour for popular products, within 24 hours for long-tail products
- Alert latency: Once a price drop is detected, notify the user within 5 minutes
- Scale: 50 million tracked products, 200 million active alerts across 30 million users
- Crawl efficiency: Minimize HTTP requests to source sites (both for politeness and cost)
- Notification accuracy: Do not alert a user for a price drop they have already been alerted about (dedup)
- Availability: 99.9% uptime for the alert pipeline
Back-of-Envelope Math
Tracked products: 50M
Active alerts: 200M (avg 4 alerts per product)
Crawl frequency (adaptive):
Tier 1 (popular, > 100 alerts): 500K products, crawled hourly
Tier 2 (moderate, 10-100 alerts): 5M products, crawled every 6 hours
Tier 3 (long tail, < 10 alerts): 44.5M products, crawled daily
Crawl volume:
Tier 1: 500K * 24 crawls/day = 12M crawls/day
Tier 2: 5M * 4 crawls/day = 20M crawls/day
Tier 3: 44.5M * 1 crawl/day = 44.5M crawls/day
Total: 76.5M crawls/day = 885 crawls/sec
Compared to naive (all hourly): 50M * 24 = 1.2B crawls/day = 13,889/sec
Adaptive approach is 16x fewer crawls.
Price changes:
Average product changes price 2x per week
50M * 2/7 = 14.3M price changes/day = 165 changes/sec
Of 76.5M crawls/day, only 14.3M discover changes = 18.7% change rate
81.3% of crawls find no change (wasted, but necessary for freshness)
Storage:
Current prices: 50M products * 200 bytes = 10 GB
Price history (1 year, only changes):
14.3M changes/day * 365 * 50 bytes = 261 GB
Compact -- fits in a single database
Alert evaluation:
14.3M price changes/day, each product has avg 4 alerts
14.3M * 4 = 57.2M alert checks/day = 662 checks/sec
Each check: compare new_price <= target_price. Trivial compute.
Notifications:
If 5% of alert checks trigger: 57.2M * 0.05 = 2.86M notifications/day
= 33 notifications/sec
Very manageable for a notification service.
The Naive Design
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Cron Job │────>│ Crawler │────>│ Database │
│ (hourly) │ │ (single) │ │ (MySQL) │
└──────────────┘ └──────────────┘ └──────────────┘
│
v
┌──────────────┐
│ Alert Check │
│ (compare) │
└──────┬───────┘
│
v
┌──────────────┐
│ Email/SMS │
│ (send) │
└──────────────┘
-- Every hour:
FOR each product in products_table:
new_price = fetch_price(product.url)
UPDATE products SET current_price = new_price WHERE id = product.id
FOR each alert WHERE product_id = product.id AND target_price >= new_price:
send_notification(alert.user, product, new_price)
UPDATE alerts SET triggered = true WHERE id = alert.id
This works for tracking 1,000 products. One server, one cron job, done.
Where Does This Break First?
At 50M products, the hourly crawl takes longer than an hour. At 300ms per HTTP request (DNS + fetch + parse), 50M requests take 15M seconds = 173 days of sequential crawling. Even with 1,000 concurrent threads across multiple machines, you need 4.2 hours per crawl cycle. You cannot maintain hourly freshness.
Where It Breaks
Problem 1: Crawling 50M products hourly is impossible. At 300ms per request, 50M requests take 173 days sequentially. With 1,000 threads: 4.2 hours. But that assumes 100% success rate and no rate limiting from source sites. In reality, Amazon rate-limits scrapers aggressively, and robots.txt specifies crawl delays. The naive "crawl everything every hour" approach does not scale.
Problem 2: 81% of crawls discover no change. Products change price an average of 2 times per week. Checking hourly means 166 out of 168 checks find the same price. You are wasting 99% of your crawl budget on discovering nothing.
Problem 3: Amazon PA-API rate limits. If you use Amazon's Product Advertising API (the legitimate way to get prices), new associates get 1 request per second. Even top-tier associates get 8,640 requests per day (1 per 10 seconds). At 50M products, the PA-API alone takes 57.8 years to crawl the catalog once. You need a mix of API and web scraping, with smart allocation.
Problem 4: Notification spam from oscillating prices. A product's price fluctuates between $49 and $51 daily (common for algorithmic pricing). A user with a $50 alert gets notified every day when the price dips to $49, then sees the price back at $51 when they check. After 3 days of this, they unsubscribe.
Problem 5: Storing every price check is wasteful. If you store a price record every time you check (76.5M rows/day), you accumulate 27.9B rows/year. But 81% of those rows are "no change" -- identical to the previous record. Storing only changes reduces this to 5.2B rows/year (14.3M/day * 365).
The Real Design
┌───────────────────────────────────────────────────────────┐
│ Crawl Scheduler │
│ Assigns crawl frequency by product tier │
│ Tier 1 (popular): hourly │ Tier 2: 6h │ Tier 3: daily │
└────────────────────┬──────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────┐
│ Crawl Queue (Kafka) │
│ Topic: crawl-tasks, partitioned by domain │
└────────────────────┬──────────────────────────────────────┘
│
┌───────────┼───────────┐
│ │ │
┌────v───┐ ┌────v───┐ ┌───v────┐
│Crawler │ │Crawler │ │Crawler │
│Worker 1│ │Worker 2│ │Worker N│
│(pool │ │ │ │ │
│per │ │ │ │ │
│domain) │ │ │ │ │
└────┬───┘ └────┬───┘ └───┬────┘
│ │ │
└───────────┼───────────┘
│
v
┌───────────────────────────────────────────────────────────┐
│ Price Change Detector │
│ Compare new price vs. stored current price │
│ If changed: write to price_changes table │
│ If unchanged: update last_checked timestamp only │
└────────────────────┬──────────────────────────────────────┘
│ (only price changes)
v
┌───────────────────────────────────────────────────────────┐
│ Alert Evaluator │
│ For each price change: find all alerts where │
│ target_price >= new_price AND not recently notified │
└────────────────────┬──────────────────────────────────────┘
│
v
┌───────────────────────────────────────────────────────────┐
│ Notification Service │
│ Dedup → Template → Channel Router → Send │
└───────────────────────────────────────────────────────────┘
Adaptive Crawl Frequency
The single biggest optimization in the system. Instead of crawling all products at the same rate, assign each product a tier based on how many alerts it has (proxy for importance) and how frequently it changes price.
def assign_crawl_tier(product):
alert_count = get_alert_count(product.id)
change_frequency = get_price_change_frequency(product.id) # changes per week
if alert_count > 100 or change_frequency > 5:
return Tier.HOURLY # 500K products
elif alert_count > 10 or change_frequency > 1:
return Tier.SIX_HOURS # 5M products
else:
return Tier.DAILY # 44.5M products
Dynamic tier adjustment: If a Tier 3 product suddenly gets 50 new alerts (someone shared a deal link), promote it to Tier 1. If a Tier 1 product has not changed price in 2 weeks, demote to Tier 2. Run tier reassignment every 6 hours.
Crawl budget per domain: Within each tier, respect per-domain crawl budgets. Amazon might get 100 crawls/sec (they have millions of products). A small retailer might get 1 crawl/sec. Domain budgets are configured manually for major sites and default to 1/sec for unknown domains.
Store-Only-Changes (80-90% Write Reduction)
The price change detector compares the fetched price against the stored current price. If they are the same, it only updates the last_checked_at timestamp. If different, it writes a new record to the price_changes table.
-- Products table (current state)
CREATE TABLE products (
id BIGINT PRIMARY KEY,
url TEXT,
domain TEXT,
title TEXT,
current_price DECIMAL(10,2),
currency CHAR(3),
last_checked_at TIMESTAMP,
last_changed_at TIMESTAMP,
crawl_tier INT,
INDEX idx_domain_tier (domain, crawl_tier)
);
-- Price changes table (append-only history)
CREATE TABLE price_changes (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
product_id BIGINT,
old_price DECIMAL(10,2),
new_price DECIMAL(10,2),
changed_at TIMESTAMP,
INDEX idx_product_time (product_id, changed_at)
);
Write math: 76.5M crawls/day, but only 14.3M result in a price change. The price_changes table gets 14.3M inserts/day instead of 76.5M. The products table gets 76.5M timestamp updates/day (lightweight UPDATE ... SET last_checked_at = NOW()), but these are cheap because they do not create new rows.
Timestamp-only updates can be batched: Instead of updating last_checked_at for each product individually, batch updates in groups of 1,000:
1,000 updates in one query. 76.5K batched queries/day instead of 76.5M individual queries.
Notification Deduplication with last_notified_price
This is the key to preventing notification spam from oscillating prices.
CREATE TABLE alerts (
id BIGINT PRIMARY KEY,
user_id BIGINT,
product_id BIGINT,
target_price DECIMAL(10,2),
last_notified_price DECIMAL(10,2) NULL,
last_notified_at TIMESTAMP NULL,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP,
INDEX idx_product (product_id),
INDEX idx_user (user_id)
);
The last_notified_price field tracks the price at which the user was last alerted. The dedup rule:
def should_notify(alert, new_price):
# Only notify if price is at or below target
if new_price > alert.target_price:
return False
# Never notified before -- send it
if alert.last_notified_price is None:
return True
# Only re-notify if price dropped FURTHER than last notification
if new_price < alert.last_notified_price:
return True
# Price is at or above last notified price -- skip
# (prevents spam when price oscillates around target)
return False
Example of dedup in action:
User sets alert: target = $50
Day 1: price = $49 → notify (first time). last_notified_price = $49
Day 2: price = $51 → no notify (above target)
Day 3: price = $49 → no notify (same as last_notified_price, not lower)
Day 4: price = $45 → notify (below last_notified_price). last_notified = $45
Day 5: price = $49 → no notify (above $45, not a new low)
Day 6: price = $42 → notify (below $45). last_notified = $42
Without dedup, this user gets 3 notifications. With dedup, only when the price hits new lows. This dramatically reduces notification fatigue and keeps open rates high.
Alert Evaluation Pipeline
When a price change is detected, the system needs to find all alerts whose target price is at or above the new price.
def evaluate_alerts(product_id, new_price, old_price):
# Only evaluate if price DROPPED (not increased)
if new_price >= old_price:
return
# Find all active alerts for this product where target >= new_price
alerts = db.query("""
SELECT * FROM alerts
WHERE product_id = %s
AND target_price >= %s
AND is_active = TRUE
""", (product_id, new_price))
for alert in alerts:
if should_notify(alert, new_price):
enqueue_notification(alert, new_price)
db.execute("""
UPDATE alerts
SET last_notified_price = %s, last_notified_at = NOW()
WHERE id = %s
""", (new_price, alert.id))
Index optimization: The query on alerts uses an index on (product_id, target_price, is_active). For a product with 100 alerts and a new price of $45, this index efficiently returns only alerts with target >= $45.
Batch evaluation: Price changes are consumed from a Kafka topic. The alert evaluator processes them in batches of 100, grouping by product_id to minimize database queries.
Deep Dives

Deep Dive 1: Crawling Strategy and Anti-Bot Defenses
E-commerce sites actively fight scrapers. Amazon, Walmart, and Best Buy use Akamai Bot Manager, Cloudflare Bot Management, and similar tools to detect and block automated requests.
Avoiding detection:
1. Respect robots.txt and crawl-delay directives
2. Rotate user agents (real browser user agents, not "Python-urllib/3.10")
3. Randomize crawl timing (not exactly every 3600 seconds)
4. Use residential proxies for high-volume domains
5. Maintain cookies/sessions (bot detectors flag stateless requests)
6. Render JavaScript when needed (price may be loaded dynamically)
API vs. scraping strategy:
For Amazon specifically, the Product Advertising API (PA-API) is the legitimate way to get prices. But the rate limits are severe:
New associates: 1 request/sec (86,400/day)
Tier 2: 1 request/sec with higher daily cap
Tier 3: 1 request/10 sec daily, 8,640/day
With 50M Amazon products:
PA-API at 86,400/day: covers 86K products/day
Need scraping for the remaining 49.9M products
Hybrid approach: Use the PA-API for real-time price checks (when a user is actively viewing a product page in the extension) and scraping for bulk background crawling. Allocate the PA-API budget to user-initiated requests (more valuable) and use scrapers for scheduled crawls.
Price extraction: The crawler needs to extract the price from the HTML. This is fragile -- sites change their HTML structure regularly.
# Amazon price selectors (these change every few months)
PRICE_SELECTORS = [
'#priceblock_ourprice',
'#priceblock_dealprice',
'.a-price .a-offscreen',
'#corePrice_feature_div .a-price-whole',
]
def extract_price(html, domain):
soup = BeautifulSoup(html, 'lxml')
selectors = DOMAIN_SELECTORS.get(domain, GENERIC_SELECTORS)
for selector in selectors:
element = soup.select_one(selector)
if element:
return parse_price(element.text)
return None # extraction failed, log for debugging
Monitoring extraction success rate: Track the percentage of crawls where price extraction succeeds. If it drops below 95% for a domain, the site probably changed its HTML structure, and the selectors need updating. Alert the on-call engineer.
Deep Dive 2: Chrome Extension Architecture
The browser extension adds a real-time layer on top of the backend crawling infrastructure.
Extension components:
Content Script (runs on product pages):
- Detects product page (URL pattern matching)
- Injects price history graph overlay
- "Set Alert" button with target price input
- Shows "Price dropped since you last viewed!"
Background Service Worker:
- Authenticates with backend API
- Caches product data locally (IndexedDB)
- Listens for push notifications (via FCM)
Popup:
- Shows active alerts and recent price drops
- Quick settings (notification preferences)
Content script injection: When the user navigates to an Amazon product page, the content script detects this via URL pattern (/dp/, /gp/product/) and extracts the ASIN (Amazon Standard Identification Number). It then queries the backend for price history and renders an overlay.
// content_script.js
const AMAZON_PRODUCT_PATTERN = /\/(dp|gp\/product)\/([A-Z0-9]{10})/;
function onPageLoad() {
const match = window.location.href.match(AMAZON_PRODUCT_PATTERN);
if (!match) return;
const asin = match[2];
chrome.runtime.sendMessage(
{ type: 'GET_PRICE_HISTORY', asin },
(response) => {
if (response.history) {
renderPriceChart(response.history);
renderAlertButton(asin, response.currentPrice);
}
}
);
}
API for extension: The backend exposes lightweight endpoints for the extension:
GET /api/product/{asin}/price-history → returns price data points
POST /api/alerts → create alert
GET /api/alerts/active → user's active alerts
GET /api/product/{asin}/current-price → latest cached price
The current-price endpoint returns the cached price (not a live crawl). It refreshes the crawl schedule if the product was not being tracked. This way, the first extension view of an untracked product shows "Price data not yet available. Tracking started." Subsequent views show data.
Deep Dive 3: Price History Storage and Visualization
Users want to see "Is this actually a good deal?" A price history graph that shows the past 12 months of price movement answers this question instantly.
Storage optimization:
The price_changes table stores only price change events, not periodic snapshots. To render a price graph, reconstruct the timeline:
def get_price_history(product_id, start_date, end_date):
changes = db.query("""
SELECT new_price, changed_at FROM price_changes
WHERE product_id = %s
AND changed_at BETWEEN %s AND %s
ORDER BY changed_at
""", (product_id, start_date, end_date))
# Reconstruct timeline: price is constant between changes
# Fill in data points at regular intervals for charting
timeline = []
for i, change in enumerate(changes):
timeline.append((change.changed_at, change.new_price))
if i + 1 < len(changes):
# Price holds until next change
next_change = changes[i + 1].changed_at
timeline.append((next_change - timedelta(seconds=1), change.new_price))
return timeline
Caching price history: Price history does not change often (only when a new price change is recorded). Cache the rendered timeline for each product with a TTL equal to the product's crawl interval. A Tier 1 product's history cache expires every hour; a Tier 3 product's cache lasts 24 hours.
Analytics from price history:
All-time low: $42 (2 months ago)
Average price (90 days): $67
Current price: $55
Recommendation: "Good deal — 18% below 90-day average"
This contextual information is far more valuable than a raw alert. Users trust the alert more when they can see the historical context.
Alternative Designs
Alternative 1: Real-Time Streaming from Retailer APIs
Some retailers offer webhooks or real-time price feeds (Best Buy Open API, Walmart Affiliate API). Subscribe to price change events instead of polling.
Alternative 2: User-Triggered Crawl Only
Do not crawl proactively at all. When a user opens a product page (via extension or website), crawl the price at that moment. Cache results for 15 minutes. Users who want alerts get prices checked only when some user views the product.
Alternative 3: Crowdsourced Price Data
The browser extension reports prices to the backend whenever any user visits a product page. With 1M extension users viewing 100 products/day, you get 100M price data points/day -- more than you could ever crawl yourself.
| Aspect | Adaptive Crawling | Real-Time API | User-Triggered Only | Crowdsourced |
|---|---|---|---|---|
| Freshness | 1h-24h (tiered) | Real-time (seconds) | Only when viewed | Depends on user traffic |
| Coverage | 50M+ products | API-supported only | Only popular products | User-visited only |
| Cost | Moderate (infra + proxy) | Low (API is free/cheap) | Very low | Very low |
| Reliability | High (you control crawl) | Depends on retailer API | Low (gaps in coverage) | Medium (uneven coverage) |
| Anti-bot risk | High | None (legitimate API) | Low (low volume) | None (real users) |
| Alert accuracy | High | Very high | Low (stale between views) | Medium |
The right answer is a combination: adaptive crawling as the backbone, retailer APIs where available (Best Buy, Walmart), and crowdsourced data from the extension to fill gaps and validate crawled prices. User-triggered crawling as a supplement for products that are not yet in the crawl queue.
Scaling Math Verification
Crawl throughput (885 crawls/sec):
- Average crawl time: 300ms (DNS + fetch + parse)
- Workers needed: 885 * 0.3 = 266 concurrent workers
- With 10 crawler machines, each running 30 threads: 300 concurrent workers. Fine.
- Bandwidth: 885 * 100 KB = 88.5 MB/sec = 708 Mbps across all crawlers.
Alert evaluation (662 checks/sec):
- Database query:
SELECT * FROM alerts WHERE product_id = ? AND target_price >= ? AND is_active = TRUE - With index on
(product_id, target_price), each query: < 1ms - 662 queries/sec is trivial for Postgres.
Notification volume (33/sec):
- 33 emails + push/sec is trivial. A single SES instance handles thousands/sec.
- Even at 10x (330/sec during Black Friday), well within capacity.
Storage (1 year):
- Products table: 50M rows * 200 bytes = 10 GB
- Price changes: 14.3M/day * 365 * 50 bytes = 261 GB
- Alerts: 200M rows * 100 bytes = 20 GB
- Total: ~291 GB. Fits on a single database server.
Failure Analysis
| Component | Current capacity | At 10x (500M products) | Breaks? | Fix |
|---|---|---|---|---|
| Crawler fleet (10) | 885 crawls/sec | 8,850 crawls/sec | Yes | Scale to 100 crawler machines |
| Products database | 50M rows, 10 GB | 500M rows, 100 GB | No | Single Postgres handles this |
| Price changes table | 5.2B rows/year, 261 GB | 52B rows/year, 2.6 TB | Yes | Partition by month, archive old data to S3 |
| Alerts table | 200M rows | 2B rows | Yes | Shard by product_id (alert eval is per-product) |
| Alert evaluation | 662 queries/sec | 6,620 queries/sec | Maybe | Index handles it, but may need read replica |
| Proxy/anti-bot costs | Moderate | 10x higher | Yes | Negotiate with proxy providers, use APIs more |
| Notification volume | 33/sec | 330/sec | No | -- |
The first thing to break at 10x is the crawler fleet and the price changes table. Scaling crawlers is linear (add machines). The price changes table at 2.6 TB/year needs partitioning by month with automatic archival of partitions older than 12 months to cold storage (S3 + Parquet for analytics queries).
The more interesting bottleneck at 10x is anti-bot defenses. At 8,850 crawls/sec across e-commerce sites, you are generating significant traffic that bot detection systems will flag. You need a larger pool of rotating proxies, more sophisticated request patterns, and a stronger investment in legitimate APIs.
What's Expected at Each Level
| Aspect | Mid-Level | Senior | Staff+ |
|---|---|---|---|
| Crawl strategy | "Check prices every hour" | Adaptive frequency based on product importance | Tiered crawling with dynamic promotion/demotion, budget per domain |
| Storage | Store every price check | Store only changes (mentions write reduction) | Quantifies 80-90% reduction, batched timestamp updates |
| Notification dedup | Not mentioned | "Don't re-notify for same price" | last_notified_price field, only notify on new lows |
| API rate limits | Not mentioned | Mentions Amazon PA-API limits | PA-API budget allocation, hybrid API + scraping strategy |
| Alert evaluation | Full table scan | Index on (product_id, target_price) | Batch evaluation from Kafka, only evaluate on price drops |
| Extension architecture | Not mentioned | Mentions browser extension | Content script injection, background service worker, crowdsourced data |
| Anti-bot handling | Not mentioned | User-agent rotation, proxies | robots.txt compliance, session management, extraction monitoring |
| Price analytics | Show current price | Price history graph | "Good deal" indicators, all-time low, average comparison |
The single most important signal at any level: do you understand that the system's efficiency depends on not crawling most products most of the time? The adaptive crawl frequency is the core insight. A system that crawls everything hourly burns 16x the crawl budget and still cannot achieve the same freshness for popular products because it is wasting bandwidth on long-tail products that change once a month.
References from Our Courses
- Kafka Partitions and Ordering — ingesting real-time price ticks from exchanges
- Time-Series Databases — storing price history for threshold evaluation
- Bloom Filters — filtering already-triggered alerts to avoid duplicates
Red Team This Design
Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.