My System Design Interview Experience
Overview
This document details a candidate's experience during a system design interview. The interview question involved designing a system that infinitely scrolls through websites, extracts raw data (HTML, CSS, JS, images), and stores it. The candidate reflects on their approach, the interviewer's feedback (or lack thereof), and key takeaways for future interviews.
Interview Rounds
The interview began with clarifying questions to define the scope of the problem, including whether the design should focus on end-to-end architecture or frontend flow, the identification of end-users, and the need for a UI or a purely backend system.
Initial Design
The candidate proposed an initial flow:
Client ā API Gateway ā Backend Services (Web Content Service + others)
When prompted about automating the process without manual intervention, the candidate introduced a message queue for job distribution.
Discussion & Evolution
During the interview, the candidate's design was discussed and refined. The interviewer's expectations leaned towards a more autonomous, backend-focused system, triggering a shift in the candidate's approach.
The proposed system, triggered by a scheduler, queues jobs to worker nodes (headless browsers). Each job extracts data and streams results to storage asynchronously. This refined approach aligned more closely with the interviewer's distributed systems background.
High-Level Design Components:
Input: List of target URLs
ā
Orchestrator / Scheduler
ā
Workers (Headless Browsers - Puppeteer / Playwright)
ā
Scrolling & Extraction Logic
ā
Data Processing / Queue
ā
Storage (S3, GCS, or Database)
ā
Monitoring & Logging
Component Details:
- Orchestrator: Coordinates scraping jobs, handles retries & rate limiting.
- Workers: Use Puppeteer/Playwright to scroll & extract HTML.
- Queue: Decouples ingestion and processing (Kafka / RabbitMQ).
- Storage: Raw HTML ā S3; Metadata ā DB.
- Monitoring: Centralized logs & retry tracking.
Scaling and Performance:
- Horizontally scale workers (Kubernetes / ECS).
- Reuse browser sessions, limit tabs, cache pages.
Tradeoffs:
| Aspect | Option A | Option B | | ------------- | ---------------------- | ------------------------ | | Scalability | More worker containers | Centralized browser pool | | Storage | Raw HTML | Parsed content | | Network | Cache | Re-fetch every time |
Original Source
This experience was originally published on medium. Support the author by visiting the original post.
Read on medium