My System Design Interview Experience

Overview

This document details a candidate's experience during a system design interview. The interview question involved designing a system that infinitely scrolls through websites, extracts raw data (HTML, CSS, JS, images), and stores it. The candidate reflects on their approach, the interviewer's feedback (or lack thereof), and key takeaways for future interviews.

Interview Rounds

The interview began with clarifying questions to define the scope of the problem, including whether the design should focus on end-to-end architecture or frontend flow, the identification of end-users, and the need for a UI or a purely backend system.

Initial Design

The candidate proposed an initial flow:

Client → API Gateway → Backend Services (Web Content Service + others)

When prompted about automating the process without manual intervention, the candidate introduced a message queue for job distribution.

Discussion & Evolution

During the interview, the candidate's design was discussed and refined. The interviewer's expectations leaned towards a more autonomous, backend-focused system, triggering a shift in the candidate's approach.

The proposed system, triggered by a scheduler, queues jobs to worker nodes (headless browsers). Each job extracts data and streams results to storage asynchronously. This refined approach aligned more closely with the interviewer's distributed systems background.

High-Level Design Components:

Input: List of target URLs
↓
Orchestrator / Scheduler
↓
Workers (Headless Browsers - Puppeteer / Playwright)
↓
Scrolling & Extraction Logic
↓
Data Processing / Queue
↓
Storage (S3, GCS, or Database)
↓
Monitoring & Logging

Component Details:

Orchestrator: Coordinates scraping jobs, handles retries & rate limiting.
Workers: Use Puppeteer/Playwright to scroll & extract HTML.
Queue: Decouples ingestion and processing (Kafka / RabbitMQ).
Storage: Raw HTML → S3; Metadata → DB.
Monitoring: Centralized logs & retry tracking.

Scaling and Performance:

Horizontally scale workers (Kubernetes / ECS).
Reuse browser sessions, limit tabs, cache pages.

Tradeoffs:

| Aspect | Option A | Option B | | ------------- | ---------------------- | ------------------------ | | Scalability | More worker containers | Centralized browser pool | | Storage | Raw HTML | Parsed content | | Network | Cache | Re-fetch every time |

System Summary:

The proposed system utilizes an orchestrator to manage a scalable pool of headless browsers that scroll, extract content, and push it asynchronously to storage.

Final Whiteboard Diagram:

                ┌───────────────────┐
                │ URL Scheduler     │
                └───────┬───────────┘
                        │
                ┌───────▼───────────┐
                │ Worker Pool       │
                │ (Puppeteer/Play)  │
                └───────┬───────────┘
                        │ Extracts content
                        ▼
                ┌───────────────────┐
                │ Message Queue     │
                └───────┬───────────┘
                        │
                ┌───────▼───────────┐
                │ Storage Layer     │
                │ (S3, DB, Logs)    │
                └───────────────────┘

Key Takeaways

The candidate identified several key areas for improvement:

Clarifying questions: Demonstrated systems thinking.
Modular design: Showed clear separation of concerns.
Adaptability: Introduced message queues when prompted.
Full-stack awareness: Considered frontend, backend, and infra links.

Divergence in Expectations:

The candidate noted a difference in focus, stemming from the interviewer's backend/distributed systems background versus the intended frontend role.

| Area | Candidate's Focus | Interviewer's Expectation | | ------------- | ----------------------| ----------------------------| | Entry Point | UI-triggered flow | Autonomous system | | Automation | Added queue later | Expected scheduler from start| | Focus | Architecture clarity | Distributed scalability | | Terminology | Frontend/system mix | Backend/infrastructure terms |

By immediately aligning the approach with a scheduler-driven, automated system, the candidate could have addressed the interviewer's concerns more effectively.

Despite not moving forward, the experience provided valuable learning and reinforced the importance of understanding the interviewer's perspective and tailoring the response accordingly.