What it was
A web scraping pipeline that pulled from 10 sources (HackerNews, Dev.to, IndieHackers, GitHub Issues, Stack Overflow, Product Hunt, YC RFS, Failory, Medium), ran relevance scoring through Claude, extracted pain points, and clustered them by theme. The goal: surface real founder problems with citations, faster than manual Reddit research.
Why it was killed
After 20 hours of building, the tool produced 9 pain points in 68 seconds at $0.50 per search. Asking Claude directly produces 50+ pain points in 3 seconds at $0.01. The comparison wasn't close.
The fundamental problem: we were building a complex pipeline to extract worse data than Claude already has in its training. Claude has seen millions of Reddit threads, founder blogs, and startup discussions. No amount of web scraping optimization could compete with that.
Three days of aggressive optimization—keyword expansion, volume increases, new data sources—moved us from 8 pain points to 9. Meanwhile, half the sources were failing: Medium hit rate limits (429 errors), Quora's React app couldn't be scraped with regex, and Reddit's API policies made it off-limits. We were fighting infrastructure instead of solving problems.
What was learned
- Validate the competitive premise first. The question "Can web scraping beat Claude for founder problems?" should have been asked on Day 1, not discovered on Day 3.
- When your complex solution is worse than "just ask ChatGPT," you're building the wrong thing. Pivot or stop—don't optimize further.
- Modern web apps can't be scraped with regex. Quora, Medium, and most 2026-era sites use React/SSR. Need headless browsers, which are slow, expensive, and fragile.
- Optimize the bottleneck, not the source. Spent 4 hours increasing fetch volume when the real problem was keyword matching filtering out 94% of posts.
- AI-generated content must match how it's consumed. Generated "problem phrases" like "can't find product market fit" for substring matching. Posts don't contain exact long phrases—needed short topic keywords instead.
- Monday planning gaps cost days, not hours. Missing competitive analysis, TypeScript interface-first guidelines, and multi-file update checklists caused 90+ minutes of preventable friction.
Build context
Week 1. Feb 9–13, 2026. 20 hours invested. ROI: negative—but the learnings shaped how Week 2 will run.
The pipeline
- Keyword expansion — User enters a topic. Claude generates 10-12 related keywords.
- Multi-source fetch — Hit 10 APIs/feeds in parallel: HackerNews, Dev.to, IndieHackers, GitHub Issues, Stack Overflow, Product Hunt, YC RFS, Failory, Medium, (Quora disabled).
- Relevance scoring — Claude scores each post 0-100 for problem signal relevance. Posts below threshold filtered out.
- Pain extraction — Claude extracts specific pain points from high-relevance posts with supporting quotes.
- Clustering — Group similar pain points into themes. Calculate composite scores.
- Display — Ranked list with expandable cards, source citations, and score breakdowns.
What worked
- TypeScript integration was clean and type-safe (after fixing 15+ initial type errors)
- JSON parsing became robust after handling all Claude output variations
- Clustering algorithm grouped pain points effectively
- Progress indicators showed actual pipeline steps, not fake timers
What didn't work
- Wrong data sources — HN/Dev.to are developers discussing code bugs, not founders discussing business problems
- Keyword matching bottleneck — Substring matching filtered out 94% of posts
- Rate limits — Medium returned 429 errors when hitting 6 tags in parallel
- Modern SPAs — Quora's React app couldn't be scraped without headless browser
- Fundamental ceiling — Can't beat Claude's training data with real-time scraping
Stack
Next.js 15, TypeScript, Claude 3.5 Sonnet API, Vercel deployment. 1,103 lines of code across 10 data source integrations.
The complete friction log from Week 1. Everything that went wrong, got confusing, or surprised me during the build. Organized by day, then by issue.
20
Hours
11
Issues
~10
Hours Lost
6
Learnings
Morning — 2 hours lost
Missing competitive analysis. Built collapsible cards and progress indicators (nice-to-have) for 6 hours before learning the real requirement: "Beat a Claude prompt." Should have been building the AI synthesis layer (must-have). The "beat Claude" requirement came 6 hours into the session.
High
Afternoon — 45 min
TypeScript type errors cascade. Used any types during rapid prototyping. Got 15+ type errors across GitHub, Stack Overflow, Dev.to, IndieHackers, HackerNews integrations. Should have defined interfaces BEFORE implementation.
Medium
Afternoon — 20 min
Multi-file update chaos. Adding data sources required updating 6 files (types.ts, API route, PainPointCard, page.tsx, sources.ts, validation array). Kept discovering new places that needed changes. Needed a checklist.
Medium
Evening — 20 min
Keyword expansion made results WORSE. Divided post limit across keywords (30 ÷ 6 = 5 per keyword) instead of multiplying. Got 4 pain points instead of 50. Math should have multiplied coverage, not divided it.
High
Evening — 15 min
Overly specific keywords. Generated multi-word phrases ("can't find product market fit") that never match in substring searches. People don't type exact long phrases. Needed short topic keywords (1-2 words).
Medium
All day — 4 hours wasted
Optimized the wrong layer. Spent 4 hours on aggressive volume optimizations (posts per keyword 30→50, scoring limit 100→200, threshold 40→30, new sources). Result: +1 pain point (8→9). The bottleneck was keyword matching filtering out 94% of posts—not fetch volume.
High
Afternoon — 2 hours
Problem phrases vs topic keywords. Changed keyword expansion to generate "problem phrases" (2-5 words) thinking they'd be more specific. But sources do substring matching—long phrases almost never match. Reverted to short topic keywords (1-2 words).
High
Morning — 90 min
JSON parsing errors in production. Claude wrapping JSON in markdown code blocks. Regex was too strict—didn't handle variations in whitespace and language tags. Had to implement robust markdown stripping.
Medium
Midday — 45 min
Medium rate limiting (429). Hitting 6 Medium tags in parallel triggered rate limits. Had to reduce to 2 tags with 500ms delays between requests. Parallel fetching isn't always better.
Medium
Afternoon — 2 hours
Quora scraper completely broken. Implemented HTML regex scraper. Zero results. Modern Quora is a React/Next.js app—questions are in JSON data embedded in script tags, not parseable HTML. Can't scrape modern SPAs with regex.
High
End of day — The realization
The fundamental ceiling. After 20 hours: 9 pain points, 68 seconds, $0.50/search. Claude prompt: 50+ pain points, 3 seconds, $0.01. We're building a Rube Goldberg machine to extract worse data than Claude already has in its training. Project killed.
High
Summary
Time invested: 20 hours across 3 days
Preventable friction: ~6 hours (better Monday planning)
Biggest single friction: Not validating competitive premise upfront. Spent 20 hours building something that can't compete with a 3-second Claude prompt.
Categories: 60% from scope gaps (missing competitive analysis, TypeScript guidelines, checklists). 25% from optimizing wrong layer. 15% from unpredictable issues (rate limits, modern SPA architecture).