Signal Miner — Toolshed

Overview

How It Worked

Friction Log

What it was

A web scraping pipeline that pulled from 10 sources (HackerNews, Dev.to, IndieHackers, GitHub Issues, Stack Overflow, Product Hunt, YC RFS, Failory, Medium), ran relevance scoring through Claude, extracted pain points, and clustered them by theme. The goal: surface real founder problems with citations, faster than manual Reddit research.

Why it was killed

After 20 hours of building, the tool produced 9 pain points in 68 seconds at $0.50 per search. Asking Claude directly produces 50+ pain points in 3 seconds at $0.01. The comparison wasn't close.

The fundamental problem: we were building a complex pipeline to extract worse data than Claude already has in its training. Claude has seen millions of Reddit threads, founder blogs, and startup discussions. No amount of web scraping optimization could compete with that.

Three days of aggressive optimization—keyword expansion, volume increases, new data sources—moved us from 8 pain points to 9. Meanwhile, half the sources were failing: Medium hit rate limits (429 errors), Quora's React app couldn't be scraped with regex, and Reddit's API policies made it off-limits. We were fighting infrastructure instead of solving problems.

What was learned

Validate the competitive premise first. The question "Can web scraping beat Claude for founder problems?" should have been asked on Day 1, not discovered on Day 3.
When your complex solution is worse than "just ask ChatGPT," you're building the wrong thing. Pivot or stop—don't optimize further.
Modern web apps can't be scraped with regex. Quora, Medium, and most 2026-era sites use React/SSR. Need headless browsers, which are slow, expensive, and fragile.
Optimize the bottleneck, not the source. Spent 4 hours increasing fetch volume when the real problem was keyword matching filtering out 94% of posts.
AI-generated content must match how it's consumed. Generated "problem phrases" like "can't find product market fit" for substring matching. Posts don't contain exact long phrases—needed short topic keywords instead.
Monday planning gaps cost days, not hours. Missing competitive analysis, TypeScript interface-first guidelines, and multi-file update checklists caused 90+ minutes of preventable friction.

Build context

Week 1. Feb 9–13, 2026. 20 hours invested. ROI: negative—but the learnings shaped how Week 2 will run.

The pipeline

Keyword expansion — User enters a topic. Claude generates 10-12 related keywords.
Multi-source fetch — Hit 10 APIs/feeds in parallel: HackerNews, Dev.to, IndieHackers, GitHub Issues, Stack Overflow, Product Hunt, YC RFS, Failory, Medium, (Quora disabled).
Relevance scoring — Claude scores each post 0-100 for problem signal relevance. Posts below threshold filtered out.
Pain extraction — Claude extracts specific pain points from high-relevance posts with supporting quotes.
Clustering — Group similar pain points into themes. Calculate composite scores.
Display — Ranked list with expandable cards, source citations, and score breakdowns.

What worked

TypeScript integration was clean and type-safe (after fixing 15+ initial type errors)
JSON parsing became robust after handling all Claude output variations
Clustering algorithm grouped pain points effectively
Progress indicators showed actual pipeline steps, not fake timers

What didn't work

Wrong data sources — HN/Dev.to are developers discussing code bugs, not founders discussing business problems
Keyword matching bottleneck — Substring matching filtered out 94% of posts
Rate limits — Medium returned 429 errors when hitting 6 tags in parallel
Modern SPAs — Quora's React app couldn't be scraped without headless browser
Fundamental ceiling — Can't beat Claude's training data with real-time scraping

Stack

Next.js 15, TypeScript, Claude 3.5 Sonnet API, Vercel deployment. 1,103 lines of code across 10 data source integrations.

The complete friction log from Week 1. Everything that went wrong, got confusing, or surprised me during the build. Organized by day, then by issue.

20 Hours

11 Issues

~10 Hours Lost

6 Learnings

Day 1: February 10 — 5 issues, ~3.5 hours lost

▼

Morning — 2 hours lost

Missing competitive analysis. Built collapsible cards and progress indicators (nice-to-have) for 6 hours before learning the real requirement: "Beat a Claude prompt." Should have been building the AI synthesis layer (must-have). The "beat Claude" requirement came 6 hours into the session.

High

Afternoon — 45 min

TypeScript type errors cascade. Used any types during rapid prototyping. Got 15+ type errors across GitHub, Stack Overflow, Dev.to, IndieHackers, HackerNews integrations. Should have defined interfaces BEFORE implementation.

Medium

Afternoon — 20 min

Multi-file update chaos. Adding data sources required updating 6 files (types.ts, API route, PainPointCard, page.tsx, sources.ts, validation array). Kept discovering new places that needed changes. Needed a checklist.

Medium

Evening — 20 min

Keyword expansion made results WORSE. Divided post limit across keywords (30 ÷ 6 = 5 per keyword) instead of multiplying. Got 4 pain points instead of 50. Math should have multiplied coverage, not divided it.

High

Evening — 15 min

Overly specific keywords. Generated multi-word phrases ("can't find product market fit") that never match in substring searches. People don't type exact long phrases. Needed short topic keywords (1-2 words).

Medium

Day 2: February 11 — 2 issues, ~6 hours lost

▼

All day — 4 hours wasted

Optimized the wrong layer. Spent 4 hours on aggressive volume optimizations (posts per keyword 30→50, scoring limit 100→200, threshold 40→30, new sources). Result: +1 pain point (8→9). The bottleneck was keyword matching filtering out 94% of posts—not fetch volume.

High

Afternoon — 2 hours

Problem phrases vs topic keywords. Changed keyword expansion to generate "problem phrases" (2-5 words) thinking they'd be more specific. But sources do substring matching—long phrases almost never match. Reverted to short topic keywords (1-2 words).

High

Day 3: February 12 — 4 issues, the kill decision

▼

Morning — 90 min

JSON parsing errors in production. Claude wrapping JSON in markdown code blocks. Regex was too strict—didn't handle variations in whitespace and language tags. Had to implement robust markdown stripping.

Medium

Midday — 45 min

Medium rate limiting (429). Hitting 6 Medium tags in parallel triggered rate limits. Had to reduce to 2 tags with 500ms delays between requests. Parallel fetching isn't always better.

Medium

Afternoon — 2 hours

Quora scraper completely broken. Implemented HTML regex scraper. Zero results. Modern Quora is a React/Next.js app—questions are in JSON data embedded in script tags, not parseable HTML. Can't scrape modern SPAs with regex.

High

End of day — The realization

The fundamental ceiling. After 20 hours: 9 pain points, 68 seconds, $0.50/search. Claude prompt: 50+ pain points, 3 seconds, $0.01. We're building a Rube Goldberg machine to extract worse data than Claude already has in its training. Project killed.

High

Summary

Time invested: 20 hours across 3 days

Preventable friction: ~6 hours (better Monday planning)

Biggest single friction: Not validating competitive premise upfront. Spent 20 hours building something that can't compete with a 3-second Claude prompt.

Categories: 60% from scope gaps (missing competitive analysis, TypeScript guidelines, checklists). 25% from optimizing wrong layer. 15% from unpredictable issues (rate limits, modern SPA architecture).