Methodology
How ideas get from public posts to the directory, in full detail. This page is intentionally technical — for the shorter version, see /about.
1. Collection (scraping)
Six source-specific Supabase Edge Functions run on scheduledpg_cron jobs:
- Reddit — 06:00 and 18:00 UTC. Hits both
/hotand/newper configured subreddit, deduplicated by Reddit post ID. Currently paused pending Reddit API access following their November 2025 Responsible Builder Policy update. - Hacker News — 07:00 and 19:00 UTC. Hits Algolia's
/search(relevance) and/search_by_date(newest-first) per configured query, deduplicated by Algolia objectID. Capped at 120 posts per run to stay within Edge Function wall clock. - GitHub Trending — 08:00 UTC. Repos gaining recent stars in software-adjacent topics.
- Product Hunt — 09:00 UTC. Daily featured products and launches.
- Indie Hackers — 10:00 UTC. Posts from the public forum.
- Google Trends — 08:00, 14:00, 20:00 UTC. Breakout queries.
Source lists are stored in the scrape_sources table and read at runtime, so adding or removing a subreddit / query doesn't require redeploying an Edge Function.
2. Extraction (LLM)
Each batch of raw posts is passed through Anthropic Claude Haiku 4.5 in chunks of 10. The prompt asks the model to identify app and SaaS product ideas and return structured JSON for each found idea. The exact fields:
- idea_title — concise product name
- summary — 2–3 sentence pitch describing problem, solution, and target user
- category — one of 14 fixed slugs (fintech, devtools, automation, ai-ml, ecommerce, health, education, creator-tools, productivity, marketing, hr-recruiting, real-estate, logistics, other)
- tags — 3–5 lowercase topic tags
- confidence — 0.0–1.0 score indicating how clearly the post describes a buildable product. Ideas below 0.5 are dropped; 0.5–0.7 are marked
needs_reviewand hidden from public views; 0.7+ areactive. - difficulty — 1 (weekend build with AI assistance) to 5 (significant infrastructure required)
- market_signal —
strong(clear unmet demand; people explicitly asking for this),moderate(some signal), orweak(speculative) - competition_level —
low/medium/highestimate of existing alternatives - revenue_potential — free-text monthly revenue range (e.g.
$2k-10k/mo) orunknown
3. Deduplication
Every extracted idea is compared against the existing corpus using PostgreSQL pg_trgm trigram similarity on title and summary separately. An existing idea is considered a duplicate if either similarity(title, new_title) > 0.6 OR similarity(summary, new_summary) > 0.5. The asymmetric threshold reflects the observation that titles are shorter and noisier (so demand a higher bar) while summaries carry more signal. A match increments the existing idea's mention_count and updates last_seen_at; no match creates a new row. We chose pg_trgm (pure Postgres, no embedding calls) over semantic similarity for Phase 1 to keep ingestion cost near-zero and avoid a hard dependency on a second LLM provider. Semantic dedup via Ghost pgvectorscale is queued for a later phase. The canonical SQL lives in supabase/migrations/001_initial_schema.sql as find_similar_ideas().
4. Popularity scoring
The popularity_score is a database-computed column recalculated on insert and on every mention increment. It combines three factors:
- Log-scaled mention count — each additional mention contributes less than the last. The 20th mention of an idea doesn't move the score as much as the 2nd.
- Source diversity — an idea surfaced by three distinct platforms (e.g. once on HN, once on Reddit, once on PH) outranks an idea surfaced three times on the same platform.
- Recency decay — gentle exponential decay on
last_seen_atso stale ideas don't permanently dominate the rankings.
The Fresh sort filters to ideas with first_seen_at in the last 7 days; everything else uses the full corpus.
5. Commentary
Every new active idea additionally gets a 2–4 sentence editorial commentary from Claude Sonnet 4.6, covering:
- Market timing — why this is interesting right now, grounded in real trends
- Closest competitor or substitute — named if a well-known one exists
- Unit economics hint — why the revenue band makes sense (or doesn't)
- Biggest risk — the single most likely failure mode
We chose Sonnet 4.6 over Haiku 4.5 after an A/B test on 21 diverse ideas. Sonnet reliably names real competitors (e.g. TradingView, ForeFlight, Polygon) and cites real pricing, which Haiku approximates but doesn't ground. Cost delta at our volume is ~$33/year, trivial relative to the citation-quality gain.
6. Storage and delivery
All ideas and their metadata live in Supabase Postgres. Related-idea suggestions are powered by Tiger Data's Ghost (Agentic Postgres) using BM25 text search, with a Supabase same-category fallback. The Next.js frontend on Vercel renders HTML idea pages, plus markdown-native variants at /ideas/{slug}.md designed for LLM ingestion.
7. Known limitations
- LLM-assigned signals are approximate.
market_signal,competition_level, andrevenue_potentialare Haiku's best-effort estimates from a single post. They're directionally useful but shouldn't drive investment decisions without independent validation. - Reddit is temporarily offline. Reddit's November 2025 Responsible Builder Policy removed self-service API keys. Our formal access application is pending.
- Source bias. The corpus skews toward developer-tool and consumer-software categories because those dominate the public discussions we scrape. Logistics and HR ideas are underrepresented.
- English only. The extraction prompt runs in English and our sources are English-dominant.
- Commentary is a synthesis, not a source. The editorial paragraph is AI-generated analysis of the idea, not a quote from the original post. Treat it as a prompt for your own research rather than a verified claim.
Audit and reproducibility
The full pipeline is open source at github.com/lld-gif/easy-saas. Extraction prompt lives at supabase/functions/_shared/extract.ts. Commentary prompt lives at src/lib/commentary.ts. Scrape schedules are in the cron.job table on the Supabase side. Migrations documenting schema evolution are numbered in supabase/migrations/.