Skip to main content

Really American Story Identification System - Technical Specification

Date: December 11, 2025
For: Justin Horwitz / Really American Media
Author: Gary Sheng


Overview

A low-cost agent service that continuously monitors social media accounts, newsletters, and news sites to surface story opportunities for the Really American content team. Team members can claim stories, mark them as complete, and add new sources to monitor.


Data Sources to Monitor

1. X (Twitter) Accounts

  • Method: X API v2 (Basic or Pro tier)
  • Cost:
    • Basic: $100/month (10,000 tweets read/month, limited to user's own tweets)
    • Pro: $5,000/month (1M tweets read/month, full access)
    • Recommendation: Start with a workaround (see below) or accept the $5K/month cost

X API Workarounds:

  • Nitter instances (free but unreliable, frequently go down)
  • Apify Twitter Scraper: ~$50-100/month for moderate usage
  • SocialData.tools: ~$0.002 per tweet, likely $50-150/month
  • RapidAPI Twitter alternatives: Various pricing, some as low as $30/month

2. Facebook Pages (e.g., Occupy Democrats)

  • Method: Meta Graph API (limited) or third-party scrapers
  • Limitation: Facebook aggressively blocks scraping. Public pages can be accessed via Graph API with approved app.
  • Cost: Free for approved apps, or Apify Facebook Scraper (~$40-100/month)
  • Alternative: CrowdTangle (if you have media access, this is free and powerful)

3. Newsletters (Politico Playbook, etc.)

  • Method: Email forwarding + parsing
  • How it works:
    1. Subscribe with a dedicated email address
    2. Forward all emails to your service
    3. Parse with an LLM or regex
  • Cost: Near-free (just email hosting, $5-10/month for domain email)
  • Tools: Zapier/Make.com for email parsing, or custom Lambda function

4. News Sites (Daily Beast, Raw Story, Irish Star, etc.)

  • Method: RSS feeds + web scraping fallback
  • Many sites have RSS:
    • Daily Beast: https://www.thedailybeast.com/rss
    • Raw Story: https://www.rawstory.com/feed/
  • For sites without RSS: Puppeteer/Playwright scraping with rotating proxies
  • Cost:
    • RSS: Free
    • Scraping: Proxy costs ~$20-50/month (BrightData, ScrapingBee, etc.)

5. YouTube Channels

  • Method: YouTube Data API v3
  • Cost: Free (10,000 quota units/day, roughly 100 channel checks per day per channel)
  • Ample for monitoring maybe 50-100 channels

6. Kay's Archive (assuming archive.org or similar)

  • Method: Archive.org API (free) or direct scraping
  • Cost: Free

Architecture

High-Level Flow

┌─────────────────────────────────────────────────────────────┐
│ DATA COLLECTORS │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│
│ │ X/Twitter│ │Facebook │ │ Email │ │ RSS/Web │ │YouTube ││
│ │ Scraper │ │ Scraper │ │ Parser │ │ Scraper │ │ API ││
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └───┬────┘│
└───────┼──────────┼──────────┼──────────┼───────────┼──────┘
│ │ │ │ │
└──────────┴──────────┴──────────┴───────────┘


┌──────────────────┐
│ RAW CONTENT │
│ DATABASE │
│ (PostgreSQL) │
└────────┬─────────┘


┌──────────────────┐
│ STORY RANKER │
│ (LLM Agent) │
│ │
│ - Relevance score│
│ - Topic tagging │
│ - Urgency rating │
│ - Deduplication │
└────────┬─────────┘


┌──────────────────┐
│ STORY QUEUE │
│ (Web App UI) │
│ │
│ - View stories │
│ - Claim stories │
│ - Mark complete │
│ - Add sources │
└──────────────────┘

Tech Stack Recommendation

ComponentTechnologyWhy
BackendPython + FastAPIFast to build, great for scraping/AI
DatabaseSupabase (PostgreSQL)Free tier generous, real-time subscriptions
Job SchedulerModal.com or Railway cronServerless, pay-per-use
LLM ProcessingOpenAI GPT-4o-mini or Claude HaikuCheap, fast, good enough
FrontendNext.js + VercelFast deployment, free hosting
Scraping InfraApify or BrightDataHandles anti-bot for you

Feature Breakdown

Core Features (MVP)

  1. Source Management

    • Add/remove X accounts to monitor
    • Add/remove Facebook pages
    • Add/remove RSS feeds
    • Add/remove YouTube channels
    • Subscribe/unsubscribe newsletters
  2. Story Queue

    • List of detected stories ranked by relevance/urgency
    • Filter by topic (Trump admin, Hegseth, Noem, etc.)
    • Filter by source type
    • Search functionality
  3. Story Claiming

    • Claim a story (locks it for you)
    • Mark as "in progress"
    • Mark as "published" (with link to content)
    • Mark as "passed" (won't do this one)
    • Release claim if you change your mind
  4. Topic Tracking

    • Pre-defined topics: Trump corruption, Hegseth, Noem, etc.
    • Custom topic creation
    • LLM auto-tags stories to topics
  5. Notifications

    • Slack/Discord integration for high-priority stories
    • Daily digest email

Nice-to-Have Features (V2)

  • Underreported story detection (comparing mainstream coverage)
  • Viral potential scoring
  • Suggested headlines/thumbnails
  • Cross-reference with trending topics
  • Performance feedback loop (learn from what works)

Cost Breakdown (Monthly)

Minimum Viable Setup (~$200-400/month)

ItemCostNotes
X/Twitter Data$50-100Via SocialData.tools or Apify
Facebook Scraping$40-80Apify or manual CrowdTangle
Proxy Service$20-50For news site scraping
LLM (GPT-4o-mini)$20-50~500K tokens/day processing
Supabase$0-25Free tier likely sufficient
Vercel$0-20Free tier likely sufficient
Modal/Railway$10-30For scheduled jobs
Email (newsletters)$5-10Fastmail or similar
TOTAL$145-365

More Robust Setup (~$500-800/month)

Add:

  • X Pro API ($5,000/month) → Only if volume demands it
  • Dedicated scraping infrastructure
  • More LLM usage for better analysis
  • Dedicated database instance

If X API is Required at Scale

The X Pro API at $5,000/month is brutal. Alternatives:

  1. Curate accounts carefully: Only monitor 20-30 high-signal accounts
  2. Use free search via Nitter mirrors: Unreliable but free
  3. Partner with existing tooling: Some newsrooms share API access
  4. Accept limited X coverage: Focus on other sources, use X manually

Development Timeline

Phase 1: MVP (2-3 weeks)

  • Set up database schema
  • Build RSS/news site collectors
  • Build newsletter email parser
  • Build YouTube channel monitor
  • Basic LLM story ranking
  • Simple web UI for viewing/claiming
  • Basic topic tagging

Phase 2: Social Media (1-2 weeks)

  • X/Twitter integration (via third-party)
  • Facebook page integration
  • Deduplication across sources

Phase 3: Polish (1 week)

  • Slack/Discord notifications
  • Better UI/UX
  • Admin panel for adding sources
  • Daily digest emails

Total: ~4-6 weeks for a solo developer


Database Schema (Simplified)

-- Sources to monitor
CREATE TABLE sources (
id UUID PRIMARY KEY,
type TEXT, -- 'twitter', 'facebook', 'rss', 'youtube', 'newsletter'
identifier TEXT, -- handle, page_id, url, channel_id, email
name TEXT,
active BOOLEAN DEFAULT true,
created_at TIMESTAMP
);

-- Raw collected content
CREATE TABLE raw_content (
id UUID PRIMARY KEY,
source_id UUID REFERENCES sources(id),
external_id TEXT, -- tweet_id, post_id, etc.
content TEXT,
url TEXT,
published_at TIMESTAMP,
collected_at TIMESTAMP,
metadata JSONB
);

-- Processed stories
CREATE TABLE stories (
id UUID PRIMARY KEY,
raw_content_ids UUID[], -- can come from multiple sources
title TEXT,
summary TEXT,
relevance_score FLOAT,
urgency_score FLOAT,
topics TEXT[],
status TEXT DEFAULT 'new', -- 'new', 'claimed', 'in_progress', 'published', 'passed'
claimed_by UUID REFERENCES users(id),
claimed_at TIMESTAMP,
published_url TEXT,
created_at TIMESTAMP
);

-- Users (team members)
CREATE TABLE users (
id UUID PRIMARY KEY,
email TEXT,
name TEXT,
role TEXT -- 'admin', 'editor', 'writer'
);

-- Topics to track
CREATE TABLE topics (
id UUID PRIMARY KEY,
name TEXT,
keywords TEXT[],
active BOOLEAN DEFAULT true
);

LLM Processing Pipeline

For each piece of raw content:

def process_content(content: RawContent) -> Story:
prompt = f"""
Analyze this content for newsworthy story potential.

Content: {content.text}
Source: {content.source_type} - {content.source_name}

Evaluate:
1. Is this a distinct newsworthy story? (not just commentary)
2. Relevance score (0-1) for progressive political audience
3. Urgency score (0-1) - is this time-sensitive?
4. Topics it relates to: {TOPIC_LIST}
5. Suggested headline (if newsworthy)
6. Brief summary (2-3 sentences)

Return JSON.
"""

response = llm.complete(prompt, model="gpt-4o-mini")
return parse_story(response)

Estimated cost per item: ~$0.001-0.002 (4o-mini pricing)


Key Risks & Mitigations

RiskImpactMitigation
X API cost prohibitiveHighUse third-party scrapers, limit accounts
Facebook blocks scrapingMediumApply for CrowdTangle, use approved Graph API
LLM costs spiralMediumUse cheapest model, batch processing, caching
Anti-bot detectionMediumRotating proxies, rate limiting, human-like patterns
Story duplicationLowSemantic similarity matching, URL deduping

  1. Define 20-30 must-have X accounts to monitor
  2. List all RSS feeds available from target news sites
  3. Subscribe to all newsletters with a dedicated email
  4. Set up Supabase and create the schema
  5. Build RSS collector first (lowest friction)
  6. Test LLM ranking with sample content
  7. Build minimal UI to view/claim stories
  8. Add social media collectors last (most complex)

Questions to Resolve

  1. How many X accounts need monitoring? (affects API cost decision)
  2. What's the team size using this? (affects claiming workflow)
  3. What Slack/Discord does the team use? (for notifications)
  4. Is CrowdTangle access available through any existing relationships?
  5. What's the acceptable delay? (real-time vs. hourly vs. daily batches)
  6. Should this integrate with existing Really American tooling?

Summary

Building this is very doable at the $200-400/month range if we're creative about X/Twitter access. The core architecture is straightforward: collectors → database → LLM ranking → web UI.

The main cost variable is X API access. If Really American absolutely needs full X coverage of many accounts, budget $5K/month just for that. Otherwise, we can work around it with third-party scrapers.

Development time: 4-6 weeks for MVP with a single developer.

Let me know which sources are highest priority and we can sequence accordingly.