Really American Story Identification System - Technical Specification

Date: December 11, 2025
For: Justin Horwitz / Really American Media
Author: Gary Sheng

Overview

A low-cost agent service that continuously monitors social media accounts, newsletters, and news sites to surface story opportunities for the Really American content team. Team members can claim stories, mark them as complete, and add new sources to monitor.

Data Sources to Monitor

1. X (Twitter) Accounts

Method: X API v2 (Basic or Pro tier)
Cost:
- Basic: $100/month (10,000 tweets read/month, limited to user's own tweets)
- Pro: $5,000/month (1M tweets read/month, full access)
- Recommendation: Start with a workaround (see below) or accept the $5K/month cost

X API Workarounds:

Nitter instances (free but unreliable, frequently go down)
Apify Twitter Scraper: ~$50-100/month for moderate usage
SocialData.tools: ~$0.002 per tweet, likely $50-150/month
RapidAPI Twitter alternatives: Various pricing, some as low as $30/month

2. Facebook Pages (e.g., Occupy Democrats)

Method: Meta Graph API (limited) or third-party scrapers
Limitation: Facebook aggressively blocks scraping. Public pages can be accessed via Graph API with approved app.
Cost: Free for approved apps, or Apify Facebook Scraper (~$40-100/month)
Alternative: CrowdTangle (if you have media access, this is free and powerful)

3. Newsletters (Politico Playbook, etc.)

Method: Email forwarding + parsing
How it works:
1. Subscribe with a dedicated email address
2. Forward all emails to your service
3. Parse with an LLM or regex
Cost: Near-free (just email hosting, $5-10/month for domain email)
Tools: Zapier/Make.com for email parsing, or custom Lambda function

4. News Sites (Daily Beast, Raw Story, Irish Star, etc.)

Method: RSS feeds + web scraping fallback
Many sites have RSS:
- Daily Beast: https://www.thedailybeast.com/rss
- Raw Story: https://www.rawstory.com/feed/
For sites without RSS: Puppeteer/Playwright scraping with rotating proxies
Cost:
- RSS: Free
- Scraping: Proxy costs ~$20-50/month (BrightData, ScrapingBee, etc.)

5. YouTube Channels

Method: YouTube Data API v3
Cost: Free (10,000 quota units/day, roughly 100 channel checks per day per channel)
Ample for monitoring maybe 50-100 channels

6. Kay's Archive (assuming archive.org or similar)

Method: Archive.org API (free) or direct scraping
Cost: Free

Architecture

High-Level Flow

┌─────────────────────────────────────────────────────────────┐
│                     DATA COLLECTORS                          │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│
│  │ X/Twitter│ │Facebook │ │ Email   │ │ RSS/Web │ │YouTube ││
│  │ Scraper │ │ Scraper │ │ Parser  │ │ Scraper │ │  API   ││
│  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └───┬────┘│
└───────┼──────────┼──────────┼──────────┼───────────┼──────┘
        │          │          │          │           │
        └──────────┴──────────┴──────────┴───────────┘
                              │
                              ▼
                   ┌──────────────────┐
                   │   RAW CONTENT    │
                   │    DATABASE      │
                   │   (PostgreSQL)   │
                   └────────┬─────────┘
                            │
                            ▼
                   ┌──────────────────┐
                   │  STORY RANKER    │
                   │   (LLM Agent)    │
                   │                  │
                   │ - Relevance score│
                   │ - Topic tagging  │
                   │ - Urgency rating │
                   │ - Deduplication  │
                   └────────┬─────────┘
                            │
                            ▼
                   ┌──────────────────┐
                   │   STORY QUEUE    │
                   │   (Web App UI)   │
                   │                  │
                   │ - View stories   │
                   │ - Claim stories  │
                   │ - Mark complete  │
                   │ - Add sources    │
                   └──────────────────┘

Tech Stack Recommendation

Component	Technology	Why
Backend	Python + FastAPI	Fast to build, great for scraping/AI
Database	Supabase (PostgreSQL)	Free tier generous, real-time subscriptions
Job Scheduler	Modal.com or Railway cron	Serverless, pay-per-use
LLM Processing	OpenAI GPT-4o-mini or Claude Haiku	Cheap, fast, good enough
Frontend	Next.js + Vercel	Fast deployment, free hosting
Scraping Infra	Apify or BrightData	Handles anti-bot for you

Feature Breakdown

Core Features (MVP)

Source Management
- Add/remove X accounts to monitor
- Add/remove Facebook pages
- Add/remove RSS feeds
- Add/remove YouTube channels
- Subscribe/unsubscribe newsletters
Story Queue
- List of detected stories ranked by relevance/urgency
- Filter by topic (Trump admin, Hegseth, Noem, etc.)
- Filter by source type
- Search functionality
Story Claiming
- Claim a story (locks it for you)
- Mark as "in progress"
- Mark as "published" (with link to content)
- Mark as "passed" (won't do this one)
- Release claim if you change your mind
Topic Tracking
- Pre-defined topics: Trump corruption, Hegseth, Noem, etc.
- Custom topic creation
- LLM auto-tags stories to topics
Notifications
- Slack/Discord integration for high-priority stories
- Daily digest email

Nice-to-Have Features (V2)

Underreported story detection (comparing mainstream coverage)
Viral potential scoring
Suggested headlines/thumbnails
Cross-reference with trending topics
Performance feedback loop (learn from what works)

Cost Breakdown (Monthly)

Minimum Viable Setup (~$200-400/month)

Item	Cost	Notes
X/Twitter Data	$50-100	Via SocialData.tools or Apify
Facebook Scraping	$40-80	Apify or manual CrowdTangle
Proxy Service	$20-50	For news site scraping
LLM (GPT-4o-mini)	$20-50	~500K tokens/day processing
Supabase	$0-25	Free tier likely sufficient
Vercel	$0-20	Free tier likely sufficient
Modal/Railway	$10-30	For scheduled jobs
Email (newsletters)	$5-10	Fastmail or similar
TOTAL	$145-365

More Robust Setup (~$500-800/month)

Add:

X Pro API ($5,000/month) → Only if volume demands it
Dedicated scraping infrastructure
More LLM usage for better analysis
Dedicated database instance

If X API is Required at Scale

The X Pro API at $5,000/month is brutal. Alternatives:

Curate accounts carefully: Only monitor 20-30 high-signal accounts
Use free search via Nitter mirrors: Unreliable but free
Partner with existing tooling: Some newsrooms share API access
Accept limited X coverage: Focus on other sources, use X manually

Development Timeline

Phase 1: MVP (2-3 weeks)

X/Twitter integration (via third-party)
Facebook page integration
Deduplication across sources

Phase 3: Polish (1 week)

Slack/Discord notifications
Better UI/UX
Admin panel for adding sources
Daily digest emails

Total: ~4-6 weeks for a solo developer

Database Schema (Simplified)

-- Sources to monitor
CREATE TABLE sources (
  id UUID PRIMARY KEY,
  type TEXT, -- 'twitter', 'facebook', 'rss', 'youtube', 'newsletter'
  identifier TEXT, -- handle, page_id, url, channel_id, email
  name TEXT,
  active BOOLEAN DEFAULT true,
  created_at TIMESTAMP
);

-- Raw collected content
CREATE TABLE raw_content (
  id UUID PRIMARY KEY,
  source_id UUID REFERENCES sources(id),
  external_id TEXT, -- tweet_id, post_id, etc.
  content TEXT,
  url TEXT,
  published_at TIMESTAMP,
  collected_at TIMESTAMP,
  metadata JSONB
);

-- Processed stories
CREATE TABLE stories (
  id UUID PRIMARY KEY,
  raw_content_ids UUID[], -- can come from multiple sources
  title TEXT,
  summary TEXT,
  relevance_score FLOAT,
  urgency_score FLOAT,
  topics TEXT[],
  status TEXT DEFAULT 'new', -- 'new', 'claimed', 'in_progress', 'published', 'passed'
  claimed_by UUID REFERENCES users(id),
  claimed_at TIMESTAMP,
  published_url TEXT,
  created_at TIMESTAMP
);

-- Users (team members)
CREATE TABLE users (
  id UUID PRIMARY KEY,
  email TEXT,
  name TEXT,
  role TEXT -- 'admin', 'editor', 'writer'
);

-- Topics to track
CREATE TABLE topics (
  id UUID PRIMARY KEY,
  name TEXT,
  keywords TEXT[],
  active BOOLEAN DEFAULT true
);

LLM Processing Pipeline

For each piece of raw content:

def process_content(content: RawContent) -> Story:
    prompt = f"""
    Analyze this content for newsworthy story potential.
    
    Content: {content.text}
    Source: {content.source_type} - {content.source_name}
    
    Evaluate:
    1. Is this a distinct newsworthy story? (not just commentary)
    2. Relevance score (0-1) for progressive political audience
    3. Urgency score (0-1) - is this time-sensitive?
    4. Topics it relates to: {TOPIC_LIST}
    5. Suggested headline (if newsworthy)
    6. Brief summary (2-3 sentences)
    
    Return JSON.
    """
    
    response = llm.complete(prompt, model="gpt-4o-mini")
    return parse_story(response)

Estimated cost per item: ~$0.001-0.002 (4o-mini pricing)

Key Risks & Mitigations

Risk	Impact	Mitigation
X API cost prohibitive	High	Use third-party scrapers, limit accounts
Facebook blocks scraping	Medium	Apply for CrowdTangle, use approved Graph API
LLM costs spiral	Medium	Use cheapest model, batch processing, caching
Anti-bot detection	Medium	Rotating proxies, rate limiting, human-like patterns
Story duplication	Low	Semantic similarity matching, URL deduping

Recommended First Steps

Define 20-30 must-have X accounts to monitor
List all RSS feeds available from target news sites
Subscribe to all newsletters with a dedicated email
Set up Supabase and create the schema
Build RSS collector first (lowest friction)
Test LLM ranking with sample content
Build minimal UI to view/claim stories
Add social media collectors last (most complex)

Questions to Resolve

How many X accounts need monitoring? (affects API cost decision)
What's the team size using this? (affects claiming workflow)
What Slack/Discord does the team use? (for notifications)
Is CrowdTangle access available through any existing relationships?
What's the acceptable delay? (real-time vs. hourly vs. daily batches)
Should this integrate with existing Really American tooling?

Summary

Building this is very doable at the $200-400/month range if we're creative about X/Twitter access. The core architecture is straightforward: collectors → database → LLM ranking → web UI.

The main cost variable is X API access. If Really American absolutely needs full X coverage of many accounts, budget $5K/month just for that. Otherwise, we can work around it with third-party scrapers.

Development time: 4-6 weeks for MVP with a single developer.

Let me know which sources are highest priority and we can sequence accordingly.

Overview​

Data Sources to Monitor​

1. X (Twitter) Accounts​

2. Facebook Pages (e.g., Occupy Democrats)​

3. Newsletters (Politico Playbook, etc.)​

4. News Sites (Daily Beast, Raw Story, Irish Star, etc.)​

5. YouTube Channels​

6. Kay's Archive (assuming archive.org or similar)​

Architecture​

High-Level Flow​

Tech Stack Recommendation​

Feature Breakdown​

Core Features (MVP)​

Nice-to-Have Features (V2)​

Cost Breakdown (Monthly)​

Minimum Viable Setup (~$200-400/month)​

More Robust Setup (~$500-800/month)​

If X API is Required at Scale​

Development Timeline​

Phase 1: MVP (2-3 weeks)​

Phase 2: Social Media (1-2 weeks)​

Phase 3: Polish (1 week)​

Database Schema (Simplified)​

LLM Processing Pipeline​

Key Risks & Mitigations​

Recommended First Steps​

Questions to Resolve​

Summary​