Really American Story Identification System - Technical Specification
Date: December 11, 2025
For: Justin Horwitz / Really American Media
Author: Gary Sheng
Overview
A low-cost agent service that continuously monitors social media accounts, newsletters, and news sites to surface story opportunities for the Really American content team. Team members can claim stories, mark them as complete, and add new sources to monitor.
Data Sources to Monitor
1. X (Twitter) Accounts
- Method: X API v2 (Basic or Pro tier)
- Cost:
- Basic: $100/month (10,000 tweets read/month, limited to user's own tweets)
- Pro: $5,000/month (1M tweets read/month, full access)
- Recommendation: Start with a workaround (see below) or accept the $5K/month cost
X API Workarounds:
- Nitter instances (free but unreliable, frequently go down)
- Apify Twitter Scraper: ~$50-100/month for moderate usage
- SocialData.tools: ~$0.002 per tweet, likely $50-150/month
- RapidAPI Twitter alternatives: Various pricing, some as low as $30/month
2. Facebook Pages (e.g., Occupy Democrats)
- Method: Meta Graph API (limited) or third-party scrapers
- Limitation: Facebook aggressively blocks scraping. Public pages can be accessed via Graph API with approved app.
- Cost: Free for approved apps, or Apify Facebook Scraper (~$40-100/month)
- Alternative: CrowdTangle (if you have media access, this is free and powerful)
3. Newsletters (Politico Playbook, etc.)
- Method: Email forwarding + parsing
- How it works:
- Subscribe with a dedicated email address
- Forward all emails to your service
- Parse with an LLM or regex
- Cost: Near-free (just email hosting, $5-10/month for domain email)
- Tools: Zapier/Make.com for email parsing, or custom Lambda function
4. News Sites (Daily Beast, Raw Story, Irish Star, etc.)
- Method: RSS feeds + web scraping fallback
- Many sites have RSS:
- Daily Beast:
https://www.thedailybeast.com/rss - Raw Story:
https://www.rawstory.com/feed/
- Daily Beast:
- For sites without RSS: Puppeteer/Playwright scraping with rotating proxies
- Cost:
- RSS: Free
- Scraping: Proxy costs ~$20-50/month (BrightData, ScrapingBee, etc.)
5. YouTube Channels
- Method: YouTube Data API v3
- Cost: Free (10,000 quota units/day, roughly 100 channel checks per day per channel)
- Ample for monitoring maybe 50-100 channels
6. Kay's Archive (assuming archive.org or similar)
- Method: Archive.org API (free) or direct scraping
- Cost: Free
Architecture
High-Level Flow
┌─────────────────────────────────────────────────────────────┐
│ DATA COLLECTORS │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌────────┐│
│ │ X/Twitter│ │Facebook │ │ Email │ │ RSS/Web │ │YouTube ││
│ │ Scraper │ │ Scraper │ │ Parser │ │ Scraper │ │ API ││
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ └───┬────┘│
└───────┼──────────┼──────────┼──────────┼───────────┼──────┘
│ │ │ │ │
└──────────┴──────────┴──────────┴───────────┘
│
▼
┌──────────────────┐
│ RAW CONTENT │
│ DATABASE │
│ (PostgreSQL) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ STORY RANKER │
│ (LLM Agent) │
│ │
│ - Relevance score│
│ - Topic tagging │
│ - Urgency rating │
│ - Deduplication │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ STORY QUEUE │
│ (Web App UI) │
│ │
│ - View stories │
│ - Claim stories │
│ - Mark complete │
│ - Add sources │
└──────────────────┘
Tech Stack Recommendation
| Component | Technology | Why |
|---|---|---|
| Backend | Python + FastAPI | Fast to build, great for scraping/AI |
| Database | Supabase (PostgreSQL) | Free tier generous, real-time subscriptions |
| Job Scheduler | Modal.com or Railway cron | Serverless, pay-per-use |
| LLM Processing | OpenAI GPT-4o-mini or Claude Haiku | Cheap, fast, good enough |
| Frontend | Next.js + Vercel | Fast deployment, free hosting |
| Scraping Infra | Apify or BrightData | Handles anti-bot for you |
Feature Breakdown
Core Features (MVP)
-
Source Management
- Add/remove X accounts to monitor
- Add/remove Facebook pages
- Add/remove RSS feeds
- Add/remove YouTube channels
- Subscribe/unsubscribe newsletters
-
Story Queue
- List of detected stories ranked by relevance/urgency
- Filter by topic (Trump admin, Hegseth, Noem, etc.)
- Filter by source type
- Search functionality
-
Story Claiming
- Claim a story (locks it for you)
- Mark as "in progress"
- Mark as "published" (with link to content)
- Mark as "passed" (won't do this one)
- Release claim if you change your mind
-
Topic Tracking
- Pre-defined topics: Trump corruption, Hegseth, Noem, etc.
- Custom topic creation
- LLM auto-tags stories to topics
-
Notifications
- Slack/Discord integration for high-priority stories
- Daily digest email
Nice-to-Have Features (V2)
- Underreported story detection (comparing mainstream coverage)
- Viral potential scoring
- Suggested headlines/thumbnails
- Cross-reference with trending topics
- Performance feedback loop (learn from what works)
Cost Breakdown (Monthly)
Minimum Viable Setup (~$200-400/month)
| Item | Cost | Notes |
|---|---|---|
| X/Twitter Data | $50-100 | Via SocialData.tools or Apify |
| Facebook Scraping | $40-80 | Apify or manual CrowdTangle |
| Proxy Service | $20-50 | For news site scraping |
| LLM (GPT-4o-mini) | $20-50 | ~500K tokens/day processing |
| Supabase | $0-25 | Free tier likely sufficient |
| Vercel | $0-20 | Free tier likely sufficient |
| Modal/Railway | $10-30 | For scheduled jobs |
| Email (newsletters) | $5-10 | Fastmail or similar |
| TOTAL | $145-365 |
More Robust Setup (~$500-800/month)
Add:
- X Pro API ($5,000/month) → Only if volume demands it
- Dedicated scraping infrastructure
- More LLM usage for better analysis
- Dedicated database instance
If X API is Required at Scale
The X Pro API at $5,000/month is brutal. Alternatives:
- Curate accounts carefully: Only monitor 20-30 high-signal accounts
- Use free search via Nitter mirrors: Unreliable but free
- Partner with existing tooling: Some newsrooms share API access
- Accept limited X coverage: Focus on other sources, use X manually
Development Timeline
Phase 1: MVP (2-3 weeks)
- Set up database schema
- Build RSS/news site collectors
- Build newsletter email parser
- Build YouTube channel monitor
- Basic LLM story ranking
- Simple web UI for viewing/claiming
- Basic topic tagging
Phase 2: Social Media (1-2 weeks)
- X/Twitter integration (via third-party)
- Facebook page integration
- Deduplication across sources
Phase 3: Polish (1 week)
- Slack/Discord notifications
- Better UI/UX
- Admin panel for adding sources
- Daily digest emails
Total: ~4-6 weeks for a solo developer
Database Schema (Simplified)
-- Sources to monitor
CREATE TABLE sources (
id UUID PRIMARY KEY,
type TEXT, -- 'twitter', 'facebook', 'rss', 'youtube', 'newsletter'
identifier TEXT, -- handle, page_id, url, channel_id, email
name TEXT,
active BOOLEAN DEFAULT true,
created_at TIMESTAMP
);
-- Raw collected content
CREATE TABLE raw_content (
id UUID PRIMARY KEY,
source_id UUID REFERENCES sources(id),
external_id TEXT, -- tweet_id, post_id, etc.
content TEXT,
url TEXT,
published_at TIMESTAMP,
collected_at TIMESTAMP,
metadata JSONB
);
-- Processed stories
CREATE TABLE stories (
id UUID PRIMARY KEY,
raw_content_ids UUID[], -- can come from multiple sources
title TEXT,
summary TEXT,
relevance_score FLOAT,
urgency_score FLOAT,
topics TEXT[],
status TEXT DEFAULT 'new', -- 'new', 'claimed', 'in_progress', 'published', 'passed'
claimed_by UUID REFERENCES users(id),
claimed_at TIMESTAMP,
published_url TEXT,
created_at TIMESTAMP
);
-- Users (team members)
CREATE TABLE users (
id UUID PRIMARY KEY,
email TEXT,
name TEXT,
role TEXT -- 'admin', 'editor', 'writer'
);
-- Topics to track
CREATE TABLE topics (
id UUID PRIMARY KEY,
name TEXT,
keywords TEXT[],
active BOOLEAN DEFAULT true
);
LLM Processing Pipeline
For each piece of raw content:
def process_content(content: RawContent) -> Story:
prompt = f"""
Analyze this content for newsworthy story potential.
Content: {content.text}
Source: {content.source_type} - {content.source_name}
Evaluate:
1. Is this a distinct newsworthy story? (not just commentary)
2. Relevance score (0-1) for progressive political audience
3. Urgency score (0-1) - is this time-sensitive?
4. Topics it relates to: {TOPIC_LIST}
5. Suggested headline (if newsworthy)
6. Brief summary (2-3 sentences)
Return JSON.
"""
response = llm.complete(prompt, model="gpt-4o-mini")
return parse_story(response)
Estimated cost per item: ~$0.001-0.002 (4o-mini pricing)
Key Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| X API cost prohibitive | High | Use third-party scrapers, limit accounts |
| Facebook blocks scraping | Medium | Apply for CrowdTangle, use approved Graph API |
| LLM costs spiral | Medium | Use cheapest model, batch processing, caching |
| Anti-bot detection | Medium | Rotating proxies, rate limiting, human-like patterns |
| Story duplication | Low | Semantic similarity matching, URL deduping |
Recommended First Steps
- Define 20-30 must-have X accounts to monitor
- List all RSS feeds available from target news sites
- Subscribe to all newsletters with a dedicated email
- Set up Supabase and create the schema
- Build RSS collector first (lowest friction)
- Test LLM ranking with sample content
- Build minimal UI to view/claim stories
- Add social media collectors last (most complex)
Questions to Resolve
- How many X accounts need monitoring? (affects API cost decision)
- What's the team size using this? (affects claiming workflow)
- What Slack/Discord does the team use? (for notifications)
- Is CrowdTangle access available through any existing relationships?
- What's the acceptable delay? (real-time vs. hourly vs. daily batches)
- Should this integrate with existing Really American tooling?
Summary
Building this is very doable at the $200-400/month range if we're creative about X/Twitter access. The core architecture is straightforward: collectors → database → LLM ranking → web UI.
The main cost variable is X API access. If Really American absolutely needs full X coverage of many accounts, budget $5K/month just for that. Otherwise, we can work around it with third-party scrapers.
Development time: 4-6 weeks for MVP with a single developer.
Let me know which sources are highest priority and we can sequence accordingly.