cat.web addscat.web is a thin domain wrapper around
cat.stack that adds:
input_data and cat.web downloads each page,
strips boilerplate, and classifies the body text in a single call.source_domain, content_type, and
web_metadata arguments inject relevant context into the
classification prompt (“This is a news article from nytimes.com…”).Everything else — supported models, output format, ensemble voting —
is identical to cat.stack.
urls <- c(
"https://www.nytimes.com/2025/01/15/opinion/some-essay.html",
"https://www.nytimes.com/2025/01/16/us/breaking-news.html",
"https://www.nytimes.com/2025/01/17/technology/product-review.html"
)
results <- classify(
categories = c("News", "Opinion", "Tutorial/Review", "Other"),
input_data = urls,
source_domain = "nytimes.com",
content_type = "news article",
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)cat.web fetches each URL with a polite User-Agent,
extracts the main content (dropping navigation, footers, comment
sections), and then runs the LLM classifier. The output
data.frame includes the original URL, the extracted body
text (or a snippet of it), and one 0/1 column per category.
If you already have the page content (perhaps from a scraping pipeline), skip the fetch and pass strings directly:
The source_domain, content_type, and
web_metadata arguments inject context the model wouldn’t
otherwise have. This matters most for short pages or pages where the
domain affects meaning (an opinion on nytimes.com vs. a personal
blog).
results <- classify(
categories = c("Pro-policy", "Critical", "Neutral-explainer"),
input_data = urls,
source_domain = "vox.com",
content_type = "explainer article",
web_metadata = list(
section = "Policy",
audience = "general public",
style = "long-form explainer"
),
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)For long pages, summarize first to cut tokens and improve focus:
summaries <- summarize(
input_data = urls,
source_domain = "nytimes.com",
content_type = "news article",
format = "bullets",
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)
results <- classify(
categories = c("Domestic", "International", "Business", "Other"),
input_data = summaries$summary,
api_key = Sys.getenv("OPENAI_API_KEY")
)cat.web doesn’t enforce crawl politeness — that’s on you.
For large jobs, add row_delay = 1 (or higher) to space out
requests.input_data from one
fetch and re-use it.timeout if you’re hitting slow or
large pages — the default (30s) is short for some sites.vignette("getting-started", package = "cat.llm")?cat.web::classify,
?cat.web::extract, ?cat.web::explore,
?cat.web::summarize