---
title: "Classifying Web Content"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Classifying Web Content}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(eval = FALSE, comment = "#>", collapse = TRUE)
```

# What `cat.web` adds

`cat.web` is a thin domain wrapper around `cat.stack` that adds:

1. **Automatic URL fetching** — pass a vector of URLs as `input_data`
   and `cat.web` downloads each page, strips boilerplate, and
   classifies the body text in a single call.
2. **Web-context prompt injection** — `source_domain`, `content_type`,
   and `web_metadata` arguments inject relevant context into the
   classification prompt ("This is a news article from nytimes.com…").

Everything else — supported models, output format, ensemble voting —
is identical to `cat.stack`.

# Install

```{r install}
install.packages(
  "cat.web",
  repos = c("https://chrissoria.r-universe.dev",
            "https://cloud.r-project.org")
)
library(cat.web)
```

# Classify a list of URLs

```{r classify-urls}
urls <- c(
  "https://www.nytimes.com/2025/01/15/opinion/some-essay.html",
  "https://www.nytimes.com/2025/01/16/us/breaking-news.html",
  "https://www.nytimes.com/2025/01/17/technology/product-review.html"
)

results <- classify(
  categories    = c("News", "Opinion", "Tutorial/Review", "Other"),
  input_data    = urls,
  source_domain = "nytimes.com",
  content_type  = "news article",
  api_key       = Sys.getenv("OPENAI_API_KEY"),
  user_model    = "gpt-4o-mini"
)
```

`cat.web` fetches each URL with a polite User-Agent, extracts the main
content (dropping navigation, footers, comment sections), and then runs
the LLM classifier. The output `data.frame` includes the original URL,
the extracted body text (or a snippet of it), and one 0/1 column per
category.

# Classify raw text instead

If you already have the page content (perhaps from a scraping pipeline),
skip the fetch and pass strings directly:

```{r classify-text}
results <- classify(
  categories    = c("News", "Opinion", "Tutorial", "Other"),
  input_data    = df$article_text,
  source_domain = "example.com",
  content_type  = "blog post",
  api_key       = Sys.getenv("OPENAI_API_KEY")
)
```

# Use web context to disambiguate

The `source_domain`, `content_type`, and `web_metadata` arguments
inject context the model wouldn't otherwise have. This matters most for
short pages or pages where the domain affects meaning (an opinion on
nytimes.com vs. a personal blog).

```{r web-context}
results <- classify(
  categories    = c("Pro-policy", "Critical", "Neutral-explainer"),
  input_data    = urls,
  source_domain = "vox.com",
  content_type  = "explainer article",
  web_metadata  = list(
    section = "Policy",
    audience = "general public",
    style = "long-form explainer"
  ),
  api_key       = Sys.getenv("OPENAI_API_KEY"),
  user_model    = "gpt-4o-mini"
)
```

# Summarize before classifying

For long pages, summarize first to cut tokens and improve focus:

```{r summarize-first}
summaries <- summarize(
  input_data    = urls,
  source_domain = "nytimes.com",
  content_type  = "news article",
  format        = "bullets",
  api_key       = Sys.getenv("OPENAI_API_KEY"),
  user_model    = "gpt-4o-mini"
)

results <- classify(
  categories = c("Domestic", "International", "Business", "Other"),
  input_data = summaries$summary,
  api_key    = Sys.getenv("OPENAI_API_KEY")
)
```

# Tips for web-data work

1. **Respect robots.txt and rate limits.** `cat.web` doesn't enforce
   crawl politeness — that's on you. For large jobs, add `row_delay = 1`
   (or higher) to space out requests.
2. **Validate the extracted text.** Boilerplate stripping isn't
   perfect; for some sites the model may end up classifying a cookie
   banner. Spot-check a sample of inputs before scaling.
3. **Cache aggressively.** Fetching the same URLs repeatedly during
   development wastes bandwidth and bumps you up against rate limits.
   Save the intermediate `input_data` from one fetch and re-use it.
4. **Set `timeout`** if you're hitting slow or large pages — the
   default (30s) is short for some sites.

# Where to learn more

- Full Getting Started guide:
  `vignette("getting-started", package = "cat.llm")`
- Per-function reference: `?cat.web::classify`, `?cat.web::extract`,
  `?cat.web::explore`, `?cat.web::summarize`
- Companion R-only package for higher-precision retrieval:
  [llm-web-research](https://pypi.org/project/llm-web-research/)
  (Python only currently).