--- title: "Getting Started with CatLLM for R" author: "Chris Soria" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with CatLLM for R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( eval = FALSE, comment = "#>", collapse = TRUE ) ``` # Introduction **CatLLM** is an ecosystem of R packages that use large language models (LLMs) to categorize open-ended text — survey responses, social media posts, academic papers, policy documents, web content — at scale. It's designed for researchers who want quantitative analysis of free-text data without manual coding or hiring research assistants. CatLLM achieves **98% accuracy compared to human consensus** on classification tasks using leading models such as GPT-5, Gemini 2.5 Pro, and Qwen 3. Validated against expert human coders across 21 LLMs and 4 surveys; see the [SocArXiv preprint](https://osf.io/preprints/socarxiv/gjvcf_v1) for methodology. The R packages are thin [reticulate](https://rstudio.github.io/reticulate/) wrappers around the underlying Python implementation. **Every parameter, default, and behavior is identical to the Python version** — only the calling syntax differs. For deep conceptual content, advanced configuration, or the full 50-parameter `classify()` reference, the Python [cat-llm README](https://github.com/chrissoria/cat-llm#readme) is the canonical source. --- # Installation Install the meta-package (brings in all 7 sub-packages) from R-universe: ```{r install-meta} install.packages( "cat.llm", repos = c("https://chrissoria.r-universe.dev", "https://cloud.r-project.org") ) ``` Or install a single domain package for a lighter footprint: ```{r install-single} install.packages( c("cat.stack", "cat.survey"), repos = c("https://chrissoria.r-universe.dev", "https://cloud.r-project.org") ) ``` One-time setup: install the Python backend (requires Python 3.9+ on your system): ```{r install-python} library(cat.llm) install_cat_stack() # With PDF processing support: # install_cat_stack(pdf = TRUE) ``` --- # Quick Start CatLLM is designed for **building datasets at scale**, not one-off queries. While you can classify individual responses, its primary purpose is batch processing entire text columns, image collections, or PDF corpora into structured research datasets. All outputs are R `data.frame`s ready for analysis or CSV export. ## Option A — via the meta-package `library(cat.llm)` attaches every domain package and exposes domain-suffixed aliases (`classify_survey()`, `classify_political()`, `classify_social()`, etc.): ```{r quickstart-meta} library(cat.llm) api_key <- Sys.getenv("OPENAI_API_KEY") # Domain-neutral classification (from cat.stack) results <- classify( input_data = c("I love this product!", "Terrible experience.", "It was fine."), categories = c("Positive", "Negative", "Neutral"), description = "Customer feedback sentiment", api_key = api_key ) # Survey classification — adds survey-tuned prompts results <- classify_survey( input_data = df$responses, categories = c("Job change", "Family reasons", "Cost of living"), survey_question = "Why did you move to a new city?", api_key = api_key ) # Academic paper classification — fetches by journal results <- classify_academic( input_data = NULL, categories = c("Empirical", "Theoretical", "Review"), journal_issn = "0894-4393", paper_limit = 50L, polite_email = "you@university.edu", api_key = api_key ) # Social media classification results <- classify_social( input_data = df$posts, categories = c("Misinformation", "Opinion", "News"), api_key = api_key ) # Political text classification (built-in registered sources) results <- classify_political( source = "city_san_diego", doc_type = "ordinance", since = "2025-01-01", n = 50L, categories = c("Housing", "Public Safety", "Finance"), api_key = api_key ) # Cognitive assessment scoring (CERAD drawings) scores <- cerad_drawn_score( shape = "diamond", image_input = df$drawing_paths, api_key = api_key ) ``` ## Option B — install only the domain you need For a lighter dependency footprint, install only the package you actually use: ```{r quickstart-single} # install.packages("cat.survey", repos = ...) library(cat.survey) results <- cat.survey::classify( input_data = df$responses, categories = c("Job change", "Family reasons", "Cost of living"), survey_question = "Why did you move to a new city?", api_key = Sys.getenv("OPENAI_API_KEY") ) ``` The two options produce identical results — `classify_survey()` from `cat.llm` is just a thin re-export of `cat.survey::classify()`. --- # The Ecosystem | Package | Domain | Wraps | |----------------|-----------------------------------------|------------------------------------------------------------------------| | **cat.stack** | General-purpose classification base | `classify`, `extract`, `explore`, `summarize` | | **cat.survey** | Open-ended survey responses | Adds `survey_question=` framing | | **cat.vader** | Social media posts | Platform connectors (Threads, Reddit, Bluesky, etc.) | | **cat.ademic** | Academic papers | OpenAlex-based journal/topic fetching, PDF support | | **cat.cog** | Cognitive assessment scoring | `cerad_drawn_score()` for CERAD constructional praxis | | **cat.pol** | Policy documents | 17 registered sources (ordinances, federal laws, EOs, political speech) | | **cat.web** | Web content | Automatic URL fetching, web-context prompt injection | | **cat.llm** | Meta-package (installs all 7) | Re-exports + domain-suffixed aliases | Every domain package shares the same core API — `classify()`, `extract()`, `explore()`, `summarize()` (where applicable) — and depends on `cat.stack`, which holds the underlying classification engine. --- # Best Practices for Classification These recommendations are based on empirical testing across 4 surveys, 4 models (7B to frontier-class), and 250-row subsamples compared against human-coded ground truth. They apply identically to R and Python. ## What works **Detailed category descriptions** — the single biggest lever for accuracy. Instead of short labels like `"Job change"`, use verbose descriptions like `"The person had a job or school or career change, including transferred and retired."` Consistently improves accuracy by several percentage points across all models. ```{r best-detailed} verbose_categories <- c( "Job/school: A change in employment, education, or career, including transfers and retirement.", "Family: Relationship changes, having children, supporting relatives, or relocating to be near family.", "Cost of living: Housing affordability, cost of goods, or general economic pressure.", "Other: The response does not fit any of the above categories." ) results <- classify( input_data = df$responses, categories = verbose_categories, api_key = Sys.getenv("OPENAI_API_KEY") ) ``` **Include an "Other" category** — a catch-all like `"Other: The response does not fit any of the above categories."` prevents the model from forcing ambiguous responses into ill-fitting categories. By default, R wrappers will prompt to add one if your category list lacks one (`add_other = "prompt"`). **Few-shot examples** (`example1`–`example6`) — providing 2-4 labeled examples can help, especially for weaker models. Effects are modest (+0–1 pp on average) and model-dependent. ```{r best-fewshot} results <- classify( input_data = df$responses, categories = verbose_categories, example1 = list(text = "Took a new job in Chicago", label = "Job/school"), example2 = list(text = "Wanted to be closer to grandkids", label = "Family"), api_key = Sys.getenv("OPENAI_API_KEY") ) ``` **Low temperature** (`creativity = 0`) — for classification, deterministic output is preferable. Higher temperatures add noise without improving accuracy. ## What doesn't help (or hurts) - **Chain of Thought** (`chain_of_thought = TRUE`): no measurable improvement in our testing; slightly degraded performance for some models. Off by default. - **Chain of Verification** (`chain_of_verification = TRUE`): uses ~4x the API calls for self-verification. Consistently *reduced* accuracy by 1–2 pp by retracting correct classifications. Not recommended for classification. - **Step-back prompting** (`step_back_prompt = TRUE`): inconsistent — slight gains for weaker models (~+1.8 pp), slight losses for stronger ones (~−0.5 pp). Not recommended as a default. - **Context prompting** (`context_prompt = TRUE`): no consistent benefit observed. ## Summary The most effective approach is straightforward: **write detailed category descriptions, include an "Other" category, use a capable model at low temperature.** Advanced prompting adds complexity and cost without reliable gains for classification. --- # Configuration ## Get an API key Get an API key from your preferred provider: | Provider | Where | |--------------|------------------------------------------------------------------------------------------| | OpenAI | [platform.openai.com](https://platform.openai.com) | | Anthropic | [console.anthropic.com](https://console.anthropic.com) | | Google | [aistudio.google.com](https://aistudio.google.com) | | HuggingFace | [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) | | xAI | [console.x.ai](https://console.x.ai) | | Mistral | [console.mistral.ai](https://console.mistral.ai) | | Perplexity | [perplexity.ai/settings/api](https://www.perplexity.ai/settings/api) | Most providers require adding a payment method. Store your key securely and never share it publicly. ## Store your key in `.Renviron` (recommended) Rather than pasting your key into scripts, store it in `~/.Renviron` so it's automatically available to every R session: ``` OPENAI_API_KEY=sk-... ANTHROPIC_API_KEY=sk-ant-... GOOGLE_API_KEY=AIza... ``` Then restart R and use: ```{r config-getenv} api_key <- Sys.getenv("OPENAI_API_KEY") ``` To find or edit your `.Renviron`: ```{r config-renviron} usethis::edit_r_environ() # opens it for editing; creates if missing ``` After saving, restart R for the changes to take effect. ## Run entirely locally with no API key For sensitive data, free use, or air-gapped environments, run against a local model via [Ollama](https://ollama.com): ```{r config-ollama} # In a separate terminal: install Ollama, then pull a model. # Recommended (larger, more accurate, ~9 GB): # ollama pull qwen2.5:14b # Smaller fallback if disk/RAM constrained (~4.7 GB): # ollama pull qwen2.5:7b results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), user_model = "qwen2.5:14b", # or "qwen2.5:7b" if you pulled the smaller one model_source = "ollama" ) ``` > ⚠️ **Disk-space heads-up:** `qwen2.5:14b` is ~9 GB on disk and Ollama > needs roughly that much *free* during the download. Check > `df -h /` first — if under ~12 GB free, use `qwen2.5:7b`. No API key needed; your data never leaves the machine. --- # Supported Models Specify any of these via `user_model = "..."`: - **OpenAI**: `gpt-4o`, `gpt-4o-mini`, `gpt-4`, `gpt-5`, ... - **Anthropic**: `claude-sonnet-4-20250514`, `claude-3-5-sonnet-20241022`, `claude-3-5-haiku-20241022`, ... - **Google**: `gemini-2.5-flash`, `gemini-2.5-pro`, ... - **HuggingFace**: `Qwen/Qwen3-235B`, `meta-llama/Llama-4-Scout`, `deepseek-ai/DeepSeek-V3`, and thousands of community models - **xAI**: `grok-2`, ... - **Mistral**: `mistral-large-latest`, `pixtral-large-latest`, ... - **Perplexity**: `sonar-large`, `sonar-small`, ... - **Ollama (local)**: `qwen2.5:14b` (recommended, ~9 GB), `qwen2.5:7b` (smaller fallback, ~4.7 GB), `llama3.1:8b`, ... (set `model_source = "ollama"`) **Fully tested:** OpenAI, Anthropic, Perplexity, Google Gemini (free tier has 5 RPM limit), HuggingFace, xAI, Mistral. For best results when starting out, **OpenAI** (`gpt-4o-mini`) or **Anthropic** (`claude-3-5-haiku-20241022`) are cheap, fast, and reliable. --- # Ensemble & multi-model classification Run the same input through multiple models and combine results via majority voting. Often improves accuracy by reducing individual model biases. ```{r ensemble} results <- classify( input_data = df$responses, categories = verbose_categories, models = list( c("gpt-4o-mini", "openai", Sys.getenv("OPENAI_API_KEY")), c("claude-3-5-haiku-20241022", "anthropic", Sys.getenv("ANTHROPIC_API_KEY")), c("gemini-2.5-flash", "google", Sys.getenv("GOOGLE_API_KEY")) ), consensus_threshold = "unanimous" # or 0.5 for majority, etc. ) ``` The output `data.frame` includes per-model predictions (e.g., `category_1_gpt_4o_mini`, `category_1_claude`) plus a consensus column. --- # API Reference (brief) Every parameter from the Python `classify()`, `extract()`, `explore()`, and `summarize()` functions is exposed in R with identical semantics. The full per-parameter documentation lives in the in-R help system and on the R-universe per-package reference manuals. | Function | In-R help | Online | |--------------------------------|------------------------------------|-----------------------------------------------------------------------------------------------| | Domain-neutral `classify()` | `?cat.stack::classify` | [cat.stack manual](https://chrissoria.r-universe.dev/cat.stack/doc/manual.html) | | Survey `classify()` | `?cat.survey::classify` | [cat.survey manual](https://chrissoria.r-universe.dev/cat.survey/doc/manual.html) | | Academic `classify()` | `?cat.ademic::classify` | [cat.ademic manual](https://chrissoria.r-universe.dev/cat.ademic/doc/manual.html) | | Political `classify()` | `?cat.pol::classify` | [cat.pol manual](https://chrissoria.r-universe.dev/cat.pol/doc/manual.html) | | Web `classify()` | `?cat.web::classify` | [cat.web manual](https://chrissoria.r-universe.dev/cat.web/doc/manual.html) | | Social `classify()` | `?cat.vader::classify` | [cat.vader manual](https://chrissoria.r-universe.dev/cat.vader/doc/manual.html) | | CERAD scoring | `?cat.cog::cerad_drawn_score` | [cat.cog manual](https://chrissoria.r-universe.dev/cat.cog/doc/manual.html) | | List registered policy sources | `?cat.pol::list_sources` | [cat.pol manual](https://chrissoria.r-universe.dev/cat.pol/doc/manual.html) | For full conceptual coverage of every parameter — batch mode, prompt tuning, embeddings, JSON formatting, advanced ensemble configurations — see the **[Python README API Reference](https://github.com/chrissoria/cat-llm#api-reference)**. The R wrappers expose every Python kwarg. --- # R ↔ Python type translation When adapting Python examples from the project README, the table below covers the syntax differences. All conversions are handled automatically by `reticulate::r_to_py()` inside the R wrappers — you write R, the wrapper passes Python. | Python | R | |-------------------------------------------------|------------------------------------------------------------| | `["a", "b", "c"]` | `c("a", "b", "c")` | | `{"key": "value"}` | `list(key = "value")` | | `True` / `False` / `None` | `TRUE` / `FALSE` / `NULL` | | `[(model, provider, key), (...)]` (ensemble) | `list(c(model, provider, key), c(...))` | | `df['col']` | `df$col` | | `import catllm` | `library(cat.llm)` | | `catllm.classify_survey(...)` | `classify_survey(...)` (after `library(cat.llm)`) | --- # Where to go from here - **Full conceptual reference**: the Python [cat-llm README](https://github.com/chrissoria/cat-llm#readme) — covers every parameter, advanced configuration, prompt tuning, embeddings, etc. Since R is a thin reticulate layer, every Python concept applies directly. - **Per-package R reference manuals**: https://chrissoria.r-universe.dev — pick a package, then click the "Reference Manual" link for full `@param` docs. - **End-to-end smoke test**: see `r-package/test-all-packages.R` in the [GitHub repo](https://github.com/chrissoria/cat-llm/tree/main/r-package) — a single R script that installs all 8 packages and runs a minimal classification per package. - **Issues, questions, contributions**: [github.com/chrissoria/cat-llm/issues](https://github.com/chrissoria/cat-llm/issues) - **Citation** — if you use CatLLM in published research, please cite: > Soria, C. (2026). *Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting.* SocArXiv. and the software DOI: > Soria, C. (2026). *CatLLM: A Reproducible Python Ecosystem for Generating, Assigning, and Scoring Open-Ended Text, Images, and Documents Across Research Domains* (v3.0.0) [Software]. Zenodo.