---
title: "Classifying Open-Ended Survey Responses"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Classifying Open-Ended Survey Responses}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(eval = FALSE, comment = "#>", collapse = TRUE)
```

# What `cat.survey` adds

`cat.survey` is a thin domain wrapper around `cat.stack` that injects
**survey-question context** into every prompt. When you call
`classify(input_data = ..., survey_question = "Why did you move?")`,
the LLM sees:

> "A respondent was asked: *Why did you move?* Their answer was: *...*"

That framing measurably improves accuracy on open-ended survey data
versus generic text classification, because the model uses the question
to disambiguate short or context-dependent responses.

Everything else — supported models, output format, ensemble voting,
batch mode — is identical to `cat.stack`.

# Install

```{r install}
install.packages(
  "cat.survey",
  repos = c("https://chrissoria.r-universe.dev",
            "https://cloud.r-project.org")
)
library(cat.survey)
```

# Quick classification

```{r quick}
responses <- c(
  "Took a new job in Chicago",
  "Wanted to be closer to grandkids",
  "Couldn't afford rent in the Bay Area",
  "Job market collapsed after the layoffs",
  "Family pressure to move home"
)

# Verbose category descriptions classify better than short labels.
verbose_cats <- c(
  "Job/school: A change in employment, education, or career, including transfers and retirement.",
  "Family: Relationship changes, having children, supporting relatives, or relocating to be near family.",
  "Cost of living: Housing affordability, cost of goods, or general economic pressure.",
  "Other: The response does not fit any of the above categories."
)

results <- classify(
  input_data      = responses,
  categories      = verbose_cats,
  survey_question = "Why did you move to a new city?",
  api_key         = Sys.getenv("OPENAI_API_KEY"),
  user_model      = "gpt-4o-mini"
)
```

# Multi-label survey responses

Many survey responses fit more than one category ("I moved for a new job
and to be closer to family"). The default classifier is multi-label —
`results` will have one 0/1 column per category, and a row can have
multiple 1s.

To force single-label, set `add_other = FALSE` and shrink the category
list. To make multi-label explicit in your analysis, use the binary
columns directly:

```{r multi-label}
# Example downstream summary:
library(dplyr)
results %>%
  dplyr::summarize(
    pct_job    = mean(`Job/school`),
    pct_family = mean(Family),
    pct_cost   = mean(`Cost of living`),
    pct_other  = mean(Other)
  )
```

# Discovering a category scheme when you don't have one

If you don't already have a coding scheme, use `extract()` to discover
one from the responses themselves, then pass the result to `classify()`:

```{r extract-then-classify}
cats <- extract(
  input_data      = responses,
  survey_question = "Why did you move to a new city?",
  max_categories  = 8L,
  api_key         = Sys.getenv("OPENAI_API_KEY")
)
cats$top_categories

# Optionally rewrite the labels to be more verbose, then classify:
results <- classify(
  input_data      = responses,
  categories      = cats$top_categories,
  survey_question = "Why did you move to a new city?",
  api_key         = Sys.getenv("OPENAI_API_KEY")
)
```

See also `extract.Rmd` in the `r-package/examples/` directory for a
deeper walkthrough of category discovery.

# Recommendations for survey work

1. **Always set `survey_question`** — it's the whole point of using
   `cat.survey` over `cat.stack`. Without it you might as well use
   `cat.stack::classify()` directly.
2. **Write verbose category descriptions.** A label like
   `"Family: relocating to be near family, having a child, divorce..."`
   classifies several percentage points more accurately than just
   `"Family"`. This is the single biggest accuracy lever.
3. **Include an "Other" category.** Prevents the model from forcing
   ambiguous responses into ill-fitting boxes. `cat.survey` will prompt
   to add one if you forget (`add_other = "prompt"` is the default).
4. **Validate on a hand-coded subsample.** For published research,
   never trust classifications without spot-checking against human
   coding on at least 50–100 responses.

# Where to learn more

- Full Getting Started guide:
  `vignette("getting-started", package = "cat.llm")`
- Per-function reference: `?cat.survey::classify`, `?cat.survey::extract`,
  `?cat.survey::explore`
- Empirical best-practices research (incl. why verbose labels help) is
  in the project [Python README](https://github.com/chrissoria/cat-llm#best-practices-for-classification).