| Title: | Survey Response Classification with LLMs |
|---|---|
| Description: | R interface to the Python cat-survey package. Classifies, extracts, and explores open-ended survey responses using LLMs. A thin domain wrapper around cat.stack that adds the survey_question parameter for survey-specific context. |
| Authors: | Chris Soria [aut, cre] |
| Maintainer: | Chris Soria <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.2 |
| Built: | 2026-07-04 06:19:29 UTC |
| Source: | https://github.com/chrissoria/cat-llm |
Wraps the Python cat_survey.classify() function. Adds survey_question
context to the base cat.stack classification engine.
classify( input_data, categories, survey_question = "", description = "", add_other = "prompt", check_verbosity = TRUE, api_key = NULL, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, filename = NULL, save_directory = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )classify( input_data, categories, survey_question = "", description = "", add_other = "prompt", check_verbosity = TRUE, api_key = NULL, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, filename = NULL, save_directory = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )
input_data |
A character vector, list, or |
categories |
A character vector of category names, or |
survey_question |
Character. The survey question text. Default |
description |
Character. Additional context for the classification task.
Default |
add_other |
Logical or |
check_verbosity |
Logical. Check category descriptions. Default |
api_key |
API key for the model provider (single-model mode). |
user_model |
Character. Model name. Default |
mode |
Character. PDF processing mode. Default |
creativity |
Numeric or |
safety |
Logical. Save progress after each item. Default |
chain_of_verification |
Logical. Default |
chain_of_thought |
Logical. Default |
step_back_prompt |
Logical. Default |
context_prompt |
Logical. Default |
thinking_budget |
Integer. Extended thinking budget. Default |
example1, example2, example3, example4, example5, example6
|
Optional few-shot examples. |
filename |
Character or |
save_directory |
Character or |
model_source |
Character. Provider hint. Default |
max_categories |
Integer. Max categories for auto mode. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
research_question |
Character or |
models |
List of model specs for ensemble mode. |
consensus_threshold |
Character or numeric. Default |
use_json_schema |
Logical. Default |
max_workers |
Integer or |
fail_strategy |
Character. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
pdf_dpi |
Integer. Default |
auto_download |
Logical. Default |
prompt_tune |
Integer or |
tune_iterations |
Integer. APO optimization passes. Default |
tune_ui |
Character. Correction UI: |
tune_optimize |
Character. Metric to optimize: |
A data.frame with classification results.
## Not run: results <- classify( input_data = c("Took a new job in Chicago", "Wanted to be closer to grandkids", "Couldn't afford rent in the Bay Area"), categories = c("Job/school", "Family", "Cost of living", "Other"), survey_question = "Why did you move?", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)## Not run: results <- classify( input_data = c("Took a new job in Chicago", "Wanted to be closer to grandkids", "Couldn't afford rent in the Bay Area"), categories = c("Job/school", "Family", "Cost of living", "Other"), survey_question = "Why did you move?", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)
Wraps the Python cat_survey.explore() function. Returns every category
string extracted from every chunk across every iteration – with duplicates
intact. Useful for analysing category stability and saturation.
explore( input_data, api_key, survey_question = "", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )explore( input_data, api_key, survey_question = "", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector, list, or |
api_key |
Character. API key for the model provider. |
survey_question |
Character. The survey question text. Default |
description |
Character. Additional context. Default |
max_categories |
Integer. Max categories per chunk. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Number of data chunks. Default |
user_model |
Character. Model name. Default |
creativity |
Numeric or |
specificity |
Character. |
research_question |
Character or |
filename |
Character or |
model_source |
Character. Provider hint. Default |
iterations |
Integer. Number of passes. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Seconds between API calls. Default |
A character vector of every category string extracted.
## Not run: raw_categories <- explore( input_data = df$open_response, survey_question = "What concerns you most about your community?", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) table(raw_categories) ## End(Not run)## Not run: raw_categories <- explore( input_data = df$open_response, survey_question = "What concerns you most about your community?", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) table(raw_categories) ## End(Not run)
Wraps the Python cat_survey.extract() function. Discovers and returns a
normalised, deduplicated set of categories found in survey response data.
extract( input_data, api_key, survey_question = "", description = "", input_type = "text", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )extract( input_data, api_key, survey_question = "", description = "", input_type = "text", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector, list, or |
api_key |
Character. API key for the model provider. |
survey_question |
Character. The survey question text. Default |
description |
Character. Additional context. Default |
input_type |
Character. Type of input. Default |
max_categories |
Integer. Maximum final categories. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Number of data chunks. Default |
user_model |
Character. Model name. Default |
creativity |
Numeric or |
specificity |
Character. |
research_question |
Character or |
mode |
Character. Processing mode. Default |
filename |
Character or |
model_source |
Character. Provider hint. Default |
iterations |
Integer. Number of passes. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Seconds between API calls. Default |
A named list with counts_df, top_categories, and raw_top_text.
## Not run: result <- extract( input_data = c("Took a new job in Chicago", "Wanted to be closer to grandkids", "Couldn't afford rent in the Bay Area"), survey_question = "Why did you move?", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)## Not run: result <- extract( input_data = c("Took a new job in Chicago", "Wanted to be closer to grandkids", "Couldn't afford rent in the Bay Area"), survey_question = "Why did you move?", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)