| Title: | General-Purpose LLM Text Classification Engine |
|---|---|
| Description: | R interface to the Python cat-stack package. General-purpose text, image, and PDF classification using LLMs with no domain assumptions. The base engine for the CatLLM ecosystem. |
| Authors: | Chris Soria [aut, cre] |
| Maintainer: | Chris Soria <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.2.2 |
| Built: | 2026-07-04 06:19:21 UTC |
| Source: | https://github.com/chrissoria/cat-llm |
Returns TRUE if the named model is available in your local Ollama
installation, FALSE otherwise. Partial name matching is supported
(e.g. "llama3.2" matches "llama3.2:latest").
check_ollama_model(model, host = "localhost", port = 11434L)check_ollama_model(model, host = "localhost", port = 11434L)
model |
Character. Model name to look for (e.g. |
host |
Character. Hostname Ollama is reachable on. Default |
port |
Integer. Port Ollama is reachable on. Default |
Logical scalar.
## Not run: check_ollama_model("qwen2.5:7b") ## End(Not run)## Not run: check_ollama_model("qwen2.5:7b") ## End(Not run)
Wraps the Python cat_stack.classify() function. Supports both single-model
and multi-model (ensemble) classification.
classify( input_data, categories, api_key = NULL, description = "", user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, filename = NULL, save_directory = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", survey_question = "", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 1L, json_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, multi_label = TRUE, batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400, json_formatter = NULL, two_step_classify = NULL, embedding_tiebreaker = FALSE, min_centroid_size = 3L, auto_start_ollama = TRUE, system_prompt = "", prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )classify( input_data, categories, api_key = NULL, description = "", user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, filename = NULL, save_directory = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", survey_question = "", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 1L, json_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, multi_label = TRUE, batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400, json_formatter = NULL, two_step_classify = NULL, embedding_tiebreaker = FALSE, min_centroid_size = 3L, auto_start_ollama = TRUE, system_prompt = "", prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )
input_data |
A character vector, list of text strings, or
|
categories |
A character vector of category names, or |
api_key |
API key for the model provider (single-model mode).
Not required when |
description |
Character. Context description for the classification
task (e.g., the survey question or image subject). Default |
user_model |
Character. Model name to use in single-model mode.
Default |
mode |
Character. PDF processing mode: |
creativity |
Numeric or |
safety |
Logical. If |
chain_of_verification |
Logical. Enable Chain of Verification.
Empirically degrades accuracy – provided for research only. Default
|
chain_of_thought |
Logical. Enable chain-of-thought reasoning. Default
|
step_back_prompt |
Logical. Enable step-back prompting. Default
|
context_prompt |
Logical. Add expert context to prompts. Default
|
thinking_budget |
Integer. Extended thinking token budget (0 = off).
Default |
example1, example2, example3, example4, example5, example6
|
Optional few-shot example strings. Empirically degrades accuracy – provided for research only. |
filename |
Character or |
save_directory |
Character or |
model_source |
Character. Provider hint for single-model mode:
|
max_categories |
Integer. Maximum number of categories when
|
categories_per_chunk |
Integer. Categories extracted per chunk when
|
divisions |
Integer. Number of data chunks when |
research_question |
Character or |
models |
A list of model specifications for multi-model ensemble mode.
Each element is either a 3-element character vector
|
consensus_threshold |
Character or numeric. Agreement threshold for ensemble mode. Options:
The output |
survey_question |
Character. Soft-deprecated alias for |
use_json_schema |
Logical. Use JSON schema for structured output.
Default |
max_workers |
Integer or |
fail_strategy |
Character. How to handle failures: |
max_retries |
Integer. Max retries per API call. Default |
batch_retries |
Integer. Max retries for batch-level failures.
Default |
json_retries |
Integer. Per-row retries when the LLM returns JSON
that fails schema validation. On each retry the prompt appends
"Respond with ONLY valid JSON". On the final attempt the formatter
fallback (if enabled via |
retry_delay |
Numeric. Seconds between retries. Default |
row_delay |
Numeric. Seconds between processing each row (useful for
rate limiting). Default |
pdf_dpi |
Integer. DPI for PDF page rendering. Default |
auto_download |
Logical. Auto-download Ollama models. Default |
add_other |
Logical or |
check_verbosity |
Logical. Check whether each category has a
description and examples (1 API call). Default |
multi_label |
Logical. If |
batch_mode |
Logical. If |
batch_poll_interval |
Numeric. Seconds between batch-job status
polls when |
batch_timeout |
Numeric. Maximum seconds to wait for a batch
job to complete. Default |
json_formatter |
Auto-enabled when |
two_step_classify |
|
embedding_tiebreaker |
Logical. Resolve true ensemble ties
(50/50 splits at the threshold) using embedding centroids built
from unanimously-agreed rows; the closer centroid wins. Companion
for |
min_centroid_size |
Integer. Minimum number of
unanimously-agreed rows needed to build a centroid for a category
when |
auto_start_ollama |
Logical. If |
system_prompt |
Character. Custom system-level instruction prepended to
every classification call. Use this to apply a prompt returned by
|
prompt_tune |
Integer or |
tune_iterations |
Integer. Number of APO optimization passes.
Default |
tune_ui |
Character. Correction UI: |
tune_optimize |
Character. Metric to optimize: |
A data.frame with one row per input item and classification
columns. In single-model mode the columns are the category names. In
ensemble mode additional consensus_* and agreement_* columns are
included.
## Not run: # Single-model classification results <- classify( input_data = c("I love this!", "Terrible service.", "It was okay."), categories = c("Positive", "Negative", "Neutral"), description = "Customer feedback", api_key = Sys.getenv("OPENAI_API_KEY") ) # Single-label: force exactly one best-matching category per response # (the prompt asks for the single most appropriate category instead of # all that apply). Use for mutually exclusive coding frames. results <- classify( input_data = c("I love this!", "Terrible service.", "It was okay."), categories = c("Positive", "Negative", "Neutral"), description = "Customer feedback", multi_label = FALSE, api_key = Sys.getenv("OPENAI_API_KEY") ) # Multi-model ensemble results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), models = list( c("gpt-4o", "openai", Sys.getenv("OPENAI_API_KEY")), c("claude-sonnet-4-5-20250929", "anthropic", Sys.getenv("ANTHROPIC_API_KEY")) ), consensus_threshold = "unanimous" ) # Even-model ensemble with strict-majority + embedding tiebreaker # (resolves true 50/50 ties via centroid similarity instead of # the default "tie -> 0"; requires cat-stack[embeddings]) results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), models = list( c("gpt-4o-mini", "openai", Sys.getenv("OPENAI_API_KEY")), c("claude-haiku-4-5", "anthropic", Sys.getenv("ANTHROPIC_API_KEY")) ), consensus_threshold = "majority", embedding_tiebreaker = TRUE ) # Async batch mode (50% cheaper, slower) — OpenAI / Anthropic / # Google / Mistral / xAI only; not yet supported with PDFs/images # or embedding_tiebreaker. results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), api_key = Sys.getenv("OPENAI_API_KEY"), batch_mode = TRUE ) ## End(Not run)## Not run: # Single-model classification results <- classify( input_data = c("I love this!", "Terrible service.", "It was okay."), categories = c("Positive", "Negative", "Neutral"), description = "Customer feedback", api_key = Sys.getenv("OPENAI_API_KEY") ) # Single-label: force exactly one best-matching category per response # (the prompt asks for the single most appropriate category instead of # all that apply). Use for mutually exclusive coding frames. results <- classify( input_data = c("I love this!", "Terrible service.", "It was okay."), categories = c("Positive", "Negative", "Neutral"), description = "Customer feedback", multi_label = FALSE, api_key = Sys.getenv("OPENAI_API_KEY") ) # Multi-model ensemble results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), models = list( c("gpt-4o", "openai", Sys.getenv("OPENAI_API_KEY")), c("claude-sonnet-4-5-20250929", "anthropic", Sys.getenv("ANTHROPIC_API_KEY")) ), consensus_threshold = "unanimous" ) # Even-model ensemble with strict-majority + embedding tiebreaker # (resolves true 50/50 ties via centroid similarity instead of # the default "tie -> 0"; requires cat-stack[embeddings]) results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), models = list( c("gpt-4o-mini", "openai", Sys.getenv("OPENAI_API_KEY")), c("claude-haiku-4-5", "anthropic", Sys.getenv("ANTHROPIC_API_KEY")) ), consensus_threshold = "majority", embedding_tiebreaker = TRUE ) # Async batch mode (50% cheaper, slower) — OpenAI / Anthropic / # Google / Mistral / xAI only; not yet supported with PDFs/images # or embedding_tiebreaker. results <- classify( input_data = df$responses, categories = c("Positive", "Negative", "Neutral"), api_key = Sys.getenv("OPENAI_API_KEY"), batch_mode = TRUE ) ## End(Not run)
Checks whether an Ollama server is reachable at host:port. If not,
attempts to start it using the platform-appropriate command and
polls until the server responds (or timeout is reached). Call
this once at the top of an R session before classifying with
model_source = "ollama".
ensure_ollama_running( auto_start = TRUE, timeout = 30, host = "localhost", port = 11434L, verbose = TRUE )ensure_ollama_running( auto_start = TRUE, timeout = 30, host = "localhost", port = 11434L, verbose = TRUE )
auto_start |
Logical. If |
timeout |
Numeric. Seconds to wait for Ollama to become ready
after |
host |
Character. Hostname Ollama is reachable on.
Default |
port |
Integer. Port Ollama is reachable on. Default |
verbose |
Logical. Print status messages. Default |
Platform start commands:
macOS — open -a Ollama (launches the Ollama.app daemon).
Falls back to ollama serve if the app is not installed.
Linux — ollama serve (run in a detached process).
Windows — ollama serve.
If Ollama is not installed, the function returns a clear error message linking to https://ollama.com.
Invisibly returns TRUE when Ollama is running.
## Not run: # Ensure Ollama is up before classifying with a local model ensure_ollama_running() results <- classify( input_data = c("text 1", "text 2"), categories = c("Positive", "Negative", "Neutral"), user_model = "qwen2.5:7b", model_source = "ollama" ) # Just check without auto-starting ensure_ollama_running(auto_start = FALSE) ## End(Not run)## Not run: # Ensure Ollama is up before classifying with a local model ensure_ollama_running() results <- classify( input_data = c("text 1", "text 2"), categories = c("Positive", "Negative", "Neutral"), user_model = "qwen2.5:7b", model_source = "ollama" ) # Just check without auto-starting ensure_ollama_running(auto_start = FALSE) ## End(Not run)
Wraps the Python cat_stack.explore() function. Returns every category
string extracted from every chunk across every iteration – with duplicates
intact. Useful for analysing category stability and saturation across
repeated extraction runs.
explore( input_data, api_key, description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0, auto_start_ollama = TRUE )explore( input_data, api_key, description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0, auto_start_ollama = TRUE )
input_data |
A character vector, list, or |
api_key |
Character. API key for the model provider. |
description |
Character. The survey question or data description.
Default |
max_categories |
Integer. Maximum categories per chunk. Default |
categories_per_chunk |
Integer. Categories to extract per chunk.
Default |
divisions |
Integer. Number of data chunks. Default |
user_model |
Character. Model name. Default |
creativity |
Numeric or |
specificity |
Character. |
research_question |
Character or |
filename |
Character or |
model_source |
Character. Provider hint. Default |
iterations |
Integer. Number of passes over the data. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Seconds between API calls. Default |
auto_start_ollama |
Logical. If |
Unlike extract(), which normalises and deduplicates categories, explore()
returns the raw unprocessed output suitable for frequency and saturation
analysis.
A character vector of every category string extracted across all
chunks and iterations. Length is approximately
iterations * divisions * categories_per_chunk.
## Not run: raw_cats <- explore( input_data = df$responses, description = "Why did you move?", api_key = Sys.getenv("OPENAI_API_KEY"), iterations = 3L, divisions = 5L ) length(raw_cats) # ~150 head(raw_cats, 10) ## End(Not run)## Not run: raw_cats <- explore( input_data = df$responses, description = "Why did you move?", api_key = Sys.getenv("OPENAI_API_KEY"), iterations = 3L, divisions = 5L ) length(raw_cats) # ~150 head(raw_cats, 10) ## End(Not run)
Wraps the Python cat_stack.extract() function. Discovers and returns a
normalised, deduplicated set of categories found in the input data.
extract( input_data, api_key, input_type = "text", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0, auto_start_ollama = TRUE )extract( input_data, api_key, input_type = "text", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0, auto_start_ollama = TRUE )
input_data |
A character vector, list, or |
api_key |
Character. API key for the model provider. |
input_type |
Character. Type of input: |
description |
Character. The survey question or data description.
Default |
max_categories |
Integer. Maximum number of final categories to return.
Default |
categories_per_chunk |
Integer. Categories to extract per data chunk.
Default |
divisions |
Integer. Number of chunks to divide the data into.
Default |
user_model |
Character. Model name. Default |
creativity |
Numeric or |
specificity |
Character. Category granularity: |
research_question |
Character or |
mode |
Character. Processing mode. For PDFs: |
filename |
Character or |
model_source |
Character. Provider hint: |
iterations |
Integer. Number of passes over the data. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Seconds between API calls (rate limiting).
Default |
auto_start_ollama |
Logical. If |
A named list with:
counts_dfA data.frame of discovered categories with counts.
top_categoriesA character vector of the top category names.
raw_top_textThe raw model output from the final merge step.
## Not run: result <- extract( input_data = df$responses, description = "Why did you move to this city?", api_key = Sys.getenv("OPENAI_API_KEY") ) print(result$top_categories) print(result$counts_df) ## End(Not run)## Not run: result <- extract( input_data = df$responses, description = "Why did you move to this city?", api_key = Sys.getenv("OPENAI_API_KEY") ) print(result$top_categories) print(result$counts_df) ## End(Not run)
Installs the cat-stack Python package into the Python environment used by
reticulate. Optionally installs PDF extras.
install_cat_stack( method = "auto", conda = "auto", pdf = FALSE, upgrade = FALSE, ... )install_cat_stack( method = "auto", conda = "auto", pdf = FALSE, upgrade = FALSE, ... )
method |
Installation method passed to |
conda |
Conda environment name. Default |
pdf |
Logical. If |
upgrade |
Logical. If |
... |
Additional arguments passed to |
The version floor is pinned to cat-stack >= 2.0.1 — the stable 2.0 line
centralizes provider parameter handling (current Anthropic models no
longer 400 on creativity / thinking_budget), grades thinking_budget
consistently across providers, and fixes description= context routing
in classify() / prompt_tune(). Older Python installs work for old
models, but silently degrade on the newest Anthropic generation.
Invisibly NULL.
## Not run: # Standard install install_cat_stack() # With PDF support (installs cat-stack[pdf]) install_cat_stack(pdf = TRUE) # Upgrade an existing install install_cat_stack(upgrade = TRUE) ## End(Not run)## Not run: # Standard install install_cat_stack() # With PDF support (installs cat-stack[pdf]) install_cat_stack(pdf = TRUE) # Upgrade an existing install install_cat_stack(upgrade = TRUE) ## End(Not run)
Returns the names of all models already downloaded to your local Ollama
installation. Requires Ollama to be running (call ensure_ollama_running()
first, or start it manually with ollama serve).
list_ollama_models(host = "localhost", port = 11434L)list_ollama_models(host = "localhost", port = 11434L)
host |
Character. Hostname Ollama is reachable on. Default |
port |
Integer. Port Ollama is reachable on. Default |
A character vector of model names (e.g. c("qwen2.5:7b", "mistral:7b")),
or an empty character vector if Ollama is not running.
## Not run: ensure_ollama_running() list_ollama_models() ## End(Not run)## Not run: ensure_ollama_running() list_ollama_models() ## End(Not run)
Wraps the Python catstack.prompt_tune() function. Runs a
coordinate-descent loop: classifies a small sample, asks you to
correct the model's output, then has a meta-LLM rewrite the
classification instructions for each category that had errors.
Returns the best system prompt found plus per-iteration metrics.
prompt_tune( input_data, categories, api_key = NULL, user_model = "gpt-4o", model_source = "auto", models = NULL, description = "", survey_question = "", sample_size = 10L, max_iterations = 3L, multi_label = TRUE, creativity = NULL, use_json_schema = TRUE, consensus_threshold = "unanimous", max_retries = 5L, input_mode = NULL, ui = "terminal", optimize = "balanced", add_other = "prompt", thinking_budget = 0L, auto_start_ollama = TRUE )prompt_tune( input_data, categories, api_key = NULL, user_model = "gpt-4o", model_source = "auto", models = NULL, description = "", survey_question = "", sample_size = 10L, max_iterations = 3L, multi_label = TRUE, creativity = NULL, use_json_schema = TRUE, consensus_threshold = "unanimous", max_retries = 5L, input_mode = NULL, ui = "terminal", optimize = "balanced", add_other = "prompt", thinking_budget = 0L, auto_start_ollama = TRUE )
input_data |
A character vector, list, or |
categories |
A character vector of category names. The labels themselves are never modified by tuning — only the classification instructions change. |
api_key |
Character or |
user_model |
Character. Model name. Default |
model_source |
Character. Provider hint. Default |
models |
List of model specs for ensemble mode (each
|
description |
Character. Context description. Default |
survey_question |
Character. Soft-deprecated alias for |
sample_size |
Integer. Items to test per iteration. Default |
max_iterations |
Integer. Max instruction attempts per
category. Default |
multi_label |
Logical. Multi-label classification. Default |
creativity |
Numeric or |
use_json_schema |
Logical. Default |
consensus_threshold |
Character or numeric. For ensemble mode.
Default |
max_retries |
Integer. Default |
input_mode |
Character or |
ui |
Character. Review interface for corrections.
|
optimize |
Character. Which metric to maximize.
|
add_other |
Logical or |
thinking_budget |
Integer. Default |
auto_start_ollama |
Logical. If |
This function is interactive — you'll be asked to review and
correct the model's labels at least once. From an R session, the
default ui = "terminal" reads your corrections from stdin (works
in R, Rscript, and most IDE consoles). ui = "browser" opens a
local web page with checkboxes; depending on your R setup this may
or may not auto-launch the browser, so terminal is the safer
default for R users.
Use the returned system_prompt with classify() via the
system_prompt = argument to apply the tuned instructions.
A named list with components:
system_prompt — the optimized system prompt (best found)
iterations — list of per-iteration records (label,
system_prompt, metrics, per_category, total_flips)
per_category_summary — per-category metrics from the
best-scoring iteration
## Not run: result <- prompt_tune( input_data = df$open_response, categories = c("Positive", "Negative", "Neutral"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", sample_size = 10L, max_iterations = 3L, ui = "terminal" ) # Inspect the optimized prompt cat(result$system_prompt) # Use it in classify() via the system_prompt argument results <- classify( input_data = df$open_response, categories = c("Positive", "Negative", "Neutral"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", system_prompt = result$system_prompt ) ## End(Not run)## Not run: result <- prompt_tune( input_data = df$open_response, categories = c("Positive", "Negative", "Neutral"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", sample_size = 10L, max_iterations = 3L, ui = "terminal" ) # Inspect the optimized prompt cat(result$system_prompt) # Use it in classify() via the system_prompt argument results <- classify( input_data = df$open_response, categories = c("Positive", "Negative", "Neutral"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", system_prompt = result$system_prompt ) ## End(Not run)
Downloads the named model into your local Ollama installation. Prints the
estimated model size and a resource check before downloading. Set
auto_confirm = TRUE to skip the interactive confirmation prompt — useful
in scripts and RMarkdown documents.
pull_ollama_model( model, host = "localhost", port = 11434L, auto_confirm = FALSE )pull_ollama_model( model, host = "localhost", port = 11434L, auto_confirm = FALSE )
model |
Character. Model name to download (e.g. |
host |
Character. Hostname Ollama is reachable on. Default |
port |
Integer. Port Ollama is reachable on. Default |
auto_confirm |
Logical. Skip the confirmation prompt. Default |
Invisibly returns TRUE on success, FALSE on failure.
## Not run: pull_ollama_model("llama3.2", auto_confirm = TRUE) ## End(Not run)## Not run: pull_ollama_model("llama3.2", auto_confirm = TRUE) ## End(Not run)
Wraps the Python cat_stack.summarize() function. Generates summaries of
input data using one or more LLM models. Supports single-model and
multi-model (ensemble) summarization.
summarize( input_data, api_key = NULL, description = "", instructions = "", format = "paragraph", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 1L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400, auto_start_ollama = TRUE )summarize( input_data, api_key = NULL, description = "", instructions = "", format = "paragraph", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 1L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400, auto_start_ollama = TRUE )
input_data |
A character vector, list, or |
api_key |
Character or |
description |
Character. Context description for the summarization
task. Default |
instructions |
Character. Specific instructions for the summary.
Default |
format |
Character. Output format: |
max_length |
Integer or |
focus |
Character or |
user_model |
Character. Model name. Default |
model_source |
Character. Provider hint: |
mode |
Character. Processing mode for images/PDFs: |
input_mode |
Character or |
input_type |
Character. Type of input: |
pdf_dpi |
Integer. DPI for PDF page rendering. Default |
creativity |
Numeric or |
thinking_budget |
Integer. Extended thinking token budget (0 = off).
Default |
chain_of_thought |
Logical. Enable chain-of-thought reasoning. Default
|
context_prompt |
Logical. Add expert context to prompts. Default
|
step_back_prompt |
Logical. Enable step-back prompting. Default
|
filename |
Character or |
save_directory |
Character or |
models |
A list of model specifications for multi-model ensemble mode.
Each element is either a 3-element character vector
|
max_workers |
Integer or |
parallel |
Logical or |
auto_download |
Logical. Auto-download Ollama models. Default |
safety |
Logical. If |
max_retries |
Integer. Max retries per API call. Default |
batch_retries |
Integer. Max retries for batch-level failures.
Default |
retry_delay |
Numeric. Seconds between retries. Default |
row_delay |
Numeric. Seconds between processing each row. Default
|
fail_strategy |
Character. How to handle failures: |
batch_mode |
Logical. Use batch processing mode. Default |
batch_poll_interval |
Numeric. Seconds between batch status polls.
Default |
batch_timeout |
Numeric. Maximum seconds to wait for batch completion.
Default |
auto_start_ollama |
Logical. If |
A data.frame with summarization results.
## Not run: # Single-model summarization results <- summarize( input_data = c("A long article about climate change...", "A detailed report on economic trends..."), description = "News articles", instructions = "Provide a 2-sentence summary", api_key = Sys.getenv("OPENAI_API_KEY") ) # PDF summarization results <- summarize( input_data = "path/to/documents/", input_type = "pdf", api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)## Not run: # Single-model summarization results <- summarize( input_data = c("A long article about climate change...", "A detailed report on economic trends..."), description = "News articles", instructions = "Provide a 2-sentence summary", api_key = Sys.getenv("OPENAI_API_KEY") ) # PDF summarization results <- summarize( input_data = "path/to/documents/", input_type = "pdf", api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)