| Title: | Political Document Classification with LLMs |
|---|---|
| Description: | R interface to the Python catpol package. Classifies, extracts, explores, and summarizes political and policy documents using LLMs. A thin domain wrapper around cat.stack that adds a registered-source fetcher (city ordinances, federal laws, executive orders, presidential speeches, social media) and policy-document prompt framing. |
| Authors: | Chris Soria [aut, cre] |
| Maintainer: | Chris Soria <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.2 |
| Built: | 2026-07-04 06:20:11 UTC |
| Source: | https://github.com/chrissoria/cat-llm |
Wraps the Python catpol.classify() function. Can classify either raw
text (via input_data) or pull directly from a registered political
data source (via source). All catstack classification arguments are
supported.
classify( categories, input_data = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, document_context = "", description = "", api_key = NULL, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, filename = NULL, save_directory = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )classify( categories, input_data = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, document_context = "", description = "", api_key = NULL, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, filename = NULL, save_directory = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )
categories |
A character vector of category names. |
input_data |
A character vector, list, or |
source |
Character or |
doc_type |
Character or |
since |
Character or |
until |
Character or |
n |
Integer or |
document_context |
Character. Context about the policy document
being analyzed. Default |
description |
Character. Additional context description. Default |
api_key |
Character or |
user_model |
Character. Model name. Default |
mode |
Character. Processing mode. Default |
creativity |
Numeric or |
safety |
Logical. Save progress after each item. Default |
chain_of_verification |
Logical. Default |
chain_of_thought |
Logical. Default |
step_back_prompt |
Logical. Default |
context_prompt |
Logical. Default |
thinking_budget |
Integer. Default |
example1, example2, example3, example4, example5, example6
|
Optional few-shot examples. |
filename |
Character or |
save_directory |
Character or |
model_source |
Character. Provider hint. Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
research_question |
Character or |
models |
List of model specs for ensemble mode. Default |
consensus_threshold |
Character or numeric. Default |
use_json_schema |
Logical. Default |
max_workers |
Integer or |
fail_strategy |
Character. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
pdf_dpi |
Integer. Default |
auto_download |
Logical. Default |
add_other |
Logical or |
check_verbosity |
Logical. Default |
prompt_tune |
Integer or |
tune_iterations |
Integer. APO optimization passes. Default |
tune_ui |
Character. Correction UI: |
tune_optimize |
Character. Metric to optimize: |
A data.frame with classification results.
## Not run: # Pull recent San Diego ordinances from a registered source results <- classify( source = "city_san_diego", doc_type = "ordinance", since = "2024-01-01", n = 50L, categories = c("Housing", "Public Safety", "Finance", "Infrastructure", "Health"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) # Or classify your own text directly results <- classify( input_data = df$bill_text, categories = c("Housing", "Public Safety", "Finance"), api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)## Not run: # Pull recent San Diego ordinances from a registered source results <- classify( source = "city_san_diego", doc_type = "ordinance", since = "2024-01-01", n = 50L, categories = c("Housing", "Public Safety", "Finance", "Infrastructure", "Health"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) # Or classify your own text directly results <- classify( input_data = df$bill_text, categories = c("Housing", "Public Safety", "Finance"), api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)
Wraps the Python catpol.explore() function. Returns every category
string extracted from every chunk across every iteration – with
duplicates intact.
explore( input_data = NULL, api_key = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, document_context = "", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )explore( input_data = NULL, api_key = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, document_context = "", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector, list, or |
api_key |
Character or |
source |
Character or |
doc_type |
Character or |
since |
Character or |
until |
Character or |
n |
Integer or |
document_context |
Character. Context about the document. Default |
description |
Character. Additional context. Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
user_model |
Character. Default |
creativity |
Numeric or |
specificity |
Character. Default |
research_question |
Character or |
filename |
Character or |
model_source |
Character. Default |
iterations |
Integer. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Default |
A character vector of every category string extracted.
## Not run: raw_cats <- explore( source = "federal_executive_orders", since = "2025-01-01", n = 30L, api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) sort(table(raw_cats), decreasing = TRUE) ## End(Not run)## Not run: raw_cats <- explore( source = "federal_executive_orders", since = "2025-01-01", n = 30L, api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) sort(table(raw_cats), decreasing = TRUE) ## End(Not run)
Wraps the Python catpol.extract() function. Returns a normalised,
deduplicated set of categories from policy text or a registered source.
extract( input_data = NULL, api_key = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, document_context = "", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )extract( input_data = NULL, api_key = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, document_context = "", description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector, list, or |
api_key |
Character or |
source |
Character or |
doc_type |
Character or |
since |
Character or |
until |
Character or |
n |
Integer or |
document_context |
Character. Context about the document. Default |
description |
Character. Additional context. Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
user_model |
Character. Default |
creativity |
Numeric or |
specificity |
Character. Default |
research_question |
Character or |
mode |
Character. Default |
filename |
Character or |
model_source |
Character. Default |
iterations |
Integer. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Default |
A named list with counts_df, top_categories, and raw_top_text.
## Not run: result <- extract( input_data = df$bill_text, document_context = "California state legislation", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)## Not run: result <- extract( input_data = df$bill_text, document_context = "California state legislation", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)
Returns the names of all data sources registered with the Python catpol
package (city ordinances, federal laws, executive orders, presidential
speeches, social media archives, etc.).
list_sources()list_sources()
A character vector of source names.
## Not run: list_sources() #> [1] "city_san_diego" "city_san_francisco" #> [3] "federal_laws" "federal_executive_orders" #> [5] "social_trump_truth" ... ## End(Not run)## Not run: list_sources() #> [1] "city_san_diego" "city_san_francisco" #> [3] "federal_laws" "federal_executive_orders" #> [5] "social_trump_truth" ... ## End(Not run)
Wraps the Python catpol.summarize() function. Generates summaries from
policy text or from a registered political data source. Adds a tone
parameter for policy-specific framing.
summarize( input_data = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, format = "paragraph", tone = "eli5", api_key = NULL, description = "", instructions = "", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400 )summarize( input_data = NULL, source = NULL, doc_type = NULL, since = NULL, until = NULL, n = NULL, format = "paragraph", tone = "eli5", api_key = NULL, description = "", instructions = "", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400 )
input_data |
A character vector, list, or PDF/URL paths; |
source |
Character or |
doc_type |
Character or |
since |
Character or |
until |
Character or |
n |
Integer or |
format |
Character. Output format. Default |
tone |
Character. Policy-specific tone, e.g. |
api_key |
Character or |
description |
Character. Default |
instructions |
Character. Specific instructions for the summary.
Default |
max_length |
Integer or |
focus |
Character or |
user_model |
Character. Default |
model_source |
Character. Default |
mode |
Character. Default |
input_mode |
Character or |
input_type |
Character. Default |
pdf_dpi |
Integer. Default |
creativity |
Numeric or |
thinking_budget |
Integer. Default |
chain_of_thought |
Logical. Default |
context_prompt |
Logical. Default |
step_back_prompt |
Logical. Default |
filename |
Character or |
save_directory |
Character or |
models |
List of model specs for ensemble mode. Default |
max_workers |
Integer or |
parallel |
Logical or |
auto_download |
Logical. Default |
safety |
Logical. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
fail_strategy |
Character. Default |
batch_mode |
Logical. Default |
batch_poll_interval |
Numeric. Default |
batch_timeout |
Numeric. Default |
A data.frame with summarization results.
## Not run: results <- summarize( source = "federal_executive_orders", since = "2025-01-01", n = 20L, format = "paragraph", tone = "eli5", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)## Not run: results <- summarize( source = "federal_executive_orders", since = "2025-01-01", n = 20L, format = "paragraph", tone = "eli5", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)