| Title: | Web Content Classification with LLMs |
|---|---|
| Description: | R interface to the Python catweb package. Classifies, extracts, explores, and summarizes web content (URLs or text) using LLMs. A thin domain wrapper around cat.stack that adds automatic URL fetching and web-context prompt injection (source domain, content type, metadata). |
| Authors: | Chris Soria [aut, cre] |
| Maintainer: | Chris Soria <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.2 |
| Built: | 2026-07-04 06:20:15 UTC |
| Source: | https://github.com/chrissoria/cat-llm |
Wraps the Python catweb.classify() function. Accepts URLs (auto-fetched
to text) or raw text strings. Injects web context (source domain, content
type, metadata) into the classification prompt.
classify( categories, input_data = NULL, api_key = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, description = "", filename = NULL, save_directory = NULL, timeout = 30L, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )classify( categories, input_data = NULL, api_key = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, description = "", filename = NULL, save_directory = NULL, timeout = 30L, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )
categories |
A character vector of category names. |
input_data |
A character vector / list / |
api_key |
Character or |
source_domain |
Character or |
content_type |
Character or |
web_metadata |
Named list or |
description |
Character. Context description. Default |
filename |
Character or |
save_directory |
Character or |
timeout |
Integer. URL fetch timeout (seconds). Default |
user_model |
Character. Model name. Default |
mode |
Character. Processing mode. Default |
creativity |
Numeric or |
safety |
Logical. Default |
chain_of_verification |
Logical. Default |
chain_of_thought |
Logical. Default |
step_back_prompt |
Logical. Default |
context_prompt |
Logical. Default |
thinking_budget |
Integer. Default |
example1, example2, example3, example4, example5, example6
|
Optional few-shot examples. |
model_source |
Character. Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
research_question |
Character or |
models |
List of model specs for ensemble mode. Default |
consensus_threshold |
Character or numeric. Default |
use_json_schema |
Logical. Default |
max_workers |
Integer or |
fail_strategy |
Character. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
pdf_dpi |
Integer. Default |
auto_download |
Logical. Default |
add_other |
Logical or |
check_verbosity |
Logical. Default |
prompt_tune |
Integer or |
tune_iterations |
Integer. APO optimization passes. Default |
tune_ui |
Character. Correction UI: |
tune_optimize |
Character. Metric to optimize: |
A data.frame with classification results.
## Not run: # Classify a list of URLs (auto-fetched to text) results <- classify( categories = c("News", "Opinion", "Tutorial"), input_data = c("https://example.com/article-1", "https://example.com/article-2"), source_domain = "example.com", content_type = "blog post", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) # Or classify raw text (no fetching) results <- classify( categories = c("News", "Opinion", "Tutorial"), input_data = df$article_text, api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)## Not run: # Classify a list of URLs (auto-fetched to text) results <- classify( categories = c("News", "Opinion", "Tutorial"), input_data = c("https://example.com/article-1", "https://example.com/article-2"), source_domain = "example.com", content_type = "blog post", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) # Or classify raw text (no fetching) results <- classify( categories = c("News", "Opinion", "Tutorial"), input_data = df$article_text, api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)
Wraps the Python catweb.explore() function. Returns every category
string extracted from every chunk across every iteration – with
duplicates intact.
explore( input_data = NULL, api_key = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, description = "", timeout = 30L, max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )explore( input_data = NULL, api_key = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, description = "", timeout = 30L, max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector / list of URLs or text. Default |
api_key |
Character or |
source_domain |
Character or |
content_type |
Character or |
web_metadata |
Named list or |
description |
Character. Default |
timeout |
Integer. URL fetch timeout (seconds). Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
user_model |
Character. Default |
creativity |
Numeric or |
specificity |
Character. Default |
research_question |
Character or |
filename |
Character or |
model_source |
Character. Default |
iterations |
Integer. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Default |
A character vector of every category string extracted.
## Not run: raw_cats <- explore( input_data = urls, source_domain = "example.com", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) table(raw_cats) ## End(Not run)## Not run: raw_cats <- explore( input_data = urls, source_domain = "example.com", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) table(raw_cats) ## End(Not run)
Wraps the Python catweb.extract() function. Accepts URLs (auto-fetched)
or raw text. Returns a normalised, deduplicated set of categories.
extract( input_data = NULL, api_key = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, description = "", timeout = 30L, max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )extract( input_data = NULL, api_key = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, description = "", timeout = 30L, max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector / list of URLs or text. Default |
api_key |
Character or |
source_domain |
Character or |
content_type |
Character or |
web_metadata |
Named list or |
description |
Character. Default |
timeout |
Integer. URL fetch timeout (seconds). Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
user_model |
Character. Default |
creativity |
Numeric or |
specificity |
Character. Default |
research_question |
Character or |
mode |
Character. Default |
filename |
Character or |
model_source |
Character. Default |
iterations |
Integer. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Default |
A named list with counts_df, top_categories, and raw_top_text.
## Not run: result <- extract( input_data = c("https://example.com/page1", "https://example.com/page2"), source_domain = "example.com", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)## Not run: result <- extract( input_data = c("https://example.com/page1", "https://example.com/page2"), source_domain = "example.com", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)
Wraps the Python catweb.summarize() function. Accepts URLs (auto-fetched)
or raw text. Web context (source domain, content type, metadata) is
injected into the summarization prompt.
summarize( input_data = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, timeout = 30L, api_key = NULL, description = "", instructions = "", format = "paragraph", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400 )summarize( input_data = NULL, source_domain = NULL, content_type = NULL, web_metadata = NULL, timeout = 30L, api_key = NULL, description = "", instructions = "", format = "paragraph", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400 )
input_data |
Data to summarize: URLs, text, or |
source_domain |
Character or |
content_type |
Character or |
web_metadata |
Named list or |
timeout |
Integer. URL fetch timeout (seconds). Default |
api_key |
Character or |
description |
Character. Default |
instructions |
Character. Specific instructions for the summary.
Default |
format |
Character. Default |
max_length |
Integer or |
focus |
Character or |
user_model |
Character. Default |
model_source |
Character. Default |
mode |
Character. Default |
input_mode |
Character or |
input_type |
Character. Default |
pdf_dpi |
Integer. Default |
creativity |
Numeric or |
thinking_budget |
Integer. Default |
chain_of_thought |
Logical. Default |
context_prompt |
Logical. Default |
step_back_prompt |
Logical. Default |
filename |
Character or |
save_directory |
Character or |
models |
List of model specs for ensemble mode. Default |
max_workers |
Integer or |
parallel |
Logical or |
auto_download |
Logical. Default |
safety |
Logical. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
fail_strategy |
Character. Default |
batch_mode |
Logical. Default |
batch_poll_interval |
Numeric. Default |
batch_timeout |
Numeric. Default |
A data.frame with summarization results.
## Not run: summaries <- summarize( input_data = c("https://example.com/article-1", "https://example.com/article-2"), source_domain = "example.com", content_type = "news article", format = "bullets", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)## Not run: summaries <- summarize( input_data = c("https://example.com/article-1", "https://example.com/article-2"), source_domain = "example.com", content_type = "news article", format = "bullets", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)