| Title: | Academic Paper Classification with LLMs |
|---|---|
| Description: | R interface to the Python catademic package. Classifies, extracts, explores, and summarizes academic papers using LLMs. A thin domain wrapper around cat.stack that adds journal and topic sourcing parameters for academic literature analysis. |
| Authors: | Chris Soria [aut, cre] |
| Maintainer: | Chris Soria <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.1.1 |
| Built: | 2026-07-04 06:19:37 UTC |
| Source: | https://github.com/chrissoria/cat-llm |
Wraps the Python catademic.classify() function. Adds journal and topic
sourcing parameters to the base cat.stack classification engine.
classify( categories, input_data = NULL, api_key = NULL, journal_issn = NULL, journal_name = NULL, journal_field = NULL, topic_name = NULL, topic_id = NULL, paper_limit = 50L, date_from = NULL, date_to = NULL, polite_email = NULL, journal = NULL, field = NULL, research_focus = NULL, paper_metadata = NULL, description = "", filename = NULL, save_directory = NULL, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )classify( categories, input_data = NULL, api_key = NULL, journal_issn = NULL, journal_name = NULL, journal_field = NULL, topic_name = NULL, topic_id = NULL, paper_limit = 50L, date_from = NULL, date_to = NULL, polite_email = NULL, journal = NULL, field = NULL, research_focus = NULL, paper_metadata = NULL, description = "", filename = NULL, save_directory = NULL, user_model = "gpt-4o", mode = "image", creativity = NULL, safety = FALSE, chain_of_verification = FALSE, chain_of_thought = FALSE, step_back_prompt = FALSE, context_prompt = FALSE, thinking_budget = 0L, example1 = NULL, example2 = NULL, example3 = NULL, example4 = NULL, example5 = NULL, example6 = NULL, model_source = "auto", max_categories = 12L, categories_per_chunk = 10L, divisions = 10L, research_question = NULL, models = NULL, consensus_threshold = "unanimous", use_json_schema = TRUE, max_workers = NULL, fail_strategy = "partial", max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, pdf_dpi = 150L, auto_download = FALSE, add_other = "prompt", check_verbosity = TRUE, prompt_tune = NULL, tune_iterations = 1L, tune_ui = "browser", tune_optimize = "balanced" )
categories |
A character vector of category names, or |
input_data |
A character vector, list, or |
api_key |
Character or |
journal_issn |
Character or |
journal_name |
Character or |
journal_field |
Character or |
topic_name |
Character or |
topic_id |
Character or |
paper_limit |
Integer. Max papers to fetch. Default |
date_from |
Character or |
date_to |
Character or |
polite_email |
Character or |
journal |
Character or |
field |
Character or |
research_focus |
Character or |
paper_metadata |
Named list or |
description |
Character. Context description. Default |
filename |
Character or |
save_directory |
Character or |
user_model |
Character. Model name. Default |
mode |
Character. Processing mode. Default |
creativity |
Numeric or |
safety |
Logical. Save progress after each item. Default |
chain_of_verification |
Logical. Default |
chain_of_thought |
Logical. Default |
step_back_prompt |
Logical. Default |
context_prompt |
Logical. Default |
thinking_budget |
Integer. Default |
example1, example2, example3, example4, example5, example6
|
Optional few-shot examples. |
model_source |
Character. Provider hint. Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
research_question |
Character or |
models |
List of model specs for ensemble mode. |
consensus_threshold |
Character or numeric. Default |
use_json_schema |
Logical. Default |
max_workers |
Integer or |
fail_strategy |
Character. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
pdf_dpi |
Integer. Default |
auto_download |
Logical. Default |
add_other |
Logical or |
check_verbosity |
Logical. Default |
prompt_tune |
Integer or |
tune_iterations |
Integer. APO optimization passes. Default |
tune_ui |
Character. Correction UI: |
tune_optimize |
Character. Metric to optimize: |
A data.frame with classification results.
## Not run: # Classify abstracts directly results <- classify( categories = c("Methods", "Theory", "Review", "Other"), input_data = df$abstract, mode = "text", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) # Fetch papers from a journal via OpenAlex results <- classify( categories = c("Empirical", "Theoretical", "Review"), journal_name = "American Sociological Review", paper_limit = 100L, polite_email = "[email protected]", api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)## Not run: # Classify abstracts directly results <- classify( categories = c("Methods", "Theory", "Review", "Other"), input_data = df$abstract, mode = "text", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) # Fetch papers from a journal via OpenAlex results <- classify( categories = c("Empirical", "Theoretical", "Review"), journal_name = "American Sociological Review", paper_limit = 100L, polite_email = "[email protected]", api_key = Sys.getenv("OPENAI_API_KEY") ) ## End(Not run)
Wraps the Python catademic.explore() function. Returns every category
string extracted from every chunk across every iteration – with duplicates
intact.
explore( input_data = NULL, api_key = NULL, description = "", journal_issn = NULL, journal_name = NULL, journal_field = NULL, topic_name = NULL, topic_id = NULL, paper_limit = 50L, date_from = NULL, date_to = NULL, polite_email = NULL, max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )explore( input_data = NULL, api_key = NULL, description = "", journal_issn = NULL, journal_name = NULL, journal_field = NULL, topic_name = NULL, topic_id = NULL, paper_limit = 50L, date_from = NULL, date_to = NULL, polite_email = NULL, max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector, list, or |
api_key |
Character or |
description |
Character. Context description. Default |
journal_issn |
Character or |
journal_name |
Character or |
journal_field |
Character or |
topic_name |
Character or |
topic_id |
Character or |
paper_limit |
Integer. Max papers to fetch. Default |
date_from |
Character or |
date_to |
Character or |
polite_email |
Character or |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
user_model |
Character. Default |
creativity |
Numeric or |
specificity |
Character. Default |
research_question |
Character or |
filename |
Character or |
model_source |
Character. Default |
iterations |
Integer. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Default |
A character vector of every category string extracted.
## Not run: raw_cats <- explore( input_data = df$abstracts, api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) table(raw_cats) ## End(Not run)## Not run: raw_cats <- explore( input_data = df$abstracts, api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini", iterations = 4L ) table(raw_cats) ## End(Not run)
Wraps the Python catademic.extract() function. Discovers and returns a
normalised, deduplicated set of categories from academic paper data.
extract( input_data = NULL, api_key = NULL, journal_issn = NULL, journal_name = NULL, journal_field = NULL, topic_name = NULL, topic_id = NULL, paper_limit = 50L, date_from = NULL, date_to = NULL, polite_email = NULL, journal = NULL, field = NULL, research_focus = NULL, paper_metadata = NULL, description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )extract( input_data = NULL, api_key = NULL, journal_issn = NULL, journal_name = NULL, journal_field = NULL, topic_name = NULL, topic_id = NULL, paper_limit = 50L, date_from = NULL, date_to = NULL, polite_email = NULL, journal = NULL, field = NULL, research_focus = NULL, paper_metadata = NULL, description = "", max_categories = 12L, categories_per_chunk = 10L, divisions = 12L, user_model = "gpt-4o", creativity = NULL, specificity = "broad", research_question = NULL, mode = "text", filename = NULL, model_source = "auto", iterations = 8L, random_state = NULL, focus = NULL, chunk_delay = 0 )
input_data |
A character vector, list, or |
api_key |
Character or |
journal_issn |
Character or |
journal_name |
Character or |
journal_field |
Character or |
topic_name |
Character or |
topic_id |
Character or |
paper_limit |
Integer. Max papers to fetch. Default |
date_from |
Character or |
date_to |
Character or |
polite_email |
Character or |
journal |
Character or |
field |
Character or |
research_focus |
Character or |
paper_metadata |
Named list or |
description |
Character. Context description. Default |
max_categories |
Integer. Default |
categories_per_chunk |
Integer. Default |
divisions |
Integer. Default |
user_model |
Character. Default |
creativity |
Numeric or |
specificity |
Character. Default |
research_question |
Character or |
mode |
Character. Default |
filename |
Character or |
model_source |
Character. Default |
iterations |
Integer. Default |
random_state |
Integer or |
focus |
Character or |
chunk_delay |
Numeric. Default |
A named list with counts_df, top_categories, and raw_top_text.
## Not run: result <- extract( topic_name = "climate change adaptation", paper_limit = 200L, polite_email = "[email protected]", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)## Not run: result <- extract( topic_name = "climate change adaptation", paper_limit = 200L, polite_email = "[email protected]", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) print(result$top_categories) ## End(Not run)
Wraps the Python catademic.summarize() function. Generates summaries of
academic paper data. The Python function accepts input_data and passes all
other arguments through via **kwargs to cat_stack.summarize().
summarize( input_data, api_key = NULL, description = "", instructions = "", format = "paragraph", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400 )summarize( input_data, api_key = NULL, description = "", instructions = "", format = "paragraph", max_length = NULL, focus = NULL, user_model = "gpt-4o", model_source = "auto", mode = "image", input_mode = NULL, input_type = "auto", pdf_dpi = 150L, creativity = NULL, thinking_budget = 0L, chain_of_thought = TRUE, context_prompt = FALSE, step_back_prompt = FALSE, filename = NULL, save_directory = NULL, models = NULL, max_workers = NULL, parallel = NULL, auto_download = FALSE, safety = FALSE, max_retries = 5L, batch_retries = 2L, retry_delay = 1, row_delay = 0, fail_strategy = "partial", batch_mode = FALSE, batch_poll_interval = 30, batch_timeout = 86400 )
input_data |
A character vector, list, or |
api_key |
Character or |
description |
Character. Context description. Default |
instructions |
Character. Specific instructions for the summary.
Default |
format |
Character. Output format. Default |
max_length |
Integer or |
focus |
Character or |
user_model |
Character. Model name. Default |
model_source |
Character. Provider hint. Default |
mode |
Character. Processing mode. Default |
input_mode |
Character or |
input_type |
Character. Input type. Default |
pdf_dpi |
Integer. DPI for PDFs. Default |
creativity |
Numeric or |
thinking_budget |
Integer. Default |
chain_of_thought |
Logical. Default |
context_prompt |
Logical. Default |
step_back_prompt |
Logical. Default |
filename |
Character or |
save_directory |
Character or |
models |
List of model specs for ensemble mode. Default |
max_workers |
Integer or |
parallel |
Logical or |
auto_download |
Logical. Default |
safety |
Logical. Default |
max_retries |
Integer. Default |
batch_retries |
Integer. Default |
retry_delay |
Numeric. Default |
row_delay |
Numeric. Default |
fail_strategy |
Character. Default |
batch_mode |
Logical. Default |
batch_poll_interval |
Numeric. Default |
batch_timeout |
Numeric. Default |
A data.frame with summarization results.
## Not run: summaries <- summarize( input_data = df$abstracts, description = "Sociology journal abstracts", instructions = "Summarize the key findings in 2 sentences", format = "paragraph", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)## Not run: summaries <- summarize( input_data = df$abstracts, description = "Sociology journal abstracts", instructions = "Summarize the key findings in 2 sentences", format = "paragraph", api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ## End(Not run)