---
title: "Classifying Academic Papers"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Classifying Academic Papers}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(eval = FALSE, comment = "#>", collapse = TRUE)
```

# What `cat.ademic` adds

`cat.ademic` is a thin domain wrapper around `cat.stack` that adds
**OpenAlex-based paper fetching** plus academic prompt framing.
You can:

1. **Fetch papers from a journal or topic** via OpenAlex (`journal_name`,
   `journal_issn`, `journal_field`, `topic_name`, `topic_id`) and
   classify them in one call.
2. **Classify text you already have** (abstracts, full text, or PDFs)
   as a plain character vector or file directory.

Everything else — supported models, output format, ensemble voting,
batch mode — is identical to `cat.stack`.

# Install

```{r install}
install.packages(
  "cat.ademic",
  repos = c("https://chrissoria.r-universe.dev",
            "https://cloud.r-project.org")
)
library(cat.ademic)
```

# Classify abstracts you already have

```{r classify-abstracts}
abstracts <- c(
  "We use mixed-methods to study labor market outcomes for...",
  "This paper develops a formal model of bargaining under...",
  "A systematic review of 47 studies on educational interventions..."
)

results <- classify(
  categories = c("Empirical-quantitative",
                 "Empirical-qualitative",
                 "Theoretical-formal",
                 "Review/meta-analysis",
                 "Other"),
  input_data = abstracts,
  mode       = "text",
  api_key    = Sys.getenv("OPENAI_API_KEY"),
  user_model = "gpt-4o-mini"
)
```

# Fetch papers from a journal

`cat.ademic` connects to [OpenAlex](https://openalex.org) — a free,
open scholarly database — to fetch papers by journal, field, or topic.
Set `polite_email` (your email) for higher rate limits.

```{r fetch-journal}
results <- classify(
  categories   = c("Quantitative", "Qualitative", "Mixed Methods"),
  journal_name = "American Sociological Review",
  paper_limit  = 100L,
  date_from    = "2024-01-01",
  polite_email = "you@university.edu",
  api_key      = Sys.getenv("OPENAI_API_KEY")
)
```

Or by ISSN for unambiguous journal identification:

```{r fetch-issn}
results <- classify(
  categories   = c("Empirical", "Theoretical", "Review"),
  journal_issn = "0003-1224",                # AJS
  paper_limit  = 50L,
  polite_email = "you@university.edu",
  api_key      = Sys.getenv("OPENAI_API_KEY")
)
```

# Fetch papers by topic

OpenAlex auto-tags papers with research topics. You can pull all papers
on a topic across journals:

```{r fetch-topic}
results <- classify(
  categories   = c("Causal-identification", "Descriptive",
                   "Theoretical", "Other"),
  topic_name   = "climate change adaptation",
  paper_limit  = 200L,
  date_from    = "2023-01-01",
  polite_email = "you@university.edu",
  api_key      = Sys.getenv("OPENAI_API_KEY")
)
```

# Classify full-text PDFs

Pass a directory or a vector of file paths. `cat.ademic` extracts the
text (or renders pages as images for vision models) and classifies:

```{r classify-pdfs}
# One-time: install PDF extras
# cat.stack::install_cat_stack(pdf = TRUE)

results <- classify(
  categories = c("Has-DGP-assumption", "No-DGP-assumption",
                 "Unclear", "Other"),
  input_data = "./papers/",                  # directory of PDFs
  mode       = "image",                      # rendered-page vision mode
  api_key    = Sys.getenv("OPENAI_API_KEY"),
  user_model = "gpt-4o"                      # vision-capable model
)
```

# Summarize before classifying

For long full-text inputs, summarizing first can improve downstream
classification quality (and reduce token cost):

```{r summarize-first}
summaries <- summarize(
  input_data   = "./papers/",
  description  = "Sociology articles",
  instructions = "Summarize methodology and key findings in 3 sentences",
  format       = "paragraph",
  api_key      = Sys.getenv("OPENAI_API_KEY"),
  user_model   = "gpt-4o-mini"
)

results <- classify(
  categories = c("Causal", "Descriptive", "Theoretical", "Other"),
  input_data = summaries$summary,
  api_key    = Sys.getenv("OPENAI_API_KEY")
)
```

# Tips for academic work

1. **Always set `polite_email`** when fetching from OpenAlex — without
   it you're throttled to a low rate limit.
2. **Abstract vs. full text.** Abstracts are cheap and fast; full-text
   classification (PDF input) is more accurate for methodological
   categories but costs more. Use abstracts for screening, full text
   for in-depth coding.
3. **Cite OpenAlex** if you publish on data fetched through it — see
   [openalex.org](https://openalex.org) for citation guidance.

# Where to learn more

- Full Getting Started guide:
  `vignette("getting-started", package = "cat.llm")`
- Per-function reference: `?cat.ademic::classify`, `?cat.ademic::extract`,
  `?cat.ademic::explore`, `?cat.ademic::summarize`
- OpenAlex docs: <https://docs.openalex.org>