--- title: "Classifying Policy Documents and Political Text" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Classifying Policy Documents and Political Text} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(eval = FALSE, comment = "#>", collapse = TRUE) ``` # What `cat.pol` adds `cat.pol` is a thin domain wrapper around `cat.stack` that adds: 1. **A registered-source fetcher** — 17 built-in political data sources (municipal ordinances, federal laws, executive orders, presidential speeches, Truth Social posts, etc.) accessible via a single `source=` argument. The data lives on HuggingFace and is refreshed weekly. 2. **Policy-document prompt framing** — context like `"This is a policy document; identify what it does and who it affects"` injected automatically. Everything else — supported models, output format, ensemble voting — is identical to `cat.stack`. # Install ```{r install} install.packages( "cat.pol", repos = c("https://chrissoria.r-universe.dev", "https://cloud.r-project.org") ) library(cat.pol) ``` # See available data sources ```{r list-sources} list_sources() #> [1] "city_san_diego" "city_san_francisco" #> [3] "city_los_angeles" "federal_laws" #> [5] "federal_executive_orders" "social_trump_truth" #> ... ``` Each source maps to a curated HuggingFace dataset with weekly updates. See the [Python catpol README](https://pypi.org/project/cat-pol/) for the current full list and the schema of each source. # Fetch and classify ordinances ```{r fetch-classify} results <- classify( source = "city_san_diego", doc_type = "ordinance", since = "2024-01-01", n = 50L, categories = c("Housing", "Public Safety", "Finance", "Infrastructure", "Health", "Other"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ``` The returned `data.frame` has one row per ordinance with the original text, the date, the URL/ID, and one 0/1 column per category. # Filter by date range and document type ```{r filters} # Resolutions only, between two dates: results <- classify( source = "city_san_francisco", doc_type = "resolution", since = "2024-06-01", until = "2024-12-31", n = 200L, categories = c("Climate", "Housing", "Transportation", "Police accountability", "Other"), api_key = Sys.getenv("OPENAI_API_KEY") ) ``` # Classify your own policy text If you have policy documents not in the registered sources (state legislation, agency rules, advocacy white papers), pass them as `input_data`: ```{r classify-text} results <- classify( input_data = df$bill_text, document_context = "California state Senate bills, 2024 session", categories = c("Housing", "Public Safety", "Education", "Healthcare", "Environment", "Other"), api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) ``` `document_context` is `cat.pol`'s analog of `cat.survey`'s `survey_question` — it gives the model framing for the documents being analyzed. # Summarize before classifying Long ordinances (10–20k words) can blow past context limits and cost a lot in tokens. Summarize first, then classify the summaries: ```{r summarize-first} summaries <- summarize( source = "city_san_diego", doc_type = "ordinance", since = "2024-01-01", n = 50L, format = "paragraph", tone = "eli5", # plain-language summary api_key = Sys.getenv("OPENAI_API_KEY"), user_model = "gpt-4o-mini" ) results <- classify( input_data = summaries$summary, categories = c("Housing", "Public Safety", "Finance", "Other"), api_key = Sys.getenv("OPENAI_API_KEY") ) ``` The `tone` parameter is specific to `cat.pol::summarize()`; options include `"eli5"` (plain language), `"neutral"` (technical), and `"academic"` (formal). Useful for downstream readability or for generating press-friendly summaries alongside the analytic classification. # Tips for political-text work 1. **Be specific about category boundaries.** Policy domains overlap — a housing ordinance might also touch finance and zoning. Either use multi-label (default) or write categories with explicit exclusion criteria. 2. **Watch for ideological priors.** LLMs have political biases. For research where the political lean of the classifier matters, use a multi-model ensemble (see the meta-package vignette, `vignette("getting-started", "cat.llm")`). 3. **Cite the data source.** Each `source=` value corresponds to a public HuggingFace dataset that has its own preferred citation. # Where to learn more - Full Getting Started guide: `vignette("getting-started", package = "cat.llm")` - Per-function reference: `?cat.pol::classify`, `?cat.pol::extract`, `?cat.pol::explore`, `?cat.pol::summarize`, `?cat.pol::list_sources` - Built-in source schemas: