cat.pol addscat.pol is a thin domain wrapper around
cat.stack that adds:
source= argument. The data lives on HuggingFace and is
refreshed weekly."This is a policy document; identify what it does and who it affects"
injected automatically.Everything else — supported models, output format, ensemble voting —
is identical to cat.stack.
list_sources()
#> [1] "city_san_diego" "city_san_francisco"
#> [3] "city_los_angeles" "federal_laws"
#> [5] "federal_executive_orders" "social_trump_truth"
#> ...Each source maps to a curated HuggingFace dataset with weekly updates. See the Python catpol README for the current full list and the schema of each source.
results <- classify(
source = "city_san_diego",
doc_type = "ordinance",
since = "2024-01-01",
n = 50L,
categories = c("Housing", "Public Safety", "Finance",
"Infrastructure", "Health", "Other"),
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)The returned data.frame has one row per ordinance with
the original text, the date, the URL/ID, and one 0/1 column per
category.
# Resolutions only, between two dates:
results <- classify(
source = "city_san_francisco",
doc_type = "resolution",
since = "2024-06-01",
until = "2024-12-31",
n = 200L,
categories = c("Climate", "Housing", "Transportation",
"Police accountability", "Other"),
api_key = Sys.getenv("OPENAI_API_KEY")
)If you have policy documents not in the registered sources (state
legislation, agency rules, advocacy white papers), pass them as
input_data:
results <- classify(
input_data = df$bill_text,
document_context = "California state Senate bills, 2024 session",
categories = c("Housing", "Public Safety", "Education",
"Healthcare", "Environment", "Other"),
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)document_context is cat.pol’s analog of
cat.survey’s survey_question — it gives the
model framing for the documents being analyzed.
Long ordinances (10–20k words) can blow past context limits and cost a lot in tokens. Summarize first, then classify the summaries:
summaries <- summarize(
source = "city_san_diego",
doc_type = "ordinance",
since = "2024-01-01",
n = 50L,
format = "paragraph",
tone = "eli5", # plain-language summary
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)
results <- classify(
input_data = summaries$summary,
categories = c("Housing", "Public Safety", "Finance", "Other"),
api_key = Sys.getenv("OPENAI_API_KEY")
)The tone parameter is specific to
cat.pol::summarize(); options include "eli5"
(plain language), "neutral" (technical), and
"academic" (formal). Useful for downstream readability or
for generating press-friendly summaries alongside the analytic
classification.
vignette("getting-started", "cat.llm")).source=
value corresponds to a public HuggingFace dataset that has its own
preferred citation.vignette("getting-started", package = "cat.llm")?cat.pol::classify,
?cat.pol::extract, ?cat.pol::explore,
?cat.pol::summarize,
?cat.pol::list_sources