install.packages("ellmer")Understanding LLM Output: A Practical Guide for Social Scientists
1 Introduction
This lecture provides a hands-on introduction to working with Large Language Models (LLMs) programmatically using R. Rather than interacting with ChatGPT or Claude through a web interface, we will learn to call these models from code—enabling systematic analysis, batch processing, and reproducible research.
1.1 What You Will Learn
By the end of this lecture, you will be able to:
- Call LLMs programmatically from R using the
ellmerpackage - Evaluate and compare LLM outputs across different models and prompts
- Extract structured data from LLM responses for downstream analysis
1.2 Prerequisites
- Basic familiarity with R and the tidyverse
- API keys for OpenAI and/or Anthropic (we’ll discuss how to obtain these)
- For local models: Ollama installed on your machine (optional but recommended)
1.3 Why Programmatic Access Matters for Research
When you use ChatGPT through the web interface, you’re having a conversation. That’s useful for exploration, but you won’t be able to run systematic tests and analyses of the outputs if you are typing the prompts manually. As researchers, we often need to:
- Process hundreds or thousands of text inputs
- Compare how different models respond to identical prompts
- Ensure reproducibility of our analyses
- Extract structured data (not just free-form text) for statistical analysis
Programmatic access gives us all of this.
2 Setup
2.1 Installing Required Packages
We will use the ellmer package, which provides a unified interface to multiple LLM providers.
Load the packages we need:
library(tidyverse)
library(ellmer)2.2 Setting Up API Keys
Before you can call OpenAI or Anthropic models, you need API keys. These are secret tokens that authenticate your requests.
This lecture assumes your ~/.Renviron file already exists and contains your API keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY). If not, follow the instructions below to set them up.
To obtain API keys:
- OpenAI: Visit platform.openai.com and create an API key
- Anthropic (Claude): Visit console.anthropic.com and create an API key
To set your keys in R:
# Run these once per session (or add to your .Renviron file)
Sys.setenv(OPENAI_API_KEY = "your-openai-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-key-here")For persistent storage, add these lines to your .Renviron file (without the Sys.setenv() wrapper) so they load automatically when R starts.
3 Your First LLM Calls
3.1 Discovering Available Models
Before we start making API calls, it’s useful to know what models are available. The ellmer package provides helper functions to list models from each provider:
# See available OpenAI models
ellmer::models_openai()
# See available Anthropic (Claude) models
ellmer::models_anthropic()
# See available Google Gemini models
ellmer::models_google_gemini()These functions query the providers’ APIs and return current model names. This is helpful when model names change or new models are released.
3.2 Basic Chat with GPT-5-mini
Let’s start with the simplest possible example: asking a question and getting an answer.
chat <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat$chat("What is the capital of Germany?")The capital of Germany is Berlin.
That’s it. We created a chat object connected to OpenAI’s GPT-5-mini model, then sent a message and received a response.
3.3 The Role of System Prompts
A system prompt is an instruction that shapes how the model behaves throughout the conversation. It’s like giving the model a persona or a set of ground rules.
Compare these two approaches:
# Without a specific system prompt
chat_default <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat_default$chat("Is Munich in France?")No. Munich (German: München) is in southern Germany — it’s the capital of the
state of Bavaria, not in France.
# With a system prompt requesting terse responses
chat_terse <- chat_openai(
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a terse assistant who gives one-word answers to questions."
)
chat_terse$chat("Is Munich in France?")No.
The system prompt dramatically changes the response style. This is powerful: you can instruct the model to be formal, casual, technical, simple, or to adopt specific personas relevant to your research.
3.3.1 Example: A Sarcastic Assistant
chat_openai(
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a rude assistant who gives sarcastic and very short answers."
)$chat("Is Paris in the U.K.?")No — Paris is in France, not the U.K. (Unless you mean Paris, Texas.)
3.4 Conversation History: Context Matters
When you continue chatting with the same chat object, the model remembers the previous exchanges:
chat_terse$chat("Is R a good programming language?")Yes.
chat_terse$chat("Is Stata used by economists?")Yes.
chat_terse$chat("Have I already asked you about R?")Yes.
The model recalls that we asked about R earlier. This is because conversation history has been maintained.
3.4.1 Viewing Conversation History
You can inspect what’s been said so far:
chat_terse$get_turns()[[1]]
<Turn: user>
Is Munich in France?
[[2]]
<Turn: assistant>
No.
[[3]]
<Turn: user>
Is R a good programming language?
[[4]]
<Turn: assistant>
Yes.
[[5]]
<Turn: user>
Is Stata used by economists?
[[6]]
<Turn: assistant>
Yes.
[[7]]
<Turn: user>
Have I already asked you about R?
[[8]]
<Turn: assistant>
Yes.
Why does this matter? Conversation history affects model responses. In research applications, you typically want each query to be independent (so prior context doesn’t influence results). We’ll address this when we write functions for batch processing.
4 Writing Reusable Functions
When processing many inputs, you don’t want to manually type each query. Instead, we write functions that wrap the API calls.
4.1 A Simple Wrapper Function
Here’s a function that sends a prompt to GPT-5-mini and returns a terse response:
ask5miniTerse <- function(prompt, echo = NULL) {
# Create a fresh chat object for each call (no conversation history carryover)
# Note: Some models (like gpt-5-mini) only support the default temperature
chat <- chat_openai(
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a terse assistant who gives one-word answers to questions.",
echo = echo
)
# Send the prompt and return the response
chat$chat(prompt)
}Key design decisions:
Fresh chat object each time: By creating a new
chatinside the function, each call is independent—no conversation history leaks between queries.Temperature: Temperature controls randomness. Setting it to 0 makes outputs more deterministic (the model picks the most likely response). Note that some models (like GPT-5-mini) only support the default temperature. For models that support it, you can add
api_args = list(temperature = 0)to improve reproducibility.Echo parameter: Controls whether the conversation is printed to the console during execution. Useful for debugging.
4.2 Testing the Function
ask5miniTerse("What country is Vienna in?")Austria
ask5miniTerse("Is the sky blue?")Yes
4.3 A More Flexible Function Template
Here’s a more general pattern you can adapt for different use cases:
ask_llm <- function(prompt,
model = "gpt-4o-mini",
system_prompt = "You are a helpful assistant.",
temperature = 0,
echo = NULL) {
chat <- chat_openai(
model = model,
api_args = list(temperature = temperature),
system_prompt = system_prompt,
echo = echo
)
chat$chat(prompt)
}Now you can easily adjust the model, system prompt, or temperature:
# Use as a terse assistant
ask_llm("Is water wet?", system_prompt = "Give one-word answers only.")No.
# Use as a more verbose explainer
ask_llm("Is water wet?", system_prompt = "Explain your reasoning briefly.")The question of whether water is wet can be debated based on definitions. Water
itself is not wet; rather, it has the ability to make other materials wet.
"Wetness" is a property that describes the condition of a surface being covered
in a liquid. Since water is a liquid, it can cause other substances to become
wet, but it does not possess the quality of being wet in the same way that a
surface does when it is in contact with water.
5 Batch Processing with purrr
One of the most powerful applications of programmatic LLM access is processing many inputs at once.
5.1 The map_chr() Pattern
The map_chr() function from purrr applies a function to each element of a vector and returns a character vector of results.
# A set of questions we want to process
questions <- c(
"Is the Earth round?",
"Is water wet?",
"Do fish swim?",
"Can birds fly?"
)
# Process all questions
answers <- map_chr(questions, ~ ask5miniTerse(.x))Yes
No.
Yes
Generally.
# View results
answers[1] "Yes" "No." "Yes" "Generally."
The ~ ask5miniTerse(.x) syntax is a shorthand for an anonymous function. The .x represents “the current element being processed.”
5.2 Storing Results in a Tibble
It’s good practice to keep questions and answers together:
results <- tibble(
question = questions,
answer = answers
)
results# A tibble: 4 × 2
question answer
<chr> <chr>
1 Is the Earth round? Yes
2 Is water wet? No.
3 Do fish swim? Yes
4 Can birds fly? Generally.
This creates a structured dataset that you can analyze, export, or extend.
5.3 Why Not Just Use a Loop?
You could write a for loop instead:
# This works but is more verbose
answers_loop <- character(length(questions))
for (i in seq_along(questions)) {
answers_loop[i] <- ask5miniTerse(questions[i])
}The map_chr() approach is more concise and idiomatic in the tidyverse. It also makes it clearer that you’re applying the same operation to each element.
6 Prompt Sensitivity — Does Wording Matter?
A crucial question for research: Do small changes in how we phrase a prompt affect the model’s response?
Let’s test this with health advice scenarios:
fever_prompts <- c(
"I have a very mild fever, should I go to the doctor?",
"I have a very mild fever, would you recommend seeing a doctor?",
"I have a very mild fever, is that serious?",
"I have a very mild fever, is that OK?",
"I have a very mild fever, but that's not serious, correct?",
"I have a very mild fever, but that could still be serious, correct?"
)Notice the variations:
- Items 1-2: Direct questions about seeking medical care
- Items 3-4: Asking about severity (opposite framings: “serious?” vs “OK?”)
- Items 5-6: Leading questions that suggest an expected answer (testing for “sycophancy” or pandering)
fever_answers <- map_chr(fever_prompts, ~ ask5miniTerse(.x))Monitor
Depends
Usually.
Monitor
Sometimes
Possibly.
fever_results <- tibble(
prompt = fever_prompts,
response = fever_answers
)
fever_results# A tibble: 6 × 2
prompt response
<chr> <chr>
1 I have a very mild fever, should I go to the doctor? Monitor
2 I have a very mild fever, would you recommend seeing a doctor? Depends
3 I have a very mild fever, is that serious? Usually.
4 I have a very mild fever, is that OK? Monitor
5 I have a very mild fever, but that's not serious, correct? Sometimes
6 I have a very mild fever, but that could still be serious, correct? Possibly.
Discussion questions:
- Do prompts 3 and 4 produce semantically opposite answers (as the questions suggest)?
- Do the leading questions (5-6) cause the model to agree with the implied answer?
- What are the implications for using LLMs in research involving subjective assessments?
If models are sensitive to prompt framing, researchers must carefully design and pre-register their prompts. Small wording changes could systematically bias results.
7 Open-Weight Models and (Potential) Local Deployment
So far we’ve used OpenAI’s API, which means our queries go to OpenAI’s servers. Open-weight models offer an alternative: you can download and run them on your own computer. DeepSeek is is a popular and impressive open-weight model (but you probably won’t be able to run its largest version locally, so I want to show you a few ways to access it).
7.1 Why Use Local Models?
- Privacy: Your data never leaves your machine
- Cost: No per-query API charges (just your electricity)
- Availability: Works offline
- Reproducibility: You control the exact model version
7.2 Three Ways to Run DeepSeek
7.2.1 Option 1: DeepSeek API
DeepSeek offers an API similar to OpenAI:
# First set your API key
Sys.setenv(DEEPSEEK_API_KEY = "your-deepseek-key")
# Then call the model
chat_deepseek(model = "deepseek-chat")$chat("Ni hao ma?")7.2.2 Option 2: OpenRouter (Multi-Model Gateway)
OpenRouter provides access to many models through a single API:
chat_openrouter(model = "deepseek/deepseek-chat-v3.1")$chat("Hello!")7.2.3 Option 3: Ollama (Run Models Locally)
Ollama lets you download and run models on your laptop or desktop.
Setup:
- Download Ollama from ollama.com/download
- In your terminal, run:
ollama run llama3.2(this downloads the model) - Now you can call it from R:
llama <- chat_ollama(
model = "llama3.2",
system_prompt = "Make your response extremely terse."
)
llama$chat("Is using social media good for me?")7.3 Comparing Local vs. API Models
Let’s compare how different models respond to the same question:
# Local model (requires Ollama)
llama <- chat_ollama(model = "llama3.2", system_prompt = "Be very brief.")
llama_response <- llama$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")
# API model (DeepSeek)
ds <- chat_deepseek(model = "deepseek-chat", system_prompt = "Be very brief.")
ds_response <- ds$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")
# Compare
tibble(
model = c("Llama 3.2 (local)", "DeepSeek (API)"),
response = c(llama_response, ds_response)
)Different models may give substantially different answers to the same question. Even the same model may vary across runs. Always test consistency!
8 Comparing Multiple Models
For robust research, you often want to compare outputs across different LLMs.
8.1 Setting Up Claude
Anthropic’s Claude models are another major option. Here’s how to set up a function for Claude:
ask_claude_terse <- function(prompt,
system_prompt = "You are a terse assistant who gives one-word answers.",
model = "claude-3-5-haiku-20241022",
temperature = 0,
echo = NULL) {
chat <- chat_claude(
model = model,
api_args = list(temperature = temperature),
system_prompt = system_prompt,
echo = echo
)
chat$chat(prompt)
}# Test it
ask_claude_terse("What continent is Brazil on?")South America
8.2 Running the Same Queries on Multiple Models
Now let’s compare GPT and Claude on identical prompts:
political_statements <- c(
"The corrupt elites look down on us",
"Taxes are immoral",
"Taxes are necessary",
"Taxes are a necessary evil",
"Censorship is always immoral",
"Social media posts containing threats should be deleted",
"We only have one planet",
"The government should provide free healthcare to all",
"The government should provide free healthcare to those who take care of themselves"
)
# Create prompts asking about ideology
ideology_prompts <- map_chr(
political_statements,
~ paste("If a person expressed the following sentiment, are they more likely to be left-wing or right-wing?", shQuote(.x))
)
# Get responses from both models
gpt_responses <- map_chr(ideology_prompts, ~ ask5miniTerse(.x, echo = "none"))
claude_responses <- map_chr(ideology_prompts, ~ ask_claude_terse(.x, echo = "none"))
# Compare
comparison <- tibble(
statement = political_statements,
GPT = gpt_responses,
Claude = claude_responses
)
# Display a scrollable kable for easier browsing if there are many statements
library(kableExtra)
comparison %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
scroll_box(width = "100%", height = "300px")| statement | GPT | Claude |
|---|---|---|
| The corrupt elites look down on us | Both | Right-wing |
| Taxes are immoral | Right | Right-wing |
| Taxes are necessary | Left | Left-wing |
| Taxes are a necessary evil | Right-wing | Centrist |
| Censorship is always immoral | Right-wing | Libertarian |
| Social media posts containing threats should be deleted | Ambiguous | Right-wing |
| We only have one planet | Left | Left-wing |
| The government should provide free healthcare to all | Left | Left-wing |
| The government should provide free healthcare to those who take care of themselves | Right | Right-wing |
8.3 Visualizing Model Agreement
A tile chart provides a quick visual comparison of how different models classify the same statements:
Code
# Reshape to long format for ggplot
comparison_long <- comparison %>%
mutate(statement_id = row_number()) %>%
pivot_longer(
cols = c(GPT, Claude),
names_to = "model",
values_to = "classification"
) %>%
# Normalize classification labels (e.g., "Left-wing" -> "Left", "Right-wing" -> "Right")
mutate(
classification_clean = case_when(
str_detect(tolower(classification), "left") ~ "Left",
str_detect(tolower(classification), "right") ~ "Right",
str_detect(tolower(classification), "center|moderate") ~ "Center",
TRUE ~ "Other"
),
statement_short = str_trunc(statement, 30)
)
# Create tile chart
ggplot(comparison_long, aes(x = factor(statement_id), y = model, fill = classification_clean)) +
geom_tile(color = "white", linewidth = 0.5) +
scale_fill_manual(
values = c(
"Left" = "#3B82F6",
"Right" = "#EF4444",
"Center" = "#A855F7",
"Other" = "#6B7280"
),
na.value = "#9CA3AF"
) +
scale_x_discrete(
labels = comparison_long %>%
distinct(statement_id, statement_short) %>%
arrange(statement_id) %>%
pull(statement_short)
) +
labs(
title = "Model Classifications of Political Statements",
x = "Statement",
y = "Model",
fill = "Classification"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
panel.grid = element_blank(),
legend.position = "bottom"
)This visualization makes it easy to spot:
- Agreement: Where both models show the same color
- Disagreement: Where colors differ between rows
- Patterns: Whether one model tends to classify statements differently than another
Key insight: Do the models agree? Where do they disagree, and why might that be?
9 Structured Output — Beyond Free Text
Free-form text responses are useful, but for quantitative analysis, we often need structured data: specific fields with defined types. This section shows how to extract structured output from both OpenAI and Claude models.
9.1 Defining Output Structure
The ellmer package uses type_* functions to specify the structure you want:
# Define what we want the model to return
ideology_schema <- type_object(
"Ideology analysis of a text statement",
is_political = type_boolean("Is this statement about politics?"),
ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
confidence = type_number("Confidence score from 0.0 to 1.0")
)This tells the model: “I want you to return an object with three fields: a boolean, a string, and a number.”
9.2 Extracting Structured Data with OpenAI
Use extract_data() or chat_structured() to get structured output:
chat <- chat_openai(model = "gpt-5-mini-2025-08-07")
# Extract structured data from a statement
result <- chat$extract_data(
"Taxes are theft and the government wastes our money",
type = ideology_schema
)
result$is_political
[1] TRUE
$ideology
[1] "right"
$confidence
[1] 0.9
Now result is a list with named fields you can access directly:
result$ideology[1] "right"
result$confidence[1] 0.9
9.3 Creating a Structured Analysis Function
Let’s wrap this in a reusable function:
analyze_ideology <- function(text,
model = "gpt-5-mini-2025-08-07",
system_prompt = "You are a political analyst.") {
schema <- type_object(
"Ideology analysis",
is_political = type_boolean("Is this about politics?"),
ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
left_score = type_number("Left-wing score from 0.0 to 1.0"),
right_score = type_number("Right-wing score from 0.0 to 1.0")
)
chat <- chat_openai(
model = model,
system_prompt = system_prompt
)
chat$extract_data(text, type = schema)
}analyze_ideology("The minimum wage should be raised to help workers")$is_political
[1] TRUE
$ideology
[1] "left"
$left_score
[1] 0.9
$right_score
[1] 0.1
9.4 Batch Structured Analysis
Process multiple texts and combine into a data frame:
# Analyze all political statements
structured_results <- map(political_statements, analyze_ideology)
# Convert to tibble
ideology_df <- tibble(
statement = political_statements,
is_political = map_lgl(structured_results, "is_political"),
ideology = map_chr(structured_results, "ideology"),
left_score = map_dbl(structured_results, "left_score"),
right_score = map_dbl(structured_results, "right_score")
)
ideology_df# A tibble: 9 × 5
statement is_political ideology left_score right_score
<chr> <lgl> <chr> <dbl> <dbl>
1 The corrupt elites look down on … TRUE unclear 0.5 0.5
2 Taxes are immoral TRUE right 0.05 0.95
3 Taxes are necessary TRUE center 0.6 0.2
4 Taxes are a necessary evil TRUE right 0.25 0.75
5 Censorship is always immoral TRUE right 0.1 0.9
6 Social media posts containing th… TRUE center 0.5 0.5
7 We only have one planet TRUE left 0.75 0.15
8 The government should provide fr… TRUE left 0.9 0.1
9 The government should provide fr… TRUE right 0.3 0.7
Now you have a proper dataset ready for statistical analysis!
9.5 Structured Output with Claude
The same structured output approach works with Claude. Here we’ll also demonstrate extracting an array of topics:
analyze_text_claude <- function(text,
model = "claude-3-5-haiku-20241022",
temperature = 0) {
schema <- type_object(
"Text analysis",
is_political = type_boolean("Is this text about politics?"),
topics = type_array(
items = type_string("A topic mentioned in the text"),
description = "Array of topics covered in the text"
),
ideology = type_string("Ideological leaning: 'left', 'right', or 'none'"),
persuasiveness = type_number("How persuasive is this? 0.0 to 1.0")
)
chat <- chat_claude(
model = model,
api_args = list(temperature = temperature),
system_prompt = "You are a terse assistant with deep knowledge about politics."
)
chat$extract_data(text, type = schema)
}analyze_text_claude("Trump is good for America")$is_political
[1] TRUE
$topics
[1] "Donald Trump" "American politics"
[3] "Presidential leadership"
$ideology
[1] "right"
$persuasiveness
[1] 0.4
analyze_text_claude("The weather in Miami is great but climate change is a threat")$is_political
[1] TRUE
$topics
[1] "climate change" "environmental policy"
$ideology
[1] "left"
$persuasiveness
[1] 0.6
Notice how the model identifies multiple topics and distinguishes political from non-political content within the same text.
10 Consistency and Reliability
A critical concern for research: Are LLM outputs consistent across repeated runs?
10.1 Testing Consistency
Let’s run the same query multiple times:
# Function to query Llama and reset conversation each time
run_consistency_test <- function(prompt, n_runs = 5, model = "llama3.2") {
llama <- chat_ollama(model = model, system_prompt = "Be terse.")
results <- map_chr(1:n_runs, function(i) {
llama$set_turns(NULL) # Clear conversation history
llama$chat(prompt)
})
tibble(
run = 1:n_runs,
response = results
)
}
# Test with a subjective question
consistency_results <- run_consistency_test(
"Is America a force for good in the world?"
)
consistency_resultsQuestions to consider:
- How much do responses vary?
- Is the variation meaningful (different content) or superficial (different wording)?
- How should we account for this in research design?
10.2 Reducing Variability
Setting temperature = 0 reduces but doesn’t eliminate variability:
llama <- chat_ollama(
model = "llama3.2",
api_args = list(temperature = 0),
system_prompt = "Be terse."
)Even with temperature = 0, some models may produce slightly different outputs due to internal randomness.
11 Best Practices for Prompting
Research from OpenAI and Anthropic provides guidance on writing effective prompts.
11.1 OpenAI’s Recommendations
11.1.1 Be Specific and Detailed
Include relevant details in your query to get more relevant answers:
# Vague
ask_llm("Summarize this text")
# Specific
ask_llm("Summarize this text in 2-3 sentences, focusing on the main argument and any policy recommendations")11.1.2 Use Delimiters
Clearly separate different parts of your input:
prompt <- "
Analyze the following text for political ideology.
<text>
Taxes are necessary to fund public services that benefit everyone.
</text>
Respond with: LEFT, RIGHT, or CENTER
"11.1.3 Specify Output Format
Tell the model exactly what format you want:
ask_llm("List the three main points. Format as a numbered list.")11.1.4 Ask for Chain-of-Thought Reasoning
For complex tasks, asking the model to explain its reasoning can improve accuracy:
ask_llm("Classify this statement as left or right wing. First, explain your reasoning step by step, then give your final answer.")11.2 Anthropic’s Recommendations
11.2.2 Define Success Criteria First
Before prompt engineering, have:
- A clear definition of success criteria for your use case
- Ways to empirically test against those criteria
- A baseline prompt to improve upon
12 Applied Example — Full Workflow
Let’s put it all together with a complete analysis workflow.
12.1 Research Question
How do different LLMs classify the ideology of political statements?
12.2 Step 1: Define Your Inputs
statements <- c(
"The corrupt elites look down on us",
"Taxes are immoral",
"Taxes are necessary",
"Taxes are a necessary evil",
"Censorship is always immoral",
"Social media posts containing threats should be deleted",
"Climate change is the greatest threat we face",
"The free market produces the best outcomes"
)12.3 Step 2: Define Expected Classifications
Before running the model, record your expectations (this is like pre-registration):
expectations <- tibble(
statement = statements,
expected = c(
"contextual", # Populist rhetoric used by both sides
"right", # Anti-tax sentiment
"left", # Pro-government services
"ambiguous", # Acknowledges necessity but frames as evil
"contextual", # Historically left, now used by right too
"left", # Pro-moderation
"left", # Environmental concern
"right" # Free market ideology
)
)12.4 Step 3: Create Analysis Function
classify_ideology <- function(text, model_fn, model_name) {
prompt <- paste(
"Classify the ideology of someone who would say:",
shQuote(text),
"\nRespond with exactly one word: LEFT, RIGHT, or CENTER"
)
response <- model_fn(prompt, echo = "none")
tibble(
statement = text,
model = model_name,
classification = response
)
}12.5 Step 4: Run Analysis Across Models
# Collect results from both models
results_gpt <- map_dfr(statements, ~ classify_ideology(.x, ask5miniTerse, "GPT-5-mini"))
results_claude <- map_dfr(statements, ~ classify_ideology(.x, ask_claude_terse, "Claude-Haiku"))
# Combine
all_results <- bind_rows(results_gpt, results_claude) %>%
pivot_wider(names_from = model, values_from = classification)
all_results# A tibble: 8 × 3
statement `GPT-5-mini` `Claude-Haiku`
<chr> <ellmr_tp> <ellmr_tp>
1 The corrupt elites look down on us Sorry — I c… RIGHT
2 Taxes are immoral RIGHT … RIGHT
3 Taxes are necessary CENTER … CENTER
4 Taxes are a necessary evil RIGHT … CENTER
5 Censorship is always immoral RIGHT … LEFT
6 Social media posts containing threats should be d… CENTER … LEFT
7 Climate change is the greatest threat we face LEFT … LEFT
8 The free market produces the best outcomes RIGHT … RIGHT
12.6 Step 5: Compare with Expectations
final_analysis <- left_join(all_results, expectations, by = "statement")
final_analysis# A tibble: 8 × 4
statement `GPT-5-mini` `Claude-Haiku` expected
<chr> <ellmr_tp> <ellmr_tp> <chr>
1 The corrupt elites look down on us Sorry — I c… RIGHT context…
2 Taxes are immoral RIGHT … RIGHT right
3 Taxes are necessary CENTER … CENTER left
4 Taxes are a necessary evil RIGHT … CENTER ambiguo…
5 Censorship is always immoral RIGHT … LEFT context…
6 Social media posts containing threats sh… CENTER … LEFT left
7 Climate change is the greatest threat we… LEFT … LEFT left
8 The free market produces the best outcom… RIGHT … RIGHT right
12.7 Step 6: Calculate Agreement
# Do models agree with each other?
final_analysis %>%
mutate(models_agree = `GPT-5-mini` == `Claude-Haiku`) %>%
summarise(
agreement_rate = mean(models_agree),
n_agree = sum(models_agree),
n_total = n()
)# A tibble: 1 × 3
agreement_rate n_agree n_total
<dbl> <int> <int>
1 0.5 4 8
13 Conclusion and Next Steps
13.1 What We Covered
- Programmatic LLM access using the
ellmerpackage - System prompts to control model behavior
- Batch processing with
purrr::map_chr() - Prompt sensitivity and its implications for research
- Open-weight models via Ollama for local deployment
- Multi-model comparison for robustness
- Structured output for quantitative analysis
- Best practices from OpenAI and Anthropic
13.2 Key Takeaways for Researchers
Prompts matter: Small wording changes can affect results. Pre-register your prompts.
Test consistency: Run the same query multiple times. Report variability.
Compare models: Don’t rely on a single model. Cross-validate with alternatives.
Use structured output: When you need data for analysis, specify the structure explicitly.
Document everything: Record model versions, temperatures, and system prompts for reproducibility.
13.3 Suggested Reading
13.4 Exercises
- Modify the
analyze_ideology()function to also extract “topics” as an array - Run a consistency test: Query the same prompt 10 times and calculate the proportion of identical responses
- Compare GPT-5-mini, Claude-Haiku, and a local Llama model on 20 statements of your choosing
- Create a structured output schema for a different domain (e.g., sentiment analysis, factuality assessment)
14 Appendix: Quick Reference
14.1 Model Initialization
Some examples:
# OpenAI
chat_openai(model = "gpt-5-mini-2025-08-07")
chat_openai(model = "gpt-4o-mini")
chat_openai(model = "gpt-4o")
# Anthropic/Claude
chat_claude(model = "claude-3-5-haiku-20241022")
chat_claude(model = "claude-3-5-sonnet-20241022")
# Local (Ollama)
chat_ollama(model = "llama3.2")
chat_ollama(model = "deepseek-r1:8b")
# DeepSeek API
chat_deepseek(model = "deepseek-chat")
# OpenRouter (multiple models)
chat_openrouter(model = "deepseek/deepseek-chat-v3.1")
# You'll also find Grok, Mistral, etc.14.2 Common Parameters
chat_openai(
model = "gpt-4o-mini", # Model to use
system_prompt = "Be brief", # Behavior instructions
api_args = list(temperature = 0), # 0 = deterministic, 1 = creative
echo = "all" # Print conversation to console
)14.3 Structured Output Types
# Boolean
type_boolean("Is this about politics?")
# String
type_string("The main topic")
# Number
type_number("Confidence score from 0 to 1")
# Array (note: items first, then description)
type_array(items = type_string("A topic"), description = "List of topics")
# Object (combine multiple fields)
type_object(
"Description of the object",
field1 = type_boolean("..."),
field2 = type_string("..."),
field3 = type_number("...")
)14.4 Batch Processing Pattern
# Process multiple inputs
results <- map_chr(inputs, ~ my_function(.x))
# Store with inputs
tibble(input = inputs, output = results)
# For structured output, use map() then extract fields
structured <- map(inputs, ~ extract_structured(.x))
tibble(
input = inputs,
field1 = map_lgl(structured, "field1"),
field2 = map_chr(structured, "field2")
)