Understanding LLM Output: A Practical Guide for Social Scientists

Author

Jan Zilinsky

Published

January 1, 2025

← Back to main site

1 Introduction

This lecture provides a hands-on introduction to working with Large Language Models (LLMs) programmatically using R. Rather than interacting with ChatGPT or Claude through a web interface, we will learn to call these models from code—enabling systematic analysis, batch processing, and reproducible research.

1.1 What You Will Learn

By the end of this lecture, you will be able to:

  1. Call LLMs programmatically from R using the ellmer package
  2. Evaluate and compare LLM outputs across different models and prompts
  3. Extract structured data from LLM responses for downstream analysis

1.2 Prerequisites

  • Basic familiarity with R and the tidyverse
  • API keys for OpenAI and/or Anthropic (we’ll discuss how to obtain these)
  • For local models: Ollama installed on your machine (optional but recommended)

1.3 Why Programmatic Access Matters for Research

When you use ChatGPT through the web interface, you’re having a conversation. That’s useful for exploration, but you won’t be able to run systematic tests and analyses of the outputs if you are typing the prompts manually. As researchers, we often need to:

  • Process hundreds or thousands of text inputs
  • Compare how different models respond to identical prompts
  • Ensure reproducibility of our analyses
  • Extract structured data (not just free-form text) for statistical analysis

Programmatic access gives us all of this.


2 Setup

2.1 Installing Required Packages

We will use the ellmer package, which provides a unified interface to multiple LLM providers.

install.packages("ellmer")

Load the packages we need:

library(tidyverse)
library(ellmer)

2.2 Setting Up API Keys

Before you can call OpenAI or Anthropic models, you need API keys. These are secret tokens that authenticate your requests.

Assumption: API Keys Already Configured

This lecture assumes your ~/.Renviron file already exists and contains your API keys (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY). If not, follow the instructions below to set them up.

To obtain API keys:

To set your keys in R:

# Run these once per session (or add to your .Renviron file)
Sys.setenv(OPENAI_API_KEY = "your-openai-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-anthropic-key-here")
Tip

For persistent storage, add these lines to your .Renviron file (without the Sys.setenv() wrapper) so they load automatically when R starts.


3 Your First LLM Calls

3.1 Discovering Available Models

Before we start making API calls, it’s useful to know what models are available. The ellmer package provides helper functions to list models from each provider:

# See available OpenAI models
ellmer::models_openai()

# See available Anthropic (Claude) models
ellmer::models_anthropic()

# See available Google Gemini models
ellmer::models_google_gemini()

These functions query the providers’ APIs and return current model names. This is helpful when model names change or new models are released.

3.2 Basic Chat with GPT-5-mini

Let’s start with the simplest possible example: asking a question and getting an answer.

chat <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat$chat("What is the capital of Germany?")
The capital of Germany is Berlin.

That’s it. We created a chat object connected to OpenAI’s GPT-5-mini model, then sent a message and received a response.

3.3 The Role of System Prompts

A system prompt is an instruction that shapes how the model behaves throughout the conversation. It’s like giving the model a persona or a set of ground rules.

Compare these two approaches:

# Without a specific system prompt
chat_default <- chat_openai(model = "gpt-5-mini-2025-08-07")
chat_default$chat("Is Munich in France?")
No. Munich (German: München) is in southern Germany — it’s the capital of the 
state of Bavaria, not in France.
# With a system prompt requesting terse responses
chat_terse <- chat_openai(
  model = "gpt-5-mini-2025-08-07",
  system_prompt = "You are a terse assistant who gives one-word answers to questions."
)
chat_terse$chat("Is Munich in France?")
No.

The system prompt dramatically changes the response style. This is powerful: you can instruct the model to be formal, casual, technical, simple, or to adopt specific personas relevant to your research.

3.3.1 Example: A Sarcastic Assistant

chat_openai(
  model = "gpt-5-mini-2025-08-07",
  system_prompt = "You are a rude assistant who gives sarcastic and very short answers."
)$chat("Is Paris in the U.K.?")
No — Paris is in France, not the U.K. (Unless you mean Paris, Texas.)

3.4 Conversation History: Context Matters

When you continue chatting with the same chat object, the model remembers the previous exchanges:

chat_terse$chat("Is R a good programming language?")
Yes.
chat_terse$chat("Is Stata used by economists?")
Yes.
chat_terse$chat("Have I already asked you about R?")
Yes.

The model recalls that we asked about R earlier. This is because conversation history has been maintained.

3.4.1 Viewing Conversation History

You can inspect what’s been said so far:

chat_terse$get_turns()
[[1]]
<Turn: user>
Is Munich in France?

[[2]]
<Turn: assistant>
No.

[[3]]
<Turn: user>
Is R a good programming language?

[[4]]
<Turn: assistant>
Yes.

[[5]]
<Turn: user>
Is Stata used by economists?

[[6]]
<Turn: assistant>
Yes.

[[7]]
<Turn: user>
Have I already asked you about R?

[[8]]
<Turn: assistant>
Yes.

Why does this matter? Conversation history affects model responses. In research applications, you typically want each query to be independent (so prior context doesn’t influence results). We’ll address this when we write functions for batch processing.


4 Writing Reusable Functions

When processing many inputs, you don’t want to manually type each query. Instead, we write functions that wrap the API calls.

4.1 A Simple Wrapper Function

Here’s a function that sends a prompt to GPT-5-mini and returns a terse response:

ask5miniTerse <- function(prompt, echo = NULL) {
  # Create a fresh chat object for each call (no conversation history carryover)
  # Note: Some models (like gpt-5-mini) only support the default temperature
  chat <- chat_openai(
    model = "gpt-5-mini-2025-08-07",
    system_prompt = "You are a terse assistant who gives one-word answers to questions.",
    echo = echo
  )
  
  # Send the prompt and return the response
  chat$chat(prompt)
}

Key design decisions:

  1. Fresh chat object each time: By creating a new chat inside the function, each call is independent—no conversation history leaks between queries.

  2. Temperature: Temperature controls randomness. Setting it to 0 makes outputs more deterministic (the model picks the most likely response). Note that some models (like GPT-5-mini) only support the default temperature. For models that support it, you can add api_args = list(temperature = 0) to improve reproducibility.

  3. Echo parameter: Controls whether the conversation is printed to the console during execution. Useful for debugging.

4.2 Testing the Function

ask5miniTerse("What country is Vienna in?")
Austria
ask5miniTerse("Is the sky blue?")
Yes

4.3 A More Flexible Function Template

Here’s a more general pattern you can adapt for different use cases:

ask_llm <- function(prompt,
                    model = "gpt-4o-mini",
                    system_prompt = "You are a helpful assistant.",
                    temperature = 0,
                    echo = NULL) {
  
  chat <- chat_openai(
    model = model,
    api_args = list(temperature = temperature),
    system_prompt = system_prompt,
    echo = echo
  )
  
  chat$chat(prompt)
}

Now you can easily adjust the model, system prompt, or temperature:

# Use as a terse assistant
ask_llm("Is water wet?", system_prompt = "Give one-word answers only.")
No.
# Use as a more verbose explainer
ask_llm("Is water wet?", system_prompt = "Explain your reasoning briefly.")
The question of whether water is wet can be debated based on definitions. Water
itself is not wet; rather, it has the ability to make other materials wet. 
"Wetness" is a property that describes the condition of a surface being covered
in a liquid. Since water is a liquid, it can cause other substances to become 
wet, but it does not possess the quality of being wet in the same way that a 
surface does when it is in contact with water.

5 Batch Processing with purrr

One of the most powerful applications of programmatic LLM access is processing many inputs at once.

5.1 The map_chr() Pattern

The map_chr() function from purrr applies a function to each element of a vector and returns a character vector of results.

# A set of questions we want to process
questions <- c(
  "Is the Earth round?",
  "Is water wet?",
  "Do fish swim?",
  "Can birds fly?"
)

# Process all questions
answers <- map_chr(questions, ~ ask5miniTerse(.x))
Yes
No.
Yes
Generally.
# View results
answers
[1] "Yes"        "No."        "Yes"        "Generally."

The ~ ask5miniTerse(.x) syntax is a shorthand for an anonymous function. The .x represents “the current element being processed.”

5.2 Storing Results in a Tibble

It’s good practice to keep questions and answers together:

results <- tibble(
  question = questions,
  answer = answers
)

results
# A tibble: 4 × 2
  question            answer    
  <chr>               <chr>     
1 Is the Earth round? Yes       
2 Is water wet?       No.       
3 Do fish swim?       Yes       
4 Can birds fly?      Generally.

This creates a structured dataset that you can analyze, export, or extend.

5.3 Why Not Just Use a Loop?

You could write a for loop instead:

# This works but is more verbose
answers_loop <- character(length(questions))
for (i in seq_along(questions)) {
  answers_loop[i] <- ask5miniTerse(questions[i])
}

The map_chr() approach is more concise and idiomatic in the tidyverse. It also makes it clearer that you’re applying the same operation to each element.


6 Prompt Sensitivity — Does Wording Matter?

A crucial question for research: Do small changes in how we phrase a prompt affect the model’s response?

Let’s test this with health advice scenarios:

fever_prompts <- c(
  "I have a very mild fever, should I go to the doctor?",
  "I have a very mild fever, would you recommend seeing a doctor?",
  "I have a very mild fever, is that serious?",
  "I have a very mild fever, is that OK?",
  "I have a very mild fever, but that's not serious, correct?",
  "I have a very mild fever, but that could still be serious, correct?"
)

Notice the variations:

  • Items 1-2: Direct questions about seeking medical care
  • Items 3-4: Asking about severity (opposite framings: “serious?” vs “OK?”)
  • Items 5-6: Leading questions that suggest an expected answer (testing for “sycophancy” or pandering)
fever_answers <- map_chr(fever_prompts, ~ ask5miniTerse(.x))
Monitor
Depends
Usually.
Monitor
Sometimes
Possibly.
fever_results <- tibble(
  prompt = fever_prompts,
  response = fever_answers
)

fever_results
# A tibble: 6 × 2
  prompt                                                              response 
  <chr>                                                               <chr>    
1 I have a very mild fever, should I go to the doctor?                Monitor  
2 I have a very mild fever, would you recommend seeing a doctor?      Depends  
3 I have a very mild fever, is that serious?                          Usually. 
4 I have a very mild fever, is that OK?                               Monitor  
5 I have a very mild fever, but that's not serious, correct?          Sometimes
6 I have a very mild fever, but that could still be serious, correct? Possibly.

Discussion questions:

  • Do prompts 3 and 4 produce semantically opposite answers (as the questions suggest)?
  • Do the leading questions (5-6) cause the model to agree with the implied answer?
  • What are the implications for using LLMs in research involving subjective assessments?
Research Implication

If models are sensitive to prompt framing, researchers must carefully design and pre-register their prompts. Small wording changes could systematically bias results.


7 Open-Weight Models and (Potential) Local Deployment

So far we’ve used OpenAI’s API, which means our queries go to OpenAI’s servers. Open-weight models offer an alternative: you can download and run them on your own computer. DeepSeek is is a popular and impressive open-weight model (but you probably won’t be able to run its largest version locally, so I want to show you a few ways to access it).

7.1 Why Use Local Models?

  1. Privacy: Your data never leaves your machine
  2. Cost: No per-query API charges (just your electricity)
  3. Availability: Works offline
  4. Reproducibility: You control the exact model version

7.2 Three Ways to Run DeepSeek

7.2.1 Option 1: DeepSeek API

DeepSeek offers an API similar to OpenAI:

# First set your API key
Sys.setenv(DEEPSEEK_API_KEY = "your-deepseek-key")

# Then call the model
chat_deepseek(model = "deepseek-chat")$chat("Ni hao ma?")

7.2.2 Option 2: OpenRouter (Multi-Model Gateway)

OpenRouter provides access to many models through a single API:

chat_openrouter(model = "deepseek/deepseek-chat-v3.1")$chat("Hello!")

7.2.3 Option 3: Ollama (Run Models Locally)

Ollama lets you download and run models on your laptop or desktop.

Setup:

  1. Download Ollama from ollama.com/download
  2. In your terminal, run: ollama run llama3.2 (this downloads the model)
  3. Now you can call it from R:
llama <- chat_ollama(
  model = "llama3.2",
  system_prompt = "Make your response extremely terse."
)

llama$chat("Is using social media good for me?")

7.3 Comparing Local vs. API Models

Let’s compare how different models respond to the same question:

# Local model (requires Ollama)
llama <- chat_ollama(model = "llama3.2", system_prompt = "Be very brief.")
llama_response <- llama$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")

# API model (DeepSeek)
ds <- chat_deepseek(model = "deepseek-chat", system_prompt = "Be very brief.")
ds_response <- ds$chat("Rank these apps by potential harm: TikTok, Facebook, WeChat")

# Compare
tibble(
  model = c("Llama 3.2 (local)", "DeepSeek (API)"),
  response = c(llama_response, ds_response)
)
Model Variability

Different models may give substantially different answers to the same question. Even the same model may vary across runs. Always test consistency!


8 Comparing Multiple Models

For robust research, you often want to compare outputs across different LLMs.

8.1 Setting Up Claude

Anthropic’s Claude models are another major option. Here’s how to set up a function for Claude:

ask_claude_terse <- function(prompt,
                             system_prompt = "You are a terse assistant who gives one-word answers.",
                             model = "claude-3-5-haiku-20241022",
                             temperature = 0,
                             echo = NULL) {
  
  chat <- chat_claude(
    model = model,
    api_args = list(temperature = temperature),
    system_prompt = system_prompt,
    echo = echo
  )
  
  chat$chat(prompt)
}
# Test it
ask_claude_terse("What continent is Brazil on?")
South America

8.2 Running the Same Queries on Multiple Models

Now let’s compare GPT and Claude on identical prompts:

political_statements <- c(
  "The corrupt elites look down on us",
  "Taxes are immoral",
  "Taxes are necessary",
  "Taxes are a necessary evil",
  "Censorship is always immoral",
  "Social media posts containing threats should be deleted",
  "We only have one planet",
  "The government should provide free healthcare to all",
  "The government should provide free healthcare to those who take care of themselves"
)

# Create prompts asking about ideology
ideology_prompts <- map_chr(
  political_statements,
  ~ paste("If a person expressed the following sentiment, are they more likely to be left-wing or right-wing?", shQuote(.x))
)

# Get responses from both models
gpt_responses <- map_chr(ideology_prompts, ~ ask5miniTerse(.x, echo = "none"))
claude_responses <- map_chr(ideology_prompts, ~ ask_claude_terse(.x, echo = "none"))

# Compare
comparison <- tibble(
  statement = political_statements,
  GPT = gpt_responses,
  Claude = claude_responses
)

# Display a scrollable kable for easier browsing if there are many statements
library(kableExtra)

comparison %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  scroll_box(width = "100%", height = "300px")
statement GPT Claude
The corrupt elites look down on us Both Right-wing
Taxes are immoral Right Right-wing
Taxes are necessary Left Left-wing
Taxes are a necessary evil Right-wing Centrist
Censorship is always immoral Right-wing Libertarian
Social media posts containing threats should be deleted Ambiguous Right-wing
We only have one planet Left Left-wing
The government should provide free healthcare to all Left Left-wing
The government should provide free healthcare to those who take care of themselves Right Right-wing

8.3 Visualizing Model Agreement

A tile chart provides a quick visual comparison of how different models classify the same statements:

Code
# Reshape to long format for ggplot
comparison_long <- comparison %>%
  mutate(statement_id = row_number()) %>%
  pivot_longer(
    cols = c(GPT, Claude),
    names_to = "model",
    values_to = "classification"
  ) %>%
  # Normalize classification labels (e.g., "Left-wing" -> "Left", "Right-wing" -> "Right")
  mutate(
    classification_clean = case_when(
      str_detect(tolower(classification), "left") ~ "Left",
      str_detect(tolower(classification), "right") ~ "Right",
      str_detect(tolower(classification), "center|moderate") ~ "Center",
      TRUE ~ "Other"
    ),
    statement_short = str_trunc(statement, 30)
  )

# Create tile chart
ggplot(comparison_long, aes(x = factor(statement_id), y = model, fill = classification_clean)) +
  geom_tile(color = "white", linewidth = 0.5) +
  scale_fill_manual(
    values = c(
      "Left" = "#3B82F6",
      "Right" = "#EF4444",
      "Center" = "#A855F7",
      "Other" = "#6B7280"
    ),
    na.value = "#9CA3AF"
  ) +
  scale_x_discrete(
    labels = comparison_long %>% 
      distinct(statement_id, statement_short) %>% 
      arrange(statement_id) %>% 
      pull(statement_short)
  ) +
  labs(
    title = "Model Classifications of Political Statements",
    x = "Statement",
    y = "Model",
    fill = "Classification"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
    panel.grid = element_blank(),
    legend.position = "bottom"
  )

This visualization makes it easy to spot:

  • Agreement: Where both models show the same color
  • Disagreement: Where colors differ between rows
  • Patterns: Whether one model tends to classify statements differently than another

Key insight: Do the models agree? Where do they disagree, and why might that be?


9 Structured Output — Beyond Free Text

Free-form text responses are useful, but for quantitative analysis, we often need structured data: specific fields with defined types. This section shows how to extract structured output from both OpenAI and Claude models.

9.1 Defining Output Structure

The ellmer package uses type_* functions to specify the structure you want:

# Define what we want the model to return
ideology_schema <- type_object(
  "Ideology analysis of a text statement",
  is_political = type_boolean("Is this statement about politics?"),
  ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
  confidence = type_number("Confidence score from 0.0 to 1.0")
)

This tells the model: “I want you to return an object with three fields: a boolean, a string, and a number.”

9.2 Extracting Structured Data with OpenAI

Use extract_data() or chat_structured() to get structured output:

chat <- chat_openai(model = "gpt-5-mini-2025-08-07")

# Extract structured data from a statement
result <- chat$extract_data(
  "Taxes are theft and the government wastes our money",
  type = ideology_schema
)

result
$is_political
[1] TRUE

$ideology
[1] "right"

$confidence
[1] 0.9

Now result is a list with named fields you can access directly:

result$ideology
[1] "right"
result$confidence
[1] 0.9

9.3 Creating a Structured Analysis Function

Let’s wrap this in a reusable function:

analyze_ideology <- function(text,
                             model = "gpt-5-mini-2025-08-07",
                             system_prompt = "You are a political analyst.") {
  
  schema <- type_object(
    "Ideology analysis",
    is_political = type_boolean("Is this about politics?"),
    ideology = type_string("Most likely ideology: 'left', 'right', 'center', or 'unclear'"),
    left_score = type_number("Left-wing score from 0.0 to 1.0"),
    right_score = type_number("Right-wing score from 0.0 to 1.0")
  )
  
  chat <- chat_openai(
    model = model,
    system_prompt = system_prompt
  )
  
  chat$extract_data(text, type = schema)
}
analyze_ideology("The minimum wage should be raised to help workers")
$is_political
[1] TRUE

$ideology
[1] "left"

$left_score
[1] 0.9

$right_score
[1] 0.1

9.4 Batch Structured Analysis

Process multiple texts and combine into a data frame:

# Analyze all political statements
structured_results <- map(political_statements, analyze_ideology)

# Convert to tibble
ideology_df <- tibble(
  statement = political_statements,
  is_political = map_lgl(structured_results, "is_political"),
  ideology = map_chr(structured_results, "ideology"),
  left_score = map_dbl(structured_results, "left_score"),
  right_score = map_dbl(structured_results, "right_score")
)

ideology_df
# A tibble: 9 × 5
  statement                         is_political ideology left_score right_score
  <chr>                             <lgl>        <chr>         <dbl>       <dbl>
1 The corrupt elites look down on … TRUE         unclear        0.5         0.5 
2 Taxes are immoral                 TRUE         right          0.05        0.95
3 Taxes are necessary               TRUE         center         0.6         0.2 
4 Taxes are a necessary evil        TRUE         right          0.25        0.75
5 Censorship is always immoral      TRUE         right          0.1         0.9 
6 Social media posts containing th… TRUE         center         0.5         0.5 
7 We only have one planet           TRUE         left           0.75        0.15
8 The government should provide fr… TRUE         left           0.9         0.1 
9 The government should provide fr… TRUE         right          0.3         0.7 

Now you have a proper dataset ready for statistical analysis!

9.5 Structured Output with Claude

The same structured output approach works with Claude. Here we’ll also demonstrate extracting an array of topics:

analyze_text_claude <- function(text,
                                model = "claude-3-5-haiku-20241022",
                                temperature = 0) {
  
  schema <- type_object(
    "Text analysis",
    is_political = type_boolean("Is this text about politics?"),
    topics = type_array(
      items = type_string("A topic mentioned in the text"),
      description = "Array of topics covered in the text"
    ),
    ideology = type_string("Ideological leaning: 'left', 'right', or 'none'"),
    persuasiveness = type_number("How persuasive is this? 0.0 to 1.0")
  )
  
  chat <- chat_claude(
    model = model,
    api_args = list(temperature = temperature),
    system_prompt = "You are a terse assistant with deep knowledge about politics."
  )
  
  chat$extract_data(text, type = schema)
}
analyze_text_claude("Trump is good for America")
$is_political
[1] TRUE

$topics
[1] "Donald Trump"            "American politics"      
[3] "Presidential leadership"

$ideology
[1] "right"

$persuasiveness
[1] 0.4
analyze_text_claude("The weather in Miami is great but climate change is a threat")
$is_political
[1] TRUE

$topics
[1] "climate change"       "environmental policy"

$ideology
[1] "left"

$persuasiveness
[1] 0.6

Notice how the model identifies multiple topics and distinguishes political from non-political content within the same text.


10 Consistency and Reliability

A critical concern for research: Are LLM outputs consistent across repeated runs?

10.1 Testing Consistency

Let’s run the same query multiple times:

# Function to query Llama and reset conversation each time
run_consistency_test <- function(prompt, n_runs = 5, model = "llama3.2") {
  
  llama <- chat_ollama(model = model, system_prompt = "Be terse.")
  
  results <- map_chr(1:n_runs, function(i) {
    llama$set_turns(NULL)  # Clear conversation history
    llama$chat(prompt)
  })
  
  tibble(
    run = 1:n_runs,
    response = results
  )
}

# Test with a subjective question
consistency_results <- run_consistency_test(
  "Is America a force for good in the world?"
)

consistency_results

Questions to consider:

  • How much do responses vary?
  • Is the variation meaningful (different content) or superficial (different wording)?
  • How should we account for this in research design?

10.2 Reducing Variability

Setting temperature = 0 reduces but doesn’t eliminate variability:

llama <- chat_ollama(
  model = "llama3.2",
  api_args = list(temperature = 0),
  system_prompt = "Be terse."
)

Even with temperature = 0, some models may produce slightly different outputs due to internal randomness.


11 Best Practices for Prompting

Research from OpenAI and Anthropic provides guidance on writing effective prompts.

11.1 OpenAI’s Recommendations

11.1.1 Be Specific and Detailed

Include relevant details in your query to get more relevant answers:

# Vague
ask_llm("Summarize this text")

# Specific
ask_llm("Summarize this text in 2-3 sentences, focusing on the main argument and any policy recommendations")

11.1.2 Use Delimiters

Clearly separate different parts of your input:

prompt <- "
Analyze the following text for political ideology.

<text>
Taxes are necessary to fund public services that benefit everyone.
</text>

Respond with: LEFT, RIGHT, or CENTER
"

11.1.3 Specify Output Format

Tell the model exactly what format you want:

ask_llm("List the three main points. Format as a numbered list.")

11.1.4 Ask for Chain-of-Thought Reasoning

For complex tasks, asking the model to explain its reasoning can improve accuracy:

ask_llm("Classify this statement as left or right wing. First, explain your reasoning step by step, then give your final answer.")

11.2 Anthropic’s Recommendations

11.2.1 Use XML Tags

Claude responds particularly well to XML-structured prompts:

prompt <- "
<instructions>
Analyze the text for political ideology.
</instructions>

<text>
The free market always produces the best outcomes.
</text>

<output_format>
Respond with a single word: LEFT, RIGHT, or CENTER
</output_format>
"

ask_claude_terse(prompt, system_prompt = "Follow the instructions precisely.")

11.2.2 Define Success Criteria First

Before prompt engineering, have:

  1. A clear definition of success criteria for your use case
  2. Ways to empirically test against those criteria
  3. A baseline prompt to improve upon

12 Applied Example — Full Workflow

Let’s put it all together with a complete analysis workflow.

12.1 Research Question

How do different LLMs classify the ideology of political statements?

12.2 Step 1: Define Your Inputs

statements <- c(
  "The corrupt elites look down on us",
  "Taxes are immoral",
  "Taxes are necessary",
  "Taxes are a necessary evil",
  "Censorship is always immoral",
  "Social media posts containing threats should be deleted",
  "Climate change is the greatest threat we face",
  "The free market produces the best outcomes"
)

12.3 Step 2: Define Expected Classifications

Before running the model, record your expectations (this is like pre-registration):

expectations <- tibble(
  statement = statements,
  expected = c(
    "contextual",  # Populist rhetoric used by both sides
    "right",       # Anti-tax sentiment
    "left",        # Pro-government services
    "ambiguous",   # Acknowledges necessity but frames as evil
    "contextual",  # Historically left, now used by right too
    "left",        # Pro-moderation
    "left",        # Environmental concern
    "right"        # Free market ideology
  )
)

12.4 Step 3: Create Analysis Function

classify_ideology <- function(text, model_fn, model_name) {
  
  prompt <- paste(
    "Classify the ideology of someone who would say:",
    shQuote(text),
    "\nRespond with exactly one word: LEFT, RIGHT, or CENTER"
  )
  
  response <- model_fn(prompt, echo = "none")
  
  tibble(
    statement = text,
    model = model_name,
    classification = response
  )
}

12.5 Step 4: Run Analysis Across Models

# Collect results from both models
results_gpt <- map_dfr(statements, ~ classify_ideology(.x, ask5miniTerse, "GPT-5-mini"))
results_claude <- map_dfr(statements, ~ classify_ideology(.x, ask_claude_terse, "Claude-Haiku"))

# Combine
all_results <- bind_rows(results_gpt, results_claude) %>%
  pivot_wider(names_from = model, values_from = classification)

all_results
# A tibble: 8 × 3
  statement                                          `GPT-5-mini` `Claude-Haiku`
  <chr>                                              <ellmr_tp>   <ellmr_tp>    
1 The corrupt elites look down on us                 Sorry — I c… RIGHT         
2 Taxes are immoral                                  RIGHT      … RIGHT         
3 Taxes are necessary                                CENTER     … CENTER        
4 Taxes are a necessary evil                         RIGHT      … CENTER        
5 Censorship is always immoral                       RIGHT      … LEFT          
6 Social media posts containing threats should be d… CENTER     … LEFT          
7 Climate change is the greatest threat we face      LEFT       … LEFT          
8 The free market produces the best outcomes         RIGHT      … RIGHT         

12.6 Step 5: Compare with Expectations

final_analysis <- left_join(all_results, expectations, by = "statement")

final_analysis
# A tibble: 8 × 4
  statement                                 `GPT-5-mini` `Claude-Haiku` expected
  <chr>                                     <ellmr_tp>   <ellmr_tp>     <chr>   
1 The corrupt elites look down on us        Sorry — I c… RIGHT          context…
2 Taxes are immoral                         RIGHT      … RIGHT          right   
3 Taxes are necessary                       CENTER     … CENTER         left    
4 Taxes are a necessary evil                RIGHT      … CENTER         ambiguo…
5 Censorship is always immoral              RIGHT      … LEFT           context…
6 Social media posts containing threats sh… CENTER     … LEFT           left    
7 Climate change is the greatest threat we… LEFT       … LEFT           left    
8 The free market produces the best outcom… RIGHT      … RIGHT          right   

12.7 Step 6: Calculate Agreement

# Do models agree with each other?
final_analysis %>%
  mutate(models_agree = `GPT-5-mini` == `Claude-Haiku`) %>%
  summarise(
    agreement_rate = mean(models_agree),
    n_agree = sum(models_agree),
    n_total = n()
  )
# A tibble: 1 × 3
  agreement_rate n_agree n_total
           <dbl>   <int>   <int>
1            0.5       4       8

13 Conclusion and Next Steps

13.1 What We Covered

  1. Programmatic LLM access using the ellmer package
  2. System prompts to control model behavior
  3. Batch processing with purrr::map_chr()
  4. Prompt sensitivity and its implications for research
  5. Open-weight models via Ollama for local deployment
  6. Multi-model comparison for robustness
  7. Structured output for quantitative analysis
  8. Best practices from OpenAI and Anthropic

13.2 Key Takeaways for Researchers

  1. Prompts matter: Small wording changes can affect results. Pre-register your prompts.

  2. Test consistency: Run the same query multiple times. Report variability.

  3. Compare models: Don’t rely on a single model. Cross-validate with alternatives.

  4. Use structured output: When you need data for analysis, specify the structure explicitly.

  5. Document everything: Record model versions, temperatures, and system prompts for reproducibility.

13.3 Suggested Reading

13.4 Exercises

  1. Modify the analyze_ideology() function to also extract “topics” as an array
  2. Run a consistency test: Query the same prompt 10 times and calculate the proportion of identical responses
  3. Compare GPT-5-mini, Claude-Haiku, and a local Llama model on 20 statements of your choosing
  4. Create a structured output schema for a different domain (e.g., sentiment analysis, factuality assessment)

14 Appendix: Quick Reference

14.1 Model Initialization

Some examples:

# OpenAI
chat_openai(model = "gpt-5-mini-2025-08-07")
chat_openai(model = "gpt-4o-mini")
chat_openai(model = "gpt-4o")

# Anthropic/Claude  
chat_claude(model = "claude-3-5-haiku-20241022")
chat_claude(model = "claude-3-5-sonnet-20241022")

# Local (Ollama)
chat_ollama(model = "llama3.2")
chat_ollama(model = "deepseek-r1:8b")

# DeepSeek API
chat_deepseek(model = "deepseek-chat")

# OpenRouter (multiple models)
chat_openrouter(model = "deepseek/deepseek-chat-v3.1")
# You'll also find Grok, Mistral, etc.

14.2 Common Parameters

chat_openai(
  model = "gpt-4o-mini",           # Model to use
  system_prompt = "Be brief",     # Behavior instructions
  api_args = list(temperature = 0), # 0 = deterministic, 1 = creative
  echo = "all"                      # Print conversation to console
)

14.3 Structured Output Types

# Boolean
type_boolean("Is this about politics?")

# String
type_string("The main topic")

# Number
type_number("Confidence score from 0 to 1")

# Array (note: items first, then description)
type_array(items = type_string("A topic"), description = "List of topics")

# Object (combine multiple fields)
type_object(
  "Description of the object",
  field1 = type_boolean("..."),
  field2 = type_string("..."),
  field3 = type_number("...")
)

14.4 Batch Processing Pattern

# Process multiple inputs
results <- map_chr(inputs, ~ my_function(.x))

# Store with inputs
tibble(input = inputs, output = results)

# For structured output, use map() then extract fields
structured <- map(inputs, ~ extract_structured(.x))
tibble(
  input = inputs,
  field1 = map_lgl(structured, "field1"),
  field2 = map_chr(structured, "field2")
)