Newsletters

Large Language Models and Survey Research: A Recap

11/25/2024

Large Language Models and Survey Research: A Recap

Soubhik Barari, NORC, and Joshua Lerner, NORC

Generative AI is changing the world—and with it, the field of survey research. As AI tools such as large language models (LLMs) weave their way into survey practice, the need grows to equip researchers with the theoretical understanding and practical skills to leverage these tools effectively. That’s exactly what we set out to do in our recent AAPOR short course. Here are a few high-level takeaways from that course.

Effectively applying large language models requires prompt engineering (yes, even if you’re not an engineer).

At its core, the interface with LLMs relies on text-based prompts, which can be categorized as either system-level or user-level. System-level prompts define the model’s overarching behavior or constraints, while user-level prompts guide specific interactions or tasks. Beyond these foundational interactions, various applications built on LLMs—such as chat agents, sandboxes, and programmatic APIs—offer researchers user-friendly tools and options for advanced customization. In our course, we demonstrated how these tools can be harnessed effectively, showing workflows for both asynchronous use cases (e.g., analyzing data post-collection) and synchronous use cases (e.g., embedding real-time LLM outputs into survey instruments). Our research at NORC has shown that system-level prompts often carry more weight than user-level prompts for tasks like classification and synthetic response generation. In short, decisions about how to prompt, rather than just what to prompt, are critical.

Survey researchers: ignore prompt engineering at your own peril.

Large language models have applications at every stage of survey research.

LLMs have applications at every stage of the survey research process, from upstream decisions like question development, survey design, data collection, and interviewing to downstream tasks such as response processing, data augmentation, analysis, and weighting.

Put differently; generative AI is an extremely versatile technology, capable of reducing every aspect of total survey error: sampling error via better processing of address-based sampling frames, measurement error via improving respondent comprehension and recall, and nonresponse error via engaging respondents with interactive experiences, along with others.

That said, generative AI is equally capable of increasing those same errors if researchers are not careful – which brings us to our next point …

Validate, validate, validate.

It’s tempting to assume that the work of validating generative AI technology has already been done – after all, millions (perhaps billions!) of dollars have been collectively spent on building, training, fine-tuning, and validating LLMs to ensure their performance on many different kinds of benchmark tasks. This assumption would be mistaken: a model may be able to ace ‘elementary school’ tests (the equivalent of benchmarks), but your application may require a high-school level of vocabulary – something you may need to teach your model yourself.

We emphasize that small-scale, task-specific tests of an LLM’s performance – sometimes called downstream model evaluation – are far more relevant than generalized benchmarks for abstract inference or computational tasks, which vendors may promote. Methods for demonstrating validity are well established in natural language processing (NLP) and machine learning (ML), such as cross-validation and active learning.

While validation involves demonstrating acceptable performance at a single point, reliability requires demonstrating that performance holds steady across time and applications. This distinction is crucial because of two inherent characteristics of LLMs in practice. First, they operate using fuzzy logic, meaning their outputs can vary—even when strict system-level prompts are applied. Researchers may need to standardize outputs through additional processing steps or the usage of templates. Second, proprietary models are often subject to ongoing fine-tuning and pre-training, which can change their behavior unpredictably. A solution to this is to adopt open-source models and develop on-premise sandboxes where performance can be monitored and controlled.

The future of Generative AI in survey research: cautiously optimistic.

Many innovations are on the horizon, including multimode and multimedia survey designs, with some survey software vendors already supporting these innovations. At the same time, many developers of large language models are placing a greater emphasis on built-in safety guardrails and increasing transparency around their system-level prompts, making these tools more appealing for research.

This growing transparency is critical. The more researchers understand the models they use, the more feasible reproducibility becomes—fostering collaboration and scholarly exchange. In regulated environments, such as federal statistical agencies, transparency is particularly essential for ensuring reliability and safeguarding personally identifiable information (PII). As we have previously written, the future of generative AI in surveys remains multifaceted, with no single dominant paradigm emerging yet. While these technologies hold great promise, they should not replace the expertise of trained survey methodologists and social scientists, whose technical skills are grounded in human insight and years of practical experience. We need extensive experimentation and rigorous testing to use these tools effectively and understand their biases.

Our course aims to provide some of the foundational training and practical tools to support this endeavor. We look forward to engaging with the AAPOR community to continue the conversation—and spark new ones—at the 2025 AAPOR conference and beyond.