Paradata to Shed Light on Data Quality Issues in Online Surveys

04/12/2023

By: Mario Callegaro, Google

The use of paradata, also recently called survey logs or log data, can be extremely powerful to detect and remediate data quality in online surveys especially in the prestest phase. Not all paradata that we can collect tell us about data quality, but for the focus of this article we only highlight the most related to data quality

When looking at all kinds of data quality paradata that can be collected, we can group them into three categories:

Email invitation delivery status
Questionnaire completion parata
Interaction with the questionnaire paradata

In the first case, and depending highly on the technology used (survey platform mailer), email invitation delivery paradata can tell us about the quality of an invitation by looking at the email open rate. The survey invitation click rate is also another statistic useful to see how many respondents were interested in opening the survey URL.

Bounce backs tell us about the quality of the email list.

Emailed undeliverables are more complex to debug, but they tell us specific email addresses and/or to which email domains the invitation could not be sent to.
Similar statistics can be obtained for text message invitations.

Questionnaire completion paradata such as breakoff rates are a strong indication of major issues in a questionnaire. For example, if a specific question leads to a very high breakoff rate, then more investigation is warranted on that question.

Median questionnaire completion is important to measure as it provides an indication of effort and can flag surveys that are too long. The more granular median time per question/per screen paradata can help detect issues such as speeding thus potentially eliciting low data quality.

Interaction with the questionnaire can tell us about the quality of each single question.

Change of response options, for example, can tell us if the response options for a question are potentially confusing. Prompt and error message incidence can tell us when respondents had trouble answering an item. Similarly, item non-response or skip rates are an indication of potential problems with an item (e.g. too sensitive).

Collecting online survey paradata can be time intensive and technically challenging because the most common online survey platforms do not use the term paradata or allow you to easily download all survey paradata. Researchers need to figure out what their online platform collects, what the survey platform calls each type of paradata, and how it can be collected. Some paradata, moreover, can be collected only with the use of javascripts of other technologies that are not offered by the standard package of a survey platform.

Secondly, paradata needs some pre-processing to be analyzed and many times they come in a non rectangular format thus needing some data munging in order to make sense of them.

The use of paradata can shed light on what is happening in an online survey, and, when used purposely, can improve a survey especially if collected and analyzed during the pretest phase. Nevertheless, researchers should be mindful of their time as collecting and especially analyzing paradata can take a lot of effort that detract from working on the substantive findings of the survey.

Reading

McClain, C. A., Couper, M. P., Hupp, A. L., Keusch, F., Peterson, G., Piskorowski, A. D., & West, B. T. (2019). A typology of web survey paradata for assessing total survey error. Social Science Computer Review, 37(2), 196–213. https://doi.org/10.1177/0894439318759670

Return to Newsletters