Blog: Lessons Learned from a User Study
After submitting a paper on a chatbot study to an HCI venue, I shared with rest of the research group some interesting findings from the paper and lessons learned from the project—or everything that was not fit to print, as I put it.
Here we include two specific examples.
Crowd workers: finding the right annotators with suitable screening questions.
As a validity check, we wanted to ensure that the chatbot deployed for the user study didn't spit out untruths and functioned as intended. Since it would be challenging to find domain experts on campus who would commit the time to annotate the data, we were interested in having crowd workers evaluate the quality of our chatbot's output. We identified existing literature on leveraging the wisdom of crowd (workers) to perform effective fact checking [1] [2] [3] and crafted our own annotation instruction accordingly.
To ensure that our annotators are knowledgeable about the topic, we included a six-item screening quiz before the main annotation task. Our internal testers could easily answer all six questions correctly, so we decided that the savvy annotators we hoped to recruit must have no trouble with it. But in the end, we were only able to recruit 22 qualified annotators—out of the 215 who met our filtering criteria and signed up for the task on Prolific. Although we were able to use the result to justify further data analysis and qualitative interviews, the smaller-than-expected pool of annotators who passed a short screening quiz represented neither a diverse crowd nor an expert group. In a future study, it may be better to either 1) reduce the difficulty of the screening quiz and loosen the inclusion criteria or 2) adopt a slightly more rigorous quiz to ensure that the annotators recruited are true experts. Of course, it would also help to keep the annotation task running for longer on Prolific to allow infrequent users to sign up.
Survey design: the wording of questions and response scales matter.
In the study, we sought to gauge our participants' affinity to political parties. A survey question was included for the participants to indicate the likelihood that they would vote for each of our parties of interest, both before and after our treatment. The responses helped us identify the party the party they are most closely aligned with and any changes in opinion. However, when we tried to decide whether an opinion change constituted an enforcement of a prior stance, the responses became less useful. Many participants gave all but one party a low rating close to the minimum of the scale, and it became difficult to decide whether a low rating indicated disagreement with the party's platform or the existence of an ideologically similar party that they would certainly vote for. If we were to design a similar survey again, we would ask the participants to indicate their level of agreement with a clearly labeled response scale (strongly disagree to strongly agree) to separate parties that are regarded highly and poorly by the respondents.
Key takeaways
The finer points of study design—whether in screening for crowd workers or phrasing survey questions—can greatly impact the quality of your data. Since re-running a user study can be prohibitively costly, it's essential to conduct trial runs and address potential issues upfront. Every small detail can shape the overall reliability and insightfulness of the study results.