Skip to content

{ Category Archives } research

CHI Highlights: Persuasive Tech and Social Software for Health and Wellness

I want to take few minutes to highlight a few papers from CHI 2011, spread across a couple of posts. There was lots of good work at this conference. This post will focus on papers in the persuasive technology and social software for health and wellness space, which is the aspect of my work that I was thinking about most during this conference.

  • Fit4life: the design of a persuasive technology promoting healthy behavior and ideal weight
    Stephen Purpura, Victoria Schwanda, Kaiton Williams, William Stubler, Phoebe Sengers

    Fit4life is a hypothetical system that monitors users’ behavior using a variety of tactics in the Persuasive Systems Design model. After describing the system (in such a way that someone in the room commented made the audience “look horrified”), the authors transition to a reflection on persuasive technology research and design, and how such a design can “spiral out of control.” As someone working in this space, the authors hit on some of the aspects that leave me a bit unsettled: persuasion vs. coercion, individual good vs. societal good, whether people choose their own view points or are pushed to adopt those of the system designers, measurement and control vs. personal experiences and responsibility, and increased sensing and monitoring vs. privacy and surveillance and the potential to eliminate boundaries between front stage and back stage spaces. The authors also discuss how persuasive systems with very strong coaching features can reduce the opportunity for mindfulness and for their users to reflect on their own situation: people can simply follow the suggestions rather than weigh the input and decide among the options.

    This is a nice paper and a good starting point for lots of discussions. I’m a bit frustrated that it was presented in a different yet concurrent session as the session on persuasive technology for health. As such, it probably did not (immediately) reach the audience that would have led to the most interesting discussion about the paper. In many ways, it argued for a “think about what it is like to live with” rather than “pitch” approach to thinking about systems. I agree with a good bit of the potential tensions the authors highlight, but I think they are a bit harder on the persuasive tech community than appropriate: in general, persuasive tech folks are aware we are building systems intended to change behavior and that this is fraught with ethical considerations, while people outside of the community often do not think of their systems as persuasive or coercive, even when they are (again, I mean this in a Nudge, choice-environments sense. On the other hand, one presentation at Persuasive last year did begin with the statement “for the sake of this paper, set aside ethical concerns” (paraphrased), so clearly there is still room for improvement.

  • Designing for peer involvement in weight management
    Julie Maitland, Matthew Chalmers

    Based on interviews with nineteen individuals, the authors present an overview of approaches for how to involve peers in technology for weight management. These approaches fall into passive involvement (norms and comparisons) and five types of active involvement (table 1 in the paper): obstructive (“don’t do it”), inductive (“you should do it”), proactive (“do it with me”), supportive (“I’ll do it too”), and cooperative (“let’s do it together”). The last category includes competition, though there was some disagreement during the Q&A about whether that is the right alignment. The authors also find gender- and role- based differences in perceived usefulness of peer-based interventions, such as differences in attitudes about competition.

    Designers could use these types of engagement to think about what their application are supporting well or not so well. Here, I wish the authors had gone a bit further in linking the types of involvement to the technical mechanisms or features of applications and contexts, as I think that would be a better jumping off point for designers. For those thinking about how to design social support into online wellness interventions, I think this paper, combined with Skeels et al, “Catalyzing Social Support for Breast Cancer Patients”, CHI 2010 and our own paper from CSCW 2011 (“‘It’s not that I don’t have problems, I’m just not putting them on Facebook’: Challenges and Opportunities in Using Online Social Networks for Health”), offer a nice high-level overview of some of the challenges and opportunities for doing so.

  • Mining behavioral economics to design persuasive technology for healthy choices
    Min Kyung Lee, Sara Kiesler, Jodi Forlizzi

    A nice paper that evaluates different persuasive approaches for workplace snack selection. These include:

    • default choice: a robot showing all snack choices with equal convenience or the healthy one more visibly, or a website that showed all snack choices (in random order) or that paginated them, with healthy choices shown on the first page.
    • planning: asking people to order a snack for tomorrow rather than select at the time of consumption.
    • information strategy: showing calorie counts for each snack.

    As one would expect, default choice strategy was highly effective in increasing the number of people who chose the healthy snack (apples) rather than the unhealthy snack (cookies). The planning strategy was effective among people who had a healthy snacking lifestyle, while those who snacked unhealthily continued to choose cookies. Interestingly, the information strategy had no effect on healthy snackers and actually led healthy snackers to choose cookies more than they otherwise would have. The authors speculate that this is either because the healthy snackers overestimate the caloric value of cookies in the absence of information (and thus avoid them more), or because considering the healthy apple was sufficiently fulfilling even if they ultimately chose the cookie.

    Some questions the study leaves open are: would people behave they same if they had to pay for the snacks? what would happen in a longer term deployment? What would have happened if the cookies were made the default, particularly for otherwise healthy snackers?

  • Side effects and “gateway” tools: advocating a broader look at evaluating persuasive systems
    Victoria Schwanda, Steven Ibara, Lindsay Reynolds, Dan Cosley

    Interviews with 20 Wii Fit users revela side effects of this use: some stop using it because it did not work while others stop because they go on to other, preferred fitness activities (abandonment as success), a tension between whether the Fit is viewed as a game or exercise tool (people rarely view it as both), and negative emotional impacts (particularly frustrating when the system misinterpreted some data, such as weight gains). One suggestion the authors propose is that behavior change systems might start with activities that better resemble games but gradually transition users to activities with fewer game-like elements, and eventually wean users off of the system all together. In practice, I’m not sure how this would work, but I like this direction because it gets at one of my main critiques of gamification: take away the game and its incentives (which my distract from the real benefits of changing one’s behavior) and the behavior reverts quite quickly.

  • Means based adaptive persuasive systems
    Maurits Kaptein, Steven Duplinsky, Panos Markopoulos

    Lab experiment evaluating the effects of using multiple sources of advice (single expert or consensus of similar others) at the same time, disclosing that advice is intended to persuade, and allowing users to select their source of advice. (This is framed more generally as about persuasive systems, but I think the framing is too broad: it’s really a study about advice.) Results: people are more likely to follow advice when they choose the source, people are less likely to follow advice when they are told that it is intended to persuade, and when shown expert advice and consensus advice from similar others, subjects were less likely to follow the advice than when they were only shown expert advice — regardless of whether the expert and consensus advice concurred with each other. This last finding is surprising to me and to the authors, who suggest that it may be a consequence of the higher cognitive load of processing multiple sources of advice; I’d love to see further work on this.

  • Opportunities for computing technologies to support healthy sleep behaviors
    Eun Kyuong Choe, Sunny Consolvo, Nathaniel F. Watson, Julie A. Kientz

    Aggregation of literature review, interviews with sleep experts, a survey of 230 individuals, and 16 potential users to learn about opportunities and challenges for designing sleep technologies. The work leads to a design framework that considers the goal of the individual using the system, the system’s features, the source of the information supporting the design choices made, the technology used, and stakeholders involved, and the input mechanism. During the presentation, I found myself thinking a lot about two things: (1) the value of design frameworks and how to construct a useful one (I’m unsure of both) and (2) how this stacks up against Julie’s recent blog post that is somewhat more down on the opportunities of tech for health.

  • How to evaluate technologies for health behavior change in HCI research
    Predrag Klasnja, Sunny Consolvo, Wanda Pratt

    The authors argue that evaluating behavior change systems based solely on whether they changed the behavior is not sufficient, and often infusible. Instead, they argue, HCI should focus on whether systems or features effectively implement or support particular strategies, such as self-monitoring or conditioning, which can be measured in shorter term evaluations.

    I agree with much of this. I think that more useful HCI contributions in this area speak to which particular mechanisms or features worked, why and how they worked, and in what context one might expect them to work. Contributions that throw the kitchen sink of features at a problem and do not get into the details of how people reacted to the specific features and what they features accomplished may tell us that technology can help with a condition, but do not, in general, do a lot to inform the designers of other systems. I also agree that shorter-term evaluations are often able to show that particular feature is or is not working as intended, though longer term evaluations are appropriate to understand if it continues to work. I am also reminded of the gap between the HCI community and the sustainability community pointed out by Froehlich, Findlater, and Landay at CHI last year, and fear that deemphasizing efficacy studies and RCTs will limit the ability of the HCI community to speak to the health community. Someone is going to have to do the efficacy studies, and the HCI community may have to carry some of this weight in order for our work to be taken seriously elsewhere. Research can make a contribution without showing health improvements, but if we ignore the importance of efficacy studies, we imperil the relevance of our work to other communities.

  • Reflecting on pills and phone use: supporting awareness of functional abilities for older adults
    Matthew L. Lee, Anind K. Dey

    Four month deployment of a system for monitoring medication taking and phone use in the homes of two older adults. The participants sought out anomalies in the recorded data; when they found them, they generally trusted the system and focused on explaining why it might have happened, turning first to their memory of the event and then to going over their routines or other records such as calendars and diaries. I am curious if this trust would extend to a purchased product rather than one provided by the researchers (if so, this could be hazardous in an unreliable system); I could see arguments for it going each way.

    The authors found that these systems can help older remain aware of their functional abilities and helped them better make adaptations to those abilities. Similar to what researchers have recommended for fitness journals or sensors, the authors suggest that people be able to annotate or explain discrepancies in their data and be able to view it jointly. They also suggest highlighting anomalies and showing them with other available contextual information about that date or time.

  • Power ballads: deploying aversive energy feedback in social media
    Derek Foster, Conor Linehan, Shaun Lawson, Ben Kirman

    I generally agree with Sunny Consolvo: feedback and consequences in persuasive systems should generally range from neutral to positive, and have been reluctant (colleagues might even say “obstinate”) about including it in GoalPost or Steps. Julie Kientz’s work, however, finds that certain personalities think they would respond well to negative feedback. This work in progress tests negative (“aversive”) feedback: Facebook posts about songs and the statement that they were using lots of energy in a pilot with five participants. The participants seemed to respond okay to the posts — which are, in my opinion, pretty mild and not all that negative — and often commented on them. The authors interpret this as aversive feedback not leading to disengagement, but I think that’s a bit too strong of a claim to make on this data: participants, despite being unpaid but having been recruited to the study, likely felt some obligation to follow through to its end in a way that they would not for a commercially or publicly available system, and, with that feeling, may have commented out of a need to publicly explain or justify their usage as shown in the posts. The last point isn’t particularly problematic, as such reflection may be useful. Still, this WiP and the existence of tools like Blackmail Yourself (which *really* hits at the shame element) do suggest that there is more work needed on the efficacy of public, aversive feedback.

  • Descriptive analysis of physical activity conversations on Twitter
    Logan Kendall, Andrea Civan Hartzler, Predrag Klasnja, Wanda Pratt

    In my work, I’ve heard a lot of concern about posting health related status updates and about seeing similar status updates from others, but I haven’t taken a detailed look at the status updates that people are currently making, which this WiP makes a start on for physical activity posts on Twitter.By analyzing the results of queries for “weight lifting”, “Pilates”, and “elliptical”, the authors find posts that show evidence of exercise, plans for exercise, attitudes about exercise, requests for help, and advertisements. As the authors note, the limited search terms probably lead to a lot of selection bias, and I’d like to see more information about posts coming from automated sources (e.g., FitBit), as well as how people reply to the different genres of fitness tweets.

  • HappinessCounter: smile-encouraging appliance to increase positive mood
    Hitomi Tsujita, Jun Rekimoto

    Fun yet concerning alt.chi work on pushing people to smile in order to increase positive mood. With features such as requiring a smile to open the refrigerator, positive feedback (lights, music) in exchange for smiles, automatic sharing of photos of facial expressions with friends or family members, automatic posting of whether or not someone is smiling enough, this paper hits many of the points about which the Fit4life authors raise concerns.

The panel I co-organized with Margaret E. Morris and Sunny Consolvo, “Facebook for health: opportunities and challenges for driving behavior change,” and featuring Adam D. I. Kramer, Janice Tsai, and Aaron Coleman, went pretty well. It was good to hear what everyone, both familiar and new faces, is up to and working on these days. Thanks to my fellow panelists and everyone who showed up!

There was a lot of interesting work — I came home with 41 papers in my “to read” folder — so I’m sure that I’m missing some great work in the above list. If I’m missing something you think I should be reading, let me know!

Mindful Technology vs. Persuasive Technology

On Monday, I had the pleasure of visiting Malcolm McCullough’s Architecture 531 – Networked Cities for final presentations. Many of the students in the class are from SI, where we talk a lot about incentive-centered design, choice architecture, and persuasive technology, which seems to have resulted in many of the projects having a persuasive technology angle. As projects were pitched as “extracting behavior” or “compelling” people to do things, it was interesting to watch the discomfort in the reactions from students and faculty who don’t frame problems in this way.1

Thinking about this afterwards brought me back to a series of conversations at Persuasive this past summer. A prominent persuasive technology researcher said something along the lines of “I’m really only focusing on people who already want to change their behavior.” This caused a lot of discussion, with major themes being: Is this a cop-out, shouldn’t we be worried about the people who aren’t trying? Is this just a neat way of skirting the ethical issues of persuasive (read: “manipulative”) technology?

I’m starting to think that there may be an important distinction that may help address these questions, one between technology that pushes people to do something without them knowing it and technology that supports people in achieving a behavior change they desire. The first category might be persuasive technology, and for now, I’ll call the second category mindful technology.

Persuasive Technology

I’ll call systems that push people who interact with them to behave in certain ways, without those people choosing the behavior change as an explicit goal, Persuasive Technology. This is a big category, and I believe that most systems are persuasive systems in that their design and defaults will favor certain behaviors over others (this is a Nudge inspired argument: whether or not it is the designer’s intent, any environment in which people make choices is inherently persuasive).

Mindful Technology

For now, I’ll call technology that helps people reflect on their behavior, whether or not people have goals and whether or not the system is aware of those goals, mindful technology. I’d put apps like Last.fm and Dopplr in this category, as well as a lot of tools that might be more commonly classified as persuasive technology, such as UbiFit, LoseIt, and other trackers. While designers of persuasive technology are steering users toward a goal that the designers’ have in mind, the designers of mindful technology give users the ability to better know their own behavior to support reflection and/or self-regulation in pursuit of goals that the users have chosen for themselves.

Others working in the broad persuasive tech space have also been struggling with the issue of persuasion versus support for behaviors an individual chooses, and I’m far from the first to start thinking of this work as being more about mindfulness. Mindfulness is, however, a somewhat loaded term with its own meaning, and that may or may not be helpful. If I were to go with the tradition of “support systems” naming, I might call applications in this category “reflection support systems,” “goal support systems,” or “self-regulation support systems.”

Where I try to do my work

I don’t quite think that this is the right distinction yet, but it’s a start, and I think these are two different types of problems (that may happen to share many characteristics) with different sets of ethical considerations.

Even though my thinking is still a bit rough, I’m finding this idea useful in thinking through some of the current projects in our lab. For example, among the team members on AffectCheck, a tool to help people see the emotional content of their tweets, we’ve been having a healthy debate about how prescriptive the system should be. Some team members prefer something more prescriptive – guiding people to tweet more positively, for example, or tweeting in ways that are likely to increase their follower and reply counts – while I lean toward something more reflective – some information about the tweet currently being authored, how the user’s tweets have changed over time, here is how they stack up against the user’s followers’ tweets or the rest of Twitter. While even comparisons with friends or others offer evidence of a norm and can be incredibly persuasive, the latter design still seems to be more about mindfulness than about persuasion.

This is also more of a spectrum than a dichotomy, and, as I said above, all systems, by nature of being a designed, constrained environment, will have persuasive elements. (Sorry, there’s no way of dodging the related ethical issues!) For example, users of Steps, our Facebook application to promote walking (and other activity that registers on a pedometer), have opted in to the app to maintain or increase their current activity level. They can set their own daily goals, but the app’s goal recommender will push them to the fairly widely accepted recommendation of 10,000 steps per day. Other tools such as Adidas’s MiCoach or Nike+ have both tracking and coaching features. Even if people are opting into specific goals, the mere limited menu of available coaching programs is a bit persuasive, as it constrains people’s choices.

Overall, my preference when designing is to focus on helping people reflect on their behavior, set their own goals, and track progress toward them, rather than to nudge people toward goals that I have in mind. This is partly because I’m a data junkie, and I love systems that help me learn more about my behavior is without telling me what it should be. It is also partly because I don’t trust myself to persuade people toward the right goal at all times. Systems have a long history of handling exceptions quite poorly. I don’t want to build the system that makes someone feel bad or publicly shames them for using hotter water or a second rinse after a kid throws up in bed, or that takes someone to task for driving more after an injury.

I also often eschew gamification (for many reasons), and to the extent that my apps show rankings or leaderboards, I often like to leave it to the viewer to decide whether it is good to be at the top of the leaderboard or the bottom. To see how too much gamification can prevent interfere with people working toward their own goals, consider the leaderboards on TripIt and similar sites. One person may want to have the fewest trips or miles, because they are trying to reduce their environmental impact or because they are trying to spend more time at home with family and friends, while another may be trying to maximize their trips. Designs that simply reveal data can support both goals, while designs that use terms like “winning” or that award trophies or badges to the person with the most trips start to shout: this is what you should do.

Thoughts?

What do you think? Useful distinction? Cluttering of terms? Have a missed an existing, better framework for thinking about this?


1Some of the discomfort was related to some of the projects’ use punishment (a “worst wasters” leaderboard or similar). This would be a good time to repeat Sunny Consolvo’s guideline that technology for persuasive technology range from neutral to positive (Consolvo 2009), especially, in my opinion, in discretionary use situations – because otherwise people will probably just opt-out.

@display

For those interested in the software that drives the SIDisplay, SI master’s student Morgan Keys has been working to make a generalized and improved version available. You can find it, under the name “@display” at this GitHub repository.

SIDisplay is a Twitter-based public display described in a CSCW paper with Paul Resnick and Emily Rosengren. We built it for the School of Information community, where it replaced a number of previous displays, including a Thank You Board (which we compare it to in the paper), a photo collage (based on the context, content & community collage), and a version of the plasma poster network. Unlike many other Twitter-based displays, SI Display and @display do not follow a hashtag, but instead follow @-replies to the display’s Twitter account. It also includes private tweets, so long as the Twitter user has given the display’s Twitter account permission to follow them.

Word clouds to support reflection

When preparing our Persuasive 2010 paper on Three Good Things, we ended up cutting a section on using word clouds to support reflection. The section wasn’t central to this paper, but it highlights one of the design challenges we encountered, and so I want to share it and take advantage of any feedback.

Our Three Good Things application (3GT) is based on a positive psychology exercise that encourages people to record three good things that happen to them, as well as the reasons why they happened. By focusing on the positive, rather than dwelling on the negative, it is believed that people can train themselves to be happier.

Example 3GT tag clouds

When moving the application onto a computer (and out of written diaries), I wanted to find a way to leverage a computer’s ability to analyze a user’s previous good things and reasons to help them identify trends. If people are more aware of what makes them happy, or why these things happen, they might make decisions that cause these good things to happen more. In 3GT, I made a simple attempt to support this trend detection by generating word clouds from a participant’s good things and reasons. I used simple stop-wording, lowerizing, and no stemming.

Limited success for Word Clouds

When we interviewed 3GT users, we expected to find that the participants believed the word clouds helped them notice and reinforce trends in their good things. Results here were mixed. Only one participant we interviewed described how the combination of listing reasons and seeing them summarized in the word clouds had helped her own reflection:

“You’ve got tags that show up, like tag clouds on the side, and it kind of pulls out the themes… as I was putting the reasoning behind why certain [good] things would happen, I started to see another aspect of a particular individual in my life. And so I found it very fascinating that I had pulled out that information… it’s made me more receptive to that person, and to that relationship.”

A second participant liked the word cloud but was not completely convinced of its utility:

I like having the word cloud. I noticed that the biggest thing in my reason words is “cat”. (Laughs). And the top good words isn’t quite as helpful, because I’ve written a lot of things like ‘great’ and ‘enjoying’ – evidently I’ve written these things a lot of times. So it’s not quite as helpful. But it’s got ‘cat’ pretty good there, and ‘morning’, and I’m not sure if that’s because I’ve had a lot of good mornings, or I tend to write about things in the morning.

Another participant who had examined the word cloud noticed that “people” was the largest tag in his good things cloud and “liked that… [his] happiness comes from interaction with people,” but that he did not think that this realization had any influence over his behavior outside of the application.

One participant reported looking at the word clouds shortly after beginning to post. The words selected did not feel representative of the good things or reasons he had posted, and feeling that they were “useless,” he stopped looking at them. He did say that he could imagine it “maybe” being useful as the words evolved over time, and later in the interview revisited one of the items in the word cloud: “you know the fact that it says ‘I’m’ as the biggest word is probably good – it shows that I’m giving myself some credit for these good things happening, and that’s good,” but this level of reflection was prompted by the interview, not day-to-day use of 3GT.

Another participant did not understand that word size in the word cloud was determined by frequency of usage and was even more negative:

It was like you had taken random words that I’ve typed, and some of them have gotten bigger. But I couldn’t see any reason why some of them would be bigger than the other ones. I couldn’t see a pattern to it. It was sort of weird… Some of the words are odd words… And then under the Reason words, it’s like they’ve put together some random words that make no sense.

Word clouds did sometimes help in ways that we had not anticipated. Though participants did not find that they helped them identify trends that would influence future decisions, looking at the word cloud from her good things helped at least one participant’s mood.

I remember ‘dissertation’ was a big thing, because for a while I was really gunning on my dissertation, and it was going so well, the proposal was going well with a first draft and everything. So that was really cool, to be able to document that and see… I can see how that would be really useful for when I get into a funk about not being able to be as productive as I was during that time… I like the ‘good’ words. They make me feel, I feel very good about them.

More work?

The importance of supporting reflection has been discussed in the original work on Three Good Things, as well as in other work that has shown how systems that support effective self-reflection can improve users’ ability to adopt positive behaviors as well as increase their feelings of self-efficacy. While some users found benefit in word clouds to assist reflection, a larger portion did not notice them or found them unhelpful. More explanation should be provided about how word clouds are generated to avoid confusion. They should also perhaps not be shown until a participant has entered a sufficient amount of data. To help participants better notice trends, improved stop-wording might be used, as well as detecting n-grams (e.g. “didn’t smoke” versus “smoke”) and grouping of similar terms (e.g., combining “bread” and “pork” into “food”). Alternatively, a different kind of reflection exercise might be more effective, one where participants are asked to review their three good things posts and write a longer summary of the trends they have noticed.

Using Mechanical Turk for experiments

In my upcoming CHI paper, “Presenting Diverse Political Opinions: How and How Much,” we used Amazon’s Mechanical Turk (AMT) to recruit subjects and to administer the study. I’ll talk a bit more about the research questions and results in a future post, but I’ve had enough questions about using Mechanical Turk that I think a blog post may be helpful.

In this study, Paul Resnick and I explored individuals’ preferences for diversity of political opinion in online news aggregators and evaluated whether some very basic presentation techniques might affect satisfaction with the range of opinions represented in a collection of articles.

To address these questions, we needed subjects with known political preferences, from the United States, and with at least some very basic political knowledge, and we wanted to collect some demographic information about each subject. Each approved subject was then assigned to either a manipulation check group or to the experimental group. Subjects in the manipulation check group viewed individual articles and indicated their agreement or disagreement with each; subjects in the experimental group viewed entire collections and answered questions about the collection. The subjects in the experimental group were also assigned to a particular treatment (how the list would appear to them). Once approved, subjects could view a list up to once per day.

Screening. To screen subjects, we used a Qualification test in AMT. When unqualified subjects viewed at task (HIT – human intelligence task, in mTurk parlance), they were informed that they needed to complete a qualification. The qualification test asked subjects two questions about their political preferences, three multiple choice questions about US politics, and a number of demographic questions. Responses were automatically downloaded and evaluated to complete screening and assignment.

To limit our subjects to US residents, we also used the automatic locale qualification.

Assignment. We handled subject assignment in two ways. To distinguish between the treatment group and the manipulation check group, we created to additional qualifications that were automatically assigned; an approved subject would be granted only one of these qualifications, and could thus could only complete the associated task type.

Tasks (HITs). The task implementation was straightforward. We hosted tasks on our own server using the external question HIT type. When a subject loaded a task, AMT passed us the subject’s worker ID. We verified that the subject was qualified for the task and loaded the appropriate presentation for that subject. Each day, we uploaded one task of each type, with many assignments; assignments are the number of turkers that can complete each task.

Because we needed real-time access to the manipulation check data, the responses to this task were stored in our own database after a subject submitted the form; the subject could then return to AMT. This was not necessary for the experimental data, and so the responses were sent directly to AMT for later retrieval.

Quality control. Careless clicking or hurrying through the task is a potential problem on mTurk. Using multiple raters does not work when asking subjects about their opinions. Kittur, Chi, and Suh recommend asking Turkers verifiable questions as a way to deal with the problem1. We did not, however, ask verifiable questions about any of the articles or the list, because that might have changed how turkers read the list and responded to our other questions. Instead, we randomly repeated a demographic question from the qualification test. 5 subjects changed their answer substantially (e.g. aging more than one year or in reverse or shifting on either of the political spectrum questions by 2 points or more). Though there are many possible explanations for these shifts – such as shared accounts within a household, careless clicking, easily shifting political opinions, deliberate deception, or lack of effort – all of these explanations are not desirable for study subjects, and so they were excluded. We also examined how long each subject took to complete tasks (looking for implausibly fast responses); this did not lead to the exclusion of any additional subjects or responses.

Some reflection. We had to pay turkers a bit more than we expected (~$12/hr) and we recruited fewer subjects than we anticipated. The unpaid qualification task may be a bit of a barrier, especially because potential subjects could only complete one of our paid tasks per day (and only one was listed at a time). Instead, we might have implemented the qualification as a paid task, but that might result in paying for subjects who would never return to complete an actual task.

Further resources

1. Kittur, A., Chi, E. H., and Suh, B. (2009). “Crowdsourcing User Studies With Mechanical Turk,” Proc. CHI 2009: 453-456. (ACM | PARC)
2. Mason, W. and Watts, D. J. (2009). “Financial incentives and the ‘performance of crowds,’” SIGKDD Workshop on Human Computation: 77-85. (ACM | Yahoo)

This study is part of the BALANCE project and was funded by NSF award #IIS-0916099.

updated viz of political blogs’ link similarity

I’ve been meaning to post a simple update to my previous visualization of political blogs’ link similarities. In the previous post, I used GEM for layout, which was not, in hindsight, the best choice.

In the visualization in this post, the edges between blogs (the nodes, colorized as liberal, independent, and conservative) are weighted as the Jaccard similarity between any two blogs. The visualization is then laid out in GUESS using multidimensional scaling (MDS) based on the Jaccard similarities.

three good things

Three Good ThingsThe first of my social software for wellness applications is available on Facebook (info page).

Three Good Things supports a positive psychology exercise in which participants record three good things, and why these things happened. When completed daily – even on the bad days – over time, participants report increased happiness and decreased symptoms of depression. The good things don’t have to be major events – a good meal, a phone call with a friend or a family member, or a relaxing walk are all good examples.

I’m interested in identifying best practices for deploying these interventions on new or existing social websites, where adding social features may make the intervention more or less effective for participants, or may just make some participants more likely to complete the exercise on a regular basis. Anyway, feel free to give the app a try – you’ll be helping my research and you may end up a bit happier.

Sidelines at ICWSM

Last week I presented our first Sidelines paper (with Daniel Zhou and Paul Resnick) at ICWSM in San Jose. Slides (hosted on slideshare) are embedded below, or you can watch a video of most of the talk on VideoLectures.

Opinion and topic diversity in the output sets can provide individual and societal benefits. If news aggregators relying on votes and links to select and subsets of the large quantity of news and opinion items generated each day simply select the most popular items may not yield as much diversity as is present in the overall pool of votes and links.

To help measure how well any given approach does at achieving these goals, we developed three diversity metrics that address different dimensions of diversity: inclusion/exclusion, nonalienation, and proportional representation (based on KL divergence).

To increase diversity in result sets chosen based on user votes (or things like votes), we developed the sidelines algorithm. This algorithm temporarily suppresses a voter’s preferences after a preferred item has been selected. In comparison to collections of the most popular items, from user votes on Digg.com and links from a panel of political blogs, the Sidelines algorithm increased inclusion while decreasing alienation. For the blog links, a set with known political preferences, we also found that Sidelines improved proportional representation.

Our approach differs and is complementary to work that selects for diversity or identifies bias based on classifying content (e.g. Park et al, NewsCube; ) or by classifying referring blogs or voters (e.g. Gamon et al, BLEWS). While Sidelines requires votes (or something like votes), it doesn’t require any information about content, voters, or long term voting histories. This is particularly useful for emerging topics and opinion groups, as well as for non-textual items.

visualizing political blogs’ linking

There are a number of visualizations of political bloggers’ linking behavior, notably Adamic and Glance’s 2005 work that found political bloggers of one bias tend to link to others of the same bias. Also check out Linkfluence’s Presidential Watch 08 map, which indicates similar behavior.

These visualizations are based on graphs of when one blog links to another. I was curious to what extent this two-community behavior occurs if you include all of the links from these blogs (such as links to news items, etc). Since I have link data for about 500 blogs from the news aggregator work, it was straightforward to visualize a projection of the bipartite blog->item graph. To classify each blog as liberal, conservative, or independent, I used a combination of the coding from Presidential Watch, Wonkosphere, and my own reading.

Projection of links from political blogs to items (Oct - Nov 2008)

Projection of links from political blogs to items (Oct - Nov 2008). Layout using GEM algorithm in GUESS.

The visualization shows blogs as nodes. Edges represent shared links (at least 6 items must be shared before drawing an edge) and are sized based on their weight. Blue edges run between liberal blogs, red edges between conservative blogs, maroon between conservative and independent, violet blue between liberal and independent, purple between independent blogs, and orange between liberal and conservative blogs. Nodes are sized as a log of their total degree. This visualization is formatted to appear similar to the Adamic and Glance graph, though there are some important differences, principally because this graph is undirected and because I have included independent blogs in the sample.

This is just a quick look, but we can see that the overall linking behavior still produces two fairly distinct communities, though a bit more connected than just the graph of blog to blog links. It’d be fun to remove the linked blog posts from this data (leaving mostly linked news items) to see if that changes the picture much. Are some media sources setting the agenda for bloggers of both parties, or are the conservative bloggers reading and reacting to one set news items and liberal bloggers reading and reacting to another? I.e., is the homophily primarily in links to opinion articles, or does it also extend to the linked news items?

I’m out of time at this point in the semester, though, so that will have to wait.

bias mining in political bloggers’ link patterns

I was pretty excited by the work that Andy Baio and Joshua Schachter did to identify and show the political leanings in the link behavior of blogs that are monitored by Memeorandum. They used singular value decomposition [1] on an adjacency matrix between sources and items based on link data from 360 snapshots of Memeorandum’s front page.

For the political news aggregator project, we’ve been gathering link data from about 500 blogs. Our list of sources is less than half of theirs (I only include blogs that make full posts available in their feeds), but we do have full link data rather than snapshots, so I was curious if we would get similar results.

The first 10 columns of two different U matrices are below. They are both based on link data from 3 October to 7 November; the first includes items that had an in-degree of at least 4 (5934 items), the second includes items with an in-degree of at least 3 (9722 items). In the first, the second column (v2) seems to correspond fairly well to the political leaning of the blog; in the second, the second column (v3) is better.

I’ll be the first to say that I haven’t had much time look at these results in any detail, and, as some of the commenters on Andy’s post noted, there are probably better approaches for identifying bias than SVD. If you’d like to play too, you can download a csv file with the sources and all links with an in-degree >= 2 (21517 items, 481 sources). Each row consists of the source title, source url, and then a list of the items the source linked to from 3 October to 7 November. Some sources were added part way though this window, and I didn’t collect link data from before they were added.

[1] One of the more helpful singular value decomposition tutorials I found was written by Kirk Baker and is available in PDF.