How do Programmers Ask and Answer Questions on the Web? is a popular Q&A website featuring questions and answers on a wide range of programming topics. In the 2 years since its foundation in 2008, more than 1 million questions have been asked on Stack Overflow, and more than 2.5 million answers have been provided.

Stack Overflow — like many other Q&A websites such as Yahoo! Answers, or the recently created Facebook Questions — is founded on the success of social media and built around an “architecture of participation” where user data is aggregated as a side-effect of using Web 2.0 applications.

To understand the role of Q&A websites in the software documentation landscape, Ohad Barzilay, Peggy and I wrote a paper in which we pose research questions and report preliminary results to identify the role of Q&A websites in software development using qualitative and quantitative research methods. The paper was just accepted at the NIER (New Ideas and Emerging Results) track of ICSE 2011 in Hawaii.

We pose the following five research questions:

  1. What kinds of questions are asked on Q&A websites for programmers?
  2. Which questions are answered and which ones remain unanswered?
  3. Who answers questions and why?
  4. How are the best answers selected?
  5. How does a Q&A website contribute to the body of software development knowledge?

For the NIER paper, we focused on the first two questions. We created a script to extract questions along with all answers, tags and owners using the Stack Overflow API. We then analyzed quantitative properties of questions, answers and tags, and we applied qualitative codes to a sample of tags and questions.

Our preliminary findings indicate that Stack Overflow is particularly effective at code reviews, for conceptual questions and for novices. The most common questions include how-to questions and questions about unexpected behaviors.

Understanding the processes that lead to the creation of knowledge on Q&A websites will enable us to make recommendations on how individuals and companies, as well as tools for programmers, can leverage the knowledge and use Q&A websites effectively. It will also shed light on the credibility of documentation on these websites.

A pre-print of the paper is available here
(© ACM, 2011. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version will be published in the Proceedings of the International Conference on Software Engineering (ICSE) 2011.)

This is the abstract of the paper:

Question and Answer (Q&A) websites, such as Stack Overflow, use social media to facilitate knowledge exchange between programmers and fill archives with millions of entries that contribute to the body of knowledge in software development. Understanding the role of Q&A websites in the documentation landscape will enable us to make recommendations on how individuals and companies can leverage this knowledge effectively. In this paper, we analyze data from Stack Overflow to categorize the kinds of questions that are asked, and to explore which questions are answered well and which ones remain unanswered. Our preliminary findings indicate that Q&A websites are particularly effective at code reviews and conceptual questions. We pose research questions and suggest future work to explore the motivations of programmers that contribute to Q&A websites, and to understand the implications of turning Q&A exchanges into technical mini-blogs through the editing of questions and answers.

Web2SE 2011: 2nd International Workshop on Web 2.0 for Software Engineering

Web 2.0 technologies such as wikis, blogs, tags and feeds have been adopted and adapted by software engineers. With Web2SE, we aim to provide a venue for pertinent work by highlighting current state-of-the-art research, by identifying future research directions, and by discussing implications of Web 2.0 on software engineering.

Following the success of the first Web2SE at ICSE 2010, Web2SE 2011 (the 2nd International Workshop on Web 2.0 for Software Engineering) has been accepted at ICSE 2011 in Honolulu, Hawaii. We welcome research papers (max. 6 pages) as well as poster and position papers (max. 2 pages) as submissions. The final version of the accepted papers will be published in the ICSE Proceedings. The deadline for submission is January 28, 2011 (Call for papers). We look forward to reading your papers!

Web2SE 2011 is organized by Margaret-Anne (Peggy) Storey, Arie van Deursen, Andrew Begel, Sue Black and myself. The workshop website is online at

You can also follow us on twitter and connect with us on facebook or on LinkedIn.

Here’s the abstract:

Social software is built around an “architecture of participation” where user data is aggregated as a side-effect of using Web 2.0 applications. Web 2.0 implies that processes and tools are socially open, and that content can be used in several different contexts. Web 2.0 tools and technologies support interactive information sharing, data interoperability and user centered design. For instance, wikis, blogs, tags and feeds help us organize, manage and categorize content in an informal and collaborative way. Some of these technologies have made their way into collaborative software development processes and development platforms. These processes and environments are just scratching the surface of what can be done by incorporating Web 2.0 approaches and technologies into collaborative software development. Web 2.0 opens up new opportunities for developers to form teams and collaborate, but it also comes with challenges for developers and researchers. Web2SE aims to improve our understanding of how Web 2.0, manifested in technologies such as mashups or dashboards, can change the culture of collaborative software development.

FlexiTools 2011

I’m on the Program Committee for the Workshop on Flexible Modeling Tools (FlexiTools) at ICSE 2011 in Honolulu, Hawaii.

The workshop addresses the problem that formal modeling tools (such as UML diagram editors) and more informal but flexible, free-form approaches (such as white boards or office tools) have complementary strengths and weaknesses. Whichever practitioners choose for a particular task, they lose the advantages of the other, with attendant frustration and loss of productivity. The goal of this workshop is to develop flexible modeling tools that blend the advantages of modeling tools and the more free-form approaches.

The focus of FlexiTools 2011 will be on challenge problems in the area of flexible modeling with the goal to identify a foundational set of challenges and concerns for the field of flexible modeling, and promising directions for addressing each.

Prospective participants are invited to submit 2-5 page position papers on any topic relevant to the dichotomy between modeling tools and more free-form approaches.

I’m looking forward to reading your papers! Submissions are due on February 7, 2011.

This is the workshop website

2010 – A year in travels

2010 has been a year on board of planes, ferries, buses, trains and cars for me. I even reached the threshold for Air Canada Elite status a few months ago.

New Year’s Eve is a nice opportunity to look back on a year of travels, as I’m sitting at the ferry terminal south of Vancouver waiting for my last trip of the year — the ferry back to Victoria.

In terms of travels, 2010 started for me in February with a trip to Vancouver for the 2010 Winter Olympics. In March, I did a trip that combined a fun visit to Las Vegas, a talk at the University of California in Irvine, and a visit to my family in Germany. The next trip in April took me to IBM Toronto for the CAS University Days.

The highlight of the year was the trip to Cape Town, South Africa, via London in May for ICSE (see separate blog post here). Following ICSE, I spent a good three months at IBM in Ottawa, from late May until early September. During that time, I flew out to Vancouver for a weekend and did several weekend trips to Montreal. I also drove down to New York state for cross-border shopping, to Providence, Rhode Island, and the Boston area to visit friends, and to IBM Hawthorne to give a talk.

After returning to Canada’s west coast in September, I spent a week travelling around southern British Columbia with relatives from Germany and I attended SPLASH in Reno, CSER and CASCON in Toronto, and FSE in Santa Fe. The final trip of the year was a drive down to Seattle from Vancouver.

Here’s a Bing map as a summary: I’m looking forward to another year of travels starting soon!

Work Item Tagging: Communicating Concerns in Collaborative Software Development

After a round of minor revisions, the extension of our ICSE 2009 paper on tagging in software development has been accepted for the TSE Special Issue of Best Papers from ICSE 2009.

For the TSE paper, Peggy and I replicated the previously reported case study on how software developers use tags for work items in two ways:

  1. We extended the study of the Jazz team in several ways; and
  2. We conducted another case with a different team at IBM that was not as likely to be biased towards using the tagging feature.

For the Jazz case study, we analyzed another year of archival data on the tagging activities; we further observed the team and tagging of work items through an additional 5 more months on-site and we conducted interviews with two additional team members. For the second case study, we observed a different group of developers for 2 weeks, we conducted interviews with 6 individuals of various roles and we analyzed tagging data from a one year period. Many of the original findings were replicated through these additional case studies, but the key differences in the extended paper are that:

  1. Our findings on the collaborative aspects of tagging are more comprehensive because of the extended ethnography and additional 8 interviews over both cases.
  2. The categorization for the types of tags used is treated much more thoroughly with new categories emerging from the analysis (many across both cases).

This is the abstract of the paper:

In collaborative software development projects, work items are used as a mechanism to coordinate tasks and track shared development work. In this paper, we explore how “tagging”, a lightweight social computing mechanism, is used to communicate matters of concern in the management of development tasks. We present the results from two empirical studies over 36 and 12 months respectively on how tagging has been adopted and what role it plays in the development processes of several professional development projects with more than 1,000 developers in total. Our research shows that the tagging mechanism was eagerly adopted by the teams, and that it has become a significant part of many informal processes. Different kinds of tags are used by various stakeholders to categorize and organize work items. The tags are used to support finding of tasks, articulation work and information exchange. Implicit and explicit mechanisms have evolved to manage the tag vocabulary. Our findings indicate that lightweight informal tool support, prevalent in the social computing domain, may play an important role in improving team-based software development practices.

Update [November 11, 2010]: The preprint of the paper is now available here (IEEE Computer Society).

Congress 2010 — An interdisciplinary experience in Montreal

As suggested by the outside member on my PhD committee — Ray Siemens from the Department of English at UVic — I attended a day of Congress 2010 in Montreal in early June.

Congress 2010 refers to the Congress of the Humanities and Social Sciences, an event that is held once a year at a Canadian University. The Congress is the premiere destination for Canada’s scholarly community in the Humanities and Social Sciences. Congress 2010 in Montreal at Concordia University had about 9,000 attendees.

The main reason I attended was a panel discussion between Pierre Levy and Alan Liu, two of the leading minds in the field of new technologies and their impact on society. It was a bilingual conversation (truly Canadian, with simultaneous translation) called “Collective Intelligence or Silicon Cage?: Digital culture in the 21st century”. Levy’s point was basically that digital media can help us understand our knowledge as a society because it works as a mirror of collective intelligence, while Liu warned that we run the risk of monotony and singularity, and that everything converges towards the same idea if we don’t have several institutions in between individual and universal. I was fortunate enough to have lunch with both panelists, and could discuss some of my research with them. Collective intelligence and emergent knowledge structures in software development are closely related.

It was very stimulating and inspiring to attend an event from a different discipline, and I found it also really interesting to see how these events are organized in other disciplines. Some good ideas that we might be able to adapt for Software Engineering venues:

  • Use a mix of panel discussions and paper presentations, to foster a more interactive environment. We have controversial issues in Software Engineering as well.
  • Produce youtube clips with highlights from every day such as this one. It’s a great way to keep people in the loop who are unable to attend, and it captures the spirit of the event.
  • Choose a university campus as venue, especially one that’s right downtown. It was great to be fully emerged into Montreal during the lunch breaks, with a huge selection of lunch places.
  • Use a different pricing model. I paid a total of 15 dollars to attend Congress 2010.
  • Be interdisciplinary. Meeting researchers from other disciplines can be very inspiring. It forces us to focus on the essence of our work, gives us the opportunity to find a broader perspective, and can lead to great ideas. When it comes to related work, I’m thinking 15th century now…

ICSE 2010 highlights

Now that the papers from ICSE 2010 are available in the ACM digital library (Volume 1, Volume 2, workshops), it’s time for a blog post about my personal highlights from ICSE 2010 in Cape Town, South Africa, in May 2010. This is of course very subjective, and it follows the tradition that Jorge Aranda started last year.


For me, ICSE 2010 started off with SUITE, the 2nd International Workshop on Search-Driven Development organized by Sushil Bajracharya, Adrian Kuhn, Joel Ossher and Yunwen Ye.

I was pleasantly surprised by how well the very discussion-focused format of SUITE worked. The paper presentations were short (5 minutes) and they were all done in the morning. That left quite a bit of time for discussion in the morning, and even more time for discussion in the afternoon. These discussions didn’t seem overly regulated, and it was great to see how topics emerged.

Topics of the workshop ranged from API search and immediate search in the IDE to dynamic filtering and Semantic Web. As discussion topics for the afternoon we selected IDE integration, developer needs, and the creation of a reference collection for SUITE researchers. It was great to see that developer needs turned out to be the most popular topic — a first sign that the ICSE community is focusing on human aspects more and more.


FlexiTools, the workshop on Flexible Modeling Tools, organized by Harold Ossher, André van der Hoek, Margaret-Anne Storey, John Grundy and Rachel Bellamy covers a really interesting area. The workshop addresses the problem that formal modeling tools (such as UML diagram editors) and more informal but flexible, free-form approaches (such as white boards or office tools) have complementary strengths and weaknesses. The goal of this workshop is to develop flexible modeling tools that have the advantages of both approaches.

The workshop was structured into 3 paper sessions and a concluding discussion session. The day started with requirements for flexible modeling tools, focusing on support for creative work, support for incremental work and changing conditions, support for alternatives, and support for capturing the evolution of models. In the following session on “Unstructured to Structured”, we discussed how unstructured informal models could incrementally be transformed into structured formal models. Several tools such as BITKit were demoed in the afternoon session on Tool Infrastructure.


Due to several double-bookings, I wasn’t able to attend all of the working conference on Mining Software Repositories, but I did manage to present our MSR challenge paper on bug lifetimes in FreeBSD.

My personal highlight of MSR was Michele Lanza‘s keynote “The Visual Terminator”. His brilliant slides are online on The keynote was about software visualization, a term he defined as “The use of computer graphics to understand software”. After telling the stories behind tools such as CodeCrawler and CodeCity, he argued that it is time to rethink software, that software is more than text, and that visualization is the key. While I didn’t agree to all the points he made — I don’t think “empirical validation of visualizations is suicide” — the keynote had everything that a good keynote should have: an outsider’s perspective (visualization is not the core topic of MSR), a great speaker, and some provocative insights. My favorite quote: “Academic research is like Formula 1: driving around in circles, wasting gasoline … but it generates spin-off values!”


Our workshop on Web 2.0 for Software Engineering went really well. Two presentations by Sue Black outlining the results of surveys on the use of social media and Web 2.0 in software development set the stage for the rest of the workshop and also provided insights into her own use of social media, in particular twitter. The following two sessions were more tool focused — from tagging and commit messages to Codebook, mashups and wikis. Three topics were chosen by the participants for the concluding panel discussion: Information overflow, Privacy and Ethics, and the potential move of the IDE to the browser. The detailed notes from the discussion are available here.

To address information overflow, several solutions such as generating less information, generating summaries, voting, context-sensitive labeling, automated categorization, interaction mining and interruption management were discussed.

The discussion regarding privacy and ethics started with the example of the use of twitter at conferences, in particular the question whether it is ethical to quote other individuals such as keynote speakers in conference related tweets. We moved on to discuss our ethical obligations — both moral and legal — before reading communication channels from Open Source projects. A general problem identified in this discussion is that we do not have good metaphors for privacy in social media. The metaphors of filing cabinets and public art projects were suggested.

Looking at projects such as Mozilla Bespin and Heroku, there seems to be the trend of moving the IDE into the browser. First of all, it needs to be noted that neither the use of Web 2.0 mechanisms nor the storage of data “in the cloud” implies that the IDE has to be in the browser. Nonetheless, despite challenges such as concurrency, editor speed and naming schemes, there are reasons why the IDE could move to the browser: accessibility, collaboration, data integration, the same configuration for everybody and superficial reasons such as “the browser is the future”. To conclude the discussion, we had a vote among the workshop participants asking if IDEs should move to the browser. 10 votes yes, 4.5 voted no, and there was 1 maybe.

In the spirit of the workshop topic, we used Web 2.0 tools throughout the workshop to take notes collaboratively. While our Google Waves suffered from low bandwidth, twitter with the #web2se hashtag was quite active.

Doctoral Symposium

Due to the overlap with Web2SE, I wasn’t able to attend the doctoral symposium, and only went there to present my paper on emergent knowledge structures.

Main Conference

The main conference started off with a video welcome address by Desmond Tutu, and an excellent keynote by Clem Sunter. Without notes or slides, he gave a highly entertaining lecture on scenario planning with regard to South Africa and the world. He used the metaphors of foxes and hedgehogs to demonstrate different approaches to dealing with business decisions. According to Clem, foxes embrace uncertainty and change their mind when they realize that there’s something better out there. They reach the optimal decisions through their knowledge of the system as a whole. Hedgehogs on the other hand simplify life around one idea, more or less disregarding everything else. Clem did a great job relating these ideas to current events and leaders. His website has more details.

The presentations of our paper on Awareness 2.0 and our NIER paper on tags went well. The highlights from other paper presentations included Andy Begel‘s presentation on Codebook. The room was very crowded when he talked about the survey they did at Microsoft which revealed that engineers need better ways to find connections between each other. The Codebook framework addresses this issue. In his talk about Supporting Developers with Natural Language Queries, Michael Würsch presented a framework that is able to process guided-input natural language queries that resemble plain English. The approach is based on an OWL ontology and the Semantic Web. The next paper in the same session addressed a similar problem, focusing on questions that require the integration of different kinds of project information. Thomas Fritz and Gail Murphy started by identifying 78 questions that developers want to ask but for which support is missing. They introduced an information fragment model along with a prototype implementation that was evaluated with positive results. Rachel Bellamy did a great presentation on the paper Moving into a New Software Project Landscape. They conducted a grounded theory study with 18 newcomers across 18 projects and found a wide range of interesting things such as the three primary factors that impact the integration of newcomers: early experimentation, internalizing structures and cultures, and progress validation. Thomas Fritz‘ presentation in their distinguished paper A Degree-of-Knowledge Model to Capture Source Code Familiarity started off with a cartoon clip that did a great job at outlining the idea of the paper: If several people collaborative on writing a paper, the degree of knowledge in the paper of one individual author will increase when this author edits to paper, and it will decrease as soon as another author edits the paper. Transferring that idea to source code, they showed that the degree-of-knowledge model can provide better results than existing approaches.

Cape Town and surroundings

Cape Town turned out to be a great place for a conference: Spectacular scenery, amazing food and wildlife not far away. I posted my best pictures here (public facebook album).

Evaluation and Usability of Programming Languages and Tools (PLATEAU) 2010

PLATEAU 2010 has been accepted at the Onward! Conference 2010 and Splash 2010 in Reno, Nevada, USA.

PLATEAU 2010 aims to be a first step in filling the void between the Programming Languages community and the Human Computer Interaction Community by developing and stimulating discussion of usability and evaluation of programming languages and tools with respect to language design and related areas.

The workshop has two goals:

  • Develop a research community that shares ideas and collaborates on research related to the evaluation and usability of languages and tools
  • Encourage the languages and tools communities to think more critically about how usability affects language and tool design and adoption

I’m on the program committee and I’m looking forward to reading your papers! Submissions are due on August 13, 2010.

This is the workshop website

My schedule for ICSE 2010 in Cape Town

Thanks Tom Zimmermann for inspiring this blog post!

This is what my ICSE schedule looks like:

May 1st: SUITE (attendee)
May 2nd: FlexiTools (attendee)

MSR (presenter)

May 3rd: MSR (attendee)
May 4th: Web2SE (co-organizer)

Doctoral Symposium (presenter)

May 5th: ICSE (attendee)
May 6th: ICSE (attendee)
May 7th: ICSE (presenter)

The Implications of How We Tag Software Artifacts: Exploring Different Schemata and Metadata for Tags

Social tagging has been adopted by software developers in various contexts from source code to work items and build definitions. While the success of tagging is usually attributed to the simplicity of tags, the implementation details of tagging systems vary significantly in terms of metadata, schemata and semantics. In a position paper that Peggy and I recently wrote for Web2SE, we argue that academia and industry should be aware of these differences and that we should start to examine their implications.

The idea of analyzing different dimensions of tagging systems is not new. A very detailed taxonomy is given by Marlow et al. They identify the following seven dimensions in the design of a tagging system:

  • Tagging rights: Users can tag everybody’s resources vs. users can only tag their own resources.
  • Tagging support: Blind tagging (users cannot see each other’s tags) vs. viewable tagging (users can see each other’s tags) vs. suggestive tagging (the system suggests tags to users).
  • Aggregation: Bag model (allows duplicate tags per resource) vs. set model (no duplicates).
  • Type of object: Type of the resource to be tagged.
  • Source of material: Resource is supplied by the systems vs. resource is supplied by the users.
  • Resource connectivity: Linked vs. grouped vs. none (possible connections between the resources).
  • Social connectivity: Linked vs. grouped vs. none (possible connections between the users).

While these dimensions apply to tagging systems used by software developers, studying tagging systems used by software developers such as ICICLE, TagSEA, IBM’s Jazz, BITKit, Google Code, ConcernMapper and Concern Graphs reveals additional dimensions on top of Marlow’s taxonomy.

We identified the following additional dimensions:

  • Pre-defined vs. user-defined: Most current tagging systems are based on the concept of tags as “freely-chosen keywords or terms that are associated with or assigned to a piece of information”. However, in older tagging systems such as ICICLE, possible keywords were pre-defined, and software developers were not able to add new keywords to the system. In a dynamic environment such as software development, the just-in-time addition of new tags is the more promising approach.
  • Metadata: Different tagging systems store different amounts of metadata. For example, in the case of tagging work items in IBM’s Jazz, information such as the tag author and the time a tag was applied to a work item can only be identified by browsing the work item’s history. In other systems such as TagSEA, the author and time can be explicitly added to each tag instance, and tags can be searched by their authors and creation time. In order to keep the simplicity, tag authors should not be required to add metadata. However, all metadata that can be recorded automatically should be stored to provide additional context.
  • Semantics: While most tagging systems treat keywords simply as terms that are associated with artifacts, some systems go beyond that and add semantics to tags. An interesting approach is taken by labels in the issue tracker of Google Code, which goes beyond basic labels to support key-value labels. Key-value labels contain one or more dashes, and the part before the first dash is considered to be a field name while the part after that dash is considered to be the value. Studying the use of key-value labels in Google Code is part of our ongoing work.
  • Hierarchies: Some tagging systems explicitly support tag hierarchies, using a dot-notation (e.g., TagSEA). Keywords that have dots in them can be treated as hierarchical, and they can be displayed in tree-views. In other systems such as IBM’s Jazz, some developers use the dot-notation even though there is no explicit support for hierarchies. A flexible approach that offers additional views when needed is promising.
  • Single type of resource vs. multiple types: Software developers handle many different kinds of artifacts from source code and work items to build scripts. Nevertheless, many tagging systems for software developers only support tagging a single kind of artifact. One exception is TagSEA. It allows software developers to tag locations in source code — called waypoints — and artifacts such as files, and it shows different kinds of artifacts in a single view. This allows for grouping and relating different kinds of artifacts while keeping the simplicity of tags.
  • Integration: Another dimension is the extent to which the tagging mechanism is integrated with other tooling. Some systems support social tagging of source code, but require the user to post code fragments on public servers before tags can be applied to code fragments (e.g., DZone Snippets and ByteMycode). In other systems such as IBM’s Jazz or TagSEA, the tagging mechanism is part of the IDE. With the recent trend of moving the IDE into the browser, tagging artifacts online is a promising approach.

Update [June 6, 2010]: The paper is now available here (ACM Digital Library).