Category Archives: Uncategorized

Which NLP library should I choose to analyze software documentation?

MSR17This blog post is based on our MSR 2017 paper.

Software developers author a wide variety of documents in natural language, ranging from commit messages and source code comments to documentation and questions or answers on Stack Overflow. To uncover interesting and actionable information from these natural language documents, many researchers rely on “out-of-the-box” natural language processing (NLP) libraries, often without justifying their choice of a particular library. In a systematic literature review, we identified 33 papers that mentioned the use of an NLP library (55% of which used Stanford’s CoreNLP), but only 2 papers offered a rudimentary justification for choosing a particular library.

Software artifacts written in natural language are different from other natural language documents: Their language is technical and often contains references to code elements that a natural language parser trained on a publication such as the Wall Street Journal will be unfamiliar with. In addition, natural language text written by software developers may not obey all grammatical rules, e.g., API documentation might feature sentences that are grammatically incomplete (e.g., “Returns the next page”) and posts on Stack Overflow might not have been authored by a native speaker.

To investigate the impact of choosing a particular NLP library and to help researchers and industry choose the appropriate library for their work, we conducted a series of experiments in which we applied four NLP libraries (Stanford’s CoreNLP, Google’s SyntaxNet, NLTK, and spaCy) to 400 paragraphs from Stack Overflow, 547 paragraphs from GitHub ReadMe files, and 1,410 paragraphs from the Java API documentation.

Comparing the output of different NLP libraries is not trivial since different overlapping parts of the analysis need to be considered. Let us use the sentence “Returns the C++ variable” as an example to illustrate these challenges. The results of different NLP libraries differ in a number of ways:

  • Tokenization: Even steps that could be considered as relatively straightforward, such as splitting a sentence into its tokens, become challenging when software artifacts are used as input. For example, Stanford’s CoreNLP tokenizes “C++” as “C”, “+”, and “+” while the other libraries treat “C++” as a single token.
  • General part-of-speech tagging (affecting the first two letters of a part-of-speech tag): Stanford’s CoreNLP mis-classifies “Returns” as a noun, while the other libraries correctly classify it as a verb. This difference can be explained by the fact that the example sentence is actually grammatically incomplete—it is missing a noun phrase such as “This method” in the beginning.
  • Specific part-of-speech tagging (affecting all letters of a part-of-speech tag): While several libraries correctly classify “Returns” as a verb, there are slight differences: Google’s SyntaxNet classifies the word as a verb in 3rd person, singular and present tense (VBZ) while spaCy simply tags it as a general verb (VB).

The figure at the beginning of this blog post shows the agreement between the four NLP libraries for the different documentation sources. The libraries agreed on 89 to 94% of the tokens, depending on the documentation source. The general part-of-speech tag was identical for 68 to 76% of the tokens, and the specific part-of-speech tag was identical for 60 to 71% of the tokens. In other words, the libraries disagreed on about one of every three part-of-speech tags—strongly suggesting that the choice of NLP library has a large impact on any result.

To investigate which of the libraries achieves the best result, we manually annotated a sample of sentences from each source (a total of 1,116 tokens) with the correct token splitting and part-of-speech tags, and compared the results of each library with this gold standard. We found that spaCy had the best performance on software artifacts from Stack Overflow and the Java API documentation while Google’s SyntaxNet worked best on text from GitHub. The best performance was reached by spaCy on text from Stack Overflow (90% of the part-of-speech tags correct) while the worst performance came from SyntaxNet when we applied it to natural language text from the Java API Documentation (75% of the part-of-speech tags correct). Detailed results and examples of disagreements between the gold standard and the four NLP libraries are available in our MSR 2017 paper.

This work raises two main issues. The first one is that many researchers apply NLP libraries to software artifacts written in natural language, but without justifying the choice of the particular NLP library they use. In our work, we were able to show that the output of different libraries is not identical, and that the choice of an NLP library matters when they are applied to software engineering artifacts written in natural language. In addition, in most cases, the commonly used Stanford CoreNLP library was outperformed by other libraries, and spaCy—which provided the best overall experience—was not mentioned in any recent software engineering paper that we included in our literature review.

The second issue is that the choice of the best NLP library depends on the task and the source: For all three sources, NLTK achieved the highest agreement with our manual annotation in terms of tokenization. On the other hand, if the goal is accurate part-of-speech tagging, NLTK actually yielded the worst results among the four libraries for Stack Overflow and GitHub data. In other words, the best choice of an NLP library depends on which part of the NLP pipeline is going to be employed. In addition, while spaCy outperformed its competition on Stack Overflow and Java API documentation data, Google’s SyntaxNet showed better performance for GitHub ReadMe files.

The worst results were generally observed when analyzing Java API documentation, which confirms our initial assumption that the presence of code elements makes it particularly challenging for NLP libraries to analyze software artifacts written in natural language. In comparison, the NLP libraries were less impacted by the often informal language used in GitHub ReadMe files and on Stack Overflow. Going forward, the main challenge for researchers interested in improving the performance of NLP libraries on software artifacts will be the effective treatment of code elements. We expect that the best possible results will eventually be achieved by models trained specifically on natural language artifacts produced by software developers.

Improving Access to Software Documentation — Two ICSE 2016 papers

This is a cross-post from the University of Adelaide’s CREST blog

Software development is knowledge-intensive, and the effective management and exchange of knowledge is key in every software project. While much of the information needed by software developers is captured in some form of documentation, it is often not obvious where a particular piece of information is stored. Different documentation formats, such as wikis or blogs, contain different kinds of information, written by different individuals and intended for different purposes. Navigating this documentation landscape is particularly challenging for newcomers.

In collaboration with researchers from Canada and Brazil, we are envisioning, developing and evaluating tool support around software documentation for different stakeholders. Two of these efforts will be presented at the International Conference on Software Engineering — the premier conference in software engineering — this year.

In the first project in collaboration with Martin Robillard from McGill University in Canada, we developed an approach to automatically augment API documentation with “insight sentences” from Stack Overflow — sentences that are related to a particular API type and that provide insight not contained in the API documentation of that type. The preprint of the corresponding paper is available here.



Software developers need access to different kinds of information which is often dispersed among different documentation sources, such as API documentation or Stack Overflow. We present an approach to automatically augment API documentation with “insight sentences” from Stack Overflow — sentences that are related to a particular API type and that provide insight not contained in the API documentation of that type. Based on a development set of 1,574 sentences, we compare the performance of two state-of-the-art summarization techniques as well as a pattern-based approach for insight sentence extraction. We then present SISE, a novel machine learning based approach that uses as features the sentences themselves, their formatting, their question, their answer, and their authors as well as part-of-speech tags and the similarity of a sentence to the corresponding API documentation. With SISE, we were able to achieve a precision of 0.64 and a coverage of 0.7 on the development set. In a comparative study with eight software developers, we found that SISE resulted in the highest number of sentences that were considered to add useful information not found in the API documentation. These results indicate that taking into account the meta data available on Stack Overflow as well as part-of-speech tags can significantly improve unsupervised extraction approaches when applied to Stack Overflow data.

The second project was developed in collaboration with three Brazilian researchers: Igor Steinmacher from the Federal University of Technology — Paraná, Tayana Conte from the Federal University of Amazonas, and Marco Gerosa from the University of São Paulo. We developed and evaluated FLOSScoach, a portal to support project newcomers, which we found to be effective at lowering project entry barriers. The preprint of the corresponding paper is available here and FLOSScoach is available here.



Community-based Open Source Software (OSS) projects are usually self-organized and dynamic, receiving contributions from distributed volunteers. Newcomers are important to the survival, long-term success, and continuity of these communities. However, newcomers face many barriers when making their first contribution to an OSS project, leading in many cases to dropouts. Therefore, a major challenge for OSS projects is to provide ways to support newcomers during their first contribution. In this paper, we propose and evaluate FLOSScoach, a portal created to support newcomers to OSS projects. FLOSScoach was designed based on a conceptual model of barriers created in our previous work. To evaluate the portal, we conducted a study with 65 students, relying on qualitative data from diaries, self-efficacy questionnaires, and the Technology Acceptance Model. The results indicate that FLOSScoach played an important role in guiding newcomers and in lowering barriers related to the orientation and contribution process, whereas it was not effective in lowering technical barriers. We also found that FLOSScoach is useful, easy to use, and increased newcomers’ confidence to contribute. Our results can help project maintainers on deciding the points that need more attention in order to help OSS project newcomers overcome entry barriers.

Research study: Developers want to know about unusual events and think their input/output is impossible to measure

developmentactivitySoftware developers pursue a wide range of activities as part of their work, and making sense of what they did in a given time frame is far from trivial as evidenced by the large number of awareness and coordination tools developed in recent years. To inform tool design for making sense of the information available about a developer’s activity, my colleagues Fernando Figueira Filho, Uirá Kulesza and I sent a questionnaire to 2000 randomly selected GitHub users (156 responses) to investigate what information developers would expect in a summary of development activity, how they would measure development activity, and what factors influence how such activity can be condensed into textual summaries or numbers. The questionnaire contained questions such as

“Assume it’s Monday morning and you have just returned from a week-long vacation. One of your colleagues is giving you an update on their development activities last week. What information would you expect to be included in their summary?”


“How would you design metrics to automatically measure the input/output of a software developer in a given month? Why?”

Here are the eight most important findings (for a more detailed account, read our ESEC/FSE 2015 paper [preprint]):

1. Developers want to know about unusual events

In addition to status updates on projects, tasks and features, many developers mentioned the importance of being aware of unusual events. One developer described the ideal summary of development activity as follows:

“Work log, what functionality [has] been implemented/tested. What were the challenges. Anything out of the ordinary.”

This anything-out-of-the-ordinary theme came up many times in our study:

“We cut our developer status meetings way down, and started stand up meetings focusing on problems and new findings rather than dead boring status. Only important point is when something is not on track, going faster than expected and why.”

When we asked about what unusual events they wanted to be kept aware of, the developers described several examples:

“If a developer hasn’t committed anything in a while, his first commit after a long silence could be particularly interesting, for example, because it took him a long time to fix a bug. Also, important commits might have unusual commit messages, for example including smileys, lots of exclamation marks or something like that. Basically something indicating that the developer was emotional about that particular commit.”

Another developer added:

“Changes to files that haven’t been changed in a long time or changes to a large number of files, a large number of deletions.”

Based on this feedback, we have started working on tool support for detecting unusual events in software projects. A first prototype for the detection of unusual commits is available online [demosource codepaper]. We are in the process of expanding this work to detect unusual events related to issues, pull requests, and other artifacts.

2. Developers with more experience see less value in using method names or code comments in summaries of development activity

In the questionnaire, we asked about a few sources that could potentially be used to generate a summary automatically. The titles of issue (opened and closed) received the highest rating, while method names and code comments received the lowest. When we divided developers based on their experience, the more experienced ones (six or more years) ranked method names and code comments as significantly less important compared to less experienced developers. We hypothesize that these differences can be explained by the diversity of activities performed by more experienced developers. While junior developers might only work on well-defined tasks involving few artifacts, the diversity of the work carried out by senior developers makes it more difficult to summarize their work by simply considering method names, code comments, or issue titles.

3. C developers see more value in using code comments in summaries of development activity compared to other developers

Another statistically significant difference occurred when we divided developers based on the programming languages they use on GitHub. Developers using C rated the importance of code comments in summaries significantly higher than developers who do not use C. We hypothesize that this might be related to the projects developers undertake in different languages. C might be used for more complex tasks which requires more meaningful code comments. No other programming language resulted in statistically significant differences.

4. Many developers believe that their input/output (i.e., productivity) is impossible to measure

When we asked developers to design a metric for their input/output, many of them told us that it’s impossible to measure:

“It’s difficult to measure output. Simple quantitative measures like lines of code don’t convey the difficulty of a code task. Changing the architecture or doing a conceptual refactoring may have significant impact but very little evidence on the code base.”

While some developers mentioned potential metrics such as LOC, the overall consensus was that no metric is good enough:

“Anything objective, like lines of code written, hours logged, tags completed, bugs squashed, none of them can be judged outside of the context of the work being done and deciphering the appropriate context is something that automated systems are, not surprisingly, not very good at.”

One of the main reasons for not measuring developer input/output is that metrics can be gamed:

“Automatic is pretty challenging here, as developers are the most capable people on earth to game any system you create.”

And many metrics do not reflect quality either:

“A poor quality developer may be able to close more tickets than anyone else but a high quality developer often closes fewer tickets but of those few, almost none get reopened or result in regressions. For these reasons, metrics should seek to track quality as much as they track quantity.”

5. Developers with more experience see less value in measuring input/output with LOC, bugs fixed, and complexity

We asked about several potential measures in the questionnaire, including lines of code (LOC), number of bugs fixed, and complexity. Developers with at least six years experience rated all of these measures as significantly less suitable for measuring input/output compared to developers with up to five years of experience.

6. Web developers see more value in measuring the number of bugs introduced compared to other developers

Developers who use JavaScript and CSS found the metric of “few bugs introduced” significantly more suitable compared to developers who do not use those languages. We hypothesize that it is particularly difficult to recover from bugs in web development.

7. C developers see LOC and complexity as more suitable measures for development activity compared to other developers

On the other hand, the measures of LOC and complexity were seen as significantly more suitable by developers using C compared to those who don’t use C (on GitHub, at least). We hypothesize that this difference is due to complex programs often being written in C.

8. Developers think textual summaries of development activity could be useful, possibly augmented with numbers

Developers who talked about the difficulty of measuring development activity generally felt positive about the idea of summarizing development activity:

“It’s dangerous to measure some number & have rankings. Because that can be easily gamed. I think having summaries of what everyone did is helpful. But ranking it & assessing it is very difficult/could encourage bad habits. I think it’s better to provide the information & leave it up to the reader to interpret the level of output.”

Numbers might be used to complement text, but not the other way around:

“I think that’s probably the better approach: text first, and maybe add numbers. […] I spend about 45 minutes every Friday reviewing git diffs, just to have a clearer picture in my mind of what happened over the week. […] The automatic summary would make it harder to miss something, and easier to digest.”

Next steps & follow-up survey

In addition to testing the various hypotheses mentioned above, we are now in the process of designing and building the tool support that the developers in our study envisioned: A development activity summarizer that reflects usual and unusual events, supported by numbers that are intended to augment the summaries instead of pitting developers against each other. Please leave a comment below if you’re interested in this work, and consider filling out our follow-up survey on summarizing GitHub data.

TaskNav: Extracting Development Tasks to Navigate Software Documentation

While much of the knowledge needed to develop software is captured in some form of documentation, there is often a gap between the information needs of software developers and the structure of this documentation. Any kind of hierarchical structure with sections and subsections can only enable effective navigation if the section headers are adequate cues for the information needs of developers.

To help developers navigate documentation, during my PostDoc with Martin Robillard at McGill University, we developed a technique for automatically extracting task descriptions from software documentation. Our tool, called TaskNav, suggests these task descriptions in an auto-complete search interface for software documentation along with concepts, code elements, and section headers.

We use natural language processing (NLP) techniques to detect every passage in a documentation corpus that describes how to accomplish some task. The core of the task extraction process is the use of grammatical dependencies identified by the Stanford NLP parser to detect every instance of a programming action described in a documentation corpus. Different dependencies are used to account for different grammatical structures (e.g., “returning an iterator”, “return iterator”, “iterator returned”, and “iterator is returned”):

In the easiest case, a task is indicated by a direct object relationship, as in the example shown above. TaskNav uses this information to extract two tasks descriptions from the example sentence: “generate receipt” and “generate other information”.

When passive voice is used, the passive nominal subject dependency connects the action and the object. In this case, TaskNav finds the task “set thumbnail size in templates”.

Some actions do not have a direct object. In those cases, TaskNav follows the preposition dependency and would extract the task “integrate with Google Checkout” from the example sentence above.

Once the TaskNav user runs a search query after selecting the search terms from auto-complete, search results are presented in a sidebar. When the user selects a result, the corresponding document is opened in TaskNav. The paragraph that matched the query is highlighted, and the document is automatically scrolled to that paragraph.

We conducted a field study in which six professional developers used TaskNav for two weeks as part of their ongoing work. We found search results identified through extracted tasks to be more helpful to developers than those found through concepts, code elements, and section headers.

TaskNav can automatically analyze and index any documentation corpus based on a starting URL and some configuration parameters, such as which HTML tags should be ignored. Documentation users can benefit from TaskNav by taking advantage of the task-based navigation offered by the auto-complete search interface. For documentation writers, TaskNav provides analytics that show how documentation is used (e.g., top queries, most frequently read documents, and unsuccessful searches). Researchers can benefit from the data accumulated by TaskNav’s logging mechanism as it provides detailed data on how software developers search and use software documentation.

All the details of our work on TaskNav are now available as a journal paper in IEEE Transactions on Software Engineering [link] [preprint], and TaskNav will also appear as a Demo at ICSE 2015.

Try TaskNav now!

WorkItemExplorer: Visualizing Software Development Tasks Using an Interactive Exploration Environment

In recent years, the focus of tool support for software developers has shifted from source code alone towards tools that incorporate the entire software development process. On top of source code editing and compilation, many development platforms, such as IBM’s Jazz or Microsoft’s Visual Studio, now offer explicit support for the management of development tasks.

These tasks have become important cogs in collaborative software development processes, and in a typical software project, developers as well as managers need to maintain an awareness of an abundance of tasks along with their properties and relationships. Current tools that aim at providing an understanding of the state of a task management system (e.g., developer dashboards) have several shortcomings, such as limited interactivity and insufficient visualizations. To better support developers and managers in their understanding of all aspects of their software development tasks, Patrick, Lars, Peggy, and I have developed WorkItemExplorer, an interactive visualization environment for the dynamic exploration of data gathered from a task management system.

WorkItemExplorer leverages multiple coordinated views by allowing users to have multiple different views, such as bar charts or time lines, open at the same time, all displaying the same data in different ways. The coordination comes into play when interacting with the views; highlighting one data element will have a mirrored effect on all other views. This enables the exploration of data relationships as well as the discovery of trends that might otherwise be difficult to see because they span multiple aspects. WorkItemExplorer is a web-based tool built on top of the Choosel framework. We have implemented an adapter for queries against the work item component of IBM’s Jazz platform, and we are working on integrating other task management systems.

WorkItemExplorer currently supports seven data elements and the relationships between them: work items, developers, iterations, project areas, team areas, tags, and comments. Using a drag-and-drop interface, these data elements can be moved into seven different views:

  • A text view with different grouping options (e.g., to see a list of work items grouped by their owner).
  • A tag cloud, primarily for the exploration of work item tags.
  • A graph for the exploration of relationships between different kinds of artifacts, such as work items and iterations.
  • A bar chart to visualize data with different groupings using bars of different lengths (e.g., to visualize developers by the number of work items they own).
  • A pie chart to visualize data with different groupings using pie wedges of different sizes (e.g., to show work items by priority).
  • A heat bars view to visualize work items over time, with an additional grouping option (e.g., to visualize the creation of different work item types, such as defects and enhancements, over time).
  • A time line to analyse data over time. Different time properties, such as creation or modification date, can be chosen (e.g., to visualize team area creation over time).

The video above shows an example use case for WorkItemExplorer by exploring who is working on “important” work items at the moment. The task “importance” can be defined in many different ways, and an exploratory tool, such as WorkItemExplorer, can be used to understand the implications of different approaches. To explore important work items and their owners, we open up two bar charts, and then drag all of the work items onto both of them. We then group one bar chart by priority, and the other one by severity, and we drag the bars that we are interested in into a third view, such as a text view. Here, we drag the bars corresponding to high priority and major as well as critical severity. If we now group the text view by work item owner, we get a list of all people working on important work items, and we can continue to explore their workload in more detail. In addition, this configuration allows us to immediately explore the relationship between severity and priority of work items in our data set. When mousing over major severity in the bar chart on the right, through partial highlighting, we can see how work items with major severity are distributed across the different priorities in the bar chart on the left.

Our preliminary evaluation of WorkItemExplorer will be published at ICSE 2012 in Zurich, Switzerland (paper pre-print). We found that

  • WorkItemExplorer can answer questions that developers ask about task management systems,
  • WorkItemExplorer enables the acquisition of new insights through the free exploration of data, and
  • WorkItemExplorer offers a flexible environment in which different individuals solve the same task in different ways.

(© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.)

Programming in a Socially Networked World: the Evolution of the Social Programmer

Since I first blogged about Stack Overflow in February 2011, the number of questions on the Q&A portal has more than doubled (from 1 million to almost 2.5 million), as has the number of answers (from 2.5 million to 5.2 million). According to a recent study by Lena Mamykina and colleagues, over 92% of the questions on Stack Overflow are answered — in a median time of a staggering 11 minutes.

The virtually real-time access to a community of other programmers willing and eager to help is an almost irresistible resource, as shown by the 12 million visitors and 131 million page views in December 2011 alone. Also, as we found in a recent study  for Web2SE 2011, Stack Overflow can reach high levels of coverage for a given topic. For example, we analyzed the Google search results for one particular API –- jQuery -– and found at least one Stack Overflow question on the first page of the search results for 84% of the API’s methods.

The access to such a vast repository of knowledge that is just a web search away raises several research questions:

  • Will developers who focus on reusing content from the web have sufficient understanding of the inner workings of the software they produce?
  • Are web resources going to cover all important aspects of a topic?
  • What meta-data is needed to facilitate technical information-seeking?
  • How can we address security and copyright concerns that come with using other developers’ code?

In a recent position paper, Fernando, Brendan, Peggy and I discuss the past, present, and future of software developers that have access to an unprecedented amount and diversity of resources on the web. The paper is available as a pre-print, and will be presented at the Future of Collaborative Software Development workshop co-located with CSCW 2012 in Seattle in February.

This is the abstract of the paper:

Social media has changed how software developers collaborate, how they coordinate their work, and where they find information. Social media sites, such as the Question and Answer (Q&A) portal Stack Overflow, fill archives with millions of entries that contribute to what we know about software development, covering a wide range of topics. For today’s software developers, reusable code snippets, introductory usage examples, and pertinent libraries are often just a web search away. In this position paper, we discuss the opportunities and challenges for software developers that rely on web content curated by the crowd, and we envision the future of an industry where individual developers benefit from and contribute to a body of knowledge maintained by the crowd using social media.

On using grounded theory in software engineering research

In this blog post, I reflect on my experiences from conducting a grounded theory study as a software engineering researcher in summer 2010. In the study, Peggy and I examined the role of a community portal, such as IBM’s Jazz or Microsoft’s MSDN, in the process of communicating software development knowledge. We just presented the results of the study at ESEC/FSE in September 2011 (paper pre-print). This is far from the first blog post on experiences using grounded theory. To read about other researchers’ experiences, you might want to take a look at L. Lennie Irvin’s collection of blog posts on grounded theory or the 2008 CASCON paper by Steve Adolph from UBC.

The Corbin / Strauss approach

Grounded theory is a systematic methodology to generate theory from data. The methodology originates from the Social Sciences and aims at studying social phenomena. There are different stances on how grounded theory should be carried out, most notably the positivist approach described by Anselm Strauss, and the more interpretative view that is for example described by Kathy Charmaz.

In our study, we followed the grounded theory approach as described by Juliet Corbin and Anselm Strauss in the Qualitative Sociology journal. They specify eleven procedures and canons that grounded theory researchers as well as the readers and evaluators of grounded theory studies should be familiar with:

  1. Data collection and analysis are interrelated processes. When grounded theory is used, data analysis begins as soon as the first bit of data is collected.
  2. Concepts are the basic units of analysis. Incidents from various data sources (in our case: interview transcripts, documentation artifacts, and ethnographic field notes) are given “conceptual labels”. The focus is on concepts that “earn their way into the theory by being present repeatedly”.
  3. Categories must be developed and related. Categories are more abstract than labels and can explain relationships between concepts. A category must be developed in terms of its properties, dimensions, conditions and consequences.
  4. Sampling in grounded theory proceeds on theoretical grounds. Sampling in grounded theory focuses on “incidents, events and happenings” (in our case: all incidents that were related to the creation or use of artifacts posted on a community portal).
  5. Analysis makes use of constant comparisons. When a new incident is noted, it has to be compared against other incidents for similarities and differences.
  6. Patterns and variations must be accounted for. Data must be examined for regularity as well as for irregularities.
  7. Process must be built into the theory. Grounded theory is about understanding processes.
  8. Writing theoretical memos is an integral part of doing grounded theory. To make sure that no concepts or categories are forgotten, memos have to be written throughout the course of the study.
  9. Hypotheses about relationships among categories should be developed and verified as much as possible during the research process. Hypotheses are constantly revised until they hold true for all of the evidence gathered in the study.
  10. A grounded theorist need not work alone. Concepts, categories and their relationships must be tested with other researchers.
  11. Broader structural conditions must be analyzed, however microscopic the research. A grounded theory study should specify how the microscopic perspective links with broader conditions (in our case: how does the particular community portal in our study compare to other portals?).

In grounded theory, coding is the fundamental process that researchers use to make sense of their data. Coding is done in three steps:

  • Open: Data is annotated line by line (see picture above for an example from our study) and concepts are created when they are present repeatedly. Open coding is applied to all data collected (in our case: interview transcripts, documentation artifacts, and ethnographic field notes). Based on the concepts, more abstract categories are developed and related. Each category has properties, dimensions, conditions, and consequences.
  • Axial: Data is put together in new ways by making explicit connections between categories and sub-categories.
  • Selective: The core category is identified and systematically related to other categories.

Making grounded theory explicit

For qualitative researchers, many of the guidelines described by Corbin and Strauss are nothing new, and in fact, we found that we had implicitly followed several of them already in previous studies. For example, when conducting interviews, researchers tend to revise their questions in later interviews based on the answers given in the first interviews and data collection is rarely completely separate from data analysis. However, there was a lot of benefit in making this process explicit:

  • We didn’t have to plan out every detail of our study beforehand.  This is often a challenge in exploratory field research where researchers are not aware of all peculiarities of the setting they are about to conduct a study in. When using grounded theory, it is “officially” part of the research methodology that questions are refined over time, that not all interviewees are pre-determined, and that the resulting theme is unknown beforehand.
  • Similarly, we were able to change direction during the study when we found interesting themes to follow-up on. Again, this is something that frequently happens in qualitative research, but grounded theory makes it explicit.
  • Grounded theory focuses on concepts that become part of the theory because they are present in the data more than once. This makes it easier for researchers to focus on themes that are relevant in the study context rather than themes that only matter to the researcher.
  • Especially during open coding, the use of grounded theory helps ignore pre-conceptions of how and why certain incidents occur. Going through interview scripts or ethnographic field notes on a line by line basis forces researchers to think about every aspect of the data collected.
  • Grounded theory also allows researchers to consider everything they encounter during a study, such as anecdotes or water-cooler conversations. This is not possible with a pre-defined set of interviewees or data sources.

The emergence of the core category

Going into the grounded theory study, I was concerned that after all the open and axial coding, there would be no “core category” that emerged from the data, and in fact, it seems a bit like magic the way that it is conventionally described: “Sufficient coding will eventually lead to a clear perception of which category or conceptual label integrates the entire analysis.”

At least from our experience, I can say that we did encounter a core category that came out pretty clear at the end of the selective coding. One of the challenges is to abstract the core category to the right level. For example, in our case, we found several interesting differences between artifacts on a community portal such as blog posts, wiki pages, and technical articles. While not a single of these differences stood out, we identified the fact that artifacts are different along several dimensions as core category.

The role of research questions

We found the role of research questions tricky when using grounded theory as methodology. As Corbin and Strauss describe it, “each investigator enters the field with some questions or areas for observation, or will soon generate them. Data will be collected on these matters throughout the research endeavor, unless the questions prove, during analysis, to be irrelevant.”

Researchers have questions going into a study, but these questions are refined, changed, and altered throughout the study. This presents a challenge when reporting the research questions for a study. To be thorough, one would have to report the initial questions along with their iterations over the course of the study. As research papers aim at the dissemination of research results rather than a discussion of the research process itself, we found it more useful to report the final set of questions.

Lack of tool support

Coding of ethnographic field notes, interview transcripts and software artifacts is tedious. Several researchers have developed tools to help with that process, in particular by offering traceability between data and codes. Examples of such tools include Saturate, Qualyzer, Atlas, MaxQDA and WeftQDA.

Unfortunately, I found that with all these tools, attaching codes to data and relating codes to each other is hard to do on a computer. After trying several tools (after all, as a Computer Science student I’d like to believe that computers can solve complex editing and annotation tasks), I gave up, printed all the data in font size 8, and went back to using pen and paper. While the traceability is only achieved by following hand-written annotations, it felt a lot more natural to annotate data “by hand”. We need a metaphor better than a list of file names to support our cognition when several sheets of paper are involved.

Reporting a grounded theory study

It is challenging to write a paper describing a qualitative study, even when there is no grounded theory involved. Reporting the qualitative coding in sufficient detail so that other researchers can replicate the work would require giving all the instances of a code being applied to an artifact in a 10-page paper. In approaches such as grounded theory, the problem gets worse as codes would have to be considered at different levels of detail (i.e., open coding, axial coding, selective coding). Instead of including all these details in their papers, some researchers choose to host the details online. That is not possible in all research settings though. For example, researchers who have access to proprietary data are usually not allowed to make their data available online.

To provide at least some traceability to readers and reviewers, we assigned unique identifiers to each one of our interviewees and we also indicated the role of the interviewees in the identifier to add additional context without revealing confidential information (e.g., M1 for the first manager we interviewed, and D1 for the first developer). When quoting individuals in our paper, we referred to the interviewees using these identifiers. The right amount of quotes in a qualitative research paper is a question of style. Some researchers prefer many exemplary quotes to make the research more concrete, others prefer generalizations and therefore discourage the use of concrete quotes. We found it easier to tell the story in a paper using quotes — however, it is important to understand that these quotes are only meant to represent a much larger body of qualitative data.

In summary

Grounded theory is a great methodology to understand the “how” and “why” of a research problem. Making the coding process explicit and going through data on a line by line basis allows for new insights, and also ensures that no important themes are overlooked. While the coding and the reporting of results can be tedious, grounded theory should be in the toolbox of every researcher who tries to understand processes in software development and beyond.

PS – Thanks to Fernando Figueira Filho for proof-reading a draft version of this post!

An Exploratory Study of Software Reverse Engineering in a Security Context

Software reverse engineering—the process of analysing a system to identify its components and to create representations of the system in other forms or at higher levels of abstraction—is a challenging task. It becomes even more challenging in security contexts such as the detection of malware or the decryption of encrypted file systems. In such settings, web resources are often unavailable because work has to be performed offline, files can rarely be shared in order to avoid infecting co-workers with malware or because information is classified, time pressure is immense, and tool support is limited.

To gain insights into the work done by security reverse engineers, Peggy, Fernando Figueira Filho, Martin Salois from DRDC Valcartier and I conducted an exploratory study aimed at understanding their processes, tools, artifacts, challenges, and needs. The results of this study will be presented at WCRE 2011 in Limerick, Ireland, in October.

We identified five processes that are part of reverse engineering in a security context:

  • analyzing assembly code,
  • documenting findings through different kinds of artifacts,
  • transferring knowledge to other reverse engineers,
  • articulating work, and
  • reporting of findings to stakeholders.

There is no general process that can capture all of the work done by security reverse engineers. Task complexity, security context, time pressure, and tool constraints make it impossible to follow a structured heavyweight process. Therefore, process and tool support has to be lightweight and flexible.

In our future work, we hope to address the challenges with improved tools and processes, and to study their usefulness in the unique work environment of security reverse engineers.

A pre-print of the paper is available here
(© 2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.)

This is the abstract of the paper:

Illegal cyberspace activities are increasing rapidly and many software engineers are using reverse engineering methods to respond to attacks. The security-sensitive nature of these tasks, such as the understanding of malware or the decryption of encrypted content, brings unique challenges to reverse engineering: work has to be done offline, files can rarely be shared, time pressure is immense, and there is a lack of tool and process support for capturing and sharing the knowledge obtained while trying to understand plain assembly code. To help us gain an understanding of this reverse engineering work, we report on an exploratory study done in a security context at a research and development government organization to explore their work processes, tools, and artifacts. In this paper, we identify challenges, such as the management and navigation of a myriad of artifacts, and we conclude by offering suggestions for tool and process improvements.

Effective Communication of Software Development Knowledge Through Community Portals

Effective management and exchange of knowledge is key in every software organization. Although various forms of documentation exist, software projects encounter many knowledge management challenges: How should knowledge be distributed? How should knowledge be kept up to date? How should feedback be solicited? How should knowledge be organized for easy access?

There is no roadmap for what kind of information is best presented in a given artifact and new forms of documentation, such as wikis and blogs, have evolved. Unlike more formal mechanisms, wikis and blogs are easy to create and maintain. However, wikis and blogs do not offer the same authoritativeness that comes with traditional documentation, and they can become outdated and less concise over time. While the informality of a wiki page or blog is sometimes enough, users often expect reviewed technical articles.

One mechanism that brings various communication channels together is the use of web or community portals. Web portals are not just used in software communities, but are essential to companies, such as, eBay or TripAdvisor, where they enable the development of communities around products. Similarly, many of today’s software projects wish to solicit feedback and input from a broader community of users, beta-testers, and stakeholders. Examples of community portals in software organizations include Microsoft’s MSDN or IBM’s Jazz.

In a paper that Peggy and I will present at ESEC/FSE 2011 in Szeged, Hungary, we report on an empirical study of the community portal for IBM’s Jazz: Using grounded theory, we developed a model that characterizes documentation artifacts in a community portal along the following eight dimensions:

  • Content: the type of content typically presented in the artifact.
  • Audience: the audience for which the artifact is intended.
  • Trigger: the motivation that triggers the creation of a new artifact.
  • Collaboration: the extent of collaboration during the creation of a new artifact.
  • Review: the extent to which new artifacts are reviewed before publication.
  • Feedback: the extent to which readers can give feedback.
  • Fanfare: the amount of fanfare with which a new artifact is released.
  • Time Sensitivity: the time sensitivity of information in the artifact.

In our study, we focused on four kinds of artifacts: the official product documentation accessible through the web and the product help menus, technical articles available in the library, the Jazz team blog, and the team wiki.

These are some of our main findings:

  • Content in wiki pages is often stale. Therefore, readers will not look at the wiki for reliable information, but rather use it as a backup option if information is not available elsewhere. To communicate important information to the community, articles and blog posts are better suited.
  • The official product documentation is reviewed rigorously. With that in mind, it can serve as the most reliable way to communicate knowledge in a community portal.
  • When content is produced by a separate documentation team, updates may not be feasible. In such a case, information may become outdated quickly and can only be fixed with a new product release.
  • New blog posts created more “buzz” or fanfare than articles or wiki pages. Thus, if there is a need to make a project’s community aware of something, a blog post may be the best-suited medium.
  • Writing can be time-consuming, in particular for technical articles and blog posts. In addition, those media forms may need to undergo a review process. To get content out quickly, the wiki may be the best solution. However, readers may only find wiki pages if they are pointed to them explicitly.
  • To solicit feedback from readers, articles and blog posts typically offer more comment functionality than the official product documentation or the wiki.

One of the goals of our research is to provide advice to managers and developers on how to effectively use a community portal. Based on the findings from our study, we make the following recommendations:

  • Make content available, but clearly distinguish different media forms.
  • Move content into more formal media formats where appropriate.
  • Be aware of the implications of different media artifacts and channels.
  • Offer developers a medium with a low entry barrier for quick externalization of knowledge.
  • Involve the community in a project’s documentation.
  • Provide readers with an option to give feedback.

A pre-print of the paper is available here
(© ACM, 2011. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version will be published in the Proceedings of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE) 2011.)

This is the abstract of the paper:

Knowledge management plays an important role in many software organizations. Knowledge can be captured and distributed using a variety of media, including traditional help files and manuals, videos, technical articles, wikis, and blogs. In recent years, web-based community portals have emerged as an important mechanism for combining various communication channels. However, there is little advice on how they can be effectively deployed in a software project.

In this paper, we present a first study of a community portal used by a closed source software project. Using grounded theory, we develop a model that characterizes documentation artifacts along several dimensions, such as content type, intended audience, feedback options, and review mechanisms. Our findings lead to actionable advice for industry by articulating the benefits and possible shortcomings of the various communication channels in a knowledge-sharing portal. We conclude by suggesting future research on the increasing adoption of community portals in software engineering projects.

What we know about Web 2.0 in Software Engineering — Part 2: Tags, feeds, and social networks

After focusing on wikis, blogs and microblogs in the last post, in this entry I’ll summarize the results of my (non-systematic) literature review on what we know about the use of tags, feeds, and social networks by software developers.


Due to some inconsistencies of what exactly a tag is, we have defined the term in our recent TSE paper [»]: A tag is a freely-chosen keyword or term that is associated with or assigned to a piece of information. In the context of software development, tags are used to annotate resources such as source files or test cases in order to support the process of finding these resources. Multiple tags can be assigned to one resource.

The concept of annotating resources is not new to software development. At IWPC 2004, Ahmed Hassan and Richard Holt presented an approach that recovers information from source control systems and attaches this information to the static dependency graph of a software system [»]. They refer to this attached information as source sticky notes and show that the sticky notes can help developers understand the architecture of large software systems. A complimentary approach that employs annotations edited by humans rather than automatically generated annotations was presented by Andrea Brühlmann and colleagues at Models 2008 [»]. They propose a generic approach to capture informal human knowledge in form of annotations during the reverse engineering process. Annotation types in their tool Metanool can be iteratively defined, refined and transformed, without requiring a fixed meta-model to be defined in advance. This strength of annotations — the ability to refine them iteratively — is also employed by the tool BITKit presented by Harold Ossher and colleagues [»]. In BITKit, tags are used to identify and organize concerns during pre-requirements analysis. The resulting tag structures can then be hardened into classifications to capture important concerns.

There is a considerable body of work on the use of annotations in source code. The use of task annotations in Java source code such as TODO, FIXME or HACK was studied by Margaret-Anne Storey and colleagues [»]. They describe how task management is negotiated between the more formal issue tracking systems and the informal annotations that developers write within their source code. They found that task annotations in source code have different meanings and are dependent on individual, team and community use. Based on this research, the tool TagSEA was developed [»]. TagSEA is a framework for tagging locations of interest within the Eclipse IDE, and it adds semantic information to annotations. The tool combines the concept of tagging and geographic navigation to make it easier to find information. TagSEA was evaluated in two longitudinal empirical studies that indicated that TagSEA was used to support reminding and refinding [»]. TagSEA was extended to include a Tours feature that allows programmers to give live technical presentations that combine static slides with dynamic content based on TagSEA annotations in the IDE [»].

TagSEA has been applied by other researchers for more advanced use cases. Quentin Boucher and colleagues describe how they have used source code tagging to identify features in the source code [»]. These tags are then used to prune source code for a pragmatic approach to software product line management. In eMoose, presented by Uri Dekel and James Herbsleb [»], developers can associate annotations or tag directives within API documentation. eMoose then pushes these annotations to the context of the invoking code by decorating the calls and augmenting the hover mechanism. The tool was evaluated in a lab study that demonstrated the directive awareness problem in traditional documentation use and the potential benefits of the eMoose approach [»].

In our own work, we have studied the role of tags in task management systems. In our initial study, we examined how tagging had been adopted and adapted in the work item tracking system of IBM’s Rational Team Concert IDE [»]. We found that the tagging mechanism was eagerly adopted by the team in our study, and that it had become a significant part of many informal processes. We distinguished the tags into four different categories: lifecycle-related, component-specific, cross-cutting, and idiosyncratic. When we replicated the study a year later with a different team, we were able to refine some of the categories and we found that tags were used to support finding of tasks, articulation work and information exchange. Implicit and explicit mechanisms had evolved to manage the tag vocabulary [»]. Our study was also replicated by Fabio Calefato and colleagues [»]. They confirmed our initial findings, and they discovered two additional tag categories, namely IDE-specific tags and divorced tags, i.e. tags that could only to be interpreted as part of a compound name.

Closely related to concept of tagging is the idea of social bookmarking, a method for organizing, storing, managing and searching bookmarks online. Such bookmarks are often organized using tags. Anja Guzzi and colleagues present Pollicino, a tool for collective code bookmarking. Their implementation was based on requirements gathered through an online survey and interviews with software developers. A user study found that Pollicino can be effectively used to document developer’s findings that can be used by other developers. This research will be presented at ICPC this year [»].


Feeds in software development are used to achieve awareness in collaborative development settings. As we found in a study published at ICSE 2010, feeds allow the tracking of work at a small scale, i.e. on a per bug or per commit basis [»]. The feed reader then becomes a personal inbox for development tasks that helps developers plan their day. However, in these kinds of feeds, information overload quickly becomes a problem. Thomas Fritz proposed an approach that integrates dynamic and static information in a development environment to allow developers to continuously monitor the relevant information in context of their work [»]. In a CHI paper from this year, Thomas Fritz and Gail Murphy give more details on their concept [»]: Through a series of interviews, they identified four factors that help developers determine relevancy of events in feeds: content, target of content, relation with the creator, and previous interactions.

Fabio Calefato and colleagues proposed a tool to embed social network information about distributed co-workers into IBM’s Jazz [»]. They used the FriendFeed aggregator to enhance the current task focused feed implementation with additional awareness information to foster the building of organizational values, attitudes, and inter-personal relationships.

Social networking

Research on social networking related to software development can be distinguished into two subgroups: Understanding the network structure in existing projects, and transferring ideas from social networking sites such as facebook into software development. Since the former isn’t really part of the adoption of Web 2.0 in software development, I’ll focus on the latter here. (For an overview of social networks mined from development projects, see the work of Christian Bird and colleagues from MSR 2006 [»] or from FSE 2008 [»] as well as Walt Scacchi’s FSE 2007 paper [»]).

Concepts from social networking sites such as facebook have been brought into software development through the Codebook project by Andrew Begel and colleagues [»]. Codebook is a social networking service in which developers can be friends not only with other developers but also with work artifacts. These connections then allow developers to keep track of task dependencies and to discover and understand connections in source code. Codebook was inspired by a survey with developers that found that the most impactful problems concerned finding and keeping track of other developers [»]. Based on the Codebook framework, Andrew Begel and colleagues also introduced a newsfeed [»] and a search portal [»].