The Implications of How We Tag Software Artifacts: Exploring Different Schemata and Metadata for Tags

Social tagging has been adopted by software developers in various contexts from source code to work items and build definitions. While the success of tagging is usually attributed to the simplicity of tags, the implementation details of tagging systems vary significantly in terms of metadata, schemata and semantics. In a position paper that Peggy and I recently wrote for Web2SE, we argue that academia and industry should be aware of these differences and that we should start to examine their implications.

The idea of analyzing different dimensions of tagging systems is not new. A very detailed taxonomy is given by Marlow et al. They identify the following seven dimensions in the design of a tagging system:

Tagging rights: Users can tag everybody’s resources vs. users can only tag their own resources.
Tagging support: Blind tagging (users cannot see each other’s tags) vs. viewable tagging (users can see each other’s tags) vs. suggestive tagging (the system suggests tags to users).
Aggregation: Bag model (allows duplicate tags per resource) vs. set model (no duplicates).
Type of object: Type of the resource to be tagged.
Source of material: Resource is supplied by the systems vs. resource is supplied by the users.
Resource connectivity: Linked vs. grouped vs. none (possible connections between the resources).
Social connectivity: Linked vs. grouped vs. none (possible connections between the users).

While these dimensions apply to tagging systems used by software developers, studying tagging systems used by software developers such as ICICLE, TagSEA, IBM’s Jazz, BITKit, Google Code, ConcernMapper and Concern Graphs reveals additional dimensions on top of Marlow’s taxonomy.

We identified the following additional dimensions:

Pre-defined vs. user-defined: Most current tagging systems are based on the concept of tags as “freely-chosen keywords or terms that are associated with or assigned to a piece of information”. However, in older tagging systems such as ICICLE, possible keywords were pre-defined, and software developers were not able to add new keywords to the system. In a dynamic environment such as software development, the just-in-time addition of new tags is the more promising approach.
Metadata: Different tagging systems store different amounts of metadata. For example, in the case of tagging work items in IBM’s Jazz, information such as the tag author and the time a tag was applied to a work item can only be identified by browsing the work item’s history. In other systems such as TagSEA, the author and time can be explicitly added to each tag instance, and tags can be searched by their authors and creation time. In order to keep the simplicity, tag authors should not be required to add metadata. However, all metadata that can be recorded automatically should be stored to provide additional context.
Semantics: While most tagging systems treat keywords simply as terms that are associated with artifacts, some systems go beyond that and add semantics to tags. An interesting approach is taken by labels in the issue tracker of Google Code, which goes beyond basic labels to support key-value labels. Key-value labels contain one or more dashes, and the part before the first dash is considered to be a field name while the part after that dash is considered to be the value. Studying the use of key-value labels in Google Code is part of our ongoing work.
Hierarchies: Some tagging systems explicitly support tag hierarchies, using a dot-notation (e.g., TagSEA). Keywords that have dots in them can be treated as hierarchical, and they can be displayed in tree-views. In other systems such as IBM’s Jazz, some developers use the dot-notation even though there is no explicit support for hierarchies. A flexible approach that offers additional views when needed is promising.
Single type of resource vs. multiple types: Software developers handle many different kinds of artifacts from source code and work items to build scripts. Nevertheless, many tagging systems for software developers only support tagging a single kind of artifact. One exception is TagSEA. It allows software developers to tag locations in source code — called waypoints — and artifacts such as files, and it shows different kinds of artifacts in a single view. This allows for grouping and relating different kinds of artifacts while keeping the simplicity of tags.
Integration: Another dimension is the extent to which the tagging mechanism is integrated with other tooling. Some systems support social tagging of source code, but require the user to post code fragments on public servers before tags can be applied to code fragments (e.g., DZone Snippets and ByteMycode). In other systems such as IBM’s Jazz or TagSEA, the tagging mechanism is part of the IDE. With the recent trend of moving the IDE into the browser, tagging artifacts online is a promising approach.

Update [June 6, 2010]: The paper is now available here (ACM Digital Library).

Christoph Treude

Singapore Management University

The Implications of How We Tag Software Artifacts: Exploring Different Schemata and Metadata for Tags

1 thought on “The Implications of How We Tag Software Artifacts: Exploring Different Schemata and Metadata for Tags”