Where does Google find API documentation?

The documentation of popular APIs is spread across many formats, from vendor-curated reference documentation to Stack Overflow threads. For developers, it is often not obvious from where a particular piece of information can be retrieved.

To understand this documentation landscape, Maurício Aniche and I systematically conducted Google searches for the elements of ten popular APIs for a paper that will be presented at WAPI 2018 in June. We queried Google with each API element separately, prefixing each query with the name of the corresponding API (for example, we searched for “Java ArrayList” and “jQuery .add()”). We then retrieved all links from the first page of the search results returned by Google and we determined the domain of each link. The detailed results for each API and each domain are available on GitHub. As an example, for Tensorflow, we found the following domains to play a prominent role:

domain coverage median rank
tensorflow.org 99.7% 1
github.com 88.6% 2
stackoverflow.com 69.6% 4
w3cschool.cn 24.5% 6
keras.io 17.6% 2

We define coverage as the percentage of API elements for which a particular domain appeared on the first page of Google search results, and we define median rank as the median of all ranks of a particular domain when it appeared on the first page of the Google search results.

The following table shows the total number of domains from which search results originated, separately for each API. The numbers demonstrate that API documentation is widely dispersed among many domains: for example, the 5,693 searches for the Java API returned results from 4,139 domains on the first page of search results alone. While there is a strong correlation (Pearson’s r = 0.94) between the size of an API measured in terms of its number of elements (and consequently the number of queries we conducted) and the number of domains, the documentation of some APIs is more dispersed than that of other APIs. Documentation for the 226 classes of JUnit can be found on 252 domains when only considering the first page of Google search results—in other words, there are more domains than API elements in this case. We define the documentation dispersion factor of an API as the number of domains divided by the number of elements, shown in the last column of the following table. While many APIs have a factor in the range between 0.72 and 0.84, JUnit is an outlier with a high factor and Tensorflow, Qt, and Symfony are outliers with a low factor, suggesting that these APIs are documented on a relatively small set of domains. Note that even these APIs still resulted in at least 500 domains.

API elements domains domains/element
JUnit 226 252 1.12
jQuery 296 249 0.84
Guava 399 320 0.80
Android 4,140 3,196 0.77
Java 5,693 4,139 0.73
Hadoop 826 594 0.72
Laravel 675 486 0.72
Symfony 1,700 738 0.43
Qt 1,609 524 0.33
Tensorflow 2,582 583 0.23

Based on this initial data, our next step is to study the documentation of popular APIs in more detail, looking beyond GitHub and Stack Overflow.

All details are available in:

Christoph Treude and Maurício Aniche. Where does Google find API documentation? In WAPI’ 18: Proceedings of the 2nd International Workshop on API Usage and Evolution, 2018. To appear.

This work is in part a replication of our earlier paper from 2011:

Chris Parnin and Christoph Treude. Measuring API Documentation on the Web. In Web2SE ’11: Proceedings of the 2nd International Workshop on Web 2.0 For Software Engineering, pages 25-30, 2011.