Terminology Wars: A Web Content Analysis

Summary

I collect statistics on the Web usage of the terms "open source" and "free software", and show evidence for the following:

For a fast look at the results, go here.

In this paper I take no position on what I believe should be the case or what terminology people should use, I simply report what is. The search URLs, methods, and data-reduction program I used are included; readers are urged to perform their own tests to verify the facts presented, and to assist in developing a more detailed and exact picture.

This is the third revision, removing some historical statistics that turned out to have been wrong, and with them the least-squares fit. The second revision added the "Linux" vs. "GNU/Linux" sanity check and the least-squares fit to the "free software" trend line. The first revision added more searches on Web media.

A cautionary note, 27 Oct 2004: Alex Fernandez has shown that the results for SourceForge, Savannah, and news.com have false-positives problems that call them into question. We're working on refining these.

Label, Label, Who's Got The Label?

Recently, as I write at the beginning of July 2004, I've received several queries, journalistic and otherwise, about the relative frequency and importance of the terms "free software" and "open source". I never like giving speculative answers, so I put some thought into the question of how to address that question in a factual way.

I decided to perform a more rigorous and carefully-thought-out version of an experiment I've done twice before: using the Web search engines to perform statistical media content analysis on the Web.

Statistical content analysis is a well-understood tool in the social sciences. A representative example may be found in Time Series Analysis of Risk Frames in Media Communication of Agrobiotechnology (which shows up high on a Google search for "media content analysis"). For my purposes, time series analysis would be overkill — I am interested mainly in present-time reality, not in how that reality evolved. This paper is nevertheless a useful example for its discussion of category framing, phrase choice, and other methodological issues.

Here are a number of questions one might ask about these terms:

  1. Which term is more frequently used among software developers?
  2. Which term is more frequently used in the news media?
  3. Which term is more frequently used on the Web as a whole?
  4. What percentage of users of either term recognize or are comfortable with both?
  5. For each term, what percentage of its users are either unaware of or choose not to use the other?

Using Search Engines for Content Analysis

For Web content analysis, the ideal tool would be one that allowed specifying arbitrary Boolean combinations of words or phrases and returned a hit count summed over the entire Web. All real search engines are more limited. See Search Engine News for more detailed discussion.

It would be nice to use Google, simply because it has the largest database. But Google's Boolean-expression capabilities are limited, with no grouping and an OR operator that is inaccurate. Worse, though every Google search returns a hit count, the Google documentation warns that these hit counts are not exact. Furthermore the total number of hits reported can vary depending on the number of displayed hits you request! This does not inspire confidence.

Thus, instead, I use Yahoo, with the second largest database but full Boolean capability. Their hit counts are rounded and therefore almost certainly inaccurate, but at least consistent.

The Problem of False Positives

A substantial problem with short phrases like "open source" and "free software", especially those not containing proper nouns, is false positives.

The phrase "open source" is relatively immune to false-positive problems, as the only homographic usage is a relatively uncommon term of art among intelligence specialists. (In fact, the absence of common conflicting usages was one of the minor criteria used to select this term at the February 1998 meeting where it was brainstormed into existence.) Some noise (but probably not a great deal) is the use of "open source" on pages like like this one, which use the phrase in a way inspired by the open-source software movement but without reference to software or movement.

The phrase "free software" is more problematic, due to the well-known gratis/libre ambiguity around the word "free". It is an observable fact that many Web references to "free software" are like this one, pointing to shareware or gratis programs rather than source-available code licensed under the Free Software Foundation's conditions.

Later in this paper I will measure and quantify the false-positive effect.

Methodology and Report Format

In the remainder of this article, I experiment with various qualifiers on the phrases "free software" and "open source". In each table, the first or 100% line is the set of all qualified matches including either "free software" or "open source". I take this to be the universe of all relevant sites from the community of free software and/or open-source developers (hereafter just "the community").

For each table, I deduce a noise level or statistical uncertainty by looking at entries that should sum to 100%. If they don't, I take the largest difference from 100% as the noise level. We'll see that the noise level is generally about 1%

I evaluated each qualifier for effectiveness at screening out false positives by skimming the top hits. Page ranking systems are a powerful ally here, because people will take their language cues from sites they consider authoritative.

I evaluated each qualifier for false negatives by looking at its effect on line 2, the "free software" AND "open source" line. My assumption is that all these are community sites, and that a qualifier which depresses this figure by more than noise level has a false-negatives problem.

For purposes of this analysis, a good qualifier is one which reduces false positives by more than noise level and does not increase false negatives by more than noise level.

I've made the search strings in all my tables live links, so you can easily gather current numbers yourself. Your retrievals will change over time, but probably not to where they aren't recognizably related to what you see here.

Developer Usage

To assay present developer usage of these terms, I decided to begin with a site search on SourceForge. SourceForge is the largest repository of open-source and free software in the world. This means that any uses of our probe phrases are very likely to be real, e.g. natural utterances by the community we are examining. In particular, on SourceForge "free software" reliably means libre rather than gratis. We don't need to experiment with qualifiers here.

In the absence of any reason to believe that the SourceForge user population is systematically biased towards one term or the other, the fact that SourceForge is the largest such site probably makes it the best possible site for our purposes.

Here is our first set of numbers. Each row gives hit counts for the Yahoo search specified on the left. The second number is a percentage of the relevant universe.


SearchHitsPercentage
site:sourceforge.net ("open source" OR "free software") 2,470,000 100% A
site:sourceforge.net ("open source" AND "free software") 31,500 1% B
site:sourceforge.net "open source" 2,410,000 98% C
site:sourceforge.net "free software" 89,900 3% D
site:sourceforge.net ("open source" AND NOT "free software") 2,370,000 96% E
site:sourceforge.net ("free software" AND NOT "open source") 57,600 2% F
Noise level estimate |A - (B+E+F)|1%
Table 2: Yahoo search on SourceForge, 2004-06-27

On SourceForge, at least, use of the term "free software" has declined to where it is barely distinguishable from this noise level.

I found these numbers rather startling. So much though that I had to investigate the possibility that the "free software" crowd had decamped en masse to elsewhere.

The obvious candidate was Savannah, a SourceForge-like site set up by the Free Software Foundation specifically for those who became ideologically unhappy with SourceForge after its maintaining organization degenerated from an open-source company to a proprietary shop. So I did the same sort of site search:


SearchHitsPercentage
site:savannah.gnu.org ("open source" OR "free software") 25,500 100% A
site:savannah.gnu.org ("open source" AND "free software") 52 0% B
site:savannah.gnu.org "open source" 55 0% C
site:savannah.gnu.org "free software" 25,500 100% D
site:savannah.gnu.org ("open source" AND NOT "free software") 2 0% E
site:savannah.gnu.org ("free software" AND NOT "open source") 25,400 99% F
Noise level estimate|A - (B+E+F)|1%
Table 3: Yahoo search on Savannah, 2004-06-27

While there, I discovered that Savannah is now devoted to GNU projects and now has a companion site for projects not part of GNU. So I did another site search:


SearchHitsPercentage
site:savannah.nongnu.org ("open source" OR "free software") 7,850 100% A
site:savannah.nongnu.org ("open source" AND "free software") 15 0% B
site:savannah.nongnu.org "open source" 15 0% C
site:savannah.nongnu.org "free software" 7,840 100% D
site:savannah.nongnu.org ("open source" AND NOT "free software") 39 0% E
site:savannah.nongnu.org ("free software" AND NOT "open source") 7,820 100% F
Noise level estimate|A - (B+E+F)|0%
Table 4: Yahoo search on savannah.nongnu.org, 2004-06-27

The universe population of both these sites (33,350 pages) is comparable to the noise level in Yahoo's SourceForge numbers. If there is any population of pages still using the term "free software" that is significant in size relative to the whole community's output, we aren't going to find it here. (I looked at ibiblio.org, which was the largest repository before SourceForge, but it was an order of magnitude smaller than Savannah.)

News Media Usage

To assess news-media usage, I ran site-specific searches on several major Internet news organizations. I started with two of the leading technology trade press sites:


SearchHitsPercentage
site:news.com ("open source" OR "free software") 1,050,000 100% A
site:news.com ("open source" AND "free software") 6,930 0% B
site:news.com "open source" 1,020,000 97% C
site:news.com "free software" 37,300 3% D
site:news.com ("open source" AND NOT "free software") 1,020,000 97% E
site:news.com ("free software" AND NOT "open source") 30,100 2% F
Noise level estimate |A - (B+E+F)|2%
Table 5a: Yahoo search on news.com, 2004-06-28

Note that news.com's article database is extremely volatile; your searches may show very different hit counts, though the phrase percentages and the top 4 seem to be pretty stable.


SearchHitsPercentage
site:zdnet.com ("open source" OR "free software") 92,300 100% A
site:zdnet.com ("open source" AND "free software") 689 1% B
site:zdnet.com "open source" 88,700 96% C
site:zdnet.com "free software" 2,030 2% D
site:zdnet.com ("open source" AND NOT "free software") 88,400 95% E
site:zdnet.com ("free software" AND NOT "open source") 1,250 1% F
Noise level estimate |A - (B+E+F)|3%
Table 5b: Yahoo search on zdnet.com, 2004-06-28


SearchHitsPercentage
site:infoworld.com ("open source" OR "free software") 44,300 100% A
site:infoworld.com ("open source" AND "free software") 558 1% B
site:infoworld.com "open source" 43,300 97% C
site:infoworld.com "free software" 928 2% D
site:infoworld.com ("open source" AND NOT "free software") 42,300 95% E
site:infoworld.com ("free software" AND NOT "open source") 367 0% F
Noise level estimate |A - (B+E+F)|4%
Table 5c: Yahoo search on infoworld.com, 2004-07-03


SearchHitsPercentage
site:eweek.com ("open source" OR "free software") 126,000 100% A
site:eweek.com ("open source" AND "free software") 9,390 7% B
site:eweek.com "open source" 122,000 96% C
site:eweek.com "free software" 12,600 10% D
site:eweek.com ("open source" AND NOT "free software") 113,000 89% E
site:eweek.com ("free software" AND NOT "open source") 1,900 2% F
Noise level estimate |A - (B+E+F)|2%
Table 5d: Yahoo search on eweek.com, 2004-07-03

The high 10% rate on "free software" at eweek.com turns out to be to be a false positive due to a Microsoft ad campaign touting "free software management tools"; when hits for the phrase "Microsoft free software" are excluded its percentage drops below noise level.


SearchHitsPercentage
site:informationweek.com ("open source" OR "free software") 27,100 100% A
site:informationweek.com ("open source" AND "free software") 91 0% B
site:informationweek.com "open source" 26,900 99% C
site:informationweek.com "free software" 166 0% D
site:informationweek.com ("open source" AND NOT "free software") 26,700 98% E
site:informationweek.com ("free software" AND NOT "open source") 72 0% F
Noise level estimate |A - (B+E+F)|2%
Table 5e: Yahoo search on informationweek.com, 2004-07-03

The picture revealed by skimming these search results is simple. Use of the term "free software" is rare in the technology press and mainly associated with older stories. Of Yahoo's top four news.com hits on "free software" AND NOT "open source" at time of writing, two are from early 1998, one is a false positive from Microsoft with "Piracy-Free Software" in the title, and the fourth is an FSF-originated advocacy piece.

I then tried to find similarly large samples from general media, without success. It appears that neither the Reuters wire service nor the Associated Press wire service nor the Wall Street Journal nor MSNBC make their archives accessible by search engines.

Whole-Web Usage

Now it's time to look at the Web as a whole. First, a search with no attempt to screen out false positives for either term:


SearchHitsPercentage
"open source" OR "free software" 52,000,000 100% A
"open source" AND "free software" 5,040,000 10% B
"open source" 33,000,000 63% C
"free software" 24,500,000 47% D
"open source" AND NOT "free software" 28,000,000 53% E
"free software" AND NOT "open source" 19,300,000 37% F
Noise level estimate |A - (B+E+F)|0%
Table 6: Yahoo search on entire Web, 2004-06-27

It seems that "open source" tops "free software", by a 16% margin, well over statistical noise level, even with the shareware/freeware pages left in. But how to filter them out for a more accurate picture?

Excluding all sites that mention "shareware" seems like an obvious first step:


SearchHitsPercentage
("open source" OR "free software") AND NOT shareware 49,900,000 100% A
"open source" AND "free software" AND NOT shareware 4,790,000 9% B
"open source" AND NOT shareware 32,100,000 64% C
"free software" AND NOT shareware 22,700,000 45% D
"open source" AND NOT "free software" AND NOT shareware 27,300,000 54% E
"free software" AND NOT "open source" AND NOT shareware 17,700,000 35% F
Noise level estimate |A - (B+E+F)|2%
Table 7: Yahoo search on entire Web, 2004-06-30

This makes a 3% difference, which is just above noise level. But examining the first 1000 actual hits by hand tells us something else, which is that pages mentioning "open source" and "shareware" are effectively always community pages whether or not they mention "free software", whereas pages not mentioning "open source" but mentioning "shareware" are effectively never community pages. That is, NOT shareware is a good filter for "free software" but not for "open source".

Inversely, there is from these results exactly one good qualifier for the phrase "open source". It is the absence of the phrase "open source intelligence", which shows up with about 1% frequency on sites relating to espionage and intelligence.

So here is our third pass:


SearchHitsPercentage
("open source" AND NOT "open source intelligence") OR ("free software" AND NOT shareware) 50,700,000 100% A
("open source" AND NOT "open source intelligence") AND ("free software" AND NOT shareware) 4,820,000 9% B
("open source" AND NOT "open source intelligence") 32,900,000 64% C
"free software" AND NOT shareware 22,700,000 44% D
("open source" AND NOT "open source intelligence") AND NOT ("free software" AND NOT shareware) 28,300,000 56% E
"free software" AND NOT ("open source" AND NOT "open source intelligence") AND NOT shareware 17,900,000 35% F
Noise level estimate |A - (B+E+F)|1%
Table 8: Yahoo search on entire Web, 2004-06-30

Looking at the top 1000 "free software" AND NOT "open source" AND NOT shareware hits by hand still shows a lot of binary-download and other pages irrelevant for our purposes. I searched diligently for other qualifiers that would identify non-community pages, checking obvious candidates like "download" and "freeware" first, but did not find any that were good by the standard I previously described.

So how much of that 44% is junk? To address this question, we need to get a better handle on the vulnerability of both phrases to false positives.

Estimating False Positives

On 1 July 2004 I looked through the top 1000 Yahoo hits for each of the four searches in Table 8 for false positives. I considered a page a positive if either (a) some use of one of the phrases "open source" or "free software" referred to software released under a license conforming to the Open Source Definition or documentation released under either GFDL or Creative Commons licenses, or (b) the page contained a clear and intentional reference to the open-source or free-software movements. For the benefit of anyone interested in running their own statistical analysis, here is both the raw false-positives list (collected by hand) and the Python program I used for reducing the data below; the absolute numbers won't stay valid for long because page ranks shift, but you can analyze the distribution.

Here is the output of report_false_positives(), a table of running averages showing how false-positive percentages vary with page rank in the top 1000 hits:


ABCDEF
1-50 0.00000.00000.00000.56000.00000.8000
51-100 0.01000.01000.00000.61000.00000.8500
101-150 0.00670.00670.00000.61330.00000.8333
151-200 0.00500.00500.00000.61500.00000.8350
201-250 0.00400.00400.00000.62400.00000.7800
251-300 0.00330.00330.00000.60330.00670.7533
301-350 0.00570.00570.00000.62000.01140.7543
351-400 0.00500.00500.00000.62000.01250.7550
401-450 0.00440.00440.00440.63110.01110.7711
451-500 0.00400.00400.00400.65000.01000.7800
501-550 0.00360.00360.00730.66180.00910.7855
551-600 0.00330.00330.00670.65330.01000.7850
601-650 0.00310.00310.00770.66000.00920.7846
651-700 0.00290.00290.00860.67000.01000.7900
701-750 0.00270.00270.00800.66530.00930.7907
751-800 0.00250.00250.00750.67370.00880.7925
801-850 0.00240.00240.00820.67530.00820.8012
851-900 0.00220.00220.00780.67000.00780.8056
901-950 0.00210.00210.00740.66950.00740.8095
951-1000 0.00200.00200.00800.67600.00700.8060
Table 9: Running average of false-positive counts in top 1000.

This table can be summarized as follows:

  1. The A search, ("open source" AND NOT "open source intelligence") OR ("free software" AND NOT shareware) has a rate of false positives below 1% noise level and trending down as page rank decreases.
  2. The B search, ("open source" AND NOT "open source intelligence") AND ("free software" AND NOT shareware) has a rate of false positives below 1% noise level and trending down as page rank decreases.
  3. The C search, ("open source" AND NOT "open source intelligence"), has a rate of false positives that is below 1% noise level, varying erratically.
  4. The D search, ("free software" AND NOT shareware) has a rate of false positives of about 67%, trending gradually upwards as page rank falls.
  5. The E search, ("open source" AND NOT "open source intelligence") AND NOT ("free software" AND NOT shareware), has false positives below noise level and trending gradually downwards as page rank falls.
  6. The F search, "free software" AND NOT ("open source" AND NOT "open source intelligence") AND NOT shareware, has a rate of false positives of 81%, rising as page rank drops.

All the variations in false-positive rate except A are readily explained by the combination of a very high false-positive rate for "free software" and a very low one for "open source". And A turns out to be an artifact resulting from the way Yahoo reports alternations. All hits for the first search key in an alternation get listed first; thus, since we specified "open source" first, the false positives are buried out of sight below pagerank 1000. On the other hand, had we listed "free software" first, the false-positive count in the first 1000 would have been artificially high.

Should we believe that these false-positive rates will maintain over millions of lower-ranked pages? The best reason to do so is that they are all explained by a simple theory: the combination of "open source" and "free software" has effectively no false positives, "open source" alone has them at below noise level, and "free software" has many. Composing these keys yields intermediate cases.

Now let's correct the numbers in Table 8 by adjusting for false positives. There is just one tricky bit here. Since we can't rely on the false-positive rate for A, we have to estimate an A' hit count by adding together the adjusted B, E, and F numbers. A'=100% gets used to calculate the percentages in the second column. We can also use it to deduce the rate of false positives in the A set.


Search Unadjusted False positives Adjusted
Hits % Hits %
("open source" AND NOT "open source intelligence") OR ("free software" AND NOT shareware) 50,700,000 100% 28% 36,366,960 100% A'
("open source" AND NOT "open source intelligence") AND ("free software" AND NOT shareware) 4,820,000 9% 0% 4,810,360 13% B
("open source" AND NOT "open source intelligence") 32,900,000 64% 1% 32,636,800 89% C
"free software" AND NOT shareware 22,700,000 44% 68% 7,354,799 20% D
("open source" AND NOT "open source intelligence") AND NOT ("free software" AND NOT shareware) 28,300,000 56% 0% 28,101,900 77% E
"free software" AND NOT ("open source" AND NOT "open source intelligence") AND NOT shareware 17,900,000 35% 80% 3,454,699 9% F
Table 10: Adjusted hits and percentages.

Note that because the A, B, C and E sets are large relative to the D and F sets, any error in the estimated rate of false positives would have to be large to swing the ajusted numbers significantly.

There were few enough true-positive search-F hits in the top 1000 (less than 200) that I could review them more closely. This turned up another interesting statistic: just about 50% were project pages written by developers, as opposed to advocacy papers or journalism or academic research or weblogging. Therefore, we can estamate that half of F (4.5% of the whole-Web hits) count as developer pages using the term "free software" exclusively.

Linux vs. GNU/Linux

After I published the first version of this paper, a friend suggested that I run the same sort of cross-check on "Linux" vs. "GNU/Linux". His theory was that this could provide an independent check on the distribution of the same ideological positions, without requiring any significant correction for false positives. The null hypothesis is to expect the incidence ratio of these terms to track the incidence ratio of "open source" vs. "free software", because the FSF's campaign for the term "free software" is accompanied by an exhortation to use the term "GNU/Linux" rather than plain "Linux".

If these numbers do not track, then either (a) the communities that use the second set of terms are not strongly correlated with the communities that use the first set of terms, or (b) we need to look at our methhodology for filtering false positives more suspiciously.


SearchHitsPercentage
site:sourceforge.net ("Linux" OR "GNU/Linux") 2,600,000 100% A
site:sourceforge.net ("Linux" AND "GNU/Linux") 31,000 1% B
site:sourceforge.net "Linux" 2,650,000 100% C
site:sourceforge.net "GNU/Linux" 31,100 1% D
site:sourceforge.net ("Linux" AND NOT "GNU/Linux") 2,580,000 % E
site:sourceforge.net ("GNU/Linux" AND NOT "Linux") 3,190 0% F
Noise level estimate |A - (B+E+F)|2%
Table 11: Yahoo search on SourceForge, 2004-07-11


SearchHitsPercentage
site:news.com ("Linux" OR "GNU/Linux") 874,000 100% A
site:news.com ("Linux" AND "GNU/Linux") 59 0% B
site:news.com "Linux" 889,000 100% C
site:news.com "GNU/Linux" 59 0% D
site:news.com ("Linux" AND NOT "GNU/Linux") 871,000 99% E
site:news.com ("GNU/Linux" AND NOT "Linux") 7 0% F
Noise level estimate |A - (B+E+F)|2%
Table 12: Yahoo search on news.com, 2004-07-11


SearchHitsPercentage
("Linux" OR "GNU/Linux") 87,800,000 100% A
("Linux" AND "GNU/Linux") 3,090,000 4% B
"Linux" 89,600,000 100% C
"GNU/Linux" 3,100,000 4% D
("Linux" AND NOT "GNU/Linux") 84,600,000 96% E
("GNU/Linux" AND NOT "Linux") 1,200,000 1% F
Noise level estimate |A - (B+E+F)|2%
Table 13: Yahoo search on entire Web, 2004-07-11

Interpreting this as a sanity check on the previous numbers, we seem to pass. None of these results are surprising.

The SourceForge and news.com ratios track the corresponding "open source"vs. "free software" ratios to within 1%, actually more precisely than the noise level of the probes. There is a very slight difference downward, which would be consistent with the theory that the FSF has been slightly less effective at promoting the "GNU/Linux" label than the "free software" label. This is consistent with the historical record; they only began promoting the "GNU/Linux" label in the mid-1990s, compared to 1983 for "free software". The alternative is that, if anything, we are underestimating the false-positive problems of "free software".

Statistical Interpretation

Use of the term "free software" within the developer community has receded to the point where it is, at D = 3% (nonexclusive use) and F = 2% (exclusive use), barely distinguishable from statistical noise. The only alternative to this interpretation is to suppose both that the population of the largest repository site in the world is hugely biased by unknown factors and that the Free Software Foundation has been unsuccessful at attracting more than a tiny percentage of "free software" partisans to Savannah. This alternative is rendered very unlikely by other evidence summarized below.

Use of the term "free software" has dropped to the same D = 3%, F = 2%, near noise level, in the Web-accessible technical press, and in the hand-inspected top hits seems primarily associated with older articles.

Use of the term "free software" on the general Web is at D = 20%, F = 9%, and exclusive use on project pages written by software developers is about 4.5%. The latter figure is close enough to the SourceForge F = 2% to suggest that SourceForge is in fact representative of the whole-Web developer population.

F for the Savannah sites accounts for about 1% of the A in Table 10. Because we know that F = 4.5% for developers on the Web as a whole, this suggests that the Savannah have recruited about one fifth of their target population of partisans for the term "free software", with SourceForge's F accounting for another two fifths, leaving at most two fifths unaccounted for on the rest of the Web.

A for SourceForge is about 7% of A for the whole Web. Given that a significant fraction of Web hits on "open source" are journalism, this suggests that SourceForge has captured more than 7% of open-source developer activity. (Other estimates have run as high as 33%.)

The phrase "open source" standing alone does not have a significant level of false positives when used to probe for Web traces of open-source software activity, advocacy, and journalistic coverage. The phrase "free software", on the other hand, does have a significant false-positives problem, at or above the 80% level, mostly or entirely due to the gratis/libre ambiguity.

Web-usage frequency of the term "free software" relative to "open source" continues to exhibit a long-term decline from a trivially 100% incidence of "free software" before the "open source" label was launched in 1998. The rate is difficult to quantify, but we know that "free software" has lost 80% of web share in six years, suggesting an annual drop of about 13% per year. However, the decay curve is more likely exponential, asymptotically approaching some nonzero stable population at or below the current 9%.

Previous versions of this paper showed rough Google statistics from 2001 and 2002, but a reviewer pointed out that those numbers were not internally consistent. Why, we don't know. Google hit counts are known to be fuzzy. Nevertheless, they suggested that the term "free software" has been in long-term decline.

So far, I have only written about the statistics of page hits. However, it would be surprising if these did not track preferences in the community population. They can fail to do so only if on average users of the term "free software" differ from users of the term "open source" in their propensity to write Web pages. One of the behavioral traits indexed by use of the term "free software" is more tendency towards ideological banner-waving; this is demonstrated by (among many other facts) the existence of Savannah, which has a purely ideological raison d'être. Thus, the most likely direction of bias is that "open source" is underrepresented by developer-web-page hits than overrepresented.

Unanswered Questions

These results indicate that the term "free software" retains much more presence on the Web as a whole (D = 20%, E = 9%) than it does among developers on SourceForge or at news.com (both D = 3%, E = 2%). What hypotheses might account for this almost order-of-magnitude difference?

One theory would start from the fact that "free software" has been a live term for the entire post-1990 lifespan of the Web, while "open source" was introduced only in 1998. Developer pages associated with live SourceForge projects get touched relatively often (due to release announcements, etc.) while on the general Web static content such as academic papers may remain untouched for indefinite periods. Are most "free software" mentions on the general Web holdovers from olden days?

While this hypothesis is attractively simple, it is difficult to test rigorously. Yahoo and Google offer only crude date filtering (last three months, last six months, last year). The Wayback Machine's recall facility is still in beta. The top trade-press hits indicate that the 20% probably does include a disproportionate number of older pages, but this is unlikely to be the whole story.

Another theory is that whole-Web statistic reflects the language of some population not represented on SourceForge or in the technology-press sites I surveyed. The most obvious candidates are community evangelists and academics. Because the choice of "open source" vs. "free software" is perceived as a political issue within the hacker community, evangelists and academics often try to reinforce their impartiality by using both terms. 13% out of that 20% has this pattern.

Some of these questions could be answered by a more fine-grained content analysis that would try to identify hits by type, sorting them as project pages, advocacy, journalism, and so forth. That would be a valuable next step.