OSS in the Academic Sector

This page focuses on classifying GitHub users into academic institutions.
R text analysis/regex matching

Approach

One of the main goals of our summer project is to understand who is producting open-source software (OSS) and important characteristics about them, including which sectors they are embedded within. To classify contributors into the academic sector, we matched GitHub users’ email and self-reported affiliation columns to information about universities and colleges from around the world using the Hipo Labs’ university domain list - a dataset that has names, countries, and domain names for nearly 9,800 universities across the world. We relied on regular expressions to account for common abbreviations in the self-reported affiliations as well as aliases for specific academic institutions. Having the domain names for academic institutions in the Hipo Labs dataset also helped us match GitHub users to specific universities based on the email address associated with their account. Lastly, we added an additional “Misc. Student” category for GitHub users that reported being students without specific academic institutions. After classifying users into this sector, we used the GitHub commits data to construct networks where nodes represent GitHub users and edges correspond to the number of shared repositories that users can contributed. This page documents this approach and details some of our major findings.

Sectoring Results

While the GHTorrrent data includes about ~2.1 million GitHub users, only around 20% of those listed provided any organizational affiliation on their profile, leaving us with only around 420,000 individuals to classify total. To start, we just used the Hipo Labs data to classify GitHub users based on their self-reported affiliation information, allocating around 16,000 users to the academic sector with just this raw string data.

However, since the input is an free form text field, those who did include affilations had a staggering amount of variation in how they reported that information. We accounted for this in the academic sector by using regular expressions to match common abbreviations and aliases for over 120 academic institutions. For example, anyone whose original self-reported affiliation info was written as “UC San Diego” or “UCSD” was re-categorized into the same group as those included in “University of California, San Diego.” This approach boosted our total count to over 27,000.

Besides the self-reported affiliation information, we also investigated email addresses as a source of information about users’ affiliations. Since the university dataset from Hipo Labs includes domain names, we were able to impute the institutions for an additional 11,000 users based on their email domain. In total, we identified 40,273 users in the academic sector. Of these, 38,709 are matched to a specific institution while 1,564 are categorized as miscellaneous students.

Top OSS Countries in the Academic Sector

Generally, we found that users based in the United States are most prominently represented. Roughly one-third of the institutions represented in our data are from the US, and users from those institutions make up around 55 percent of total users in the academic sector. In this figure, you can see the locations with the greatest number of GitHub users in the academic sector (plotted on a logarithmic scale). Below the figure is a searchable table of countries, which includes the raw number of users at academic institutions in that country (n) as well as the percentage of users at academic institutions in that country compared to the rest of the world.

Top OSS Universities Around the World

This figure shows which specific academic institutions had the greatest number of GitHub users contributing to OSS. The top 10 are all located in the US, though both China and Canada also have universities with a high number of contributors. Below the figure is another searchable table, this time of academic institutions.

Top Universities in the United States

Here we focus specifically on the US. The figure below details which US universities have the greatest number of contibutors to OSS, with the color indicating if the school is public or private.

Limitations and Future Steps

There are limitations to the methods we used to categorize users into the academic sector. For one, we relied heavily on the university dataset from Hipo Labs, which is not without its own flaws, including consistency and coverage. Also, both we, the researchers, and Hipo Labs are based in North America, which likely means that we more accurately accounted for the academic institutions located here than those in the rest of the world. This is compounded with the fact the GitHub is also a US-based company and is a common software hosting platform here, but there may be other popular options in different parts of the world that we are not accounting for.

Additionally, some compromises had to be made concerning university systems, or universities with multiple campuses. In general, for simplicity, we only categorized users into the flapship institution/campus for that system. For example, any user whose self-reported affiliation includes “University of Michigan” was grouped into “University of Michigan - Ann Arbor”, even if they were actually affiliated with the Flint campus of the university. However, there are a few exceptions: the campuses of the University of California system and the University of Illinois system were treated as separate, as were the member institutions of the University of London.

Other concerns include users reporting affiliations with more than one institution and users reporting their affiliations in different languages.