Data & Methods
Our work on OSS generally aims to measure how much OSS is in use (stock), how much is created (flow), who is developing these tools, and how OSS tools are being shared within and across different sectors. More specifically, our summer project implements aspects of computational text analysis, probabilistic matching and social network analysis to classify GitHub users into different economic sectors, understand which institutions users are affiliated within each sector, and analyze how users collaborate within and across economic sectors and geographic boundaries.
Drawing on Keller and colleagues’ (2018) data science framework, we combined multiple data sources to classify GitHub users into one of five sectors: academic, business, governmental, household, and non-profit. Our main source of data is GHTorrrent, which includes a list of ~2.1 million users scraped from GitHub with information about their email, self-reported affiliation, and location data. To supplement these user data, we leveraged the GHOSS.jl API to scrape information about commit, addition, and deletion history from all these users’ OSI-licensed GitHub repositories from 2008-2019. In total, our GitHub commits dataset was comprised of ~3.2 million contributors and ~7.8 million distinct repositories.
To classify contributors into the academic sector, we matched GitHub users’ email and self-reported affiliation columns to information about universities and colleges from around the world using the Hipo Labs’ university domain list. We relied on regular expressions to account for common abbreviations in the self-reported affiliations as well as aliases for specific academic institutions. Having the domain names for academic institutions in the Hipo Labs dataset also helped us match GitHub users to specific universities based on the email address associated with their account. Lastly, we added an additional “Misc. Student” category for GitHub users that reported being students without specific academic institutions. After classifying users into this sector, we used the GitHub commits data to construct networks where nodes represent GitHub users and edges correspond to the number of shared repositories that users can contributed.
Next, we classified users into the government sector based on a combination of email and self-reported affiliation data. First, we acquired the US Government Domain list to classify government users based on email. From these users, we created a dictionary of their self-reported affiliations and used this list to match users without government emails. Second, due to variation in the self-reported data it can be difficult to match exact strings, so we took two additional steps: (1) to exclude all non-alphanumeric characters and prepositions and (2) to match entries based on bigrams (i.e. sequences of adjacent words from the affiliation string) that occur more than once. Third, we generated a dictionary of government entities to match GitHub users by combining institutions from the A-Z Index of Government Departments and Agencies, the US Government Federally Funded Research and Development Centers, and the US Government Manual. In addition to matching on these unique entries, we also extracted bigrams from this list to account for variations in self-reported affiliations. Lastly, we developed a list of government terms to supplement our matching strategy for this sector.
To classify users into the non-profit sector, we combined the Forbes’ Top-100 largest US Charities, the United Nations list of Non-Governmental Organizations and a list of non-profits that administer government laboratories extracted from the US Government Federally Funded Research and Development Centers dataset to match GitHub users to existing non-profit and NGO-based institutions around the world using regular expressions. To classify users into the household sector, we used regular expressions to catch common phrases that developers cite to signify they work from home, including freelancer, personal, self-employed, etc. This approach identified approximately 4,800 users that fit into this sector.
To assign users into the business sector, we took an exclusionary approach that depends on the other four sectors. First, we worked to standardize the affiliation column by removing (1) all website domain information using manually curated terms originally based on DataHub’s Domain Entries, (2) all legal entity nomenclature based on manually curated version of Gleif’s legal entity abbreviations, and (3) a list of commonly occurring arbitrary symbols. After these procedures were applied, we removed (a) all users classified into the academic, government, non-profit or household sectors and (b) all users that did not list an institution that was mentioned in the affiliation column more than five times. This critical threshold of 5 is arbitrary but helps us to establish some degree of commonality among those in the business sector. Furthermore, while this exclusionary approach is less than ideal, classifying GitHub users into the business sector is complicated by the absence of a publicly available data source that comprehensively lists all businesses around the world.
Finally, to categorize GitHub by geography, we used regular expressions to recode all self-reported location data available in the GHTorrrent data, finding ~770,000 users with valid country codes. In addition to this self-reported location data, GHTorrent also provides latitude and longitude information for GitHub users (assigned using the Open Street Maps API). Drawing on these geocoded data, we aggregated all latitude and longitude information that were within a 2-degree difference. Although this data reduction approach is somewhat crude, the main purpose of this assignment is to cluster users with a common “city code” so that we can examine how users collaborate between more specific geographic units. In turn, we used these city codes to construct networks where nodes represent GitHub users and edges correspond to the number of shared repositories that users can contributed.