Sectoring Open Source Software

Symposium Presentation of the 2020 DSPG Project






Sectoring Open Source Software:
Where Do GitHub Contributions Come From?

Crystal Zang, Morgan Klutzke, Daniel Bullock,
Brandon Kramer, Gizem Korkmaz, and José Bayoán Santiago Calderón
Sponsors: Carol Robbins (NCSES) and Ledia Guci (NCSES)











Why Study Open Source Software?

Current NCSES and other economic indicators do not measure
the scope and impact of OSS developed outside the business sector











Sectoring Open Source Software on GitHub

Our two main goals for the 2020 DSPG Summer Project were to:

(1) Classify GitHub users into one of five economic sectors

(Academic, Business, Household, Government and Non-Profit)

(2) Examine where GitHub users are located around the world











Methods

We relied on aspects of computational text analysis to standardize entries

(regular expressions, list matching, and bigrams)











Sectoring Results

Twenty percent of the GHTorrent data (~2.1 million) provides email address or work
affiliation for sectoring, which gives us ~420,000 GitHub users.


Most users fall into the business sector followed by
the academic, household and government sectors.











Business Sector

Most OSS producing companies are large tech companies based in Silicon Valley











Academic Sector

US-based academic institutions are the largest producers of OSS






Most of the top OSS-producing universities are close to
major tech hubs in CA, MA, NY, TX and WA











Geographic Analyses

Countries Where Github Users are Located

Most GitHub users based in the US are around 4 times higher than in China











U.S. States Where Github Users are Located

Within the US, most GitHub users are based on the coasts and near major tech hubs











Cities Where Github Users are Located

Silicon Valley is the world’s most prominent OSS hotspot
followed by London, NYC, Moscow and Beijing





















City-Level GitHub Users Collaboration Network

Node size: Weighted degree, reflects the amount of direct collaboration
Node size: Betweenness centrality, reflects importance in the global network











Main Findings

The majority of users come from the business sector followed
by the academic, government and household sectors


Most OSS production seems to be coming from the business sector


Most GitHub users are based in the US (both in general and in the academic sector)


Major universities in California may also be benefitting
from the proximity of Silicon Valley’s OSS production


Challenges & Future Directions

Hoping to scrape more user data to improve classification accurary


Improving the government, non-profit and household classification systems


Determine how to classify contributions at the intersection of multiple sectors


Conducting network analysis within and across sectors to understand collaboration tendencies