Analyzing GitHub, how developers change programming languages over time

Waren Long, source{d}.

Analyzing GitHub, how developers change programming languages over time


Waren Long, source{d}

PyDays Vienna, 2018

The programming language competition in the open source world

History

The dataset

The dataset

Programming language history of GitHub user X

Filtering

Quantization

Transportation

Problem (LP)

How to calculate the flows
between 2 language profiles ?

Minimum-cost flow problem & EMD

$$ (\mathcal{P})~~ \left\{ \begin{array}{lll} \min & \sum_{i=1}^S \sum_{j=1}^D ~ x_{i,j} c_{i,j} \\ s.c. & \sum_{j=1}^D x_{i,j} \leq s_i & ~~i = 1,...,S \\ & \sum_{i=1}^S x_{i,j} \geq d_j & ~~j = 1,...,D \\ & x_{i,j} \geq 0 & ~~i,j = 1,...,S,D \end{array} \right. $$

Hypothesis :$$\sum_{i=1}^N s_i = \sum_{j=1}^N d_j ~~~~~\mbox{and}~~~~~ c_{i,j} = 1 ~~~~\forall i,j$$

Transition

Matrix

Centrality measure

How likely people coding in one language would switch to another one ?

Power iteration

Algorithm to calculate the dominant eigenvector of a well conditioned matrix

Repeat until convergence :

$$x_{i+1} = P\cdot x_i$$

x : stationary distribution of the Markovian process associated with P

Convergence is guaranteed if P is stochastic, irreducible and aperiodic

Marvovian Process

Make the Transition Matrix well conditioned for power iteration

Google in 1998

Make the transition matrix well conditioned

Larry and Sergey had exactly the same objective with the WWW matrix.

Cij is 1 if web page i links to j and 0 otherwise.

They invented a trick to make C well conditioned.

They called it PageRank.

GitHub "LanguageRank"

Update the transition matrix as follows :

$$ P = \beta P + \frac{1-\beta}{N}\left( \begin{array}{cccc} 1 & 1 & ... & 1 \\ 1 & 1 & ... & 1 \\ ... & ... & \ddots & ... \\ 1 & 1 & ... & 1 \\ \end{array} \right) $$

N : number of languages

β : dampening or random walk factor, usually 0.85

Who is the most popular on GitHub ?

Rank Language popularity, % source code, %
1. Python 17.7 11.0
2. Java 15.5 16.2
3. C 10.0 16.8
4. C++ 9.9 12.3
5. PHP 8.8 23.8
6. Ruby 8.7 2.5

GitHub annual stats

Transition

Matrix

Sorted

Language

Competition

Read this on your device


warenlg.github.io/pydays-vienna-2018/

Thank You