Managing Software Engineer Talent Pools using Unsupervised Learning

Waren Long, Athenian.

Managing Software Engineer Talent Pools
using Unsupervised Learning

ML Conference, Berlin - December 10th, 2019

Waren Long

warenlg.github.io/ml-conference-berlin-2019

About me

Background in mathematics: Optimization and Graph Theory
Worked 2.5 years at source{d} as a Sotware Engineer in the ML team
⇒ MLonCode: code completion, automated code review, clone detection, ...
Now working at Athenian where I switched to Product
⇒ Provide engineering teams metrics and insights to speed up software delivery and improve code quality

Plan

Origins
Developer Clustering
Challenges

Cover picture

Origins

Motivation

What is the skill set of your talent pool?
Which developers are key for your software development?
Which developers could work well with each other?
To which team of developers should we assign this project?

Data Requirements

Large open source org
Many contributors
Public team strucure

GitLab

Fully open source org
117 repos gitlab.com/gitlab-org/*, 13M LoC
3k contributors overall including 250 employees
about.gitlab.com/company/team
Willing to validate the results

Developer Clustering

Commit Time Series

Collect and normalize the time series -> gitbase
Compute the distance matrix using Dynamic Time Wrapping -> fastdtw
Clustering -> HDBSCAN
Dimension reduction -> UMAP
Results for source{d}

Developer Clustering

Programming languages

Collect LoC added/deleted/changed by developer by language -> hercules
Exclude markup and autogenerated languages: XML, JSON, SVG, ...
Saturate to the 95th percentile
Dimension reduction
Clustering in the embdedding space

Developer Clustering

Source code identifiers

Extract UASTs from source code files -> bblfsh
Represent code files as bag of TF-IDF scores of their identifiers
Assign developers an aggregation of the bags they contributed to
Find decorrelated sparse topics -> BigARTM
Manually label the topics

Topic Modeling

Topic label	Top terms
Backend frameworks	`servlet`, `flask`, `javax`
Language detection	`language`, `java`, `linguist`
Data mining	`chartj`, `graphql`, `average`
Frontend + UI, CSS	`modernizr`, `mstyle`, `elementn`
Config management	`chef`, `runner`, `platform`
Low-level backend	`btree`, `opclass`, `using`

Teams of topic contributors

Frontend	Config Management	Gollum (wiki)
Verify FE	Monitor BE	Product Management
Core	Plan BE	Support Department
Manage FE	Core Alumni	Core Alumni
Create FE	Create BE	Core
Monitor FE	Distribution BE
Serverless FE

Challenges

TM on identitfiers is hard

Parsing errors when extracting UASTs
Approximate tuning for the number of topics
Painful manual labelling

⇒ Easier to search for library experts

Extract imports using regex
Use public librairy classifications -> awesome-libraries

Who is who?

identity-matching

Bot Identites

Introduce noise in the analysis -> CloudFoundry analysis
Squash the contributions of the real developers

Developer	number of commits
ci bot	1852
dependabot	1630
James White	847
final release builder	719

Bot Detection

Build a training set of bot identities by running regex on the names
Train a BPE model on the emails
Extract statistical features from the commit activity
Train a XGBoost to detect bots among developer identities
Model published in modelforge (asdf) and included in identity-matching

Conclusion

Engineering Team Distribution at GitLab

Thank you