Managing Software Engineer Talent Pools using Unsupervised Learning

Waren Long, Athenian.

Managing Software Engineer Talent Pools
using Unsupervised Learning

ML Conference, Berlin   -   December 10th, 2019

Waren Long

warenlg.github.io/ml-conference-berlin-2019

About me

Plan

  1. Origins
  2. Developer Clustering
  3. Challenges
Cover picture

Origins

Motivation

Data Requirements

GitLab

Developer Clustering

Commit Time Series

  1. Collect and normalize the time series -> gitbase
  2. Compute the distance matrix using Dynamic Time Wrapping -> fastdtw
  3. Clustering -> HDBSCAN
  4. Dimension reduction -> UMAP
  5. Results for source{d}

Developer Clustering

Programming languages

  1. Collect LoC added/deleted/changed by developer by language -> hercules
  2. Exclude markup and autogenerated languages: XML, JSON, SVG, ...
  3. Saturate to the 95th percentile
  4. Dimension reduction
  5. Clustering in the embdedding space

Developer Clustering

Source code identifiers

  1. Extract UASTs from source code files -> bblfsh
  2. Represent code files as bag of TF-IDF scores of their identifiers
  3. Assign developers an aggregation of the bags they contributed to
  4. Find decorrelated sparse topics -> BigARTM
  5. Manually label the topics

Topic Modeling

Topic label Top terms
Backend frameworks servlet, flask, javax
Language detection language, java, linguist
Data mining chartj, graphql, average
Frontend + UI, CSS modernizr, mstyle, elementn
Config management chef, runner, platform
Low-level backend btree, opclass, using

Teams of topic contributors

Frontend Config Management Gollum (wiki)
Verify FE Monitor BE Product Management
Core Plan BE Support Department
Manage FE Core Alumni Core Alumni
Create FE Create BE Core
Monitor FE Distribution BE
Serverless FE

Challenges

TM on identitfiers is hard

⇒ Easier to search for library experts

Who is who?

identity-matching

Bot Identites

Developer number of commits
ci bot 1852
dependabot 1630
James White 847
final release builder 719

Bot Detection

  1. Build a training set of bot identities by running regex on the names
  2. Train a BPE model on the emails
  3. Extract statistical features from the commit activity
  4. Train a XGBoost to detect bots among developer identities
  5. Model published in modelforge (asdf) and included in identity-matching

Conclusion

Engineering Team Distribution at GitLab

Thank you