Empower developers with ML-assisted code review

Empower developers with ML
assisted code review

AI Meetup - April, 18^th, 2019

Waren Long

source{d}

About source{d}

35 software engineers
Offices in Madrid and San Francisco but fully remote
Activities:
1. Code as Data to turn code into actionable insights
2. ML on Code e.g. assisted code review

sourced.tech

Plan

ML on Code: Origins & Motivation
Lookout
style-analyzer

ML on Code

Software Development Workflow

5h/week, 13% of coding time, generally done by Sr developers
72% of developers work with blocking code reviews
45% of developers lack time to review code

codacy.com/blog

At Google

25,000 developers
20,000 code reviews per workday
4h of average latency for the entire review process
15% of comments indicate a possible defect

Modern Code Review: A Case Study at Google
A. Bacchelli et al. 2018

25 million PR Review Comments on GitHub

dataset

The Alternative Hypothesis

"Programming languages are inherently harder to write and read... so programmers deliberately write code as unsurprising as possible."

"Code (in all languages) is more predicatble than natural language because it more technical and difficult to learn."

On the Naturalness of Software
P. Devanbu et al. 2016

Software is bimodal

"Source code is bimodal: it combines a formal algorithmic channel and a natural language channel of identifiers and comments. Because the two channels interact, [...] bimodality is a natural fit for machine learning."

RefiNym: Using Names to Refine Types
E. Barr et al, 2018

Lookout

When to help ?

While you type = IDE
While you check = CI
While you review = PR
Periodically, asynchronously

Part of the workflow
More time to run the models
Nice UI
High precision score required
Longer feedback loop

Goals

Assisted code review platform
Tight git/GitHub integration
Language agnostic
Batteries included

Example of Lookout Comment on GitHub

Architecture

Push event

Review event

style-analyzer

Approach

Parse to intermediate representation
Train Decision Tree Forest
Extract production rules
Generate fixes from mismatched predictions

Representations of Source Code

Token-level models
→ Raw content

Syntactic models
→ Abstract Syntax Tree (AST)

doc.bblf.sh/architecture

Classes Predicted by style-analyzer

␣	whitespace
→	tabulation
↲	newline
␣+/-	whitespace indentation increase/decrease
→+/-	tabulation indentation increase/decrease
'/"	single/double quotes
∅	empty gaps between non-label nodes, NOOP

Feature Extraction

AST-augmented token stream
Window size of 10 tokens
2 levels up in the AST hierarchy

Annotated Code Snippet

∅function␣classesToArray∅(␣value␣)␣{↲
             ⇥   if␣(␣isArray∅(␣value␣)␣)␣{∅return␣value∅;∅}
                 if␣(␣typeof␣value␣===␣"string"␣)␣{↲
             ⇥       return␣value∅.∅match(␣rnothtml␣)␣||␣[]∅;↲
             ⇤   }↲
                 return␣[]∅;↲
             ⇤}∅

Explainability is key

Build trust with the users
Prefer interpretable output with human-readable decisions
Decision Tree Forest models
Optimize the number and the lengths of the rules

Generating Production Rules From Decision Trees
J.R. Quinlan, 1987

Rules

a≤5 Λ b≤1 Λ c ⇒ α
a≤5 Λ 1<b<4 ⇒ β
5<a<10 Λ c ⇒ γ
a>5 Λ c Λ b>2 ⇒ α

Machine Learning

Feature selection (univariate, ANOVA F-criterion)
Hyperparameter optimization (Bayesian)
Evaluation : 80% + 20% split

Evaluation

■ Precision
■ PredR

~95% weighted avg.

Evaluation improvements

Extend classes
Test the real behaviour of users
Random mutations
Extract from commits

Code as Data and ML on Code Applications

Clone Detection: src-d/gemini
Topic Modeling: src-d/vecino
Developer Similarity: src-d/dev-similarity
Git Repository Analysis:

Thank you

Contacts:

Ressources: