ML on Code

Waren Long, source{d}.

Machine Learning on Code

from n-grams to GGNNs

Waren Long

Nantes ML Meetup - July 1st, 2019

Programs as

Token sequences  āž™  ASTs  āž™  Graphs 

Early days

The first language models

Code is bimodal

"Source code is bimodal: it combines a formal algorithmic channel and a natural language channel of identifiers and comments. Because the two channels interact, [...] bimodality is a natural fit for machine learning."

  Earl Barr

fMRI scans of skilled programmers show NLP parts of the brain active when reading code

Decoding the representation of code in the brain
  B. Floyd et al. 2017

... but code is hard to write

"Programming languages are inherently harder to write and read... so programmers deliberately write code as unsurprising as possible."

"Code (in all languages) is more predicatble than natural language because it more technical and difficult to learn."

  Prem Devanbu at ML4P

On the natualness of Software
  A. Hindle et al. 2012

n-gram language model

On the natualness of Software
  A. Hindle et al. 2012

Source code vocabularies

Modeling Vocabulary for Big Code Machine Learning
  R. Robbes et al. 2019

Token-based applications

The unveiling

Syntactical Features

Syntactical representation

Natural Language Code
I shot an elephant in my pyjamas
                        Assert.NotNull(clazz)
                    

Different Context

• Token neighbors

• AST-node neighbors

• AST paths

Design tradeoff from code2vec

Gemini

Code deduplication at Scale

  1. Create Public Git Archive, dataset of 180k popular GitHub repos
  2. Extract UASTs and combine various syntactical features
  3. Create a pairwise similarity graph using Locality Sensitive Hashing
  4. Extract connected components from the graph
  5. Perform comunity detection to display highlight clone communities

src-d/gemini, source{d}

1. Public Git Archive

Public git archive: a big code dataset for all
  V. Markovtsev et al. 2018

2. Feature extraction

3. Community detection

code2vec

  1. AST path extraction
  2. Distributed representation of contexts
  3. Path-attention network
  4. Evaluation on semantic labelling

code2vec: Learning Distributed Representations of Code
Alon et. al. 2018

Attention mechanism

State of the Art

GGNNs

Programs as graphs

Start from syntax

            Assert.NotNull(clazz)
        

Programs as graphs

Adding data flows

Potentially big graphs

            def sum_positive(arr, lim):
              sum = 0
              for i in range(lim):
                  if arr[i] > 0:
                      sum += arr[i]
              return sum
        

~900 nodes/graph

~8k edges/graph

Graph Neural Networks

The Graph Neural Network Model
F. Scarselli et. al. 2009

Graph Neural Networks

The Graph Neural Network Model
F. Scarselli et. al. 2009

Thank you