How to Build a Code Search Tool Using PyTorch Transformers and Annoy | by Youness Mansar | Feb, 2022

Leveraging joint text and code embeddings for search.

Did you ever look for a code snippet on google because you were too lazy to write it yourself? Most of us did! Then how about building your own code search tool from scratch? This is what we will try to do in this project. We will use a parallel dataset of StackOverflow questions and corresponding code samples in order to build a system that is able to rank existing code snippets given how much they “match” a search query expressed in natural language.

There is a Colab notebook link at the end of the blog if you want to try out the approach.

We are trying to build a system that is able to take a question expressed in natural language and then returns the code snippet that best matches this question from a set of hundreds of thousands of Python code snippets.

The system will be based on learning embeddings for code snippets and embeddings for “How-to” questions. The search part will use Annoy, an approximate nearest neighbors implementation in Python/C++.

The data:

We will use the StaQC dataset, in which there are (Question, Code snippet) pairs mined automatically from stack overflow.

Samples from the dataset look like this:

Question: How to get random single document from 1 billion documents in MongoDB using python?

Code:

import randomcollection = mongodb[""collection_name""]rand = random.random()  # rand will be a floating point between 0 to 1.
random_record = collection.find_one({ 'random' => { '$gte' => rand } })

The dataset has 85,294 single-code snippet answers and 60,083 multi-code snippet answers.

Preprocessing:

Both code and question are strings, so we need to tokenize them before feeding them into a model. Since we are working with code, we can’t use regular pre-trained tokenizers and thus have to build our own.

First, we need our tokenizer to split on special characters like “_” or “=” and correctly split camelCase variable names.

We do that in two steps:

  • Add space around special characters and add space around words in camelCase.
  • Train a hugging face tokenizer with whitespace pre-tokenization.

This gives the following result:

Question ->

How to get a random single document from 1 billion documents in MongoDB using python?

Tokenized question ->

[‘[CLS]’, ‘how’, ‘to’, ‘get’, ‘a’, ‘random’, ‘single’, ‘document’, ‘from’, ‘1’, ‘bill’, ‘##ion’, ‘documents’, ‘in’, ‘mongo’, ‘db’, ‘using’, ‘python’, ‘?’, ‘[SEP]’]

Code snippet ->

class MyLabel(Label): pass    
class AwesomeApp(App):
def build(self):
return MyLabel()

Tokenized code ->

[‘[CLS]’, ‘class’, ‘my’, ‘label’, ‘(‘, ‘label’, ‘)’, ‘:’, ‘pass’, ‘class’, ‘awesome’, ‘app’, ‘(‘, ‘app’, ‘)’, ‘:’, ‘def’, ‘build’, ‘(‘, ‘self’, ‘)’, ‘:’, ‘return’, ‘my’, ‘label’, ‘(‘, ‘)’, ‘[SEP]’]

The ML part:

In this part, we will train a model that is able to map a string (either a code snippet or a question) to a fixed dimension vector of size 256. The constraint that we will apply to this output vector is that we want the similarity between the vector of a question and the vector of its corresponding code snippet to be high while having low similarity in all the other cases.

We use a bi-linear similarity measure:

Where W is a learnable parameter. See “Contrastive Learning of General-Purpose Audio Representations” for details.

In order to learn useful embeddings, we need positive examples, which are the pairs of “How to” questions and their corresponding code snippets. We also need a negative example, which will be all the other vectors in the batch. Meaning that the representations of the positive pair will be pushed to a higher similarity score, while we lead the similarity between the sample vector and all the unrelated vectors in the batch to lower values.

We want the Batch matrix to be as close to the identity as possible.

This way of doing contrastive learning is taken from the paper referenced above. It allows getting negative examples for “free” in a way, since they are already computed as part of the batch. It also has a big advantage compared to other approaches like the triplet loss in that, instead of having only one negative example for each positive one, we can get batch_size -1 negative examples, which forces the encoder to learn more discriminative embeddings.

The whole point of mapping text input to fixed dim vectors is that it makes the search part much more efficient by using approximate nearest neighbors algorithms.

The search part:

Once we have all the code snippets encoded into vectors and a model to encode a new question the same way, we can start working on the search part of the system. We will be using a library called Annoy.

Using Annoy will allow us to query a dataset of n code snippets in O(log(n)) time.

# We define the Annoy Indext = AnnoyIndex(len(samples[0]["vector"]), "angular")# We add all the vectors to the indexfor i, sample in enumerate(samples):
t.add_item(i, sample["vector"])
# We then build the trees for the approximate searcht.build(100)# We can use it to query the NN of a question vectorindexes = t.get_nns_by_vector(search_vector, n=3, search_k=-1)

How to escape HTML special characters?

# Search Result 1 -->
raw_html2 = raw_html.replace('\n', '')

# Search Result 2 -->
def escape(htmlstring):
escapes = {'"': '"',
''': ''',
'<': '&lt;',
'>': '&gt;'}
# This is done first to prevent escaping other escapes.
htmlstring = htmlstring.replace('&', '&amp;')
for seq, esc in escapes.iteritems():
htmlstring = htmlstring.replace(seq, esc)
return htmlstring

# Search Result 3 -->
mytext.replace(r"rn", r"n")

# Query encoded in 0.010133743286132812 seconds
# Search results found in in 0.00034046173095703125 seconds

How uniquify list?

# Search Result 1 -->
seen = set()
L = []
if 'some_item' not in seen:
L.append('some_item')
seen.add('some_item')

# Search Result 2 -->
def is_duplicate(a,b):
if a['name'] == b['name'] and a['cost'] == b['cost'] and abs(int(a['cost']-b['cost'])) < 2:
return True
return False

newlist = []
for a in oldlist:
isdupe = False
for b in newlist:
if is_duplicate(a,b):
isdupe = True
break
if not isdupe:
newlist.append(a)

# Search Result 3 -->
>>> L = [2, 1, 4, 3, 5, 1, 2, 1, 1, 6, 5]
>>> S = set()
>>> M = []
>>> for e in L:
... if e in S:
... continue
... S.add(e)
... M.append(e)
...
>>> M
[2, 1, 4, 3, 5, 6]

M = list(set(L))

# Query encoded in 0.006329536437988281 seconds
# Search results found in in 0.0003654956817626953 seconds

We can see that this system can, in a few milliseconds, find the relevant code snippets among 145k possible answers.

In this small project, we trained a transformer model to predict embeddings for both text and code. Those embeddings were then used to build a KNN search index using approximate Nearest neighbor in Annoy.
This approach can still be further improved by using an even bigger dataset of question/code or by benchmarking different contrasive learning strategies.

You can run your own queries by cloning the repo:

Or by running this colab notebook:

References:

Read more here: Source link