Skip to main content
Menu

Blog Post

Research

Large Language Models for Code Generation – Part 1

What are code-generating LLMs? Helping developers be more productive

Introduction

With all the excitement about large language models and AGI powering applications everywhere – we, the developers, have been quietly benefitting from an important use of this technology – code generation.

Code-writing assistance has been around for 40+ years, starting from things like syntax highlighting, autocompletion in IDEs, as well as code analysis and Linting. More recently AI-assisted tools like DeepCode (now Snyk) leveraged machine learning for more intelligent suggestions.

Figure 1: History of code writing assistants

And then came GitHub Copilot (powered by OpenAI Codex) and everything changed!

Today many developers (including many of us at Vectara) use GitHub Copilot as well as the open source StarCoder to become more efficient and productive in creating software. Google Bard now also provides code generation capabilities that are quite good.

It is widely speculated that modern LLMs will eventually replace some jobs that are currently performed by humans, which begs the question: will developers eventually be replaced by code generating LLMs? 

In this part 1 blog post I explore in more detail how code generating LLMs work, and show some examples of what they do well and where they fail. In part 2 will discuss what we can expect next for code-generating LLMs and “will they ever replace developers”?

What are code-generating LLMs

As I’ve explained in the blog post about LLMs and Hallucinations, a large language model (LLM) is a neural network model, trained on a massive corpus of text to “predict the next token”.

A code-generating LLM is trained on a more specialized dataset that includes code repositories, technical forums, coding platforms, documentation of various products and general web data that’s useful for this purpose. 

Because code-generating LLMs are integrated with your IDE, they fully grasp the context of your code (comments, function names, and variable names), and use this contextual information to further improve the suggestions they make.

This works very well in practice. Here’s a simple example: when I write the following docstring in Python to describe the functionality I would like:

# extract domain from URL

Copilot completes this to:

# extract domain from URL
def get_domain(url):\
    return url.split('//')[-1].split('/')[0]

And Starcoder responds with:

# extract domain from URL
def get_domain(url):
    return urlparse.urlsplit(url).netloc

Both are valid solutions and work well (although you may need to add import urlparse in the second one).

Now let’s try something more complex.

We prompt with:

def quicksort(a):

And Copilot responds with:

def quicksort(a):
   # if the input list contains fewer than 2 elements
   if len(a) < 2:
       # return the input list
       return a
   # otherwise
   else:
       # select the first element of the input list as the pivot
       pivot = a[0]
       # create a list of all the elements less than the pivot
       less = [i for i in a[1:] if i <= pivot]
       # create a list of all the elements greater than the pivot
       greater = [i for i in a[1:] if i > pivot]
       # return the sorted list of all the elements less than the pivot
       # concatenated with the pivot
       # concatenated with the sorted list of all the elements greater than the pivot
       return quicksort(less) + [pivot] + quicksort(greater)

And right there we have a python implementation of quicksort, even including explanations about the logic behind each step.

Copilot is based on GPT-3, imagine what it can do with GPT-4, or even later as new and more powerful models emerge?  In fact, GPT-4 itself is already capable of generating code directly, based on a prompt, and it not only provides the code but also explains it:

Me: “Please write a python script to go through all files in a folder, read the contents of each file that ends with ‘html’ and print the title of that file”

GPT4:

You can achieve this using the os module to list files in a directory and the BeautifulSoup4 library to parse the HTML files and extract the title. First, install the beautifulsoup4 package if you haven’t already:

pip install beautifulsoup4

Then, you can use the following Python script:

import os
from bs4 import BeautifulSoup

def extract_title_from_html(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        soup = BeautifulSoup(content, 'html.parser')
        title = soup.title.string if soup.title else None
        return title

def main():
    folder_path = 'path/to/your/folder'

    for filename in os.listdir(folder_path):
        if filename.endswith('.html'):
            file_path = os.path.join(folder_path, filename)
            title = extract_title_from_html(file_path)
            if title:
                print(f"Title of {filename}: {title}")
            else:
                print(f"No title found in {filename}")

if __name__ == '__main__':
    main()

Replace ‘path/to/your/folder’ with the actual path to the folder containing the HTML files. This script will go through all the files in the specified folder, read the contents of each file ending with ‘.html’, and print the title of that file.

This is quite remarkable – a full solution, including instructions of which package to install, how to use the code and more. 

Hallucinations of code-generating LLMs

As a developer myself, I am super excited about tools like StarCoder and GitHub Copilot, but I’m also cautious: what guarantees do I have that they work as advertised?

As we know LLMs hallucinate sometimes, and code-generating LLMs are not immune to that. And as with other LLM hallucinations, the code generated may look great and work okay, but it may not do the right thing – creating subtle bugs that are hard to find.

For example – I provided this prompt:

# function to select the lowest two even numbers that sum to 17

This of course is an impossible task, since no two even numbers could sum up to 17.

But Copilot continued quite confidently to suggest:

def select_two_even_numbers(numbers):
    # filter out odd numbers
    even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
    # sort the list
    even_numbers.sort()
    # return the last two elements
    return even_numbers[-2:]

Which is wrong.  

This becomes even worse if the code generated by LLMs introduces security weaknesses that can be later exploited by hackers. For example:

With this prompt:

# using os.system, ask user for input file and use cat to display its contents
def cat_file():

The output is:

filename = input("Enter a filename: ")
os.system("cat " + filename)

There is clearly a significant security risk in this code, as the user may for example enter the following:

important_file.txt; rm -rf /

And if the user executing this script has admin rights, this may result in deletion of all data on the local disk.
The bottom line is this: we need a way to validate that code generated by an LLM is correct and does not introduce any security vulnerabilities, or find other ways to address this issue (SeL4 kernel anyone?) to ensure LLM code generation does not result in less secure software.

Summary

Code-generating LLMs like GitHub Co-pilot and StarCoder, among others, are based on a large language model trained on datasets that are specifically collected to allow for the LLM to learn to generate code in many languages and across multiple use-cases.

As the examples above demonstrate code-generating LLMs can be quite helpful and increase productivity for developers in many day-to-day coding tasks.

As this technology improves, we will learn how to best work with code generating LLMs: what prompts we should use to get the result we want, and how to best catch hallucinations or other mistakes the LLM makes.

I’m equally intrigued about the possibility of code generating LLMs being helpful to other members of a technology organization that interact with the engineering team, like product managers, data analysts and others.

In part 2 (to be published 5/18) we’ll explore more advanced capabilities of code-generating LLMs, where this technology is headed, and whether or not we should expect software developers to be replaced by LLMs anytime in the near future.

Recommended Content

COMPANY OVERVIEW

Vectara: Hybrid Search and Beyond [PDF]

In the AI era, how people interact with information has changed. Users expect relevant answers to questions in natural language, not a shopping list of hit or miss search results. They expect the best semantic or exact matches regardless of typos, colloquialisms, or context. Additionally, it is Vectara's mission to remove language as a barrier by allowing cross-language hybrid search that delivers summarized answers in the language of your choice. The Internet, mobile, and AI have made information accessible, now Vectara helps you find meaning quickly through the most relevant answers. Get to know Vectara, and if you have a question, just ask.

Get Introduced to Vectara
Resource Image
Close Menu