Large Language Models for Code Generation

Introduction

With all the excitement about large language models and AGI powering applications everywhere – we, the developers, have been quietly benefitting from an important use of this technology – code generation.

Code-writing assistance has been around for 40+ years, starting from things like syntax highlighting, autocompletion in IDEs, as well as code analysis and Linting. More recently AI-assisted tools like DeepCode (now Snyk) leveraged machine learning for more intelligent suggestions.

Figure 1: History of code writing assistants

And then came GitHub Copilot (powered by OpenAI Codex) and everything changed!

Today many developers (including many of us at Vectara) use GitHub Copilot as well as the open source StarCoder to become more efficient and productive in creating software. Google Bard now also provides code generation capabilities that are quite good.

It is widely speculated that modern LLMs will eventually replace some jobs that are currently performed by humans, which begs the question: will developers eventually be replaced by code generating LLMs?

In this part 1 blog post I explore in more detail how code generating LLMs work, and show some examples of what they do well and where they fail. In part 2 will discuss what we can expect next for code-generating LLMs and “will they ever replace developers”?

What are code-generating LLMs

As I’ve explained in the blog post about LLMs and Hallucinations, a large language model (LLM) is a neural network model, trained on a massive corpus of text to “predict the next token”.

A code-generating LLM is trained on a more specialized dataset that includes code repositories, technical forums, coding platforms, documentation of various products and general web data that’s useful for this purpose.

Because code-generating LLMs are integrated with your IDE, they fully grasp the context of your code (comments, function names, and variable names), and use this contextual information to further improve the suggestions they make.

This works very well in practice. Here’s a simple example: when I write the following docstring in Python to describe the functionality I would like:

Copilot completes this to:

And Starcoder responds with:

Both are valid solutions and work well (although you may need to add import urlparse in the second one).

Now let’s try something more complex.

We prompt with:

And Copilot responds with:

And right there we have a python implementation of quicksort, even including explanations about the logic behind each step.

Copilot is based on GPT-3, imagine what it can do with GPT-4, or even later as new and more powerful models emerge? In fact, GPT-4 itself is already capable of generating code directly, based on a prompt, and it not only provides the code but also explains it:

Me: “Please write a python script to go through all files in a folder, read the contents of each file that ends with ‘html’ and print the title of that file”

GPT4:

You can achieve this using the os module to list files in a directory and the BeautifulSoup4 library to parse the HTML files and extract the title. First, install the beautifulsoup4 package if you haven’t already:

Then, you can use the following Python script:

Replace ‘path/to/your/folder’ with the actual path to the folder containing the HTML files. This script will go through all the files in the specified folder, read the contents of each file ending with ‘.html’, and print the title of that file.

This is quite remarkable – a full solution, including instructions of which package to install, how to use the code and more.

Hallucinations of code-generating LLMs

As a developer myself, I am super excited about tools like StarCoder and GitHub Copilot, but I’m also cautious: what guarantees do I have that they work as advertised?

As we know LLMs hallucinate sometimes, and code-generating LLMs are not immune to that. And as with other LLM hallucinations, the code generated may look great and work okay, but it may not do the right thing – creating subtle bugs that are hard to find.

For example – I provided this prompt:

This of course is an impossible task, since no two even numbers could sum up to 17.

But Copilot continued quite confidently to suggest:

Which is wrong.

This becomes even worse if the code generated by LLMs introduces security weaknesses that can be later exploited by hackers. For example:

With this prompt:

The output is:

There is clearly a significant security risk in this code, as the user may for example enter the following:

And if the user executing this script has admin rights, this may result in deletion of all data on the local disk.
The bottom line is this: we need a way to validate that code generated by an LLM is correct and does not introduce any security vulnerabilities, or find other ways to address this issue (SeL4 kernel anyone?) to ensure LLM code generation does not result in less secure software.

Summary

Code-generating LLMs like GitHub Co-pilot and StarCoder, among others, are based on a large language model trained on datasets that are specifically collected to allow for the LLM to learn to generate code in many languages and across multiple use-cases.

As the examples above demonstrate code-generating LLMs can be quite helpful and increase productivity for developers in many day-to-day coding tasks.

As this technology improves, we will learn how to best work with code generating LLMs: what prompts we should use to get the result we want, and how to best catch hallucinations or other mistakes the LLM makes.

I’m equally intrigued about the possibility of code generating LLMs being helpful to other members of a technology organization that interact with the engineering team, like product managers, data analysts and others.

In part 2 (to be published 5/18) we’ll explore more advanced capabilities of code-generating LLMs, where this technology is headed, and whether or not we should expect software developers to be replaced by LLMs anytime in the near future.