Large language models (LLMs) like GPT-3 are capturing the imaginations of data scientists around the world, thanks to their advanced ability to understand and generate text. Now researchers at Salesforce have leveraged an LLM to build CodeGen, which can understand source code and even generate its own code in different programming languages.
To get the low-down on the CodeGen, we turned to Salesforce Chief Scientist Silvio Savarese, who was kind enough to answer a few questions about the research project, how it was developed, and its possible role with Salesforce in the future.
The original inspiration for CodeGen came more than a year ago, when Savarese’s team “envisioned a conversational system that is capable of having a discussion with the user to solve a coding problem or task,” Savarese said via an email Q&A with Dataname.
“With CodeGen, this discussion takes the form of an English text-based discourse between a human and a machine,” he continued. “The result is AI-generated code that solves the discussed problem.”
Just as language models have demonstrated a capability to understand William Shakespeare’s writings and even to generate prose that closely resembles the Bard’s, CodeGen has the ability to understand the various textual components of a programming language and to generate code that matches the syntax, rules, and constraints of that language.
“CodeGen establishes a bridge between natural language and programming language,” Savarese said. “CodeGen can help democratize programming much like low-code tools do by lowering the barrier to entry for non-developers.”
CodeGen’s model has a GPT-style architecture, and was trained from scratch on Google’s TPU-v4, Savarese said. At this point, CodeGen remains mostly a research project, according to Savarese, although it is being tested with a small group of users.
Up to this point, the demonstrations of CodeGen have focused primarily on interactive data science scenarios, such as working with a Jupyter notebooks. It’s also been used with context-sensitive code-completion within common development environments, he said.
“Since CodeGen is a flexible, foundational model, it can be applied broadly,” Savarese said. “For example, it helps us better understand existing code. It helps detect bugs in code that humans have written, estimate risks, and even summarize a code’s functionality to help new developers understand it. CodeGen even translates code across programming languages–another benefit when dealing with legacy code that may still have value but is difficult to maintain.”
CodeGen excels at frequently used programming patterns, Savarese said, such as “known efficient implementations of algorithms, file operations, data manipulation, custom analytics tools on top of platforms like Tableau, Web development and design, or the construction of larger programs composed of many ‘simler’ steps,” he continued. “Programs for which entirely new algorithms to solve a problem are needed or code written in less common programming languages may be less approachable.”
The model could enable users with no experience to develop simple programs, Savarese said, while more complex programs would still require some development experience. It could accelerate the progression of IT users with some experience, such as administrators, who want to become full-blown deelopers, while it could be a time-saver for expert coders who want to eliminate redundant or repetitive tasks, he said.
CodeGen is available on GitHub.
Google’s Massive New Language Model Can Explain Jokes
10 NLP Predictions for 2022
‘Deep-Speare’ Emulates The Bard with AI