OpenAI Unveils Codex AI Coding Assistant Amid Data Questions
AI's Current Struggles with Coding
Though artificial intelligence is rapidly advancing, it's still considered quite limited in tasks requiring a high degree of flexibility, such as the task of writing computer code.
Earlier this year, ChatGPT creator OpenAI published a detailed white paper criticizing AI for its subpar performance in coding challenges. Among other findings, it noted that even the most advanced AI models are "still unable to solve the majority" of coding tasks.
OpenAI's Vision and the Arrival of Codex
Later in an insightful interview, OpenAI CEO Sam Altman stated that these models are "on the precipice of being incredible at software engineering." He added that "software engineering by the end of 2025 looks very different than software engineering at the beginning of 2025."
This was a bold prediction with little substance to support it. If anything, generative AI of the kind Altman promotes has arguably worsened at coding, as issues with hallucination rates reportedly increase with each new iteration.
Now, we have a clearer picture of his intentions.
Early on Friday, OpenAI unveiled a preview of Codex, the company's attempt at a specialized coding "agent" – what some call a somewhat vague industry term that seems to shift in meaning depending on the company promoting it.
Codex Capabilities and Training Concerns
"Codex is a cloud-based software engineering agent that can work on many tasks in parallel," according to the company's official research preview.
The new tool is intended to assist software engineers by writing new features, debugging existing code, and answering questions about source code, among other functions.
Unlike ChatGPT's all-encompassing model, which is generally aimed at the broader mass market, Codex has purportedly been trained to "generate code that closely mirrors human style and PR preferences." This phrasing can be charitably interpreted as a way to say "steal other people's code" – an AI training tactic that OpenAI has faced lawsuits over in the recent past, particularly when it assisted the development of Microsoft's Copilot which utilized open-source and copyrighted code from GitHub.
The Shadow of Past Legal Battles
Thanks largely to a legal technicality, OpenAI, GitHub, and Microsoft emerged from that legal challenge largely without major consequences. This outcome may provide OpenAI with some perceived legal protection if it decides to proceed independently with its own model trained on GitHub code.
Cloud Operation and Lingering Data Questions
In the Codex release, OpenAI claims its coding agent operates entirely in the cloud and is cut off from the internet, meaning it cannot scour the web for data like ChatGPT. Instead, OpenAI states it "limits the agent’s interaction solely to the code explicitly provided via GitHub repositories and pre-installed dependencies configured by the user via a setup script."
Nevertheless, the data used for Codex's initial training must have originated from somewhere. Judging by the increasing number of copyright lawsuits that seem to plague the AI industry, it's likely only a matter of time before its origins are revealed.
More on OpenAI: ChatGPT Users Are Developing Bizarre Delusions