Four Futures for Large Language Models
June 2022 (original twitter thread)
Almost two years after GPT-3 we've seen continued scaling of large language models, with multiple firms joining the fray
Here are four possible futures for how these competitive dynamics might shake up over the next few years:
Current LLMs are mostly trained on publicly available sources, like wikipedia, blogposts, GitHub, online books, etc.
GPT-3 was trained on 570GB of such text (for perspective, that's barely enough to fill half of a 1TB microSD card).
If this continues, LLMs may end up commoditized, with mostly-interchangeable models available from multiple providers
Firms might gain a temporary edge by scaling up their model/data/context length/retrieval bank
(The challenges here are nothing to sneeze at, and will likely pose barriers)
But, if others can quickly follow suit, this wouldn't fundamentally alter the competitive landscape
2) Market specialization via private data
To stave off commoditization, firms might focus on building LLMs for specific applications where private data gives a competitive edge
For example, a software company with a large, private codebase might build superior code LLMs
Similarly, a hospital system with a large EHR database may have an edge when building a medical LLM
And a company with a messaging app may be able to build a better LLM chatbot
Unique sources of unlabeled data are likely to become increasingly important for differentiation
3) Dominance through Data Flywheels (aka "Neural Network Effects")
Another way to prevent commoditization is to build data flywheels, where user behavior creates unique training data not accessible to competitors
For example, when users use an LLM like GitHub Copilot, they can accept or reject proposed completions
This produces training data that creates a powerful feedback loop:
More people use the model -> the model gets better -> more people use the model
These data flywheels can make it challenging for later entrants to catch up:
New users will gravitate towards the best existing models, further strengthening them at the expense of the newcomers
LLMs (and neural networks in general) have formidable problems, and there is no guarantee they will be solved soon
For example, even with today's best mitigations, LLMs still sometimes output false/toxic text and insecure/incorrect code
If these flaws remain unsolved within the next few years, we might see a series of high-profile offensive outputs or security vulnerabilities.
This could lead to a loss of confidence in LLMs and a lack of widespread adoption.
These are four possible futures for the competitive dynamics of LLMs, but of course there are tons of others!
I'm curious to hear which people think are the most likely, and if you think I've missed any important ones