Databricks launched Dolly 2.0, the next version of the large language model (LLM) with ChatGPT-like human interaction (also called “instruction-following”) that the company released just two weeks ago.
The company says that Dolly 2.0 is the first open-source, instruction-following LLM that has been fine-tuned on a clear and free dataset and can also be used for commercial reasons. This means that business applications can use Dolly 2.0 without paying for API access or sharing data with third parties.
Ali Ghodsi, the CEO of Databricks, says that there are other LLMs that can be used in business, but “they won’t talk to you like Dolly 2.0.” He also said that users can change and improve the training data because it is provided for free under an open-source license. He said –
“So you can make your own version of Dolly.”
As part of its ongoing pledge to open source, Databricks is also releasing databricks-dolly-15k, the dataset that Dolly 2.0 was fine-tuned on. This is a corpus of more than 15,000 records generated by thousands of Databricks employees, and Databricks says it is the “first open-source, human-generated instruction corpus specifically designed to enable large language to exhibit the magical interactivity of ChatGPT.”
In the past two months, there has been a wave of LLM releases that are similar to ChatGPT and follow instructions. These releases are open source by many standards (or offer some level of openness or gated access). One of these was Meta’s LLaMA, which led to others like Alpaca, Koala, Vicuna, and Dolly 1.0 from Databricks.
Many of these “open” models, however, were subject to “industrial capture,” according to Ghodsi, because they were trained on datasets whose terms limit commercial use, such as a 52,000-question-and-answer dataset from the Stanford Alpaca project that was trained on OpenAI’s ChatGPT. But, he said, OpenAI’s terms of service say that you can’t use the results of services that compete with OpenAI.
Databricks, however, found out how to get around this issue: Dolly 2.0 is a language model with 12 billion parameters. It is based on the open-source Eleuther AI pythia model family and was only tuned on a small, open-source corpus of instruction records made by workers of Databricks.
The licensing terms for this dataset say that it can be used, changed, and added to for any reason, including academic or commercial ones. Up until now, models trained on ChatGPT data were on the edge of the law. Ghodsi said –
“The whole community has been tiptoeing around this and everybody’s releasing these models, but none of them could be used commercially.”
“So that’s why we’re super excited.”
In a Databricks blog post, it was emphasized that, like the original Dolly, the 2.0 version is not state-of-the-art, but “exhibits a surprisingly capable level of instruction-following behavior given the size of the training corpus.” The post adds that the amount of effort and expense necessary to build powerful AI technologies is “orders of magnitude less than previously imagined.”
Ghodsi said of Dolly’s diminutive size –
“Everyone else wants to go bigger, but we’re actually interested in smaller.”
“Second, it’s high-quality. We looked over all the answers.”
Ghodi also said that he thinks Dolly 2.0 will start a “snowball” effect, where other people in the AI field will join in and come up with other ideas. The limit on commercial use, he explained, was a big obstacle to overcome –
“We’re excited now that we finally found a way around it. I promise you’re going to see people applying the 15,000 questions to every model that exists out there, and they’re going to see how many of these models suddenly become kind of magical, where you can interact with them.”
Our coverage of the release of Dolly 2.0 by Databricks, the first open, instruction-following LLM available for commercial usage, comes to a close here.