Robert Važan

Where are all the specialized LLMs?

Specialized language models can outperform generalist models in their specialized domain while costing a fraction to train and use. Due to the quadratic cost of training, you can train a hundred 7B models for the cost of a single 70B model. It is then quite a surprise that open 70B models are accompanied by only one lonely general-purpose 7B variant. Why wouldn't developers of these models invest even a small fraction of their huge compute budget to train several 7B models that would be highly competitive in their domain of specialization?

To be fair, models are gradually being specialized by language (Qwen in Chinese, YaLM in Russian, LeoLM in German, Aguila in Spanish, CroissantLLM in French, Polka in Polish, and probably many more). Most of them are bilingual with English. Language family models like SEA-LION for southeast Asia languages are an interesting option. There are a few models specialized by programming language (Python variant of Code Llama, SQLCoder) and some specialized by profession (Code Llama and others for coding, Meditron for medicine, and Samantha for psychology). That's however surprisingly little given the relatively low cost of smaller models and their wide applicability.

Fine-tuning is not going to cut it. It's useful only if domain data is scarce, because it then benefits from general knowledge and transfer learning embedded in base model. If there's plenty of data in the domain, pretraining will make much better use of available parameters and compute budget. It is possible to build on top of an existing generalist model after expanding its vocabulary, but it's yet to be seen whether this is more or less effective than training from scratch. Prior training can be a handicap if the model has settled into certain patterns and lost its ability to adapt as lottery ticket hypothesis would suggest.

My guess is that the lack of systematic specialization is due to several factors. Developers of generalist models see themselves as advancing the state of the art and they perceive specialized models as scope creep for their project. They instead overtrain their smaller models far beyond Chinchilla-optimal levels. Developers of generalist models usually lack domain-specific knowledge, which limits their ability to curate suitable datasets and monitor quality. Some domains have impoverished training data and need a lot of transfer learning to work well. Finally, instruction tuning of specialized models is tricky, because there isn't enough domain-specific instruction data. It is nevertheless possible that some big player will train a set of specialized models if that becomes the main goal of the project.

Since we are talking about smaller models, it is worth considering how far random hobbyists can go with consumer GPUs. Unfortunately, Chinchilla-optimal 1B models are the limit even if you dedicate high-end GPU to the job 24x7 for a year. If you want to train a bunch of different models, you are down to 100-300M parameters per model. Maybe my napkin math is slightly off and you could go a bit beyond 1B, but 7B and 13B models are definitely out of reach for hobbyists. Small models still have high value, because they can be embedded everywhere, but successfully competing with generalist models in quality requires utilizing all available inference resources, which means scaling up to 7B or 13B parameters for models targeting consumer hardware.

People investing months of compute time on expensive multi-GPU systems are going to want recognition for their work as well as control over training setup. That's why every specialized model will be one of its kind. Systematic development of a range of related models is unlikely, although there's going to be a lot of knowledge and code sharing among the projects, which will implicitly provide some structure. Model catalogs will provide systematization externally. Growth will be very uneven across different domains, but specialist LLM space will be expanding and it will be in fact the main contributor to increasing output quality of local LLMs.