Wednesday, January 29, 2025

DeeoSeek Theats and Advantages for Other Models

DeepSeek, the new open source Large Language Model, is challenging conventional wisdom about what it costs to train an LLM and draw inferences from such models, even if there is some debate about the actual cost savings. 


By some estimates, DeepSeek creates models that are 20 to 40 times cheaper (when generating inferences)  than competitors like OpenAI, although DeepSeek also claims its costs to train models also are cheaper by about that much as well. 


The truth might be somewhere between the extremes of huge cost advantages in training and inference and parity with existing models on those scores. 


And, as an open source model, DeepSeek’s work can be used by others. Indeed, Meta software engineers already are said to be looking for ways to incorporate DeepSeek methods into Meta’s own open source Llama models. 


DeepSeek has not provided detailed disclosures on hardware utilization, power efficiency, or software optimizations that would justify significant cost reductions compared to leading AI labs like OpenAI, Google DeepMind, or Anthropic, though claiming an advantage as much as 20 times lower than those models. 


Some observers also wonder whether the claimed non-use of advanced Nvidia graphical processing units is substantially true. Some work might have used such advanced GPUs, though the final training run might not have done so. 


Some might suspect “borrowing” of intellectual property as well, which could explain some of the cost advantages. Microsoft and OpenAI also believe that has happened.  


DeepSeek says it uses a Mixture of Experts (MoE) architecture, which allows the model to activate only a small portion of its parameters for any given task. But many existing GenAI models also use MoE. 


By selectively activating only the necessary "experts" within the model, DeepSeek reduces the computational resources required for both training and inference. Still, some existing GenAI models also use MoE (Gemini, for example) so perhaps some of the claimed gains lie there, though in principle some other models might also be able to tweak their approaches as well. 


An optimized training process might be more important. DeepSeek says it has developed efficient training methods that allows training  using fewer computational resources and in less time. 


All that challenges existing conventional wisdom about how much AI capex and opex (electricity, for example) will be required to fully use AI “everywhere” in an economy. And since much investment in the AI ecosystem has been based on those assumed costs, DeepSeek is causing consternation in many circles about whether investments were an instance of “buy high, sell low.” 


Much of the immediate commentary along those lines has seemingly assumed that lower computation costs (training and inference) will translate somewhat directly into value creation (so-called “AI leadership”). 


Perhaps that is overreaching. One tends to hear the same thing about investment levels in other information technologies, as though the outcomes are directly related to the magnitude of investments, and that is not often true. 


For example, even without clear causal relationships, policymakers always assume that investing in better broadband, or coding skills, or the latest information technology, necessarily drives higher productivity. 


That might be partly true, but only in the context of other variables that arguably also contribute to higher productivity, including the sum total of all other human and institutional capital already built up. If IT infrastructure alone were able to drive productivity, there would not continue to be large gaps between economic output leaders and laggards. 


As we have seen time and again, mere increase in inputs (IT investment) does not drive productivity and creativity outputs in a linear way. So though lower cost LLM technology will be helpful, it does not necessarily represent an immediate strategic shift in creativity or productivity.


It might devalue already-made capital investments to some extent, but that also remains to be seen. That might damage some investors, to be sure. 


But DeepSeek might mostly be an intensification of the expected cost reduction cycle that all computing technologies undergo. 


Still, there is concern and hope in different DeepSeek, it is said, has significantly reduced model building costs through several innovative approaches:

  1. Mixture of Experts (MoE) architecture: DeepSeek uses an MoE system that activates only 37 billion of its 671 billion parameters for any given task, dramatically reducing computational costs. Alphabet’s Gemini uses MoE as well. 

  2. Reinforcement Learning (RL): Instead of relying on supervised fine-tuning, DeepSeek applied pure RL to its base model, allowing the AI to self-discover chain-of-thought reasoning through trial-and-error.

  3. Group relative policy optimization (GPRO): This RL algorithm is built directly into the main model, eliminating the need for a separate model for reinforcement learning and further cutting down training costs.

  4. Efficient hardware: DeepSeek trained its models on less powerful, cheaper chips (Nvidia H800 GPUs), demonstrating the ability to achieve high performance with modest hardware.

  5. Distillation techniques: DeepSeek used strategies like generating its own training data, which requires more compute but can lead to more efficient models.


Keep in mind that DeepSeek is open source, so other models are free to use or modify parts of the system for there own uses.

No comments:

Have AI Chatbots Actually Affected Search Volume?

There’s a good reason Alphabet in late March 2025 is the “Magnificent 7” stock with the lowest price-earning ratio. Investors and analysts a...