Wednesday, December 27, 2023

LLM Costs Should Drop Over Time: They Almost Have To Do So

One reason bigger firms are likely to have advantages as suppliers and operators of large language models is that LLMs are quite expensive--at the moment--compared to search operations. All that matters for LLM business models.


Though costs should change over time, the current cost delta between a single search query and a single inference operation are quite substantial. It is estimated, for example, that a search engine query costs between $0.0001 and $0.001 per query.


In comparison, a single LLM inference operation might cost  between $0.01 and $0.10 per inference, depending on model size, prompt complexity, and cloud provider pricing. 


Costs might vary substantially if a general-purpose LLM is used, compared to a specialized, smaller LLM adapted for a single firm or industry, for example. It is not unheard of for a single inference operation using a general-purpose model  to cost a few dollars, for example, though costs in the cents per operation are likely more common. 


In other words, an LLM inference operation might cost 10 to 100 times what a search query costs. 


Here, for example, are recent quotes by Google Cloud’s Vertex AI service. 


Model

Type

Region

Price per 1,000 characters

PaLM 2 for Text (Text Bison)

Input

Global

  • Online requests: $0.00025

  • Batch requests: $0.00020

Output

Global

  • Online requests: $0.0005

  • Batch requests: $0.0004

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Reinforcement Learning from Human Feedback

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


PaLM 2 for Text 32k (Text Bison 32k)

Input

Global

  • Online requests: $0.00025

  • Batch requests: $0.00020

Output

Global

  • Online requests: $0.0005

  • Batch requests: $0.0004

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


PaLM 2 for Text

(Text Unicorn)

Input

Global

  • Online requests: $0.0025

  • Batch requests: $0.0020

Output

Global

  • Online requests: $0.007

  • Batch requests: $0.0060

PaLM 2 for Chat (Chat Bison)

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Reinforcement Learning from Human Feedback

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


PaLM 2 for Chat 32k (Chat Bison 32k)

Input

Global

  • Online requests: $0.00025*

Output

Global

  • Online requests: $0.0005*

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Embeddings for Text

Input

Global

  • Online requests: $0.000025

  • Batch requests: $0.00002

Output

Global

  • Online requests: No charge

  • Batch requests: No charge

Codey for Code Generation

Input

Global

  • Online requests: $0.00025

  • Batch requests: $0.00020

Output

Global

  • Online requests: $0.0005

  • Batch requests: $0.0004

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Codey for Code Generation 32k

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Codey for Code Chat

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing

Codey for Code Chat 32k

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Codey for Code Completion

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005


But training and inference costs could well decline over time, experts argue.  Smaller, more efficient models are likely to develop, using cost-reduction techniques like parameter pruning, knowledge distillation, and low-rank factorization, some will argue. 


Sparse training methods focused only on the relevant parts of the model for specific tasks also will help. 


Use of existing pre-trained models that are fine-tuned for specific tasks also can reduce training costs. 


Dedicated hardware specifically optimized for LLM workloads already is happening. In similar fashion, optimizing training algorithms; quantization and pruning (removing unnecessary parameters); automatic model optimization (tools and frameworks that automatically optimize models for specific hardware and inference requirements) and open source all will help lower costs. 


Will Video Content Industry Survive AI?

Virtually nobody in business ever wants to say that an industry or firm transition from an older business model to a newer model is doomed t...