Wednesday, December 27, 2023

LLM Costs Should Drop Over Time: They Almost Have To Do So

One reason bigger firms are likely to have advantages as suppliers and operators of large language models is that LLMs are quite expensive--at the moment--compared to search operations. All that matters for LLM business models.


Though costs should change over time, the current cost delta between a single search query and a single inference operation are quite substantial. It is estimated, for example, that a search engine query costs between $0.0001 and $0.001 per query.


In comparison, a single LLM inference operation might cost  between $0.01 and $0.10 per inference, depending on model size, prompt complexity, and cloud provider pricing. 


Costs might vary substantially if a general-purpose LLM is used, compared to a specialized, smaller LLM adapted for a single firm or industry, for example. It is not unheard of for a single inference operation using a general-purpose model  to cost a few dollars, for example, though costs in the cents per operation are likely more common. 


In other words, an LLM inference operation might cost 10 to 100 times what a search query costs. 


Here, for example, are recent quotes by Google Cloud’s Vertex AI service. 


Model

Type

Region

Price per 1,000 characters

PaLM 2 for Text (Text Bison)

Input

Global

  • Online requests: $0.00025

  • Batch requests: $0.00020

Output

Global

  • Online requests: $0.0005

  • Batch requests: $0.0004

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Reinforcement Learning from Human Feedback

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


PaLM 2 for Text 32k (Text Bison 32k)

Input

Global

  • Online requests: $0.00025

  • Batch requests: $0.00020

Output

Global

  • Online requests: $0.0005

  • Batch requests: $0.0004

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


PaLM 2 for Text

(Text Unicorn)

Input

Global

  • Online requests: $0.0025

  • Batch requests: $0.0020

Output

Global

  • Online requests: $0.007

  • Batch requests: $0.0060

PaLM 2 for Chat (Chat Bison)

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Reinforcement Learning from Human Feedback

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


PaLM 2 for Chat 32k (Chat Bison 32k)

Input

Global

  • Online requests: $0.00025*

Output

Global

  • Online requests: $0.0005*

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Embeddings for Text

Input

Global

  • Online requests: $0.000025

  • Batch requests: $0.00002

Output

Global

  • Online requests: No charge

  • Batch requests: No charge

Codey for Code Generation

Input

Global

  • Online requests: $0.00025

  • Batch requests: $0.00020

Output

Global

  • Online requests: $0.0005

  • Batch requests: $0.0004

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Codey for Code Generation 32k

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Codey for Code Chat

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing

Codey for Code Chat 32k

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005

Supervised Tuning

us-central1

europe-west4

$ per node hour Vertex AI custom training pricing


Codey for Code Completion

Input

Global

  • Online requests: $0.00025

Output

Global

  • Online requests: $0.0005


But training and inference costs could well decline over time, experts argue.  Smaller, more efficient models are likely to develop, using cost-reduction techniques like parameter pruning, knowledge distillation, and low-rank factorization, some will argue. 


Sparse training methods focused only on the relevant parts of the model for specific tasks also will help. 


Use of existing pre-trained models that are fine-tuned for specific tasks also can reduce training costs. 


Dedicated hardware specifically optimized for LLM workloads already is happening. In similar fashion, optimizing training algorithms; quantization and pruning (removing unnecessary parameters); automatic model optimization (tools and frameworks that automatically optimize models for specific hardware and inference requirements) and open source all will help lower costs. 


No comments:

Have LLMs Hit an Improvement Wall, or Not?

Some might argue it is way too early to worry about a slowdown in large language model performance improvement rates . But some already voic...