IP Carrier

Wednesday, December 27, 2023

LLM Costs Should Drop Over Time: They Almost Have To Do So

One reason bigger firms are likely to have advantages as suppliers and operators of large language models is that LLMs are quite expensive--at the moment--compared to search operations. All that matters for LLM business models.

Though costs should change over time, the current cost delta between a single search query and a single inference operation are quite substantial. It is estimated, for example, that a search engine query costs between $0.0001 and $0.001 per query.

In comparison, a single LLM inference operation might cost between $0.01 and $0.10 per inference, depending on model size, prompt complexity, and cloud provider pricing.

Costs might vary substantially if a general-purpose LLM is used, compared to a specialized, smaller LLM adapted for a single firm or industry, for example. It is not unheard of for a single inference operation using a general-purpose model to cost a few dollars, for example, though costs in the cents per operation are likely more common.

In other words, an LLM inference operation might cost 10 to 100 times what a search query costs.

Here, for example, are recent quotes by Google Cloud’s Vertex AI service.

Model	Type	Region	Price per 1,000 characters
PaLM 2 for Text (Text Bison)	Input	Global	Online requests: $0.00025 Batch requests: $0.00020
	Output	Global	Online requests: $0.0005 Batch requests: $0.0004
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
	Reinforcement Learning from Human Feedback	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
PaLM 2 for Text 32k (Text Bison 32k)	Input	Global	Online requests: $0.00025 Batch requests: $0.00020
	Output	Global	Online requests: $0.0005 Batch requests: $0.0004
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
PaLM 2 for Text (Text Unicorn)	Input	Global	Online requests: $0.0025 Batch requests: $0.0020
PaLM 2 for Text (Text Unicorn)	Output	Global	Online requests: $0.007 Batch requests: $0.0060
PaLM 2 for Chat (Chat Bison)	Input	Global	Online requests: $0.00025
	Output	Global	Online requests: $0.0005
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
	Reinforcement Learning from Human Feedback	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
PaLM 2 for Chat 32k (Chat Bison 32k)	Input	Global	Online requests: $0.00025*
	Output	Global	Online requests: $0.0005*
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
Embeddings for Text	Input	Global	Online requests: $0.000025 Batch requests: $0.00002
Embeddings for Text	Output	Global	Online requests: No charge Batch requests: No charge
Codey for Code Generation	Input	Global	Online requests: $0.00025 Batch requests: $0.00020
	Output	Global	Online requests: $0.0005 Batch requests: $0.0004
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
Codey for Code Generation 32k	Input	Global	Online requests: $0.00025
	Output	Global	Online requests: $0.0005
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
Codey for Code Chat	Input	Global	Online requests: $0.00025
	Output	Global	Online requests: $0.0005
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
Codey for Code Chat 32k	Input	Global	Online requests: $0.00025
	Output	Global	Online requests: $0.0005
	Supervised Tuning	us-central1 europe-west4	$ per node hour Vertex AI custom training pricing
Codey for Code Completion	Input	Global	Online requests: $0.00025
Codey for Code Completion	Output	Global	Online requests: $0.0005

But training and inference costs could well decline over time, experts argue. Smaller, more efficient models are likely to develop, using cost-reduction techniques like parameter pruning, knowledge distillation, and low-rank factorization, some will argue.

Sparse training methods focused only on the relevant parts of the model for specific tasks also will help.

Use of existing pre-trained models that are fine-tuned for specific tasks also can reduce training costs.

Dedicated hardware specifically optimized for LLM workloads already is happening. In similar fashion, optimizing training algorithms; quantization and pruning (removing unnecessary parameters); automatic model optimization (tools and frameworks that automatically optimize models for specific hardware and inference requirements) and open source all will help lower costs.

IP Carrier

Wednesday, December 27, 2023

LLM Costs Should Drop Over Time: They Almost Have To Do So

Will Video Content Industry Survive AI?

Report Abuse