Sunday, March 17, 2024

"Tokens" are the New "FLOPS," "MIPS" or "Gbps"

Modern computing has some virtually-universal reference metrics. For Gemini 1.5 and other large language models, tokens are a basic measure of capability. 


In the context of LLMs, a token is the basic unit of text (for example) that the model processes and generates, usually measured in “tokens per second.”


For a text-based model, tokens can include individual words; subwords (prefixes, suffixes or  characters) or special characters such as punctuation marks or spaces. 


For a multimodal LLM, where images and audio and video have to be processed, content is typically divided into smaller units like patches or regions, which are then processed by the LLM. Each patch or region can be considered a token.


Audio can be segmented into short time frames or frequency bands, with each segment serving as a token. Videos can be tokenized by dividing them into frames or sequences of frames, with each frame or sequence acting as a token.


Tokens are not the only metrics used by large- and small-language models, but tokens are among the few that are relatively easy to quantify. 


Metric

LLM

SLM

Tokens per second

Important for measuring processing speed

Might be less relevant for real-time applications

Perplexity

Indicates ability to predict next word

Less emphasized due to simpler architecture

Accuracy

Task-specific, measures correctness of outputs

Crucial for specific tasks like sentiment analysis

Fluency and Coherence

Essential for generating human-readable text

Still relevant, but might be less complex

Factual correctness

Important to avoid misinformation

Less emphasized due to potentially smaller training data

Diversity

Encourages creativity and avoids repetitive outputs

Might be less crucial depending on the application

Bias and fairness

Critical to address potential biases in outputs

Less emphasized due to simpler models and training data

Efficiency

Resource consumption and processing time are important

Especially crucial for real-time applications on resource-constrained devices

LLMs rely on various techniques to quantify their performance on attributes other than token processing rate. 


Perplexity is measured by calculating the inverse probability of the generated text sequence. Lower perplexity indicates better performance as it signifies the model's ability to accurately predict the next word in the sequence.


Accuracy might compare the LLM-generated output with a reference answer. That might include precision (percent of correct predictions); recall (proportion of actual correct answers identified by the model) or F1-score that combines precision and recall into a single metric.


Fluency and coherence is substantially a matter of human review for readability, grammatical correctness, and logical flow. 


But automated metrics such as BLEU score (compares the generated text with reference sentences, considering n-gram overlap); ROUGE score (similar to BLEU but focuses on recall of n-grams from reference summaries) or Meteor (considers synonyms and paraphrases alongside n-gram overlap). 


So get used to hearing about token rates, just as we hear about FLOPS, MIPS, Gbps, clock rates or bit error rates.


  • FLOPS (Floating-point operations per second): Measures the number of floating-point operations a processor can perform in one second.

  • MIPS (Millions of instructions per second): Similar to IPS, but expressed in millions.

  • Bits per second (bps): megabits per second (Mbps), and gigabits per second (Gbps).

  • Bit error rate (BER): Measures the percentage of bits that are transmitted incorrectly.


Token rates are likely to remain a relatively easy-to-understand measure of model performance, compared to the others, much as clock speed (cycles the processor can execute per second) often is the simplest way to describe a processor’s performance, even when there are other metrics. 


Other metrics, such as the number of cores and threads; cache size; instructions per second (IPS) or floating-point operations per second also are relevant, but unlikely to be as relatable, for normal consumers, as token rates.


No comments:

How Big is "GPU as a Service" Market?

It’s almost impossible to precisely quantify the addressable market for specialized “graphics processor unit as a service” providers such as...