Modern computing has some virtually-universal reference metrics. For Gemini 1.5 and other large language models, tokens are a basic measure of capability.
In the context of LLMs, a token is the basic unit of text (for example) that the model processes and generates, usually measured in “tokens per second.”
For a text-based model, tokens can include individual words; subwords (prefixes, suffixes or characters) or special characters such as punctuation marks or spaces.
For a multimodal LLM, where images and audio and video have to be processed, content is typically divided into smaller units like patches or regions, which are then processed by the LLM. Each patch or region can be considered a token.
Audio can be segmented into short time frames or frequency bands, with each segment serving as a token. Videos can be tokenized by dividing them into frames or sequences of frames, with each frame or sequence acting as a token.
Tokens are not the only metrics used by large- and small-language models, but tokens are among the few that are relatively easy to quantify.
LLMs rely on various techniques to quantify their performance on attributes other than token processing rate.
Perplexity is measured by calculating the inverse probability of the generated text sequence. Lower perplexity indicates better performance as it signifies the model's ability to accurately predict the next word in the sequence.
Accuracy might compare the LLM-generated output with a reference answer. That might include precision (percent of correct predictions); recall (proportion of actual correct answers identified by the model) or F1-score that combines precision and recall into a single metric.
Fluency and coherence is substantially a matter of human review for readability, grammatical correctness, and logical flow.
But automated metrics such as BLEU score (compares the generated text with reference sentences, considering n-gram overlap); ROUGE score (similar to BLEU but focuses on recall of n-grams from reference summaries) or Meteor (considers synonyms and paraphrases alongside n-gram overlap).
So get used to hearing about token rates, just as we hear about FLOPS, MIPS, Gbps, clock rates or bit error rates.
FLOPS (Floating-point operations per second): Measures the number of floating-point operations a processor can perform in one second.
MIPS (Millions of instructions per second): Similar to IPS, but expressed in millions.
Bits per second (bps): megabits per second (Mbps), and gigabits per second (Gbps).
Bit error rate (BER): Measures the percentage of bits that are transmitted incorrectly.
Token rates are likely to remain a relatively easy-to-understand measure of model performance, compared to the others, much as clock speed (cycles the processor can execute per second) often is the simplest way to describe a processor’s performance, even when there are other metrics.
Other metrics, such as the number of cores and threads; cache size; instructions per second (IPS) or floating-point operations per second also are relevant, but unlikely to be as relatable, for normal consumers, as token rates.
No comments:
Post a Comment