While crafting the right prompt sets the direction for a Large Language Model (LLM), understanding and utilizing generation parameters allows you to fine-tune how the model arrives at its response. These parameters act like dials controlling aspects of the text generation process, influencing creativity, randomness, and output length. Mastering them is significant for achieving consistent and desired results in your applications.
Think of the LLM generating text one token (word or sub-word) at a time. At each step, it calculates the probability of every possible next token. Parameters modify how the model chooses the actual next token from this probability distribution.
Temperature is perhaps the most commonly adjusted parameter. It directly controls the randomness of the output.
Low Temperature (e.g., T=0.1 to 0.4): Values closer to 0 make the model more deterministic and focused. It will consistently pick the tokens with the highest probability. This leads to outputs that are more predictable, coherent, and stick closely to the most likely patterns in the training data. Use low temperatures for tasks requiring factual accuracy, consistency, or predictability, such as:
High Temperature (e.g., T=0.8 to 1.5 or higher): Values closer to 1 (or even slightly above) increase randomness. The model becomes more likely to pick less probable tokens, leading to more diverse, creative, and sometimes surprising outputs. However, very high temperatures can also result in less coherent or nonsensical text. Use high temperatures for tasks where creativity, exploration, or variability is desired:
Technically, temperature modifies the softmax function applied to the model's raw output scores (logits) before they become probabilities. A low temperature sharpens the probability distribution, concentrating probability mass on the top choices. A high temperature flattens the distribution, making less likely tokens more probable.
Lower temperature sharpens the probability distribution, making the most likely token significantly more probable. Higher temperature flattens the distribution, increasing the chance of selecting less probable tokens.
When using an API, setting the temperature might look like this (specific syntax depends on the API provider):
# Example API call snippet (conceptual)
response = client.completions.create(
model="some-llm-model",
prompt="Write a short story about a futuristic city.",
temperature=0.9, # Higher temperature for creativity
max_tokens=150
)
response_factual = client.completions.create(
model="some-llm-model",
prompt="What is the capital of France?",
temperature=0.1, # Lower temperature for factual recall
max_tokens=50
)
Top-p, also known as nucleus sampling, provides another way to control randomness by selecting from a smaller, dynamic set of the most probable tokens.
Instead of considering all possible tokens (like temperature does, just with adjusted probabilities), top-p considers only the smallest set of tokens whose cumulative probability exceeds a certain threshold p. The model then samples only from this "nucleus" of high-probability tokens.
top_p = 0.9
, the model sums the probabilities of the most likely tokens in descending order until the sum reaches or exceeds 0.9. It then redistributes the probabilities among only those selected tokens and samples from that subset. If the single most likely token already has a probability of 0.95, then top_p=0.9
would effectively mean only that token is considered. If the top 10 tokens sum to 0.92, then only those 10 are considered.Top-p is often seen as an alternative or complement to temperature. It adaptively adjusts the number of tokens considered based on the shape of the probability distribution at each step. This can prevent the model from picking highly improbable "tail" tokens (which can happen even with moderate temperatures if the distribution is very flat) while still allowing for diversity when multiple good options exist.
Many practitioners recommend using either temperature or top-p, setting the unused parameter to a neutral value (e.g., temperature=1.0
if using top_p
, or top_p=1.0
if using temperature
). Check the specific API documentation for best practices.
# Example API call snippet (conceptual)
response = client.completions.create(
model="some-llm-model",
prompt="Suggest three innovative uses for graphene.",
top_p=0.85, # Sample from the top 85% probability mass
temperature=1.0, # Set temperature to neutral if primarily using top_p
max_tokens=100
)
Top-k sampling is a simpler alternative. It restricts the model to sampling only from the k most likely next tokens.
top_k = 5
, the model identifies the five tokens with the highest probabilities for the next step. It then ignores all other tokens and samples only from those top five, usually after redistributing their probabilities.While easy to understand, top_k
can sometimes be too restrictive if the true probability mass is concentrated in fewer than k tokens, or not restrictive enough if the distribution is very flat and the k-th token is still quite unlikely. Top-p often provides more adaptive control.
# Example API call snippet (conceptual)
response = client.completions.create(
model="some-llm-model",
prompt="List common types of renewable energy.",
top_k=10, # Sample only from the 10 most likely next tokens
max_tokens=75
)
This parameter sets a hard limit on the length of the generated response, measured in tokens. Tokens can be words, parts of words, punctuation, or spaces, depending on the model's tokenizer.
max_tokens
(or similar names like max_length
, max_new_tokens
) prevents the model from generating excessively long, rambling, or potentially expensive responses.max_new_tokens
). API documentation is your guide here. Newer APIs often favor max_new_tokens
.# Example API call snippet (conceptual)
response = client.completions.create(
model="some-llm-model",
prompt="Explain the concept of recursion in programming. Be concise.",
temperature=0.3,
max_tokens=100 # Limit the generated explanation to 100 tokens
)
You might encounter a few other parameters:
["\n", " Human:", " AI:"]
). When the model generates one of these sequences, it immediately stops generation, even if max_tokens
hasn't been reached. This is useful for structuring dialogues or preventing the model from generating unwanted text after a natural endpoint.There's no single "best" set of parameters; the optimal configuration depends heavily on your specific task:
temperature
(e.g., 0.1-0.4) and potentially top_p=1.0
.temperature
(e.g., 0.7-1.0) or moderate top_p
(e.g., 0.8-0.95). You might also slightly decrease penalties if some repetition is acceptable.temperature
(e.g., 0.5-0.7) is often a good starting point.Experimentation is essential. The best way to understand how these parameters affect output for your specific use case is to try different values and observe the results. Start with the default API settings or the suggested values above, then adjust incrementally based on whether you need more creativity, more focus, less repetition, or different output lengths. Keep track of which combinations work best for different types of prompts.
Understanding these parameters gives you a much finer degree of control over LLM outputs beyond just the prompt itself. In the upcoming hands-on exercise, you'll get to experiment directly with these settings and see their effects firsthand.
© 2025 ApX Machine Learning