Algorithmic and rule-based systems represent some of the earliest approaches to generating text programmatically. While newer machine learning techniques, particularly those involving Large Language Models (LLMs), have gained prominence for their ability to produce more fluent and diverse text, these foundational methods still hold value. They are especially useful in scenarios requiring high degrees of control, predictability, or when dealing with highly structured data. These systems operate on explicitly defined rules, grammars, or procedural logic, rather than learning patterns from vast amounts of data as statistical models do.
One of the most straightforward methods for algorithmic text creation is template-based generation. This technique involves using a fixed sentence structure with specific parts, or "slots," that can be filled with varying words or phrases.
A template is essentially a string with placeholders. These slots are then populated from predefined lists of words or by functions that generate appropriate content for each slot.
For instance, consider generating simple product update notifications:
"New feature: [FeatureName]
is now available! Users can now [Benefit]
."
Here, [FeatureName]
and [Benefit]
are slots. You might have corresponding lists:
FeatureName_options = ["Advanced Search", "Dark Mode", "Collaboration Tools"]
Benefit_options_for_Advanced_Search = ["find information more quickly", "use complex query operators"]
Benefit_options_for_Dark_Mode = ["reduce eye strain in low light", "enjoy a new visual theme"]
Python Example: Simple Template Filling
Let's look at a Python snippet that demonstrates how to fill such templates:
import random
templates = [
"The {adjective} {noun} {verb} the {object}.",
"A {noun} often {verb} when it's {adjective}.",
"Did you see the {adjective} {noun} near the {object}?"
]
word_lists = {
"adjective": ["quick", "lazy", "sleepy", "noisy", "hungry"],
"noun": ["fox", "dog", "cat", "rabbit", "bird"],
"verb": ["jumps", "runs", "sleeps", "eats", "chases"],
"object": ["log", "fence", "river", "tree", "house"]
}
def fill_template(template, lists):
output = template
# Iterate through a copy of items for safe modification if needed later
# or ensure all placeholders are covered by lists.
# This simple version assumes direct replacement.
processed_slots = set()
for slot_key in lists.keys(): # Iterate over actual slots present in lists
placeholder = "{" + slot_key + "}"
if placeholder in output:
# Fill only the first occurrence of a placeholder in this simple loop
# More complex logic could handle multiple identical placeholders differently
output = output.replace(placeholder, random.choice(lists[slot_key]), 1)
processed_slots.add(slot_key)
# A simple check if any placeholder of the form {slot_name} remains
# This doesn't guarantee all *intended* slots were filled if not in word_lists
if "{" in output and "}" in output and any("{" + s + "}" in output for s in word_lists.keys() if s not in processed_slots):
# This situation implies a placeholder was in the template but not in word_lists
# or wasn't processed. For robust applications, error handling or default values are needed.
# For this example, we'll proceed, but in real apps, you'd log this.
pass # Or print a warning, or raise an error
return output
# Generate a few sentences
for _ in range(3):
chosen_template = random.choice(templates)
print(fill_template(chosen_template, word_lists))
Running this code might produce output like:
The sleepy fox jumps the log.
A dog often eats when it's hungry.
Did you see the quick cat near the fence?
Advantages:
[Condition]
with a high of [Temp]
°C."), simple alerts, or components of form letters.Disadvantages:
Grammar-based systems adopt a more structured approach by defining a formal grammar that dictates how sentences can be constructed. Context-Free Grammars (CFGs) are frequently used for this purpose.
A CFG is composed of:
S
for sentence, NP
for noun phrase, VP
for verb phrase).S -> NP VP
signifies that a sentence can be formed by a noun phrase followed by a verb phrase.S
) from which the generation process begins.Text generation involves starting with the start symbol and repeatedly applying production rules to expand non-terminals until only terminal symbols remain in the sequence.
Example CFG Production Rules:
S -> NP VP
NP -> Det N | Det Adj N
VP -> V | V NP | V Adv
Det -> "the" | "a" | "another"
N -> "cat" | "dog" | "mouse" | "ball"
Adj -> "big" | "small" | "red"
V -> "chased" | "ate" | "saw" | "played with"
Adv -> "quickly" | "happily"
Using these rules, a system could generate sentences like:
Python Illustration with NLTK
Full implementation of a CFG parser and generator can be quite involved. Fortunately, Python libraries like NLTK (Natural Language Toolkit) provide tools for working with CFGs. Below is an illustrative sketch demonstrating how you might define and use a CFG with NLTK, assuming NLTK is installed (pip install nltk
).
import nltk
from nltk import CFG
# For random sentence generation from a CFG, we might need a custom recursive function
# or carefully use nltk.parse.generate if its features suit our needs for randomness.
import random
# Define a simple CFG as a string
grammar_rules = """
S -> NP VP
NP -> Det N | Det Adj N PP | Det Adj N
VP -> V | V NP | V PP | V NP PP
PP -> P NP
P -> 'with' | 'on' | 'under' | 'near'
Det -> 'the' | 'a' | 'one' | 'some'
N -> 'cat' | 'dog' | 'mouse' | 'ball' | 'mat' | 'table' | 'park'
Adj -> 'big' | 'small' | 'red' | 'fluffy' | 'quick'
V -> 'chased' | 'saw' | 'played' | 'slept' | 'ate' | 'found'
"""
# Create a CFG object from the string definition
cfg_grammar = CFG.fromstring(grammar_rules)
# Function to randomly generate a sentence from the CFG
def generate_random_sentence_from_cfg(grammar, symbol=None):
if symbol is None:
symbol = grammar.start() # Get the start symbol (e.g., S)
if isinstance(symbol, str): # Terminal symbol (a word)
return [symbol]
# Non-terminal symbol: choose one of its productions randomly
productions = grammar.productions(lhs=symbol)
if not productions:
# This case should ideally not be reached if the grammar is well-formed
# and all non-terminals can expand to terminals.
return ["<error_empty_production>"]
chosen_production = random.choice(productions)
sentence_parts = []
for sym_in_rhs in chosen_production.rhs(): # For each symbol in the chosen rule's right-hand side
sentence_parts.extend(generate_random_sentence_from_cfg(grammar, sym_in_rhs))
return sentence_parts
print("Generated sentences using CFG and random expansion:")
for _ in range(3): # Generate 3 random sentences
sentence_tokens = generate_random_sentence_from_cfg(cfg_grammar)
print(' '.join(sentence_tokens))
Running this code might produce varied outputs like:
Generated sentences using CFG and random expansion:
a fluffy dog played with a ball
the red mouse slept under one big table
some cat found the small park
Note that the quality and coherence of these sentences depend heavily on the grammar design. More complex grammars can produce more sophisticated, but also potentially more nonsensical, outputs if not carefully constrained.
Advantages:
Disadvantages:
Although often categorized as a very basic statistical method, text generation using Markov chains can also be viewed as an algorithmic approach. Here, the "rule" for generating the next word is based on transition probabilities between words (or characters) derived from a corpus.
A Markov chain model for text generation predicts the next word in a sequence based only on the n preceding words. This is known as an n-gram model.
How it works:
Diagram: Bigram Markov Chain for Text
A simplified Markov chain illustrating potential word transitions and their associated probabilities (P). Text generation begins at "START" and proceeds by following paths according to these probabilities until an "END" state is reached.
Python Sketch: Simple Bigram Text Generator
import random
from collections import defaultdict, Counter
# Sample corpus (very small for demonstration)
corpus_text = "the cat sat on the mat the dog ran on the grass the cat ran too the dog sat by the mat"
words = corpus_text.lower().split() # Normalize to lowercase
# Build bigram probability model
# bigram_model stores: current_word -> Counter(next_word: frequency)
bigram_model = defaultdict(Counter)
for i in range(len(words) - 1):
current_word = words[i]
next_word = words[i+1]
bigram_model[current_word][next_word] += 1
def generate_markov_text(start_word, num_words=10, model=bigram_model):
start_word = start_word.lower()
if start_word not in model:
# Try a random start word if the provided one isn't available
if not model: return "Model is empty."
start_word = random.choice(list(model.keys()))
# return "Start word not in corpus model."
current_word = start_word
sentence = [current_word]
for _ in range(num_words - 1):
if current_word not in model or not model[current_word]:
break # No known next word for the current_word
# Get possible next words and their frequencies
next_word_options = model[current_word]
# Convert frequencies to a list for random.choices (weighted random selection)
population = list(next_word_options.keys())
weights = list(next_word_options.values())
if not population: # Should be caught by 'not model[current_word]'
break
next_word = random.choices(population, weights=weights, k=1)[0]
sentence.append(next_word)
current_word = next_word
# Simple end condition (optional)
if next_word == "mat" and random.random() < 0.3: # Arbitrary stop condition
break
return ' '.join(sentence)
# Generate text
print(f"Generated 1: {generate_markov_text('the', 7)}")
print(f"Generated 2: {generate_markov_text('cat', 5)}")
print(f"Generated 3 (random start): {generate_markov_text('unknown_word', 8)}") # Will pick a random start
Running this code might yield outputs such as:
Generated 1: the cat sat on the mat the
Generated 2: cat ran on the grass
Generated 3 (random start): dog sat by the mat the cat sat
(Note: The exact output will vary due to the random selection.)
Advantages:
Disadvantages:
Despite their limitations when compared to modern neural models, these foundational techniques remain useful in specific contexts:
[percentage]
% in the [region]
region.").The primary drawbacks of algorithmic and rule-based systems stem from their lack of deep semantic understanding and their heavy reliance on manually crafted rules or simple local statistics. They generally struggle with:
These limitations are precisely what have driven the development of more sophisticated techniques, particularly those based on machine learning and, more recently, Large Language Models. As we progress through this chapter and course, we will explore methods that aim to overcome these challenges by learning complex patterns directly from vast amounts of data. This enables the generation of more fluent, coherent, and contextually appropriate synthetic text. However, understanding these earlier algorithmic and rule-based methods provides a valuable baseline. It highlights the specific problems that more advanced approaches seek to solve, and these foundational techniques also remain a practical choice for certain well-defined tasks where their strengths align with the requirements.
© 2025 ApX Machine Learning