“If you want to know where the future is being made, look for where language is being invented and lawyers are congregating.” – Stewart Brand

This is certainly true for AI. Let’s set [the lawyers] aside for a bit and focus on the words. (Though frequently the law and the words can’t be pulled apart!)

Anyone who writes about AI for a wide audiece will find themselves explaining the same terms over and over. You can’t know what a reader brings with them to a piece and the pace of the field keeps befuddles experts, who debate over the meaning of common terms.

This repetition of definitions inspired the creation of an AI glossary for this site. This is a work in progress with a significant backlog, but today I’d like to share how it got started, using [DSPy][dspy], Claude 3.5 Haiku, and some Jekyll features.

This site is a Jekyll site. Jekyll is a static blogging engine written in Ruby. By “static”, we mean there’s no server – just files. These files are generated and assembled using the Jekyll software, parsing all my markdown files and HTML templates to build a complete site.

We’re going to create a small Python script which will prepare and pipe all these markdown posts past an LLM in order to identify potential glossary terms and definitions.

First, let’s set up DSPy and point it at Claude:

import dspy

# Set up DSPy and the LM
lm = dspy.LM('anthropic/claude-3-5-haiku-latest', api_key='YOUR_API_KEY')
dspy.configure(lm=lm)

The last time we used DSPy, we explored how it works and how it generates and optimizes prompts for you given some light structure and definitions. This time, we want a more complex object returned to us; not just a glossary term, but also its definition, synonyms, acronym (if it has one), and expounding details from the post. Complicating this is that we want many terms per post – DSPy needs to return an array of fully defined terms.

Thankfully, DSPy works nicely with Pydantic, a data validation library that lets us define our desired term object:

from pydantic import BaseModel

# Define the Term object we want returned
class Term(BaseModel):
    term: str = dspy.OutputField(desc="A glossary term, like: a technical term specific to the subject matter, a concept crucial to understanding an article's main ideas, a term explicitly defined or explained in a post, or a word or phrase that are frequently used or emphasized in the post. Do not include the abbreviation in the 'term' field.")
    abbreviation: str = dspy.OutputField(desc="Populate the abbreviation field if the term is abbreviated in the article, ensure that it is not pluralized. If there is no abbreviation, populate the abbreviation field with an empty string.")
    definition: str = dspy.OutputField(desc="A definition of the term. Lightly edit the definition so it can stand alone outside the context of the post, but ensure that you do not add any information that is not present in the original text.")
    details: str = dspy.OutputField(desc="Text from the post that expounds a bit on the term, adding texture and details beyond the definition. The 'details' field can be empty if there is no additional context to provide and multiple paragraphs if there is more than one piece of context to provide.")
    synonyms: List[str] = dspy.OutputField(desc="Any synonyms, acronyms, or alternative terms that are used in the post")

Here we’re not only defining the attributes of each Term we want returned, but also lightly describing each attribute. DSPy will notice these types and descriptions and use them in its instructions to the LLM.

At first, this felts needlessly wordy. And if we’re going to get this detailed, why not just fall back to a standard long prompt, complete with example formatting?

For one, I really like the way this breaks down the prompt into it’s separate components. It’s easier to navigate the Term descriptions than it is to eyeball a wall of triple-quoted string. Adding or removing an attribute of the Term definition is simple.

Also, DSPy manages the extraction of the structured data from the prompt. By defining my signature like so, I can call it and get back an list of populated Term objects without mucking about with the raw text reply:

# Find key terms for the post and terms where their definition might not be clear to the reader
class ExtractTerms(dspy.Signature):
    """
    Find key terms for the post and terms where their definition might
    not be clear to the reader, from a markdown blog post. Ignore all 
    text between markdown code blocks.
    """

    post: str = dspy.InputField(desc="the markdown blog post")
    terms: List[Term] = dspy.OutputField(desc="Array of glossary terms.")

extractTerms = dspy.Predict(ExtractTerms)

This can them be called with:

terms = extractTerms(post=MY_MARKDOWN_POST_STRING).terms

Now we can go through each post, get the terms for that post, and note which post the term was found in:

# Get the terms from the posts
posts_path = Path("../_posts")
glossary = []
for post_file in sorted(posts_path.glob('*.md')):
    print(f"Processing {post_file}")
    with open(post_file, 'r') as f:
        post_content = f.read()
        # Remove any YAML frontmatter if it exists
        post_content = re.split(r'\n---\n', post_content, maxsplit=2)[-1]
        try:
            terms = extractTerms(post=post_content)
        except Exception as e:
            print(f"Failed to process {post_file}: {e}")
            continue
        for term in terms.terms:
            # We convert our term object to a dict so we
            # can save our post path
            term_dict = term.dict()
            if term_dict['term'] not in glossary:
                if str(post_file).startswith('../'):
                    term_dict['path'] = str(post_file)[3:]
                else:
                    term_dict['path'] = post_file
                print(f"Adding term {term_dict['term']}")
                glossary.append(term_dict)

If the same terms are identified in multple posts (and they were), we’re going to have duplicate terms in our glossary list. We can merge it, capturing each post that cited a given term and concatenating their details.

# Compare two term dicts to see if they are the same term
def compare_terms(term1, term2):
    if term1['term'].lower() == term2['term'].lower():
        return True
    if any(syn.lower() in [s.lower() for s in term2['synonyms']] for syn in term1['synonyms']):
        return True
    if term1['term'].lower() in [s.lower() for s in term2['synonyms']]:
        return True
        
    return False

# Condense the glossary by finding identical terms and merging their definitions, details, and synonyns.
merged_glossary = {}
for term in glossary:
    found = False
    for key in merged_glossary:
        if compare_terms(term, merged_glossary[key]):
            found = True
            merged_glossary[key]['details'] += "\n\n" + term['details']
            merged_glossary[key]['synonyms'] += term['synonyms']
            merged_glossary[key]['pages'].append(term['path'])
            merged_glossary[key]['synonyms'] = list(set(merged_glossary[key]['synonyms']))
            break
    if not found:
        page = term['path']
        term['pages'] = [page]
        merged_glossary[term['term']] = term

Then we sort and save it to the _data directory:

# Sort the merged_glossary by keys
sorted_glossary = dict(sorted(merged_glossary.items()))

# Create the _data directory if it doesn't exist
Path("../_data").mkdir(parents=True, exist_ok=True)

# Write the sorted glossary values to a YAML file
with open('../_data/glossary_gen.yaml', 'w') as yaml_file:
    yaml.dump(list(sorted_glossary.values()), yaml_file, default_flow_style=False, sort_keys=False)

We’re calling it glossary_get.yaml here because our final glossary will simply be glossary.yaml. We’ll hand review and edit the generated output, renaming it to the simpler name when we’re done. That way any future generation won’t overwrite our hand-polished file.

YAML files in the _data directory are handled specially by Jekyll. The YAML (or CSV or JSON) is read in as an object which we can reference during the building of our site.

Our glossary page page uses some light templating to render every term.

But even better, we can solve our original problem with Jekyll’s include feature, which is similar to Rails’ partials. Let’s create _includes/term.html like so:

<div class="term">
{% for item in site.data.glossary %}
    {% if item.term == include.term or item.abbreviation == include.term %}
        {% if item.abbreviation == "" %}
            <h1>{{ item.term }} </h1>
        {% else %}
            <h1>{{ item.term}} ({{ item.abbreviation}})</h1>
        {% endif %}
        {% if item.synonyms.size > 0 %}
            <p class="aka"><span class="aka-header">Also known as</span> {{ item.synonyms | join: ", " }}</p>
        {% endif %}
            <div class="definition">
                <p>{{ item.definition }}</p>
            {% if include.show_details == "true" %}
                <p>{{ item.details | markdownify }}</p>
            {% endif %}
            </div>
    {% endif %}
{% endfor %}
</div>

Add some CSS styling and we can add this line to any future post:

{% include term.html term="RLHF" %}

Which yields:

Reinforcement Learning from Human Feedback (RLHF)

Also known as alignment training, human-guided AI training

A training technique where human contractors provide feedback to improve AI model outputs, correcting problematic responses and guiding the model's behavior.

We can add an extra parameter to expound a bit:

{% include term.html term="RLHF" show_details="true" %}

Which get us:

Reinforcement Learning from Human Feedback (RLHF)

Also known as alignment training, human-guided AI training

A training technique where human contractors provide feedback to improve AI model outputs, correcting problematic responses and guiding the model's behavior.

RLHF is primarily used to make LLMs easier to use. ChatGPT’s breakthrough can partially be chalked up to OpenAI’s use of RLHF to train a base GPT-3 model for chat interactions. Prior to ChatGPT, most LLMs were text-completion models, not conversation models. RLHF is also used to make sure models behave. For example, OpenAI uses RLHF to ensure that ChatGPT doesn’t generate toxic or inappropriate responses. The company has a team of human contractors who provide feedback on the model’s outputs, helping to reduce harmful content and improve the model’s alignment with human expectations. This task can be emotionally taxing, as workers must review violent or sexual content to guide a model’s behavior.

Using an LLM to speed up the assembly of a glossary was a huge help. Our initial YAML output was over 2,000 lines. Pruning off-topic terms and tweaking details took an hour or so.

The initial scripting with DSPy took only a dozen or so minutes. The speed at which DSPy lets you get to a proof-of-concept is impressive, preventing you from playing whack-a-mole with a long prompt and giving you scaffolding for future iteration and optimization.

If you’d like to try this out with your own site, you can find all my code here. Be sure to let me know how it goes!


Have thoughts? Send me a note