Using AI to Boost Accessibility and SEO

Author

Name: Mark Ericksen
@brainlid: @brainlid

A cartoon of a smiling robot looking at a computer screen with an image of trees and a house; it's emitting a speech bubble with a stream of made-up character shapes representing its description. — Image by Annie Ruygt

This article is about putting AI to work for us by making our websites more accessible and perform better with SEO. Fly is an excellent platform for running services that prioritize accessibility and SEO. Get started now!

In the fast-paced world of digital content, the adage “a picture is worth a thousand words” remains as true as ever. This is the case for Sarah, a content manager at a popular online magazine. She is racing to finalize a feature article called “The Hidden Gems of Urban Street Art,” a visually rich showcase of Toronto’s vibrant but oft-overlooked urban art scene. As the deadline looms near, she faces a dilemma: how do we ensure that these compelling images are not only eye-catching but also accessible to all readers and at the same time optimized for SEO?

Here’s an example photo from Sarah’s feature article that needs a description for accessibility and SEO:

Photo by Scott Webb on Unsplash

A colorful example of urban art painted on a support beam of an overpass. Used as an example.

We’ve been tasked with solving this recurring need, the need for context-aware alt tag descriptions and caption text for images that accompany an article. “Context-aware” means that the description should relate to how the image is being used and not simply a description of the image. With these two versions of an image description, we improve both SEO (Search Engine Optimization) and the accessibility of our content.

We’re turning to AI to help automate the process and bulk process hundreds or thousands of past images as well.

Our job is to use AI to create decent alt tag and caption text on hundreds or thousands of images.

For the AI, we will leverage both OpenAI’s ChatGPT and Anthropic’s Claude LLMs (Large Language Models) through the Elixir LangChain library to perform AI analysis on images. The same process can be applied using other languages and tools.

Let’s first understand more about what makes high quality image description text so we know what we’re aiming for.

Context matters

How the image is being used matters. The usage provides additional context for how the images should be described. Let’s consider the following image when used in different contexts.

Photo by Aaron Andrew Ang on Unsplash

Elderly man sitting on a bench by a lakeside, reading a book with mountains in the background.

Description with no context: Elderly man sitting on a bench by a lakeside, reading a book with mountains in the background.

Description in the context of a story about a local author: René Favre reading beside a serene lake with misty mountains in the background.

Description in the context of a travel destination: Older man sitting on a bench overlooking Lake Lucerne with mountains in the background.

How the image is used changes how it should be described. This impacts what aspects of the photo should be called out as relevant or ignored entirely.

Okay, so the description needs to be in the context of our article. What are the current best practices around alt text and captions?

What is alt text?

Alt text, or alternative text, describes images in HTML code. It’s essential for two main reasons:

Accessibility: Screen readers use alt text to convey image content to visually impaired users, ensuring that everyone can fully engage with the webpage.
SEO: Search engines read alt text to understand and index images, enhancing webpage visibility in search results.

In a nutshell, alt text makes images accessible and boosts their discoverability, benefiting both users and search engines. When we make our content more accessible to users and to search engines, it benefits us.

What are captions?

The <figure> and <figcaption> are semantic HTML5 elements that work together to provide context and captions for images, diagrams, or other media within a webpage.

ARIA (Accessible Rich Internet Applications) tags are special attributes added to HTML elements to enhance web accessibility.

The Mozilla documentation encourages us to semantic HTML elements over ARIA tags whenever possible.

If at all possible, you should use the appropriate semantic HTML elements to mark up a figure and its caption — <figure> and <figcaption>.

In HTML, it looks like this:

    <figure>
  <img src="image.png" alt="put image description here" />
  <figcaption>Figure 1: The caption</figcaption>
</figure>

  

On the webpage, the default styling places the caption below the media, although this can be modified through CSS.

We have more freedom with the content displayed in a figcaption, the only rule is there can only be one per figure. It is self contained with the media it describes, making it more portable within a page. A general recommendation is to connect the media to the page, providing context for why it is there. Lastly, keep in mind the visual display of the figcaption’s contents and make the text concise.

In a nutshell, <figure> and <figcaption> provide visible captions and context to sighted readers, assistive technology, and search engines, effectively connecting media with the surrounding article or content on the page.

How can AI help?

AI can analyze an image and provide a text description. Recent model improvements from both OpenAI and Anthropic make this much easier and more powerful.

However, unless we want the AI to run amok and wax poetic on one image and do something completely different on the next, we need to provide the set of constraints and instructions that will get us what we want. In other words, prompt engineering!

To get consistent, well performing text, we need to understand a bit more about what makes for good alt text so the AI can give us exactly what we need.

This is a list of the relevant recommendations for alt text.

Concise Description: Keep the ALT text concise and to the point. Aim for a simple sentence or a fragment that conveys the essential information about the image. Typically, 125 characters or less is advisable as some screen readers may truncate longer text.
Content Relevance: Focus on the information that the image conveys and its context within the page. Describe what is relevant to the content and functionality of the site. If the image contains text, such as a logo, include the text in the ALT tag.
Avoid Redundancy: Do not include phrases like “image of” or “graphic of,” as screen readers often provide this context to their users. This can be redundant and unnecessarily verbose.
Context Matters: Tailor the ALT text depending on where and how the image is used. The same image might need different ALT text in different contexts.
We won’t go into Localization here, but AI can help with this too!
Localization: When dealing with multilingual sites, make sure to translate the ALT text appropriately, keeping the same considerations in mind as for the original language.
Use Proper Grammar: Even though it’s brief, ensure your ALT text uses correct spelling, grammar, and punctuation. This improves clarity and the overall user experience.

AI can help with all of these points! This dramatically simplifies the task, especially when we need to provide descriptive text for many, many images.

As for the figcaption tag, we have more freedom with that and we’ll try to pull in more context from the article. It’ll be fun!

Bring on the AI!

We understand that context matters and we have guidelines for how the alt text should be generated. We’re ready to write code!

To send an image to both OpenAI and Anthropic, we’ll base64 encode the image and submit it that way. In Elixir, that process looks like this:

    image_data =
  image_path
  |> File.read!()
  |> Base.encode64()

  

Let’s start with OpenAI’s ChatGPT service. We need model gpt-4o for the image abilities we are using.

    alias LangChain.ChatModels.ChatOpenAI

openai_chat_model = ChatOpenAI.new!(%{model: "gpt-4o"})

Next, we’ll setup the messages we’re sending to the LLM. This is the “prompt engineering” step.

In the following code, notice that the “system” message provides the general context for what we are doing and what we want from the LLM.

The “user” message is made up of two parts:

PromptTemplate: supports variable replacement tags using EEx templates. This allows us to easily customize the prompt for each image as we process through a whole batch. This turns into a ContentPart.
ContentPart: Makes it easy for us to provide our image directly to the LLM.

We provide the LLM with the context for the task, specific instructions about an image, and an image to analyze with a “vision” enabled model so it can perform the task.

    alias LangChain.Message
alias LangChain.Message.ContentPart
alias LangChain.PromptTemplate

messages = [
  Message.new_system!("""
  You are an expert at providing an image description for assistive technology and SEO benefits.

  The image is included in an online article titled "The Hidden Gems of Urban Street Art."

  The article aims to showcase the vibrant and often overlooked artworks that adorn
  the nooks and crannies around the city of Toronto Canada.

  You generate text for two purposes:
  - an HTML img alt text
  - an HTML figure, figcaption text

  ## Alt text format
  Briefly describe the contents of the image where the context is focusing on the urban street art.
  Be concise and limit the description to 125 characters or less.

  Example alt text:
  > A vibrant phoenix graffiti with blazing orange, red, and gold colors on the side of a brick building in an urban setting.

  ## figcaption format
  Image caption descriptions should focus on the urban artwork, providing a description of the appearance,
  style, street address if available, and how it relates to the surroundings. Be concise.

  Example caption text:
  > A vibrant phoenix graffiti on a brick building at Queen St W and Spadina Ave. With wings outstretched, the mural's blazing oranges, reds, and golds contrast sharply against the red brick backdrop. Passersby pause to observe, integrating the artwork into the urban landscape.
  """),
  Message.new_user!([
    PromptTemplate.from_template!("""
    Provide the descriptions for the image. Incorporate relevant information from the following additional details if applicable:

    <%= @extra_image_info %>

    Output in the following JSON format:

    {
      "alt": "generated alt text",
      "caption": "generation caption text"
    }
    """),
    ContentPart.image!(image_data, media: :jpg, detail: "low")
  ])
]

  

NOTE: Make sure the :media option matches both the image and what is supported by the LLM you are working with.

NOTE: The use of a PromptTemplate and <%= @image_data %> in the user Message. As we are processing a whole set of images, data from the system where we get the image is rendered into our prompt, helping to further customize the generated text from the LLM, making it far more specific and relevant.

JSON Output

For each image, we want two pieces of generated content. To make this easy on ourselves, we instruct the LLM to output the two pieces of information in a JSON object.

Specifically, we instruct it to output in the following format:

{
  "alt": "generated alt text",
  "caption": "generation caption text"
}

To make working with that output easier, we’ll use a JsonProcessor for processing messages from the LLM. It has the added ability to return a JSON formatting error to the LLM in the case when it get’s it wrong.

We’ll see this next when we put it all together.

Making the request

Everything is ready to make the request!

We have the image
We setup which LLM we are connecting with
We provide context in our prompt and instructions for the type of description we want

Now, we’ll submit the request to the server and review the response. For this example, the image_data_from_other_system is a substitute for a database call or other lookup for additional information we have on the image.

    alias LangChain.Chains.LLMChain
alias LangChain.MessageProcessors.JsonProcessor

# This data comes from an external data source per image.
# When we `apply_prompt_templates` below, the data is rendered into the template.
image_data_from_other_system = "image of urban art mural on underpass at 507 King St E"

{:ok, _updated_chain, response} =
  %{llm: openai_chat_model, verbose: true}
  |> LLMChain.new!()
  |> LLMChain.apply_prompt_templates(messages, %{extra_image_info: image_data_from_other_system})
  |> LLMChain.message_processors([JsonProcessor.new!()])
  |> LLMChain.run(mode: :until_success)

response.processed_content

  

Notice that when running the chain, we use the option mode: :until_success. Some LLMs are better are generating valid JSON than others. When we included the JsonProcessor, it parses the assistant’s content, converting it into an Elixir map. The converted data is stored on the message.processed_content.

If the LLM fails to give us valid JSON, the JsonProcessor generates a :user message reporting the issue for the LLM. The :until_success mode repeatedly makes the round-trip requests to the LLM, allowing it to correct any errors. But don’t worry, it won’t run forever! The LLMChain’s max_retry_count gives up after X failures, the default being 3.

Here’s a sample of what was generated when I ran it:

    %{
  "alt" => "Colorful mural of a face under a bridge at 507 King St E",

  "caption" => "A captivating face mural under the 507 King St E underpass, featuring vivid hues and expressive eyes. The art adds a pop of color to the urban landscape, drawing the attention of pedestrians as they pass by."
}

Great! We got highly specific image descriptions for both alt text and an use in a caption. Even more, it’s all written in the context of the article about urban art. Importantly, we got the data back in a machine-friendly format that our application can easily work with.

We’ve come this far, let’s see if it works with Anthropic’s Claude LLM!

Anthropic’s Claude

We’ll setup the Anthropic model to work with like this:

    alias LangChain.ChatModels.ChatAnthropic

anthropic_chat_model = ChatAnthropic.new!(%{model: "claude-3-opus-20240229"})

Make sure the model you select supports the “vision” ability. The one used here works just fine.

Our code looks exactly the same except we’ve swapped out the llm chat model to connect with.

    alias LangChain.Chains.LLMChain
alias LangChain.MessageProcessors.JsonProcessor

# This data comes from an external data source per image.
# When we `apply_prompt_templates` below, the data is rendered into the template.
image_data_from_other_system = "image of urban art mural on underpass at 507 King St E"

{:ok, _updated_chain, response} =
  %{llm: anthropic_chat_model, verbose: true}
  |> LLMChain.new!()
  |> LLMChain.apply_prompt_templates(messages, %{extra_image_info: image_data_from_other_system})
  |> LLMChain.message_processors([JsonProcessor.new!()])
  |> LLMChain.run(mode: :until_success)

response.processed_content

  

What do we get?

    %{
  "alt" => "Colorful street art mural with a face in vibrant colors on an underpass pillar in an urban setting.",

  "caption" => "A striking urban art mural adorns an underpass pillar at 507 King St E. The artwork features a mesmerizing face composed of vivid, rainbow-like hues and intricate patterns. Framed by the industrial yellow beams above, the mural transforms the concrete structure into a captivating focal point for passersby."
}

Nice! The Elixir LangChain library abstracted away the differences between the two services and with no code changes, we can make a similar request about the image from Anthropic’s Claude LLM as well!

Discussion

That’s a lot! Let’s recap the main points we learned from this.

Both OpenAI’s ChatGPT and Anthropic’s Claude models can interpret the meaning of images.
AI can create pretty good image descriptions that are context aware
AI can incorporate image specific descriptions to make the description hyper-specific
AI can return the text descriptions in multiple formats from the one request
AI can be used to batch process thousands of images if needed
AI libraries can abstract away the differences between services making our code less dependent on any particular service.

There are limits to what AI can do. AI won’t always perfectly interpret what’s in the image as well as a human. However, it can be done much faster, automated and performed on more images than is humanly possible. In many instances, that means we get good (if not perfect), context aware ALT text where we may otherwise have none because we don’t have the people nor the time to manually do the work.

Finally, this feature would likely end up being built into the Content Management Systems (CMS). In our fictitious scenario, that’s probably what we’d end up doing… building this feature into the company’s software. But now you’ve seen how to do it yourself! Sweet!

Final Thought

AI made our visual content more accessible to people using assistive technology and our image content will perform better in SEO.

Fly.io ❤️ Elixir

Fly.io is a great place to run your Phoenix apps. It’s easy to get started. You can be running in minutes.
Deploy a Phoenix app today! →

Resources

The best practice recommendations we’re using come from these resources:

Previous post ↓: My Favorite new LiveView Feature