Parsing Recipes with Robot Help

no nonsense recipes
Image by Annie Ruygt

Fly can run your Laravel apps, workers, and more around the world. Want to spin up a Laravel application? You’ll be up and running in minutes!

This is NOT a story about how Javascript stole 2 hours of my life, and PHP rescued me - in 6 minutes. That did happen, but I digress.

The real story is: Recipe sites are bloated. Rumor has it that this is because you can’t copyright a recipe, but can copyright the content surrounding a recipe.

In any case, I wanted to see how hard it was to parse recipes and just get the important parts. I had dreams of tokenizing HTML strings and creating a ton of conditionals (I think we used to call that AI) to remove the flowery cruft.

It’s actually a lot easier than that - thanks to SEO.

SEO to the Rescue

Due to Google’s insistence, almost all recipe sites embed standardized Recipe information somewhere in the HTML — most often in the form of a <script type="application/ld+json"> tag that contains some JSON. This JSON is fashioned after whatever schema.org dictates for Recipes.

Still, it’s a bit tricky to parse JSON, and took a tad of experimentation.

The basic process is to get some HTML from a recipe, and do some XML parsing to find the bits that matter for Recipe parsing. Then run that through a parser to get the bits you want.

This is all a breeze in PHP, which I mention only because the HTTP client used along with the most common ld+json parser in Node couldn’t return HTML from a site that returned gzipped content. In 2023. Listen, I’m not angry. It just sounds a lot like it.

Parsing Recipe Data

A composer package made this pretty easy: https://github.com/brick/structured-data. This package could read/parse HTML and find any schema data within it. Our job then is to find Recipe-specific data, and grab what we want.

Here’s the full code base - it has more details than what I’ll mention here.

use App\RecipeParser;
use Brick\StructuredData\Reader\MicrodataReader;
use Brick\StructuredData\Reader\RdfaLiteReader;
use Brick\StructuredData\Reader\JsonLdReader;
use Brick\StructuredData\HTMLReader;

$parsers = [
    new JsonLdReader(),
    new MicrodataReader(),
    new RdfaLiteReader(),
];

$url = request('recipe');

$response = Http::throw()
            ->get($url);

// UTF8 isn't handled correctly otherwise
$html = mb_convert_encoding($response->body(), 'HTML-ENTITIES', "UTF-8");

$recipe = null;
foreach($parsers as $parser) {
    $reader = new HTMLReader($parser);
    $items = $htmlJsonReader->read($html, $url);

    if($recipe = RecipeParser::fromItems($items, $url)) {
        break;
    }
}

// If we didn't get a recipe after
// our 3 different reader attempts...
if (! $recipe) {
    // failure
}

// Do something with the valid recipe
return $recipe;

Then we need our RecipeParser class, which takes the $items and finds Recipe data within, if available.

The JSON (and thus our $items) is in a format I found to be a bit convoluted, but it’s manageable. The hardest part was running the code through a ton of recipe sites and accounting for variations in how they output Recipe schemas.

Our RecipeParser class is a bit mundane, it mostly involves these 2 methods - fromItems and parse.

The static method fromItems() is the “entrypoint” into the class - it takes the $items (retrieved from the meta data parser) and the $url. It tries to find a Recipe schema within the given data.

If we find a recipe, the parse() method will grab each “section” of a recipe, and see if there’s a corresponding method to parse it out.

// A static function to serve as the entrypoint to the parser
public static function fromItems($items, $url)
{
    foreach($items as $item) {
        // Normalize schema "type", find any that contain "recipe"
        // We often get a Recipe mixed in with other data types (e.g. "article")
        $types = Str::lower(implode(',', $item->getTypes()));
        if(Str::contains($types, 'recipe')) {
            return (new static(url: $url))->parse($item);
        }
    }

    // Sometimes you don't get a "type" defined, but the
    // entire JSON document might be a recipe
    if (count($items) == 1) {
        return (new static(url: $url))->parse($item);
    }

    // No recipe data type found
    return false;
}

// This does the "hard" work of parsing out a recipe
public function parse(Item $item): Recipe
{
    // For each "recipe" property, we check if the class has a related method
    // e.g. `public function parse_ingredientlist() {}`
    foreach($item->getProperties() as $name => $values) {
        $fn = "parse_".Str::replace(['http://schema.org/', 'https://schema.org/'], '', Str::lower($name));
        if(method_exists($this, $fn)) {
            $this->$fn($values);
        }
    }

    return new Recipe(
        $this->title, $this->url, $this->author, 
        $this->ingredients, $this->steps, $this->yield, 
        $this->totalTime, $this->images
    );
}

There’s a bunch of data munging to account for differences you see across the various sites in how they show the Recipe metadata.

For example, one method is to get the name of the recipe:

protected function parse_name($values)
{
    $this->name = (is_array($values) ? $values[0] : $values);
}

The variables $values is usually an array, but if $values is not, then we just assume it’s a string. Others might have multiple values, or be instances of Brick\StructuredData\Item, in which case we need to do extra work to parse the Item classes (essentially we need to recursively parse Item classes).

Fly.io ❤️ Laravel

Need a place in the clouds to fly your app? Deploy your servers close to your users with Fly.io, it’ll be up in minutes!

Deploy your Laravel app!

Sprinkle in AI

If we can’t find relevant metadata in a recipe site, maybe we can use some AI to help us! Frankly, we could probably use AI for every request, but given current pricing and how fast (slow) the OpenAI API is, using this as a fallback makes sense to me.

Thanks to SEO and recipes being a whole industry, most recipes you’ll find on the first few pages of Google results are optimized, and so you don’t need AI to parse them.

However, to add AI, all you need is PHP OpenAI package. It abstracts OpenAI’s various API’s really nicely.

In our case, we use the Chat API, as it allows us to use the 3.5-turbo model. I don’t yet have GPT 4 access!

What I did was add in a 4th “Reader” class that talks to OpenAI:

<?php

namespace App;

use OpenAI;
use DOMDocument;
use App\Recipe;
use Brick\StructuredData\Reader\JsonLdReader;
use Illuminate\Support\Str;

class AIRecipeReader
{
    public static function read($url)
    {
        $prompt = "Extract only the following information from the recipe found here: $url

            - dishName
            - publishDate (in YYYY-MM-DD format)
            - total cook time (in human-readable format)
            - author
            - ingredients 
            - steps (array of strings)
            - servings

            Please generate the output as valid JSON, preferably in ld+json format based on schema.org specification.";

        $client = OpenAI::client(config('ai.open_ai_key'));

        $result = $client->chat()->create([
            'model' => 'gpt-3.5-turbo',
            'messages' => [
                ['role' => 'user', 'content' => $prompt]
            ],
        ]);

        $dom = new DOMDocument;
        $html = mb_convert_encoding('<script type="application/ld+json">'.$result->choices[0]->message->content.'</script>', 'HTML-ENTITIES', "UTF-8");
        $dom->loadHTML($html);
        $items = (new JsonLdReader)->read($dom, $url);
        return RecipeParser::fromItems($items, $url);
    }
}

Implementing this is just a matter of adding in a conditional if a Recipe object isn’t found.

The fun part here is the prompt. We asked OpenAI to return valid ld+json, and we (usually) get that back! We just feed that into the JsonLdReader class.

That class needs an instance of DOMDocument, so we hack that into place before parse the result. We also did our usual UTF-8 encoding to ensure we get the correct characters when needed.

Here’s a thing to note tho: OpenAI’s API responses are pretty slow! You’ll know you hit OpenAI when the request to get a recipe back feels like it’s taking forever (~10-20 seconds).

You can check out the results at https://recipeplz.fly.dev!