Hands on! Parse your emails with Google's Gemma

Brave the impossible and use only your CPU to transform your emails into machine-readable semantic JSON files which can then be interpreted by personal assistants, finance tools, etc.

Feb 25, 2024

As I explained in my previous post, LLMs are not good at everything but they’re particularly good at parsing information and transforming it into another format. It’s a technique that we use everywhere in ChatFAQ for example.

Today we’re going to dive into the code that lets us do this. The goal is simple:

First we’ll classify the email to know what kind of email it is. Yeah I said that it’s not a great idea because LLMs are not super performant for that. I’ve tried my best, it seems to work well with GPT-4 and more or less decently with Gemma.
And then for each type of email we’re going to extract a JSON which tells us in a machine-readable format the content of that email.

I’ll walk through the main elements of the code, if you want to follow up with the completed project it’s all on GitHub.

Also, of course there are many libraries and frameworks and whatnots to help you do this in different ways, but we’re here to learn so we do all by hand today.

Buckle up and let’s go!

The flask app

In order to do this, we’re going to make a Flask app which exposes:

A basic page that allows to upload an email in the .eml format (what you get when you “Download this email” from Gmail for example).
An API which for a given email gives you the semantics of it.

We’ll do that in a very small src/semmail/app.py file.

First, a super basic view which is just a form that will call the API when submitted.

@app.route("/")
def home():
    """Just a dumb form where you can upload a file to the API"""

    return render_template_string(
        """
        <!DOCTYPE html>
        <html>
        <head>
            <title>Upload File</title>
        </head>
        <body>
        <h2>Upload File</h2>
        <form action="/upload" method="post" enctype="multipart/form-data">
            <input type="file" name="email_file">
            <input type="submit" value="Upload">
        </form>
        </body>
        </html>
    """
    )

Then the API itself. We’re going to expect that there is a function that takes an email as input and returns the parsed output.

@app.route("/upload", methods=["POST"])
def upload_file():
    """After a very basic validation of the file, we put it through the LLM
    so that we can know what it's about."""

    if "email_file" not in request.files:
        return jsonify({"error": "No file part"}), 400

    file = request.files["email_file"]

    if file.filename == "":
        return jsonify({"error": "No selected file"}), 400
    try:
        file_content = file.read()
        result = interpret_email(file_content)
        return jsonify(result), 200
    except ValueError as e:
        return jsonify({"error": str(e)}), 400

In the middle of the boilerplate, you’ll figure the interpret_email() function. For now it can just return a static structure, we’re going to implement it in a moment.

Calling the AI

Because in the project I want to have an OpenAI and a Gemma implementation I’ll have two modules in the GitHub but I’ll only cover Gemma because it’s the hot new kid that everyone wants to play with.

Make the instance

The first thing you need to do is to go to the model’s HuggingFace page and use your account to accept the license. Then go to your settings and fetch an API key that you’ll need to put in the project’s .env.

HUGGINGFACE_TOKEN=xxx

This will allow us to create the instance of the tokenizer and the model in our src/semmail/ai/gemma.py file. Put at the root of the module:

model_id = "google/gemma-2b-it"

if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"


tokenizer = lazy_object_proxy.Proxy(lambda: AutoTokenizer.from_pretrained(model_id))
model = lazy_object_proxy.Proxy(
    lambda: AutoModelForCausalLM.from_pretrained(model_id).to(device)
)
login_done = False


def ensure_login():
    """Makes sure that we're logged into HuggingFace Hub so that we can
    download the LLM (which requires to approve a license)."""

    global login_done

    if not login_done:
        login(token=environ["HUGGINGFACE_TOKEN"])
        login_done = True

A bunch of things to unpack:

Everything is wrapped in a lazy_object_proxy, which avoids to blow up the CPU and RAM at moment the module is imported. It will wait that the function is called a first time for that. You’ll thank me later.
We create an ensure_login() function which allows subsequent functions to make sure that we’re logged into huggingface_hub, but also does it only once to avoid having to do this every time we call the AI.
There is a conditional detection of CUDA to enable it or not depending on the availability. You guys tell me if it works, I’m an idiot who didn’t check his GPU’s compatibility before buying it.

Communicate with Gemma

You’ve probably noted the name of the model, google/gemma-2b-it.

The “2b” indicates the size of the model. I’m using this one and not the bigger one because it uses a lot less resources and can realistically be used on a CPU while the other one cannot.
The "it” tells you that it has been trained for chat-like interactions.

So how do you get anything about this chat training? It means that your prompt has to follow this structure:

<start_of_turn>user
How does the brain work?<end_of_turn>
<start_of_turn>model

It indicates to the model the alternance between human and model speakers. Sadly, it does not have a system prompt to also guide the LLM outside of this, but we’ll go around that.

The idea is to use it the following way:

def ask_gemma(instruction: str, this_input: str, max_tokens: int = 1000) -> str:
    ensure_login()

    chat = [
        {
            "role": "user",
            "content": f"# Instructions\n{instruction}\n# Input\n{this_input}",
        },
    ]

    prompt = tokenizer.apply_chat_template(
        chat,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = tokenizer.encode(
        prompt,
        add_special_tokens=True,
        return_tensors="pt",
    )
    outputs = model.generate(
        input_ids=inputs.to(model.device),
        max_new_tokens=max_tokens,
    )

    convo_raw = tokenizer.decode(outputs[0])
    convo: Sequence[Dict] = parser.parse(convo_raw)  # noqa

    return convo[-1]["content"]

What you see here is that we’re using the tools from the transformers lib to generate the chat template prompt and give it to the LLM. Then when it runs, you receive a response that contains both the question and the answer from the bot, and… It becomes a bit confusing.

To be honest I’m not entirely sure how I’m supposed to parse this or if there are utilities in the transformers lib (I didn’t find them), so I’ve written my own parser (which you see used in the code above).

If you’ve outgrown your fear of regular expressions you may outgrow your fear of parsers as well. It’s relatively easy to write using the Lark package. It’s not the scope of this article, just check the grammar if you’re interested. What matters is that we’ve got a parser!

Another thing is that you’ve noticed how there are two important parameters:

instruction — Corresponds to the system prompt, tells the bot what to do
this_input — The user input

Since there is no management of system prompt in this fine-tuning, I’m just bulding a prompt from those two and hoping that the LLM picks it up (it does).

At this point we have a function that runs the LLM locally on your CPU/GPU. Pretty neat!

Getting the email’s text

Emails might be the oldest and most inconsistent standard in the Internet world. Their encoding is super confusing and while Python has a built-in library that implements all the heavy lifting, it really comes as a kit that you’ve got to assemble yourself (without the instructions).

The strategy is as follows:

Go through the different “parts” of the email, looking either for a plain text or a HTML attachment. With a preference for the HTML attachment, because sometimes you will find a plain text attachment that turns out to be bullshit, so sadly only the HTML is reliable.
Because the HTML is pretty fat, if that’s what we’re going for we’ll make sure to convert it into Markdown. This will greatly reduce the amount of tokens and absolutely reduce the complexity of understanding the message for the LLM.

This is all the job of the parse_email() function, which I’m not going to detail because it would be off-topic. What you need to know is that it outputs the email in a simplified text format which looks like:

From: foo@bar.com
To: someone@example.com
Date: 2024-02-23 20:12:00 +0100
Subject: Some email

Blah blah blah this is the content of the email

It’s something we can easily give to our LLM.

Plain text to JSON

Now the useful part. The core of this project is to convert plain text into JSON, isn’t it? Let’s do that!

def parse_to_json(
    prompt: str, text: str, schema: Any, attempts: int = 3
) -> Optional[Any]:
    for _ in range(attempts):
        parsed_raw = ask(prompt, text)
        parsed_raw = MD_START.sub("", parsed_raw)
        parsed_raw = MD_END.sub("", parsed_raw)
        try:
            parsed = yaml.safe_load(parsed_raw)
            jsonschema.validate(parsed, schema)
        except (yaml.YAMLError, jsonschema.ValidationError):
            pass
        else:
            return parsed

What do we see here:

First we clean up the model for any Markdown enclosing. Sometimes chat-tuned models like to put YAML within ```yaml quotes. We make sure to remove it if this happens.
Then we try to load the YAML data. Why YAML and not JSON? Easy: JSON is a sub-set of YAML so if the model decides to output JSON it will still work, but on the other hand YAML is more permissive and uses less tokens than JSON. So it’s both safer and more economic to use.
If all went well, we validate the parsed structure against the provided JSON schema. This ensures that the output corresponds to the constraints that we need to work with.
And if not or if the validation fails, we try again. Most LLMs will not strictly always have the same output for a given input so it doesn’t hurt to try another time to see if it’s still broken.

Prompting

In order to parse the different elements, we’ll use a prompt library. For each prompt, we associate a JSON schema which helps validating the output.

I’m not going to go through every single prompt because if you’re human you can probably understand them but I’m just going to explain the one I use to classify emails because it’s the hard one.

My goal here is to determine the probability of each email type by asking explicitly the LLM to give that probability according to different factors that I give him. The idea is that you can then easily pick the email type by checking which is the category with the highest probability. We rely on the LLM’s feelings but we’re using hard Python algorithms to take the decision.

The prompt goes as follows:

Take a deep breath.

You will analyze an email. For this email you need to determine the
likeliness that this email belongs to a specific category. This works
with a score system. For each category you MUST give a score of 0 if
you are sure that it's not from that category, a score of 1 if you are
sure or a number between 0 or 1 that reflects how much you want to
give that category to the email.

The categories are:
    - Commercial is a prospective email.
    - Bill is an invoice or a bill for a sold service or product.
    - Conversation is a regular conversation between humans.

Here are elements to look for in an email. For each element, if I tell 
you +X then consider that it's adding points to that aspect and -X is 
removing points.

Has few sentences +bill -commercial
Has a total price +bill
Has a list of items sold +bill
Has "bill", "invoice", "order confirmation" or any synonym in the Subject +bill -commercial -conversation
Presents several product benefits +commercial -bill
Is structured in a Hello/Message/Signature way +conversation -bill
Different signatures and quoted mails +conversation -bill -commercial

Now return the following YAML:

bill: x
commercial: x
conversation: x

You need to replace "x" by the score. If you want to give a score of
1 to two or more categories, you need to think harder to make the
difference.

Make sure that the output is pure YAML, not wrapped in Markdown, no sentences.

You can see the structure:

The LLM receives a general purpose
Then I explain the categories
Then I give the different factors for the different categories so that the LLM knows what to look for (it’s really not working if you don’t do this)
Then I give the YAML schema at the end. If the schema is too high in the prompt the LLM tends to forget about it
And then some banalities about the output to avoid getting stuff like “Of course, here is your YAML”, which would screw the parsing

This is matched up by a JSON schema:

{
    "type": "object",
    "properties": {
        "commercial": {"type": "number", "minimum": 0, "maximum": 1},
        "bill": {"type": "number", "minimum": 0, "maximum": 1},
        "conversation": {"type": "number", "minimum": 0, "maximum": 1},
    },
    "required": ["commercial", "bill", "conversation"],
}

With this strategy, I’ve written 4 prompts:

The one you’ve just seen to determine the email’s type
If it’s a commercial email, extract the name of the product and the USP
If it’s a bill, extract the total price and the purchased items
If it’s a conversation, make a summary of the conversation

Deciding

And all this culminates into the interpret_email() function.

def interpret_email(email: bytes) -> Any:
    """Uses a first round of LLM in order to determine the type of message,
    then proceeds to using a specific prompt for that type in order to parse
    the email into a JSON output."""

    parsed_email = parse_email(email)

    email_type_proba = parse_to_json(
        DETERMINE_TYPE.prompt,
        parsed_email,
        DETERMINE_TYPE.schema,
    )

    email_type = max(email_type_proba.items(), key=lambda p: p[1])[0]
    extra = {}

    if email_type == "commercial":
        extra["commercial"] = parse_to_json(
            COMMERCIAL_INFO.prompt,
            parsed_email,
            COMMERCIAL_INFO.schema,
        )
    elif email_type == "bill":
        extra["bill"] = parse_to_json(
            BILL_INFO.prompt,
            parsed_email,
            BILL_INFO.schema,
        )
    elif email_type == "conversation":
        extra["conversation"] = parse_to_json(
            CONVERSATION_INFO.prompt,
            parsed_email,
            CONVERSATION_INFO.schema,
        )

    return dict(
        email_type=dict(
            chosen=email_type,
            proba=email_type_proba,
        ),
        **extra,
    )

Which is very simple:

First we determine the email type, which we get as a JSON object
And then we use one of the three parsers to get the extra information relative to this type
And finally we output a JSON with the extracted information and the decision-making values that we’ve used

So if I take my latest Amazon purchase, I’m getting the following output:

{
  "bill": {
    "bought": [
      {
        "label": "L'investisseur eclaire: Cultiver son...",
        "price": [
          37.46,
          "EUR"
        ]
      }
    ],
    "total": [
      43.72,
      "EUR"
    ]
  },
  "email_type": {
    "chosen": "bill",
    "proba": {
      "bill": 1,
      "commercial": 0.2,
      "conversation": 0
    }
  }
}

Hooray! It worked!

Conclusion

I’ve showcased two things in this article:

Using extremely small boilerplate code and pretty conventional tools, I can easily leverage LLMs to parse generic content into usable JSON. It’s something that was completely unthinkable a few months ago!
And I can do so using a local LLM that even runs on “commodity” hardware (be sure to have 10 Gio of RAM before starting the proejct, or you’ll see how fast your computer can freeze)

This is an exciting time because long-standing problems are finally getting solved. In a near future you can expect to see every single tool out there getting a lot smarter when it comes to understanding human text.