Lever’s Planer: Extract Reply Text from Emails in JavaScript

Anyone involved in hiring understands how challenging it can be to share information effectively. Sometimes this can lead to panic when an interview happening today is a surprise to the employee meant to give it. This panic and lack of preparation prevents the interviewer from providing a great experience for the candidate.

At Lever we believe that collaborative hiring leads to a better candidate experience and better hiring decisions, so we incorporate the whole team in our hiring process and have built our product to facilitate ease of information sharing. For example, everyone involved in recruiting is able to see all email communication to and from a candidate in a single place. To make this work, we support 2-way sync of emails for our customers who use Google Apps or Microsoft Exchange as their email service.

But the world of email is a messy place. Replies generally have the original message quoted in the body, and, without a clear standard, each email client implements replies differently. It was easy enough to remove quoted content in Gmail — Google adds a handy `gmail_quote` class to the div that contains the quoted original message. But as Lever expanded to support larger companies and Outlook users, we discovered more and more cases where our interface failed to remove the full quoted text of original emails.

Ugly email with quoted message displayed

Displaying a long thread of emails where the only useful text is a single sentence at the top does not make for a good experience, so we sought to improve the way we extract reply text from emails. We use Mailgun at Lever for various purposes, and we had heard about talon, their open source reply extraction library. But alas! It is written in Python, and our system is written in Javascript and runs in node. Undaunted, I rolled up my sleeves, brushed up on my Parseltongue, and set out to port talon into JavaScript.

The result is Lever’s Planer (at Lever, we have an affinity for simple machines and tools) which we kept open source to align with the motivations of its predecessor, talon.

Using planer is easy. First, install using npm:

$ npm install planer

In your code, use require like any other module:

var planer = require('planer');

If you want to extract reply text from a plain text email body, use the aptly named method:

var replyText = planer.extractFromPlain(emailBodyText);

If you want to extract reply text from an html email, you will need to provide a DOMImplementation (for unit test in planer, we use jsdom). If it is running in a browser, you can use the `document` object.

var replyHtml = planer.extractFromHtml(emailBodyHtml, window.document);

We’ve been using Planer in production for almost 9 months. The email you saw earlier now looks like this in our interface:

While there are some emails it does not properly extract, it is generally more robust and full-featured than our previous in-browser implementation of the same process.

The previous implementation worked well with html emails from gmail, and it would remove lines that start with ‘>’ from the plain text versions of emails, and that was about it. With planer, we can now recognize the quoted elements in html emails generated by versions of desktop Outlook and Outlook365 as well as gmail, and we can reliably remove quoted text from both html and plain text emails by analyzing the content to find the start of the quote in multiple languages (English, French, and German for now).

Though we’ve made significant progress, software is never finished. There is a lot of work to do before Planer can extract reply text from every possible email in all languages. Like all good open source projects, we love contributions! If you find a reply email that it does not seem to extract properly, please create an issue. Better yet, create a pull request with an improvement and a new test case.