AI wants to make your writing more polite. And what better data set to learn from then the Enron emails. [WITH VIDEO]

Home / Uncategorized / AI wants to make your writing more polite. And what better data set to learn from then the Enron emails. [WITH VIDEO]

So maybe, over time, might it turn “let’s book this” to “let’s put this into an offshore vehicle off the books”?

 

 

By:

Eric De Grasse 
Chief Technology Officer

 

13 July 2020 (Paris, France) – Almost everyone in the e-discovery industry is familiar with Enron, the energy services company that underwent a massive investigation in the early 2000s for wide-scale fraud, which would eventually bring into question the accounting practices of many corporations.

The Enron Corpus (more commonly referred to as the Enron Data Set) is legendary in the e-discovery industry. The Corpus was originally created in 2003, when the Federal Energy Regulatory Commission (FERC), which had conducted the initial investigation, released a large data set of Enron documents for purchase into the public domain. It was a database of over 600,000 emails generated by 158 employees.

Andrew McCallum, a computer scientist at the University of Massachusetts Amherst, was the first to purchase the set but his focus was to use it for studies on social networking and computer-mediated communication.

It was not until MIT researcher Leslie Kaelbling purchased the raw files from a government contractor (the format in which FERC released the emails was unusable), deduplicated them, mapped their structure, and made them available to the e-discovery community. After her processing, about 200,000 emails remained.

Because of its size and public status, and its variety of data types, the data set was a rare and valuable tool for experimenting on network analysis, text mining and text classification methods. It became the subject of research by computer science departments of several universities, including the Massachusetts Institute of Technology and Stanford University. In the summer of 2009, the team at TREC Legal Track, an organization co-sponsored by the U.S. Department of Defense, started conducting research on the Enron Corpus with the purpose of improving large-scale search techniques.

It’s the gift that keeps on giving. While the e-discovery industry has moved on to other new, more modern sets of emails, the Corpus is still used to test next generation enterprise-scale software solutions.

And now, a research team at Carnegie Mellon University is using it to devise a technique that’s designed to automatically make written communication more polite. Rather than merely scanning text for politeness, as past computational linguistics methods have, this one actually changes directives or requests that use either impolite or neutral language by restructuring them or adding words to make them more well-mannered. “Say that more politely,” for instance, might become “Could you please say that more politely?”

But it’s not just about using words or phrases such as “please” and “thank you”.  Sometimes, it means making language a bit less direct, so that instead of saying “you should do X,” the sentence becomes something like “let us do X.”

All of this research was presented last week at the Association for Computational Linguistics’ annual meeting (held virtually) and many of the attendees of the event saw immediate potential uses for the work, including emails and chatbots. You can read the full paper by clicking here.

At the heart of their experiment is a dataset of 1.39 million sentences analyzed for politeness and labeled with a politeness score. The team then developed a “tag and generate” approach, which identifies sentences that are outright impolite, or could just use a manners boost, and tweaks them with words and phrases Emily Post would be more approving of.

“Yes, go ahead and remove it” becomes “Yes, we can go ahead and remove it.” Adding “we,” the researchers explain, creates the sense that the burden of the request is shared by speaker and addressee.

“Not yet — I’ll try this weekend” becomes “Sorry, not yet — I’ll try to make sure this weekend,” with the apology politely conveying that the requested action might be something of a burden.

These might seem like super-subtle changes. But as anyone who’s ever puzzled over a text message knows, nuance can easily get lost in written communication, leading to misinterpretation.

While politeness plays a crucial role in social and professional interactions, standards of what it looks like vary from culture to culture, so for their work, the team focused on speakers of North American English in a formal setting.

And, as I indicated at the beginning of this piece, the team’s dataset comes from a surprising, though rather appropriate, source: emails exchanged by employees at Enron. Huh. So maybe, over time might it turn, say, “let’s book this” to “let’s put this into an offshore vehicle off the books”?

The researchers have released their “politeness transfer” dataset on Github so others interested in the topic can build on the work.

In April they produced this video which explains the process.

 

 

Related Posts