Google’s quest to end the language barrier and the possible effect on non-English e-discovery document reviews

Home / Uncategorized / Google’s quest to end the language barrier and the possible effect on non-English e-discovery document reviews

September 16, 2013

Uncategorized

Gregory P. Bufithis, Esq. Founder/CEO

16 September 2013 – We have written about this numerous times before: quantitative prediction and how it continues to shake up numerous professional services industries by automating or semi-automating tasks previously performed by experts. How it has already changed the legal services industry is now well trodden ground.

Humans are not exactly known for their predictive skills if one believes Daniel Kahneman’s argument in Thinking, Fast and Slow or Nicholas Taleb’s assertions in his book Antifragile: Things That Gain from Disorder. The advent of computer assisted review/ technology assisted review/predictive coding to document review processes has shown that advanced computer analytics can produce more accurate results than reviews using only keyword search and human review. Oh, some debate still lingers on when-to-use-and-when-not-to-use it and a few quibbles on the numbers but it is being embraced. And the economic downturn in the legal industry, and the associated cost control pressures from corporate clients have further increased the speed at which quantitative prediction solutions have been adopted.

It has decimated the contract attorney industry. Right now in D.C., for example, there is a document review project involving a Fortune 100 company that, 2 years ago, would have required 100+ reviewers. This time around it is being handled with 35. Pretty much the same size data universe as 2 years ago but this time … a predictive coding platform is being used. The technology is faster, better, cheaper. This is the great disruptor in the e-discovery market when it comes to the contract attorney sector. It is the technology driving the change. Staffing agencies and e-discovery vendors and even corporate in-house legal departments are utilizing “data swat-teams” comprised of contract attorneys who possess the tech skills + the analysis ability with a greater emphasis on data search specialists who have the ability to conduct complex searches, analyze information and generate reports. But fewer bodies are required.

The only contract attorneys deemed “safe” are those who have fluency in one or more non-English languages. Non-English document reviews were up 38% last year according to our sister company, The Posse List, and comprised 72% of all document review jobs they posted. If you follow their job postings you know the hourly pay rates for non-English reviews are significantly higher, downright astronomical for the CJKs … Chinese, Japanese, Korean.

Ah, but times might just be a-changin’.

While we have come a long way toward making automated language translation easier, faster and more reliable, a world of seamless and immediate translation is still out of our grasp. But it’s getting better and better. Google, Facebook, IBM and Microsoft are all devoting gobs of money to perfect instant, seamless translation, with Google, IBM and Microsoft creating special legal translation units.

IBM started it all decades ago when it set up the first machine translation architectures based on mathematical models called translation models. Generally speaking, a translation model accounts for all of the elementary operations that rule the process of translation between the different word orderings of the source and target languages. Translation models are usually enriched with statistical parameters, to help drive the search in the space of all valid transformations of the source sentence into the target sentence. IBM developed specialized algorithms to provide for the automatic estimation of these parameters.

Last week’s Der Spiegel had an interview with the German computer scientist chasing Google’s translation dreams, Franz Josef Och, about the challenges of translating between Google’s 71 supported languages. We have a link below. The article highlights the many problems involved for a computer to “learn” a language, such as the issue in English and in German, for example, where adjectives precede the noun, whereas in French it’s usually the other way around.

And the usual issues: context, syntax, intonation and ambiguity. Because a computer system is not context aware, it could grab the wrong word. Additionally, it doesn’t understand the language at all. It just tries to decode words, instead of decoding the meaning. Many languages are not similar at all, and do not have corresponding common words and/or their usage is not the same at all.

But the technology is getting much, much better. Last week in Paris we attended a Microsoft Research event for its Natural Language Processing group and we learned about the Machine Translation (MT) project which is focused on creating MT systems and technologies that cater to the multitude of translation scenarios today, including legal. The key is Statistical Machine Translation (SMT) and that breaks down into areas such as syntax-based SMT and phrase-based SMT. Plus there is Word Alignment and Language Modeling technologies.

These advanced language modeling toolkits mean that problems with morphology, syntax, semantics and word sense disambiguation are being solved. For the vendors and the multinational companies who need it, the business model is a no brainer. The value of an automated, instant, seamless translation platform to a corporation means Google and IBM and Microsoft could charge a substantial amount of money for such a tool.

For now, contract attorneys with non-English language skills will stay in demand. But just as predictive coding has rended the English language document review market (and is being used in more and more non-English language document reviews) those nasty algorithms are making their way across all languages. As Marc Andreessen said several years ago in his prescient essay Software is eating the world “all of the technology required to transform industries through software is finally working and can be widely delivered at global scale … don’t be on the wrong side of software-based disruption”.

For the Der Spiegel piece we noted above please click here.

admin