A good perspective on search, and baked in search bias

Home / Uncategorized / A good perspective on search, and baked in search bias

October 12, 2022

Search and retrieval systems manifest bias

BY:

Eric De Grasse
Chief Technology Officer

PROJECT COUNSEL MEDIA

12 October 2022 (San Francisco, California) – “Google Cloud Next” is in full swing (it runs 11-13 October) and there are 125 sessions this year. It is a global hybrid event – virtual and physical – and every session is being live streamed, or you can get it via on-demand. And it is all free.

It’s quite the place to be if you want to hobnob with enterprise search architects and developers, application developers, data and machine learning engineers, data analysts and data scientists, and even DevOps and Systems Administrators or cybersecurity professionals. I note that this year there are a ton of C-level executives and IT administrators. Probably sent by their respective Board of Directors.

NOTE TO READERS: Of course this year’s edition has special importance because as our boss, Greg Bufithis, detailed in a post last week Google is trying to become a more visual, more exploratory search engine. It is trying to blow up how you think about search. It needs to compete in a world where TikTok and Instagram are changing the way the internet works – especially given TikTok has become a search engine. Part of the reason Google came out with two new AI video systems.

Of course, while it is a “Masterclass in Search” it is not all goodness and light.

I was scanning the comments related to the HackerNews’ post “Google’s Million’s of Search Results Are Not Being Served in the Later Pages Search Results“. Sailfast made this comment:

Yeah – as someone that has run production search clusters before on technologies like Elastic / open search, deep pagination is rarely used and an extremely annoying edge case that takes your cluster memory to zero. I found it best to optimize for whatever is a reasonable but useful for users while also preventing any really seriously resource intensive but low value queries (mostly bots / folks trying to mess with your site) to some number that will work with your server main node memory limits.

More about Elastic below. But the comment outlines a facet of search which is not often discussed.

First, search plumbing imposes certain constraints. The idea of “all” information is one that many carry around like a trusted portmanteau. What are the constraints of the actual search system available or in use?

Second, optimization is a fancy word that translates to one or more engineers deciding what to do; for example, change a Bayesian prior assumption, trim content based on server latency, filter results by domain, etc.

Third, manipulation of the search system itself by software scripts or “bots” force engineers to figure out what signals are okay and which are not okay. It is possible to inject poisoned numerical strings or phrases into a content stream and manipulate the search system. (Hey, thank you, search engine optimization researchers and information warfare professionals. Great work.)

When I meet a person (usually much younger than myself) who says, “I am a search expert”, I just shake my head. Even many open source intelligence “experts” display that they live in a cloud of unknowing about search. Most of these professionals are unaware that much of their “research” actually comes from Google search and maps. Our veteran OSINT network knows a lot better.

And this is not about ediscovery search. That is fairly primitive search – text analysis and text mining of static data bases. Yes, you might be reading some nuanced documents written in highly technical language particular to a specific business, or maybe even complex information referred to in very relaxed, colloquial wording. But go to Legaltech/Legalweek in NYC every year and just pick a search vendor. Ediscovery search is commoditised and any of those search vendors will get you there. Or do it yourself. If you attended the pop-up event on enterprise search and ediscovery search for executives held a few months ago, you met with vendors in the text analysis, network analysis, and text mining area who are building out their own e-discovery engines built on the Python programming language. Yes, even the legal industry (always the last one to learn) is becoming more sophisticated about applied data analytics, data mining, and business intelligence

BUT NET NET NET … search and retrieval systems manifest bias, from the engineers, from the content itself, from the algorithms, and from user interfaces themselves. That’s why, yes, life is easier if one just believes everything one encounters online, but thinking in a different way is difficult, requires specialist knowledge when you dealing with dynamic databases and content, and a willingness to verify … everything. Through multiple systems.

Which brings me to Elastic, which I referenced above. Elastic is one of our primary search vehicles.

It seems like open-source search is under pressure. We learn from SiliconAngle that “Elastic Delivers Strong Revenue Growth and Beats Expectations, but Its Stock is Down“. For anyone unfamiliar with Elastic, the writer Mike Wheatley describes the company’s integral relationship with open-source software:

“The company sells a commercial version of the popular open-source Elasticsearch platform. Elasticsearch is used by enterprises to store, search and analyze massive volumes of structured and unstructured data. It allows them to do this very quickly, in close to real time. The platform serves as the underlying engine for millions of applications that have complex search features and requirements. In addition to Elasticsearch, Elastic also sells application observability tools that help companies to track network performance, as well as threat detection software.”

Could it be that recent concerns about open-source security issues are more important to investors than fiscal success? The write-up shares some details from the company’s press release:

“The company reported a loss before certain costs such as stock compensation of 15 cents per share, coming in ahead of Wall Street analysts’ consensus estimate of a 17-cent-per-share loss. Meanwhile, Elastic’s revenue grew by 30% year-over-year, to $250.1 million, beating the consensus estimate of $246.2 million. On a constant currency basis, Elastic’s revenue rose 34%. Altogether, Elastic posted a net loss of $69.6 million, more than double the $34.4 million loss it reported in the year-ago period.”

In a series of press releases, Elastic emphatically accentuates the positive – like the dramatic growth of its cloud-based business and its flourishing subscription base. So now, the question is whether the company’s new chief product officer, Ken Exner, find a way to circumvent open-source’s inherent weaknesses.

By the way, Exner used to work at Amazon overseeing AWS Developer Tools. For those of you that use Amazon Web Services, you know you can use “AWS Elasticsearch” as an open-source analytics and search engine. There is a whole set of use cases that includes clickstream analytics, real-time application monitoring, advanced database analytics, data mining, and log analytics.

admin