This article explains how to use Postgres joins to do fast text analytics, using an example of a recent project we’ve been working on at Outlandish.

TL; DR

Background

I learnt to code while prototyping a new search system for the BBC and it’s stayed a discipline close to my heart. Outlandish does a lot of work with search and text analytics – everything from simple document repositories like Culture Hive to much more in depth platforms like PNGi Portal – an anti-corruption site that helps people find dodgy dealings in Papua New Guinea and which has already contributed to the removal of several ministers and senior civil servants.

A lot of the work that we do in this area is fairly quick and dirty – someone spends a few days or weeks doing some exploration that helps move on a journalist’s or academic’s research project. Quite often we’re trying to just give someone some pointers of where to start their research without them having the wade through thousands of documents.

We use a variety of tools for text analysis including the excellent ElasticSearch, custom map/reduce tasks and, increasingly, Postgres’ full text search capabilities.

For those who don’t know, Postgres is a relational database system similar to MySQL/MariaDB or Oracle. The main advantage of Postgres is that it has a large number of extensions that provide some very powerful functionality. For example, it has an indexable binary JSON datatype that make it suitable for doing a lot of the flexible workloads that people often use no-SQL databases such as Mongo for, but with the benefits of joins, transactions, etc.

Postgres provides two main datatypes for fulltext search – tsquery and tsvector – as well as indexes that make lookups using these data formats very fast. I’m not going to go into the whole mechanics of full text search here – the excellent “Postgres search is good enough” article is a good place to start. Instead I’m going to focus on the specific mechanics of using tsquery->tsvector joins to run fast and simple text analytics.

Example: corruption data case study

Our dataset included the names and various other details about 50,000 companies and about 60,000 government documents with an average length of ~11,000 characters (~5 pages). Our clients (a university research team) wanted to know which companies were receiving government support so that they could start looking into those companies in more detail.

The key data we’re interested in are the company’s name and “abbreviated name”, and the body text from each documents.

We end up with something like

Next we import all the text and companies from our data and then create indexes so that we can quickly compare the two tables.

Adding magic

The document vector is fairly straightforward:

The main thing to note is that we’re using the “simple” vector settings rather than the default “English” settings. This is because most dictionaries do some pre-processing called stemming which converts similar terms into the same token:

This is useful if you’re trying to do some fuzzy matching, but not if you’re trying to identify exact company names.

Fortunately the “simple” settings give us the behaviour we want and don’t apply stemming:

For what it’s worth, a ts_vector is essentially a list of words that appear in the given text, with numbers to show where the word appeared:

(not that ‘the’ is the 1st and 8th word in this phrase)

The next thing to do is to add a tsquery column to the companies table:

This is a bit more complicated in this case because we’re creating a query for each document which is essentially “ID number OR legalName OR abbreviatedName”. We’re using a CASE (e.g. IF) statement because not all of the companies have all of the fields and we’ll get a syntax error if we try to create a query using an empty string. We’re also doing a bit of dirty on-the-fly data cleaning (the REPLACE ‘”’ with ‘’) to stop quotes in company names breaking the query.

It’s worth noting that the plainto_tsquery and phraseto_tsquery functions are useful helpers. plainto_tsquery does some useful escaping of special characters (including ones the query operators such as |). Phraseto_tsquery turns the phrase you type into a query which filters out results where the words are not consecutive. The phrase query looks like this:

as opposed to a plain AND query which looks like this:

Querying the data

With the magic indexed columns in place we can query the data pretty easily. To get a list of all the matches:

Magically, the database will return a list of all the mentions of all of the companies in all of the documents in a few seconds. You can then do all the normal analytic queries on that data (e.g. find the company mentioned in the most documents, or the documents that mention the most companies that are not mentioned in other documents. In PostGres you can make it all a lot faster by creating a materialised view:

Find out more

I was meaning to write a whole load more detail on that but I’ve run out of time, so if you want to know more get in touch with me – harry@outlandish.com and I’ll write some more.