Postgres Full-Text Search: A Search Engine in a Database
Early in on my SQL journey, I thought that searching for a piece of text in the database mostly involved querying like this:
SELECT col FROM table WHERE col LIKE '%some_value%';
Then I would throw in some wildcard operators or regular expressions if I wanted to get more specific.
Later on, I worked with a client who wanted search functionality in an app, so LIKE
and regex weren't going to cut it. What I had known all along was just pattern matching. It works perfectly fine for certain purposes, but what happens when it's not just a matter of checking for a straightforward pattern in a single text field?
For example, what if you wanted to search across multiple fields? How about returning possible matches even if the search term happens to be misspelled? Also, what if you have very large amounts of data to search on? Sure, you can create indexes for columns on which you want to query for pattern matches, but that will have limitations (for instance, the B-tree index doesn't work for col LIKE '%substring%'
).
So when we say PostgreSQL is the "batteries included database," this is just one reason why. With Postgres, you don't need to immediately look farther than your own database management system for a full-text search solution. If you haven't yet given Postgres' built-in full-text search a try, read on for a simple intro.
Postgres Full-Text Search Basics for the Uninitiated
Core Postgres includes the following full-text search capabilities. To name a few:
- Ignore stop words (common words such as "the" or "an").
- Stemming, where search matches can be based on a "root" form, or stem, of a word (“run” matches “runs” and “running” and even “ran”).
- Weight and rank search matches (so best matches can be sorted to the top of a result list).
Before we go further, let's also get ourselves familiarized with the following concepts:
- A document is a set of data on which you want to carry out your full-text search. In Postgres, this could be built from a single column, or a combination of columns, even from multiple tables.
- The document is parsed into tokens, which are small fragments (e.g. words, phrases, etc) from the document's text. Tokens are then converted to more meaningful units of text called lexemes.
- In Postgres, this conversion is done with dictionaries -- there are built-in ones, but custom dictionaries can be created if necessary. These dictionaries help determine stop words that should get ignored, and whether differently-derived words have the same stem. Most dictionaries are for a specific language (English, German, etc) but you could also have ones that are for a specific domain.
- The sorted list of lexemes from the document is stored in the
tsvector
data type.
Example: Searching Storm Event Details
I have a table that contains storm events data gathered by the U.S. National Weather Service. For simplicity's sake I won't include all possible fields in the statement below, but there's a copy of the data and some further information available in this repository.
CREATE TABLE se_details (
episode_id int,
event_id int primary key,
state text,
event_type text,
begin_date_time timestamp,
episode_narrative text,
event_narrative text,
...
);
Let's also say that we want to carry out a full-text search on the data on the event_narrative
column. We could add a new column to the table to store the preprocessed search document (i.e. the list of lexemes):
ALTER TABLE se_details ADD COLUMN ts tsvector
GENERATED ALWAYS AS (to_tsvector('english', event_narrative)) STORED;
ts is a generated column (new as of Postgres 12), and it's automatically synced with the source data.
We can then create a GIN index on ts:
CREATE INDEX ts_idx ON se_details USING GIN (ts);
And then we can query like so:
SELECT state, begin_date_time, event_type, event_narrative
FROM se_details
WHERE ts @@ to_tsquery('english', 'tornado');
tsquery is the other full-text search data type in Postgres. It represents search terms that have also been processed as lexemes, so we'll pass in our input term to the to_tsquery
function in order to optimize our query for full-text search. (@@
is a match operator.)
What we get with this query are records where "tornado" is somewhere in the text string, but in addition to that, here are a couple of records in the result set where there are also matches for "tornado" as lexeme ("tornado-like" and "tornadoes"):
state | KENTUCKY
begin_date_time | 2018-04-03 18:08:00
event_type | Thunderstorm Wind
event_narrative | A 1.5 mile wide swath of winds gusting to around 95 mph created **tornado-like** damage along Kentucky Highway 259 in Edmons
on County. The winds, extending 3/4 of a mile north and south of Bee Spring, destroyed or heavily damaged several small outbuildings, tore
part of the roof off of one home, uprooted and snapped the trunks of numerous trees, and snapped around a dozen power poles. Several othe
r homes sustained roof damage, and wind-driven hail shredded vinyl siding on a number of buildings.
state | WISCONSIN
begin_date_time | 2018-08-28 15:30:00
event_type | Thunderstorm Wind
event_narrative | A swath of widespread tree and crop damage across the southern portion of the county. Sections of trees and crops compl
etely flattened, and some structural damage from fallen trees or due to the strong downburst winds. Various roads closed due to fallen tre
es. Two semi-trucks were overturned on highway 57 in Waldo. The widespread wind damage and tornadoes caused structural damage to many home
s with 70 homes sustaining affected damage, 3 homes with minor damage, 2 homes with major damage, one home destroyed, and 2 businesses wit
h minor damage.
Searching for Phrases
One way to handle phrases as search terms is to use the &
(AND
) or <->
(FOLLOWED BY
) Boolean operators with the tsquery
.
For example, if we want to search for the phrase "rain of debris":
SELECT state, begin_date_time, event_type, event_narrative
FROM se_details
WHERE ts @@ to_tsquery('english', **'rain & of & debris'**);
The search phrase gets normalized to 'rain' & 'debri'. The order doesn't matter as long as both 'rain' and 'debri' have matches in the document, such as this example:
A debris flow caused by heavy rain on a saturated hillside blocked the Omak River Road one mile south of the intersection with State Route 97.
If we do to_tsquery('english', 'rain <-> of <-> debris')
the tsquery value is 'rain' <2> 'debri'
, meaning it will only match where 'rain' is followed by 'debri' precisely two positions away, such as here:
Heavy rain caused debris flows on the Coal Hollow Fire and Tank Hollow Fire burn scars.
(This was actually the only match, so using the <-> operator is a little bit more restrictive.)
The phraseto_tsquery
function can also parse the phrase itself, and inserts <N>
between lexemes where N is the integer position of the next lexeme when counting from the preceding one. This function doesn't recognize operators unlike to_tsquery; for example, we can just pass in the entire phrase like so:
phraseto_tsquery('english', 'rain of debris')
The tsquery
value is 'rain' <2> 'debri'
like above, so phraseto_tsquery
also accounts for positioning.
Functions for Weighting and Ranking Search Results
One very common use case for assigning different weights and ranking is searching on articles. For example, you may want to merge the article title and abstract or content together for search, but want matches on title to be considered more relevant and thus rank higher.
Going back to our storm events example, our data table also has an episode_narrative
column in addition to event_narrative
. For storm data, an event is an individual type of storm event (e.g. flood, hail), while an episode is an entire storm system and could contain many different types of events.
Let's say we want to be able to carry out a full-text search on event as well as episode narratives, but have decided that the event narrative should weigh more than the episode narratives. We could define the ts column like this instead:
ALTER TABLE se_details ADD COLUMN ts tsvector
GENERATED ALWAYS AS
**(setweight(to_tsvector('english', coalesce(event_narrative, '')), 'A') ||**
**setweight(to_tsvector('english', coalesce(episode_narrative, '')), 'B'))** STORED;
setweight
is a full-text function that assigns a weight to the components of a document. The function takes the characters 'A', 'B', 'C', or 'D' (most weight to least, in that order). We're also using a coalesce here so that the concatenation doesn't result in nulls if either episode_narrative
or event_narrative
contain null values.
You could then use the ts_rank
function in an ORDER BY
clause to return results from most relevant to less.
SELECT …
ORDER BY ts_rank(ts, to_tsquery('english', 'tornado')) DESC;
So, this record is ranked higher in the search results:
state | MISSISSIPPI
begin_date_time | 2018-04-06 22:18:00
event_type | Tornado
event_narrative | This tornado touched down near the Jefferson Davis-Covington County line along Lucas Hollow Road. It continued southeast, crossing the
county line. Some large limbs and trees were snapped and uprooted at this location. It then crossed Lucas Hollow Road again before crossing Leonard Road.
A tornado debris signature was indicated on radar in these locations. The tornado uprooted and snapped many trees in this region. It also overturned a sm
all tractor trailer on Oakvale Road and caused some minor shingle damage to a home. After crossing Oakvale Road twice, the tornado lifted before crossing
Highway 35. The maximum winds in this tornado was 105mph and total path length was 2.91 miles. The maximum path width was 440 yards.
episode_narrative | A warm front was stretched across the region on April 6th. As a disturbance rode along this stalled front, it brought copious amounts
of rain to the region thanks to ample moisture in place. As daytime heating occurred, some storms developed which brought severe weather to the region.
Compared to this, where there is a match for "tornado" in episode_narrative
but not event_narrative
:
state | NEBRASKA
begin_date_time | 2018-06-06 18:10:00
event_type | Hail
event_narrative | Hail predominately penny size with some quarter size hail mixed in.
episode_narrative | Severe storms developed in the Nebraska Panhandle during the early evening hours of Jun
e 6th. As this activity tracked east, a broken line of strong to severe thunderstorms developed. Hail up to
the size of ping pong balls, thunderstorm wind gusts to 70 MPH and a brief tornado touchdown were reported
. Heavy rain also fell leading to flash flooding in western Keith county.
Tip: ts_rank returns a floating-point value, so you could include the expression in your SELECT
to see how these matches score. In my case I get around a 0.890 for the Mississippi event, and 0.243 for the Nebraska event.
Yes, You Can Keep Full-Text Search in Postgres
You can get even deeper and make your Postgres full-text search even more robust, by implementing features such as highlighting results, or writing your own custom dictionaries or functions. You could also look into enabling extensions such as unaccent (remove diacritic signs from lexemes) or pg_trgm (for fuzzy search). Speaking of extensions, those were just two of the extensions supported in Crunchy Bridge. We've built our managed cloud Postgres service such that you can dive right in and take advantage of all these Postgres features.
With all that said: as you can see, you don't need a very involved setup to get started. It's a good idea to try out whether you are just beginning to explore a full-text search solution, or even just reevaluating whether you need to go all out for a dedicated full-text search service, especially if you already have Postgres in your stack.
To be fair, Postgres doesn't have some search features that are available with platforms such as Elasticsearch. But a major advantage is that you won't have to maintain and sync a separate data store. If you don't quite need search at super scale, there might be more for you to gain by minimizing dependencies. Plus, the Postgres query syntax that you already know with the addition of some new functions and operators, can get you pretty far. Got any other questions or thoughts about full-text search with Postgres? We're happy to hear them on @crunchydata.
Related Articles
- Postgres Tuning & Performance for Analytics Data
19 min read
- Running an Async Web Query Queue with Procedures and pg_cron
6 min read
- Name Collision of the Year: Vector
9 min read
- Sidecar Service Meshes with Crunchy Postgres for Kubernetes
12 min read
- pg_incremental: Incremental Data Processing in Postgres
11 min read