Text Search Functions and Operators

Education is not limited to just classrooms. It can be gained anytime, anywhere... - Ravi Ranjan (M.Tech-NIT)

PostgreSQL Text Search Functions and Operators

Introduction

PostgreSQL provides two data types to support full-text search, one is tsvector and anothe is tsquery type. The tsvector type represents a document in a form optimized for text search; the tsquery type similarly represents a text query.

A tsvector value is a sorted list of distinct lexemes, which are words that have been normalized to merge different variants of the same word . Sorting and duplicate-elimination are done automatically during input, as shown in this example:

SELECT 'a fat rat sat on a mat and a mad cat ate the rat'::tsvector;

                            tsvector
----------------------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mad' 'mat' 'on' 'rat' 'sat' 'the'
(1 row)

Here in the above example, the document converted to the tsvector data type.

To represent lexemes containing whitespace or punctuation, surround them with quotes, here is the following example.

SELECT $$the lexeme '    ' contains spaces$$::tsvector;

                 tsvector
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'
(1 row)

The above example the dollar-quoted string have been used as literals .

To represent lexemes containing whitespace or punctuation, surround them with quotes, here is the following example.

SELECT $$the lexeme '    ' contains spaces$$::tsvector;

                 tsvector
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'
(1 row)

Optionally, integer positions can be attached to lexemes, here is the example below.

SELECT $$the lexeme '    ' contains spaces$$::tsvector;

                                   tsvector
-------------------------------------------------------------------------------
 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
(1 row)

A position normally indicates the source word's location in the document. Position values can range from 1 to 16383; Duplicate positions for the same lexeme are discarded.

Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default and hence is not shown on output:

SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;

          tsvector
----------------------------
 'a':1A 'cat':5 'fat':2B,4C
(1 row)

To implement full text searching there must be a function to create a tsvector from a document and a tsquery from a user query. PostgreSQL provides the function to_tsvector for converting a document to the tsvector data type. The to_tsvector parses a textual document into tokens, reduces the tokens to lexemes, and returns a tsvector which lists the lexemes together with their positions in the document.

Text Search Operators

Operator Description Example Result
@@ tsvectormatchestsquery ? SELECT to_tsvector('fat cats ate rats') @@ to_tsquery('cat & rat');
 ?column?
----------
 t
(1 row)
@@@ deprecated synonym for@@ SELECT to_tsvector('fat cats ate rats') @@@ to_tsquery('cat & rat');
 ?column?
----------
 t
(1 row)
|| concatenatetsvectors SELECT 'a:1 b:2'::tsvector || 'c:1 d:2 b:3'::tsvector;
         ?column?
---------------------------
 'a':1 'b':2,5 'c':3 'd':4
(1 row)
&& ANDtsquerys together SELECT 'fat | rat'::tsquery && 'cat'::tsquery;
         ?column?
---------------------------
 ( 'fat' | 'rat' ) & 'cat'
(1 row)
|| OR tsquerys together SELECT 'fat | rat'::tsquery || 'cat'::tsquery;
         ?column?
---------------------------
 ( 'fat' | 'rat' ) | 'cat'
(1 row)
!! negate atsquery SELECT !! 'cat'::tsquery;
 ?column?
----------
 !'cat'
(1 row)
@> tsquerycontains another ? SELECT 'cat'::tsquery @> 'cat & rat'::tsquery;
 ?column?
----------
 f
(1 row)
<@ tsquery is contained in ? SELECT 'cat'::tsquery <@ 'cat & rat'::tsquery;
 ?column?
----------
 t
(1 row)

In addition to the operators shown in the table, the ordinary B-tree comparison operators (=, <, etc) are defined for types tsvector and tsquery. These are not very useful for text searching but allow, for example, unique indexes to be built on columns of these types.

Text Search Functions

get_current_ts_config() function

This function is used to get default text search configuration.

Syntax :

get_current_ts_config()

Return Type : regconfig

Example

SELECT get_current_ts_config();

Here is the result.

 get_current_ts_config
-----------------------
 english
(1 row)

length() function

This function is used to get the number of lexemes in tsvector.

Syntax :

length(tsvector)

Return Type : integer

Example

SELECT length('fat:2,4 cat:3 rat:5A mat:1'::tsvector);

Here is the result.

 length
--------
      4
(1 row)

numnode() function

This function is used to get number of lexemes plus operators in tsquery.

Syntax :

numnode(tsquery)

Return Type : integer

Example

SELECT numnode('(mat & cat) | bat'::tsquery);

Here is the result.

 numnode
---------
       5
(1 row)

plainto_tsquery() function

This function is used to produce tsquery ignoring punctuation.

Syntax :

plainto_tsquery([ config regconfig , ] query text)

Return Type : tsquery

Example

SELECT plainto_tsquery('english', 'The Bad Cats');

Here is the result.

 plainto_tsquery
-----------------
 'bad' & 'cat'
(1 row)

querytree() function

This function is used to get indexable part of a tsquery.

Syntax :

querytree(query tsquery)

Return Type : text

Example

SELECT querytree('fat & ! cat'::tsquery);

Here is the result.

 querytree
-----------
 'fat'
(1 row)

setweight() function

This function is used to assign weight to each element of tsvector.

Syntax :

setweight(tsvector, "char")

Return Type : tsvector

Example

SELECT setweight('sat:1,5 bat:3 rat:2B'::tsvector, 'A');

Here is the result.

           setweight
-------------------------------
 'bat':3A 'rat':2A 'sat':1A,5A

Here is another one example.

SELECT setweight('sat:1,5 bat:3 rat:2B'::tsvector, 'D');

Here is the result.

         setweight
---------------------------
 'bat':3 'rat':2 'sat':1,5
(1 row)

strip() function

This function is used to remove positions and weights from tsvector.

Syntax :

strip(tsvector)

Return Type : tsvector

Example

SELECT strip('fat:2,5 cat:1 rat:4A'::tsvector);

Here is the result.

       strip
-------------------
 'cat' 'fat' 'rat'
(1 row)

to_tsquery() function

This function is used to normalize words and convert to tsquery.

Syntax :

to_tsquery([ config regconfig , ] query text)

Return Type : tsquery

Example

SELECT to_tsquery('english', 'The & Pet & Cats');

Here is the result.

  to_tsquery
---------------
 'pet' & 'cat'
(1 row)

to_tsvector() function

This function is used to reduce document text to tsvector.

Syntax :

to_tsvector([ config regconfig , ] document text)

Return Type : tsvector

Example

SELECT to_tsvector('english', 'The Fat Rats');

Here is the result.

   to_tsvector
-----------------
 'fat':2 'rat':3
(1 row)

ts_headline() function

This function is used to display a query match.

Syntax :

ts_headline([ config regconfig, ] document text, query tsquery [, options text ])

Return Type : text

Example

SELECT ts_headline('x y z', 'z'::tsquery);

Here is the result.

 ts_headline
--------------
 x y z
(1 row)

ts_rank() function

This function is used to rank document for a query.

Syntax :

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ])

Return Type : float4

Example

SELECT ts_rank(to_tsvector('The quick brown fox jumps over the lazy dog.'),
               to_tsquery('fox'));

This is a regular query where we feed a string which we convert to a tsvector and a token which is converted to a tsquery. Here is the ranking result.

  ts_rank
-----------
 0.0607927
(1 row)

Here is another one.

SELECT ts_rank(to_tsvector('The quick brown fox jumps over the lazy dog.'),
               to_tsquery('fox & dog'));

Now we want to query the two tokens fox and dog . We chain them together in an AND (&) formation. Here is the ranking result.

 ts_rank
---------
 0.09149
(1 row)

Here is another one with more tokens.

SELECT ts_rank(to_tsvector('The quick brown fox jumps over the lazy dog.'),
               to_tsquery('fox & dog & lazy'));

The more tokens match the string, the higher the ranking will be. Here is the ranking.

 ts_rank
---------
 0.25948
(1 row)

 

ts_rank_cd() function

This function is used to rank document for query using cover density.

Syntax :

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ])

Return Type : float4

Example

SELECT ts_rank_cd('{0.1, 0.2, 0.4, 1.0}', to_tsvector('PostgreSQL full text search is a wonderful methode'), 
to_tsquery('wonderful'));

Here is the result.

 ts_rank_cd
------------
        0.1
(1 row)

ts_rewrite() function

This function is used to searches for the given tsquery in the target tsquery and replaces its occurences with the substitute tsquery.

Syntax :

ts_rewrite(query tsquery, target tsquery, substitute tsquery)

Return Type : tsquery

Example

SELECT ts_rewrite('c & b'::tsquery, 'b'::tsquery, 'bad|bat'::tsquery);

Here is the result.

       ts_rewrite
-------------------------
 ( 'bad' | 'bat' ) & 'c'
(1 row)

ts_rewrite() function

This function is used to replace using targets and substitutes from a SELECT command.

Syntax :

ts_rewrite(query tsquery, select text)

Return Type : tsquery

Example

CREATE TABLE test (col1 tsquery PRIMARY KEY, COL2 tsquery);
INSERT INTO test values ('p','q');
SELECT ts_rewrite('r & q'::tsquery, 'SELECT col1,col2 FROM test');

Here is the result.

 ts_rewrite
------------
 'r' & 'q'
(1 row)

Here is another one.

SELECT ts_rewrite('p & q'::tsquery, 'SELECT col1,col2 FROM test WHERE ''p & q'::tsquery @> col1);

Here is the result.

 ts_rewrite
------------
 'q' & 'q'
(1 row)

tsvector_update_trigger() function

This function triggered function for automatic tsvector column update

Syntax :

tsvector_update_trigger()

Return Type : trigger

Here is the step by step process to execute the example.

At first, create a table mytable.


CREATE TABLE mytable(
title text,
body text,
tsv tsvector);


Here we create a trigger tsvectorupdate.


CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON mytable FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(tsv,'pg_catalog.english',title,body);


Here we insert one row into the mytable.


INSERT INTO mytable VALUES('here is title','here is the body text');


Here is the table mytable:


postgres=# SELECT * FROM mytable;
     title     |         body          |            tsv
---------------+-----------------------+----------------------------
 here is title | here is the body text | 'bodi':7 'text':8 'titl':3
(1 row)


Now see the query:


SELECT title, body FROM mytable WHERE tsv@@ to_tsquery('title & body');

Here is the result.

     title     |         body
---------------+-----------------------
 here is title | here is the body text
(1 row)

Text Search Debugging Functions

The functions are listed separately because they are not usually used in everyday text searching operations. They are helpful for development and debugging of new text search configurations.

ts_debug() function

test a configuration

Syntax :

ts_debug([ config regconfig, ] document text, OUT alias text, OUT description text, OUT token text, OUT dictionaries regdictionary[], OUT dictionary regdictionary, OUT lexemes text[])

Return Type : setof record

Example

SELECT ts_debug('english', 'The earth is round.');

Here is the result.

                                ts_debug
-------------------------------------------------------------------------
 (asciiword,"Word, all ASCII",The,{english_stem},english_stem,{})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all ASCII",earth,{english_stem},english_stem,{earth})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all ASCII",is,{english_stem},english_stem,{})
 (blank,"Space symbols"," ",{},,)
 (asciiword,"Word, all ASCII",round,{english_stem},english_stem,{round})
 (blank,"Space symbols",.,{},,)
(8 rows)

ts_lexize() function

This function is used to test a dictionary.

Syntax :

ts_lexize(dict regdictionary, token text)

Return Type : text[]

Example

SELECT ts_lexize('english_stem', 'animals');

Here is the result.

 ts_lexize
-----------
 {anim}
(1 row)

ts_parse() function

test a parser

Syntax :

ts_parse(parser_name text, document text, OUT tokid integer, OUT token text)

Return Type : setof record

Example

SELECT ts_parse('default', 'bad - cat');

Here is the result.

 ts_parse
-----------
 (1,bad)
 (12," ")
 (12,"- ")
 (1,cat)
(4 rows)

ts_parse() function

This function is used to test a parser.

Syntax :

ts_parse(parser_oid oid, document text, OUT tokid integer, OUT token text)

Return Type : setof record

Example

SELECT ts_parse(3722, 'bad - cat');

Here is the result.

 ts_parse
-----------
 (1,bad)
 (12," ")
 (12,"- ")
 (1,cat)
(4 rows)

ts_token_type() function

get token types defined by parser

Syntax :

ts_token_type(parser_name text, OUT tokid integer, OUT alias text, OUT description text)

Return Type : setof record

Example

SELECT ts_token_type('default');

Here is the result.

                        ts_token_type
--------------------------------------------------------------
 (1,asciiword,"Word, all ASCII")
 (2,word,"Word, all letters")
 (3,numword,"Word, letters and digits")
 (4,email,"Email address")
 (5,url,URL)
 (6,host,Host)
 (7,sfloat,"Scientific notation")
 (8,version,"Version number")
 (9,hword_numpart,"Hyphenated word part, letters and digits")
 (10,hword_part,"Hyphenated word part, all letters")
 (11,hword_asciipart,"Hyphenated word part, all ASCII")
 (12,blank,"Space symbols")
 (13,tag,"XML tag")
 (14,protocol,"Protocol head")
 (15,numhword,"Hyphenated word, letters and digits")
 (16,asciihword,"Hyphenated word, all ASCII")
 (17,hword,"Hyphenated word, all letters")
 (18,url_path,"URL path")
 (19,file,"File or path name")
 (20,float,"Decimal notation")
 (21,int,"Signed integer")
 (22,uint,"Unsigned integer")
 (23,entity,"XML entity")
(23 rows)

ts_token_type() function

This function is used to get token types defined by parser.

Syntax :

ts_token_type(parser_oid oid, OUT tokid integer, OUT alias text, OUT description text)

Return Type : setof record

Example

SELECT ts_token_type(3722);

Here is the result.

                        ts_token_type
--------------------------------------------------------------
 (1,asciiword,"Word, all ASCII")
 (2,word,"Word, all letters")
 (3,numword,"Word, letters and digits")
 (4,email,"Email address")
 (5,url,URL)
 (6,host,Host)
 (7,sfloat,"Scientific notation")
 (8,version,"Version number")
 (9,hword_numpart,"Hyphenated word part, letters and digits")
 (10,hword_part,"Hyphenated word part, all letters")
 (11,hword_asciipart,"Hyphenated word part, all ASCII")
 (12,blank,"Space symbols")
 (13,tag,"XML tag")
 (14,protocol,"Protocol head")
 (15,numhword,"Hyphenated word, letters and digits")
 (16,asciihword,"Hyphenated word, all ASCII")
 (17,hword,"Hyphenated word, all letters")
 (18,url_path,"URL path")
 (19,file,"File or path name")
 (20,float,"Decimal notation")
 (21,int,"Signed integer")
 (22,uint,"Unsigned integer")
 (23,entity,"XML entity")
(23 rows)

ts_stat () function

This function is used to get statistics of a tsvector column.

Syntax :

ts_stat(sqlquery text, [ weights text, ] OUT word text, OUT ndoc integer, OUT nentry integer)

Return Type : setof record

Example

Here is the table mytable:


postgres=# SELECT * FROM mytable;
     title     |         body          |            tsv
---------------+-----------------------+----------------------------
 here is title | here is the body text | 'bodi':7 'text':8 'titl':3
(1 row)
SELECT ts_stat('SELECT tsv from mytable');

Here is the result.

  ts_stat
------------
 (titl,1,1)
 (text,1,1)
 (bodi,1,1)
(3 rows)