Empress
Technical News – January 2009
Database Text SearchIndex – Fast Text Data Retrieval – Part 1
 
Introduction
Empress
Ultra Embedded 10.20 offers many additional features to application developers.
One of those features is a Text Search Index capability.
Empress Text
Search Index
Empress
Ultra Embedded V10.20 Text Search Index capability empowers application
developers to implement an efficient search for database records using
keywords/tokens/phrases. The most typical usage would be to associate those
keywords with particular character/text based attribute in a database table.
The
search index capability is developed as an additional set of C calls that are
used in conjunction with Empress C/C++ Kernel Level API – mr Routines. The
search index is a user maintained index – an index not maintained via Empress
database engine calls.
Application
would supply the list (array) of tokens/keywords/phrases on insertion in the
Empress database in order to create a text search index.
Lets take
an example of storing song titles in an Empress TEXT data type attribute. For
the famous Beatles song “I Want To Hold Your Hand” the list of tokens/keywords/phrases could look like as
follows:
Want
Hold
Hand
I
Hold Your
Your Hand
Hold Your Hand
In the
above example the list contains seven tokens. Tokens can contain a single word
or multiple words. Words do not necessarily have to be contained in the
original song title. This could be beneficial when, for example, using either
token Mc or Mac to get a song title “Old Macdonald”.
Once
created an Empress text search index enables very powerful searches using
inserted tokens/keywords.
In the
presence of hundreds of thousands of song titles a search for the song title
with a token Hold could
result in the result set containing the following titles:
You Really Got A Hold on Me (Beatles)
If We Hold On
(Diana Ross)
I Want To Hold Your Hand (Beatles)
Hold Me Back
(Ac/Dc)
Hold Me, Thrill Me, Kill Me (U2)
Retrieval Using
Search Index
The
following example provides the actual program code (select_text.c) using
Empress C/C++ Kernel Level API – mr Routines to show how to retrieve records
from the table songs using the search index on the same table. The
search index C API calls are in bold. The example performs the retrieval of all
the records from the table that contain token “Hold”. When translated into SQL the
example does something like:
SELECT
id, title FROM songs WHERE title HAS “Hold
#include      < mscc.h>
#include 
 
 
< tsi_api.hx>
#ifdef 
WINCE
global_shared_func    (char* format, ...)
#define 
 
 
 
 
 
 
 
swprintf               
#define 
 
 
 
 
 
 
 
printf                      mssio_printf
#endif
#define 
 
 
 
DATABASE 
 
L" karaokedb"
int 
 
 
 
 
msmain
(int 
 
 
argc,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
char*  argv[])
{
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
wchar_t*                                  id_value
 
 
 
 
 
 
 
wchar_t*                                  title_value
 
 
 
 
 
 
 
void*                         
 
 
 
 
 
 
 
long*                         
            char                   
}
              =
mropen (DATABASE, L" songs" , 'r')
 
 
 
 
 
 
 
songs_recdesc = mrmkrec (songs_tabdesc)
 
 
 
 
 
 
 
id_attrdesc
= mrngeta (songs_tabdesc, L" id"
 
 
 
 
 
 
 
title_attrdesc
= mrngeta (songs_tabdesc, L" title" )
 
 
 
 
 
 
 
index_handle=
mstsi_open (songs_tabdesc , title_attrdesc, 'r')
 
 
 
 
 
 
 
id_value =
mrspv (id_attrdesc)
 
 
 
 
 
 
 
title_value =
mrspv (title_attrdesc)
 
 
 
 
 
 
 
printf (" Songs
that contain token Hold\n\n"
 
 
 
 
 
 
 
 
 
  printf (buf, %-12s
%-22s\n"
, " Id" Title" )
 
 
 
 
 
 
 
record_list
= mstsi_get (index_handle, L" Hold" , 0)
 
 
 
 
 
 
 
if (record_list == 0)
 
 
 
 
 
 
 
{
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
printf
("
No songs with token Hold\n"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
return 0
 
 
 
 
 
 
 
}   
 
 
 
 
 
 
 
qual = mrqlst
(songs_tabdesc, record_list)
 
 
 
 
 
 
 
retrieve_desc =
mrgetbegin (qual, songs_recdesc, (void*) 0)
              (mrget (retrieve_desc))
 
 
 
 
 
 
 
{
 
 
 
 
 
 
 
 
 
 
mrcopyv
(songs_recdesc, id_attrdesc, id_value)
 
 
 
 
 
 
 
 
 
 
mrcopyv
(songs_recdesc, title_attrdesc, title_value)
 
 
 
 
 
 
 
 
 
        sprintf (buf, " %-12ls
%-22ls\n"
, id_value, title_value)
 
 
 
 
 
 
 
 
 
        printf (" %s" , buf)
 
 
 
 
 
 
 
}
 
 
 
 
 
 
 
mrgetend
(retrieve_desc)
 
 
 
 
 
 
 
mrfree
(id_value)
 
 
 
        mrfree (title_value)
 
 
 
 
 
 
 
mrfrrec
(songs_recdesc)
 
 
 
 
 
 
 
mstsi_close
(index_handle)
 
 
 
 
 
 
 
 
 
mrclose
(songs_tabdesc)
 
 
 
 
 
 
 
 
 
  printf ( select1_text is done\n" )
 
 
 
 
 
 
 
return 0
} 
Performance Considerations
When
executing the above mr program the performance could be several orders of
magnitude better than when performing the same search using an SQL query, e.g.:
SELECT
id, title FROM songs WHERE title LIKE “%Hold%”
or
SELECT
id, title FROM songs WHERE title SMATCH “*
In another
example a search for a keyword " Embedded" was performed against all titles stored in Empress
database containing Wikipedia data. The table pages contained
7,649,051 titles and related information such as
and author.
Using the text search index, the search completed in 0.077 seconds with 131 titles
in the result set.
For the comparison, an SQL query was executed to do the similar search on the
TEXT attribute title
SELECT
id, title FROM pages WHERE title SMATCH '
It required 35.314 seconds for the SQL query to complete.
The
Empress text search index was about 500 times faster than the SQL query!!
 
EmpressSoftware Inc.
www.empress.com