CTX_DOC
manipulate SQL queries by injecting arbitrary SQL code via the THEMES, GIST, TOKENS, FILTER, HIGHLIGHT, and MARKUP procedures
-----------
Sectioner
The sectioner object is responsible for identifying the containing section(s) for each text unit. Typically, these sections will be predefined HyperText Markup Language (HTML) or eXtensible Markup Language (XML) sections. Optionally,the sectioner can process all tags as sections delimiters.
For example:
<TITLE>XML Handbook</TITLE>. This allows search between tags using the WITHIN operator. Use of the WITHIN is illustrated in the section on XML searching.
Showing posts with label Oracle Text. Show all posts
Showing posts with label Oracle Text. Show all posts
Mar 3, 2009
Mar 2, 2009
Oracle Text - Part 5
Generating XML Output
The INDEX_STATS procedure supports both formatted text and XML output.The following code
creates the INDEX_STATS report in text format:
-- Table to store our report
CREATE TABLE index_report (id NUMBER(10), report CLOB);
DECLARE
v_report CLOB := null;
BEGIN
CTX_REPORT.INDEX_STATS(index_name => 'SONG_INDEX',
report => v_report,part_name => NULL,frag_stats => NULL,
list_size => 20,report_format => NULL);
INSERT INTO index_report (id, report) VALUES (1, v_report);
COMMIT;
DBMS_LOB.FREETEMPORARY(v_report);
END;
/
Then output can be viewed:
SELECT report FROM index_report WHERE id = 1;
To find out the size of the index, we can use the function CTX_REPORT.INDEX_SIZE. This is a function returning a CLOB.
select ctx_report.index_size('SONG_INDEX') from dual;
for a XML report :
SQL> select ctx_report.index_size('SONG_INDEX', null, 'XML') from dual;
This will give us a heap of XML output, including sizes for each object.
Here's a section of the output:
<SIZE_OBJECT_NAME> text_user.SYS_IL0000051565C00006$$
</SIZE_OBJECT_NAME>
<SIZE_OBJECT_TYPE> INDEX (LOB)
</SIZE_OBJECT_TYPE>
<SIZE_TABLE_NAME> text_user.DR$TI$I
</SIZE_TABLE_NAME>
<SIZE_TABLESPACE_NAME> USERS
</SIZE_TABLESPACE_NAME>
<SIZE_BLOCKS_ALLOCATED> 8
</SIZE_BLOCKS_ALLOCATED>
<SIZE_BLOCKS_USED> 4
</SIZE_BLOCKS_USED>
<SIZE_BYTES_ALLOCATED> 65536
</SIZE_BYTES_ALLOCATED>
<SIZE_BYTES_USED> 32768
</SIZE_BYTES_USED>
</SIZE_OBJECT>
<SIZE_OBJECT>
selecting a CLOB, force it into an XMLTYPE value:
select xmltype(ctx_report.index_size('SONG_INDEX', null, 'XML')) from dual;
select extract(xmltype(ctx_report.index_size('SONG_INDEX', null, 'XML')), '//SIZE_OBJECT') from dual;
We're now using the XML DB "extract" function to fetch all of the XML within tags. The syntax '//SIZE_OBJECT' is an XPATH expression meaning "all the XML within a SIZE_OBJECT element anywhere below the root element".the reader is encouraged to look at the tutorial at w3Schools.
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT') from dual;
Now specify that only information from the table $I is want . this is done with
within a predicate the xpath expression:
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]') from dual;
The size in bytes can be get by adding the XPATH expression:
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED') from dual;
To avoid tags. There's two ways of doing this 1) the text() function within the XPATH, 2) use extractValue rather than extract.
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED/text()') as "Table Size" from dual;
select extractValue(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')), '//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED')as "Table Size" from dual;
For fetching ALL the sizes, perhaps to add them together to get a summary XMLSequence can use,which returns a collection of XMLType values. This function can use in a TABLE clause to unnest the collection values into multiple rows. Now that we're generating a "table", we no longer need to reference DUAL.
To get just the sizes:
select * from table(xmlsequence( extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT/SIZE_BYTES_USED')));
To get all the object info, one object per row:
select * from table(xmlsequence( extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')), '//SIZE_OBJECT')));
Processing information into individual values:
select extractValue(Column_Value, '/SIZE_OBJECT/SIZE_OBJECT_NAME') as "Name",
extractValue(Column_Value, '/SIZE_OBJECT/SIZE_TABLESPACE_NAME') as "Tablespace",
extractValue(Column_Value, '/SIZE_OBJECT/SIZE_BYTES_USED') as "Bytes"
from table(xmlsequence(extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT')));
To get sum of all the sizes to get an aggragate total size of all objects
used in the index.
select sum(extractValue(Column_Value, '/SIZE_OBJECT/SIZE_BYTES_USED')) "Total"
from table(xmlsequence(extract(xmltype(replace(ctx_report.index_size('ti', null, 'XML'),chr(10),'')),'//SIZE_OBJECT')));
INDEX_STATS
Most of the CTX_REPORT procedures have two forms - a function which returns a CLOB (as done above)and an alternate form which requires to pass in a CLOB to be populated. One procedure - INDEX_STATS - does not have a function version. This is because it needs to do a full table scan of the main index table, which is likely to take some time.Therefore when calling INDEX_STATS it is needed to go about things a little differently. First create a table with an XMLType column. Then call a PL/SQL block which passes a temporary CLOB to CTX_REPORT, and then inserts that CLOB into the XMLType column. Then various XML operations can perform on that table.
create table output(report XMLType);
declare
tlob clob;
begin
ctx_report.index_stats(index_name=>'ti', report=>tlob,
list_size=>20, report_format=>'XML');
insert into output values (xmlType (replace(tlob,chr(10),'')) );
commit;
dbms_lob.freetemporary(tlob);
end;
/
There is a single row in table OUTPUT, which contains XML report. First well get the
estimated row fragmentation:
select extractValue(report,'//STAT_STATISTIC[@NAME="estimated row fragmentation"]')
as "Fragmentation" from output;
Now get the top three most frequent tokens.
using the position function:
select extract(value(d), '//STAT_TOKEN_TEXT')
from output, table(xmlsequence(extract(report,'/CTXREPORT/INDEX_STATS/STAT_TOKEN_STATS/STAT_TOKEN_LIST[@NAME="most frequent tokens"]/
STAT_TOKEN[position()<4]'))) d; Conclusion
The XML output mode of CTX_REPORT allows powerful manipulation of index information. In order to make full use of these,a good understanding of the XML features of the Oracle database, such as extract, extractValue, XMLSequence, and of XPATH syntax in general is needed.With the use of XML exploding within the data processing world, such an understanding is likely to be very useful in the future.
The INDEX_STATS procedure supports both formatted text and XML output.The following code
creates the INDEX_STATS report in text format:
-- Table to store our report
CREATE TABLE index_report (id NUMBER(10), report CLOB);
DECLARE
v_report CLOB := null;
BEGIN
CTX_REPORT.INDEX_STATS(index_name => 'SONG_INDEX',
report => v_report,part_name => NULL,frag_stats => NULL,
list_size => 20,report_format => NULL);
INSERT INTO index_report (id, report) VALUES (1, v_report);
COMMIT;
DBMS_LOB.FREETEMPORARY(v_report);
END;
/
Then output can be viewed:
SELECT report FROM index_report WHERE id = 1;
To find out the size of the index, we can use the function CTX_REPORT.INDEX_SIZE. This is a function returning a CLOB.
select ctx_report.index_size('SONG_INDEX') from dual;
for a XML report :
SQL> select ctx_report.index_size('SONG_INDEX', null, 'XML') from dual;
This will give us a heap of XML output, including sizes for each object.
Here's a section of the output:
<SIZE_OBJECT_NAME> text_user.SYS_IL0000051565C00006$$
</SIZE_OBJECT_NAME>
<SIZE_OBJECT_TYPE> INDEX (LOB)
</SIZE_OBJECT_TYPE>
<SIZE_TABLE_NAME> text_user.DR$TI$I
</SIZE_TABLE_NAME>
<SIZE_TABLESPACE_NAME> USERS
</SIZE_TABLESPACE_NAME>
<SIZE_BLOCKS_ALLOCATED> 8
</SIZE_BLOCKS_ALLOCATED>
<SIZE_BLOCKS_USED> 4
</SIZE_BLOCKS_USED>
<SIZE_BYTES_ALLOCATED> 65536
</SIZE_BYTES_ALLOCATED>
<SIZE_BYTES_USED> 32768
</SIZE_BYTES_USED>
</SIZE_OBJECT>
<SIZE_OBJECT>
selecting a CLOB, force it into an XMLTYPE value:
select xmltype(ctx_report.index_size('SONG_INDEX', null, 'XML')) from dual;
select extract(xmltype(ctx_report.index_size('SONG_INDEX', null, 'XML')), '//SIZE_OBJECT') from dual;
We're now using the XML DB "extract" function to fetch all of the XML within
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT') from dual;
Now specify that only information from the table $I is want . this is done with
within a predicate the xpath expression:
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]') from dual;
The size in bytes can be get by adding the XPATH expression:
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED') from dual;
To avoid tags. There's two ways of doing this 1) the text() function within the XPATH, 2) use extractValue rather than extract.
select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED/text()') as "Table Size" from dual;
select extractValue(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')), '//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED')as "Table Size" from dual;
For fetching ALL the sizes, perhaps to add them together to get a summary XMLSequence can use,which returns a collection of XMLType values. This function can use in a TABLE clause to unnest the collection values into multiple rows. Now that we're generating a "table", we no longer need to reference DUAL.
To get just the sizes:
select * from table(xmlsequence( extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT/SIZE_BYTES_USED')));
To get all the object info, one object per row:
select * from table(xmlsequence( extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')), '//SIZE_OBJECT')));
Processing information into individual values:
select extractValue(Column_Value, '/SIZE_OBJECT/SIZE_OBJECT_NAME') as "Name",
extractValue(Column_Value, '/SIZE_OBJECT/SIZE_TABLESPACE_NAME') as "Tablespace",
extractValue(Column_Value, '/SIZE_OBJECT/SIZE_BYTES_USED') as "Bytes"
from table(xmlsequence(extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT')));
To get sum of all the sizes to get an aggragate total size of all objects
used in the index.
select sum(extractValue(Column_Value, '/SIZE_OBJECT/SIZE_BYTES_USED')) "Total"
from table(xmlsequence(extract(xmltype(replace(ctx_report.index_size('ti', null, 'XML'),chr(10),'')),'//SIZE_OBJECT')));
INDEX_STATS
Most of the CTX_REPORT procedures have two forms - a function which returns a CLOB (as done above)and an alternate form which requires to pass in a CLOB to be populated. One procedure - INDEX_STATS - does not have a function version. This is because it needs to do a full table scan of the main index table, which is likely to take some time.Therefore when calling INDEX_STATS it is needed to go about things a little differently. First create a table with an XMLType column. Then call a PL/SQL block which passes a temporary CLOB to CTX_REPORT, and then inserts that CLOB into the XMLType column. Then various XML operations can perform on that table.
create table output(report XMLType);
declare
tlob clob;
begin
ctx_report.index_stats(index_name=>'ti', report=>tlob,
list_size=>20, report_format=>'XML');
insert into output values (xmlType (replace(tlob,chr(10),'')) );
commit;
dbms_lob.freetemporary(tlob);
end;
/
There is a single row in table OUTPUT, which contains XML report. First well get the
estimated row fragmentation:
select extractValue(report,'//STAT_STATISTIC[@NAME="estimated row fragmentation"]')
as "Fragmentation" from output;
Now get the top three most frequent tokens.
using the position function:
select extract(value(d), '//STAT_TOKEN_TEXT')
from output, table(xmlsequence(extract(report,'/CTXREPORT/INDEX_STATS/STAT_TOKEN_STATS/STAT_TOKEN_LIST[@NAME="most frequent tokens"]/
STAT_TOKEN[position()<4]'))) d; Conclusion
The XML output mode of CTX_REPORT allows powerful manipulation of index information. In order to make full use of these,a good understanding of the XML features of the Oracle database, such as extract, extractValue, XMLSequence, and of XPATH syntax in general is needed.With the use of XML exploding within the data processing world, such an understanding is likely to be very useful in the future.
Oracle Text - Part 4
Progressive Relaxation is a new technique for text searching, available with Oracle 10g.
It assumes a basic working knowledge of Oracle Text, such as the operators used
in query expressions.When Would Use It?
First, let's consider a search scenario.
You have a web site selling books. A user searches for "Michael Crichton" in the "search author" box. OK, easy enough. You do the search, and return the top 10 hits (or whatever) that match the search criteria.
But what if the user mis-spells the firstname, and searches for "Michel Crichton"? In this case, a good strategy for handling this might be to find the top 10 hits from these searches:
1. Any books where the author is exactly "Michel Crichton"
2. Any books containing a fuzzy match of each word, in the right order
3. Any books having either exact word
4. Any books having a fuzzy match of either word.
Of course we can do this in a search like
select book_id from books where contains (author, '(michel crichton) OR (?michel ?crichton) OR (michel OR crichton) OR (?michel OR ?crichton)
But there are two problems with this search:
1. From the user's point of view, hits which are a poor match will be mixed in with hits which are a good match. The user wants to see good matches displayed first.
2. From the system's point of view, the search is inefficient. Even if there were plenty of hits for exactly "Michel Crichton", it would still have to do all the work of the fuzzy expansions and fetch data for all the rows which satisfy the query.
An alternative is to run four separate queries. This way, we can do the exact search first, and then only run the queries with more relaxed criteria if they are needed to get enough hits for the results page.
But apart from the inefficiency of (potentially) running multiple queries, we need to de-duplicate the results. The "relaxed" queries will in many cases hit the rows returned by the exact queries. To avoid duplicates in the result set, the application must screen these hits out, a potentially expensive task in terms of programming and maintenance,if not raw performance.
To solve this problem, Oracle Text in Oracle Database 10g introduces "progressive relaxation".This allows you to specify the different searches to run, and Oracle will run each in turn,returning de-duplicated results until the application stops fetching hits.
The scores returned are manipulated such that if you order by score
Benefits
The benefits of progressive relaxtion is that an application developer can specify operations to be applied to a user query in a declarative manner. It is not necessary to parse the query and apply operators to each search term - the developer just specifies what options should be applied to each term, and how they should be combined (eg AND or OR)
The application also benefits from automatic de-duplication of results at a very early stage in the query (before docid to rowid translation) which is much more efficient than doing it at the final stage, as you would have to if you were running multiple queries.
Query Templates
The actual implementation of progressive relaxation is via the query template mechanism. If you have not come across this before, don't worry - it's quite straightforward and the examples should make it clear. Basically a query template is an XML fragment that is used in the CONTAINS clause in place of a simple query string.
There are some other things that you can do with query templates, such as specifying
language, query grammars and scoring options, but we won't be covering them here.
So - on to our first example:
create table mybooks (title varchar2(20), author varchar2(20));
insert into mybooks values ('Consider the Lillies', 'Ian Crichton Smith');
insert into mybooks values ('Sphere', 'Michael Crichton');
insert into mybooks values ('Stupid White Men', 'Michael Moore');
insert into mybooks values ('Lonely Day', 'Michaela Criton');
insert into mybooks values ('How to Teach Poetry', 'Michaela Morgan');
create index auth_idx on mybooks (author) indextype is ctxsys.context;
SELECT score(1), title, author FROM mybooks WHERE CONTAINS (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0 ORDER BY score(1) DESC;
The output of this query is:
SCORE(1) TITLE AUTHOR
---------- -------------------- --------------------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
26 Stupid White Men Michael Moore
26 Consider the Lillies Ian Crichton Smith
1 How to Teach Poetry Michaela Morgan
It can see that the first line is an exact match, according to the first entry in our progression sequence. The second line corresponds to a fuzzy match of each term in order - our second criteria. The third and fourth rows come from the exact "micheal OR chrichton" and finally the last row has a single match which is a fuzzy hit on one of the terms.
Obviously in this example, it is fetching all the rows, so there is no major advantage in using progressive relaxation. We can limit it to only return the first two hits, using a PL/SQL cursor:
set serveroutput on format wrapped
declare
max_rows integer := 2;
counter integer := 0;
begin
-- do the headings
dbms_output.put_line(rpad('Score', 8)||rpad('Title', 20)||rpad('Author', 20));
dbms_output.put_line(rpad('-----', 8)||rpad('-----', 20)||rpad('------', 20));
-- loop for the required number of rows
for c in (select score(1) scr, title, author from mybooks where contains (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0) loop
dbms_output.put_line(rpad(c.scr, 8)||rpad(c.title, 20)||rpad(c.author, 20));
counter := counter + 1;
exit when counter >= max_rows;
end loop;
end;
/
The output from this is
Score Title Author
----- ----- ------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
Now theres one more feature an application designer might want. And thats to
stop the search after it has evaluated the first successful search criteria. So in our example, if we get one or more hits on exactly "michael chrichton", we dont want to try any of the other searches. If the exact search fails, we want to try the others only until one of them returns one or more rows.
Theres (currently) no way to do this as part of the query template syntax.However, it is possible to do this at the application level by looking at the scores returned using PLSQL
Score Title Author
----- ----- ------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
26 Stupid White Men Michael Moore
26 Consider the Lillies Ian Crichton Smith
1 How to Teach Poetry Michaela Morgan
Now, given that the maximum score of a text query is 100, and that we had four steps in our search, we might be able to notice something here. Specifically, any match on the first step will always score in the top quarter of the possible results - 76% to 100%. The next step will score in the range 51-75%, the next 26-50%, and the final step 1-25%. If we had had five steps in our query, the scores would have been in the ranges 81-100%, 61-80%, 41-60%, 21-40% and 1-20%.
So in order to stop our results after the first valid search, we need to detect the score crossing one of these boundaries. In order to do this we MUST know in advance how many stepsthere are. The PL/SQL for all this is a little more tricky than before:
declare
max_rows integer := 2;
counter integer := 0;
number_of_steps integer := 4;
score_range_size integer; -- 33 for 3 steps, 25 for 4, 20 for 5 etc
this_score_group integer; -- final step is 1, penultimate step is 2 ...
last_score_group integer := 0; -- to compare change
begin
-- do the headings
dbms_output.put_line(rpad('Score', 8)||rpad('Title', 20)||rpad('Author', 20));
dbms_output.put_line(rpad('-----', 8)||rpad('-----', 20)||rpad('------', 20));
for c in (select score(1) scr, title, author from mybooks where contains (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0) loop
score_range_size := 100/number_of_steps;
this_score_group := c.scr/score_range_size;
exit when this_score_group < last_score_group;
last_score_group := this_score_group;
dbms_output.put_line(rpad(c.scr , 8)||rpad(c.title, 20)||rpad(c.author, 20));
counter := counter + 1;
exit when counter >= max_rows;
end loop;
end;
/
The output from this is:
Score Title Author
----- ----- ------
76 Sphere Michael Crichton
We could add a new, stricter step (remembering to increase the number_of_steps variable).This won't actually find anything but will demonstrate that the procedure continues until it does find at least one row:
number of steps number := 5;
...
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>
and output would be:
Score Title Author
----- ----- ------
61 Sphere Michael Crichton
there is a simplification which makes life easier for the application developer.
Generating the full syntax above can be complicated to program. So there is a "shorthand" syntax, known as a "query rewrite template":
<query>
<textquery> michael crichton
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>
</progression>
</textquery>
</query>
This will generate the same four-step syntax as used above. The arguments after TOKENS are as follows:
* Prefix - what to put before each token
* Suffix - what to put after each token
* Connector - an operator to link each token. Space causes a phrase search.
It is usually a good idea to surround each term with braces "{}", in case the user has entered a reserved word like STEM or NT. Beware, though, that adding a wild card after a brace can have strange effects - {dog}% is not the same as dog%.
It assumes a basic working knowledge of Oracle Text, such as the operators used
in query expressions.When Would Use It?
First, let's consider a search scenario.
You have a web site selling books. A user searches for "Michael Crichton" in the "search author" box. OK, easy enough. You do the search, and return the top 10 hits (or whatever) that match the search criteria.
But what if the user mis-spells the firstname, and searches for "Michel Crichton"? In this case, a good strategy for handling this might be to find the top 10 hits from these searches:
1. Any books where the author is exactly "Michel Crichton"
2. Any books containing a fuzzy match of each word, in the right order
3. Any books having either exact word
4. Any books having a fuzzy match of either word.
Of course we can do this in a search like
select book_id from books where contains (author, '(michel crichton) OR (?michel ?crichton) OR (michel OR crichton) OR (?michel OR ?crichton)
But there are two problems with this search:
1. From the user's point of view, hits which are a poor match will be mixed in with hits which are a good match. The user wants to see good matches displayed first.
2. From the system's point of view, the search is inefficient. Even if there were plenty of hits for exactly "Michel Crichton", it would still have to do all the work of the fuzzy expansions and fetch data for all the rows which satisfy the query.
An alternative is to run four separate queries. This way, we can do the exact search first, and then only run the queries with more relaxed criteria if they are needed to get enough hits for the results page.
But apart from the inefficiency of (potentially) running multiple queries, we need to de-duplicate the results. The "relaxed" queries will in many cases hit the rows returned by the exact queries. To avoid duplicates in the result set, the application must screen these hits out, a potentially expensive task in terms of programming and maintenance,if not raw performance.
To solve this problem, Oracle Text in Oracle Database 10g introduces "progressive relaxation".This allows you to specify the different searches to run, and Oracle will run each in turn,returning de-duplicated results until the application stops fetching hits.
The scores returned are manipulated such that if you order by score
Benefits
The benefits of progressive relaxtion is that an application developer can specify operations to be applied to a user query in a declarative manner. It is not necessary to parse the query and apply operators to each search term - the developer just specifies what options should be applied to each term, and how they should be combined (eg AND or OR)
The application also benefits from automatic de-duplication of results at a very early stage in the query (before docid to rowid translation) which is much more efficient than doing it at the final stage, as you would have to if you were running multiple queries.
Query Templates
The actual implementation of progressive relaxation is via the query template mechanism. If you have not come across this before, don't worry - it's quite straightforward and the examples should make it clear. Basically a query template is an XML fragment that is used in the CONTAINS clause in place of a simple query string.
There are some other things that you can do with query templates, such as specifying
language, query grammars and scoring options, but we won't be covering them here.
So - on to our first example:
create table mybooks (title varchar2(20), author varchar2(20));
insert into mybooks values ('Consider the Lillies', 'Ian Crichton Smith');
insert into mybooks values ('Sphere', 'Michael Crichton');
insert into mybooks values ('Stupid White Men', 'Michael Moore');
insert into mybooks values ('Lonely Day', 'Michaela Criton');
insert into mybooks values ('How to Teach Poetry', 'Michaela Morgan');
create index auth_idx on mybooks (author) indextype is ctxsys.context;
SELECT score(1), title, author FROM mybooks WHERE CONTAINS (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0 ORDER BY score(1) DESC;
The output of this query is:
SCORE(1) TITLE AUTHOR
---------- -------------------- --------------------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
26 Stupid White Men Michael Moore
26 Consider the Lillies Ian Crichton Smith
1 How to Teach Poetry Michaela Morgan
It can see that the first line is an exact match, according to the first
Obviously in this example, it is fetching all the rows, so there is no major advantage in using progressive relaxation. We can limit it to only return the first two hits, using a PL/SQL cursor:
set serveroutput on format wrapped
declare
max_rows integer := 2;
counter integer := 0;
begin
-- do the headings
dbms_output.put_line(rpad('Score', 8)||rpad('Title', 20)||rpad('Author', 20));
dbms_output.put_line(rpad('-----', 8)||rpad('-----', 20)||rpad('------', 20));
-- loop for the required number of rows
for c in (select score(1) scr, title, author from mybooks where contains (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0) loop
dbms_output.put_line(rpad(c.scr, 8)||rpad(c.title, 20)||rpad(c.author, 20));
counter := counter + 1;
exit when counter >= max_rows;
end loop;
end;
/
The output from this is
Score Title Author
----- ----- ------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
Now theres one more feature an application designer might want. And thats to
stop the search after it has evaluated the first successful search criteria. So in our example, if we get one or more hits on exactly "michael chrichton", we dont want to try any of the other searches. If the exact search fails, we want to try the others only until one of them returns one or more rows.
Theres (currently) no way to do this as part of the query template syntax.However, it is possible to do this at the application level by looking at the scores returned using PLSQL
Score Title Author
----- ----- ------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
26 Stupid White Men Michael Moore
26 Consider the Lillies Ian Crichton Smith
1 How to Teach Poetry Michaela Morgan
Now, given that the maximum score of a text query is 100, and that we had four
So in order to stop our results after the first valid search, we need to detect the score crossing one of these boundaries. In order to do this we MUST know in advance how many stepsthere are. The PL/SQL for all this is a little more tricky than before:
declare
max_rows integer := 2;
counter integer := 0;
number_of_steps integer := 4;
score_range_size integer; -- 33 for 3 steps, 25 for 4, 20 for 5 etc
this_score_group integer; -- final step is 1, penultimate step is 2 ...
last_score_group integer := 0; -- to compare change
begin
-- do the headings
dbms_output.put_line(rpad('Score', 8)||rpad('Title', 20)||rpad('Author', 20));
dbms_output.put_line(rpad('-----', 8)||rpad('-----', 20)||rpad('------', 20));
for c in (select score(1) scr, title, author from mybooks where contains (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0) loop
score_range_size := 100/number_of_steps;
this_score_group := c.scr/score_range_size;
exit when this_score_group < last_score_group;
last_score_group := this_score_group;
dbms_output.put_line(rpad(c.scr , 8)||rpad(c.title, 20)||rpad(c.author, 20));
counter := counter + 1;
exit when counter >= max_rows;
end loop;
end;
/
The output from this is:
Score Title Author
----- ----- ------
76 Sphere Michael Crichton
We could add a new, stricter step (remembering to increase the number_of_steps variable).This won't actually find anything but will demonstrate that the procedure continues until it does find at least one row:
number of steps number := 5;
...
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>
and output would be:
Score Title Author
----- ----- ------
61 Sphere Michael Crichton
there is a simplification which makes life easier for the application developer.
Generating the full syntax above can be complicated to program. So there is a "shorthand" syntax, known as a "query rewrite template":
<query>
<textquery> michael crichton
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>
</progression>
</textquery>
</query>
This will generate the same four-step syntax as used above. The arguments after TOKENS are as follows:
* Prefix - what to put before each token
* Suffix - what to put after each token
* Connector - an operator to link each token. Space causes a phrase search.
It is usually a good idea to surround each term with braces "{}", in case the user has entered a reserved word like STEM or NT. Beware, though, that adding a wild card after a brace can have strange effects - {dog}% is not the same as dog%.
Feb 27, 2009
Oracle Text - Part 2
Oracle Text offers the CTX_CLS.CLUSTERING package for building clusters.
Optimizer Hints
We can also "hint" the database optimizer to improve query performance if we
know ahead of time what plan is best:
SELECT /*+ index product_information description_idx */
score(1), product_id FROM product_information
WHERE CONTAINS (product_description, 'monitor NEAR "high resolution"', 1) > 0
AND list_price < 500;
Parallel Indexing
Parallel indexing can take advantage of hardware when you have multiple CPUs.
Parallel index creation is useful for
• Performance improvement
• Data Staging
• Rapid initial deployment of applications based on large data collections
• Application testing, when users need to test different index parameters
and schemas while developing an application
The following example creates a text index with degree 3:
CREATE INDEX myindex ON docs(tk) INDEXTYPE IS ctxsys.context PARALLEL 3;
SDATA Sections and Composite Domain Indexes
SDATA New in Oracle Text 11g They are designed to improve the performance of “mixed queries” – queries which have
a text search part and a structured part.
For example, the query:
SELECT item_id FROM items WHERE
CONTAINS (description, 'madonna') > 0
AND itmtype = 'BOOK' AND price < 10
ORDER BY price DESC
These queries are never likely to be as fast
as a simple text-only query.
Composite Domain Indexes
Composite Domain Indexes use the same underlying technology as SDATA
sections, but in an easier-to-use and more standard fashion.
A ‘domain index’ is a type of index for use with a particular type of data (in our case, textual data). A composite index in normal Oracle terms is an index that covers more than one column. So a Composite Domain Index (CDI, for short) is a extension of the usual domain index to cover multiple columns.
Original query again:
SELECT item_id FROM items WHERE
CONTAINS (description, 'madonna') > 0
AND itmtype = 'BOOK' AND price < 10
ORDER BY price DESC
To create appropriate indexes for this query in previous versions we may have run
the following SQL commands:
CREATE INDEX typeind ON items (itmtype)
CREATE INDEX priceind ON items (price)
CREATE INDEX descind ON items (description) INDEXTYPE IS
ctxsys.context
In Oracle 11g Release 1 it can do with a single call:
CREATE INDEX compind ON items (description)
INDEXTYPE IS ctxsys.context
FILTER BY itmtype, price SORT BY price
In Oracle Database 10g Release 1 MDATA (for MetaDATA)sections .
These were designed for short character fields which would be indexed
“as a whole” inside the text index. This would allow us to rewrite the query above
as something like:
SELECT item_id FROM items WHERE
CONTAINS (description, 'madonna and MDATA(itmtype, BOOK') > 0
AND price < 10 ORDER BY price DESC
It will remove many unnecessary docid to rowid resolutions, and the base table access to
evaluate “itmtype=’BOOK’” predicate, since we can get itmtype=BOOK from the text
index. However, it doesn’t solve the problem completely:
• We can only do equality searches, we can’t do “price < 10” with MDATA
• We can’t use it for sorting.
(Structured DATA) sections. These sections are embedded in the text of a document – like field or zone sections – but unlike previous sections they may contain character, numeric or date information and may be searched using operators such as “greater than”, “less than” and “between” as well as equality searches. Here’s an example of a query which makes use of SDATA query operators:
SELECT item_id FROM items WHERE
CONTAINS (description, 'racing and SDATA(itemtype=''BOOK'') and SDATA(price<10)') > 0
ORDER BY price DESC
XML Support
XML features include the operator WITHIN, nested section search, search within
attribute values, mapping multiple tags to the same name, path searching using
INPATH and HASPATH operators.
XML example to demonstrate Oracle Texts features.(Replace + with following operators ie: < or >
+?xml version="1.0"?+
+FAQ OWNER="Billy Text"+
+TITLE+"Oracle Text FAQ"+/TITLE+
+DESCRIPTION+
Everything you always wanted to know about Text"+/DESCRIPTION+"
+QUESTION+"What is Oracle Text?
+/QUESTION++ANSWER+"
Oracle Text uses standard SQL to index search and analyze text and
documents stored in the database files or websites.
"+/ANSWER++/FAQ+"
SELECT title description FROM FAQTable
WHERE CONTAINS(text'Oracle WITHIN QUESTION')> 0;
SELECT title description FROM FAQTable
WHERE CONTAINS(text'Billy WITHIN FAQ0OWNER')> 0;
SELECT title description FROM FAQTable
WHERE CONTAINS(text'Oracle INPATH(FAQ/TITLE)')> 0;
SELECT title description FROM FAQTable
WHERE CONTAINS(text'HASPATH(FAQ/TITLE/DESCRIPTION)')> 0;
• existsNode() : given an XPath expression, checks if the XPath applied
over the document can return any valid nodes.
• extract() : given an XPath expression, applies the XPath to the
document and returns the fragment as a XMLType.
select f.faq.extract('/FAQ/QUESTION/text()').getStringVal()
from faq f where contains(faq, 'standard or SQL INPATH(FAQ/ANSWER)')>0
Optimizer Hints
We can also "hint" the database optimizer to improve query performance if we
know ahead of time what plan is best:
SELECT /*+ index product_information description_idx */
score(1), product_id FROM product_information
WHERE CONTAINS (product_description, 'monitor NEAR "high resolution"', 1) > 0
AND list_price < 500;
Parallel Indexing
Parallel indexing can take advantage of hardware when you have multiple CPUs.
Parallel index creation is useful for
• Performance improvement
• Data Staging
• Rapid initial deployment of applications based on large data collections
• Application testing, when users need to test different index parameters
and schemas while developing an application
The following example creates a text index with degree 3:
CREATE INDEX myindex ON docs(tk) INDEXTYPE IS ctxsys.context PARALLEL 3;
SDATA Sections and Composite Domain Indexes
SDATA New in Oracle Text 11g They are designed to improve the performance of “mixed queries” – queries which have
a text search part and a structured part.
For example, the query:
SELECT item_id FROM items WHERE
CONTAINS (description, 'madonna') > 0
AND itmtype = 'BOOK' AND price < 10
ORDER BY price DESC
These queries are never likely to be as fast
as a simple text-only query.
Composite Domain Indexes
Composite Domain Indexes use the same underlying technology as SDATA
sections, but in an easier-to-use and more standard fashion.
A ‘domain index’ is a type of index for use with a particular type of data (in our case, textual data). A composite index in normal Oracle terms is an index that covers more than one column. So a Composite Domain Index (CDI, for short) is a extension of the usual domain index to cover multiple columns.
Original query again:
SELECT item_id FROM items WHERE
CONTAINS (description, 'madonna') > 0
AND itmtype = 'BOOK' AND price < 10
ORDER BY price DESC
To create appropriate indexes for this query in previous versions we may have run
the following SQL commands:
CREATE INDEX typeind ON items (itmtype)
CREATE INDEX priceind ON items (price)
CREATE INDEX descind ON items (description) INDEXTYPE IS
ctxsys.context
In Oracle 11g Release 1 it can do with a single call:
CREATE INDEX compind ON items (description)
INDEXTYPE IS ctxsys.context
FILTER BY itmtype, price SORT BY price
In Oracle Database 10g Release 1 MDATA (for MetaDATA)sections .
These were designed for short character fields which would be indexed
“as a whole” inside the text index. This would allow us to rewrite the query above
as something like:
SELECT item_id FROM items WHERE
CONTAINS (description, 'madonna and MDATA(itmtype, BOOK') > 0
AND price < 10 ORDER BY price DESC
It will remove many unnecessary docid to rowid resolutions, and the base table access to
evaluate “itmtype=’BOOK’” predicate, since we can get itmtype=BOOK from the text
index. However, it doesn’t solve the problem completely:
• We can only do equality searches, we can’t do “price < 10” with MDATA
• We can’t use it for sorting.
(Structured DATA) sections. These sections are embedded in the text of a document – like field or zone sections – but unlike previous sections they may contain character, numeric or date information and may be searched using operators such as “greater than”, “less than” and “between” as well as equality searches. Here’s an example of a query which makes use of SDATA query operators:
SELECT item_id FROM items WHERE
CONTAINS (description, 'racing and SDATA(itemtype=''BOOK'') and SDATA(price<10)') > 0
ORDER BY price DESC
XML Support
XML features include the operator WITHIN, nested section search, search within
attribute values, mapping multiple tags to the same name, path searching using
INPATH and HASPATH operators.
XML example to demonstrate Oracle Texts features.(Replace + with following operators ie: < or >
+?xml version="1.0"?+
+FAQ OWNER="Billy Text"+
+TITLE+"Oracle Text FAQ"+/TITLE+
+DESCRIPTION+
Everything you always wanted to know about Text"+/DESCRIPTION+"
+QUESTION+"What is Oracle Text?
+/QUESTION++ANSWER+"
Oracle Text uses standard SQL to index search and analyze text and
documents stored in the database files or websites.
"+/ANSWER++/FAQ+"
SELECT title description FROM FAQTable
WHERE CONTAINS(text'Oracle WITHIN QUESTION')> 0;
SELECT title description FROM FAQTable
WHERE CONTAINS(text'Billy WITHIN FAQ0OWNER')> 0;
SELECT title description FROM FAQTable
WHERE CONTAINS(text'Oracle INPATH(FAQ/TITLE)')> 0;
SELECT title description FROM FAQTable
WHERE CONTAINS(text'HASPATH(FAQ/TITLE/DESCRIPTION)')> 0;
• existsNode() : given an XPath expression, checks if the XPath applied
over the document can return any valid nodes.
• extract() : given an XPath expression, applies the XPath to the
document and returns the fragment as a XMLType.
select f.faq.extract('/FAQ/QUESTION/text()').getStringVal()
from faq f where contains(faq, 'standard or SQL INPATH(FAQ/ANSWER)')>0
Oracle Text - Part 1
Oracle Text ie: Context Oracle text (formerly: interMedia text) in 10g
----------------------------------------------------------------
User = CTXSYS
Mainly used for Not basic SQL searching; Its about full retrieval against virtually any datatype (including all LOB types). The indexing & searching abilities of Oracle Text are not restricted to data stored in the database. It can index & search documents stored on the filesystem also and index more than 150 document types, including Microsoft Word, PDF, and XML.
Oracle Text search functionality includes fuzzy, stemming (search for mice and find mouse),wildcard
Demo
CREATE USER text_user IDENTIFIED BY oracle;
GRANT connect, resource, ctxapp TO text_user;
CONN text_user/oracle
CREATE TABLE songs (ID NUMBER(10),
Title VARCHAR2(50), Genre VARCHAR2(50));
INSERT INTO songs (ID, Title, Genre)
VALUES (1, 'The Preble Mice Go Squeak', 'CHILD');
INSERT INTO songs (ID, Title, Genre)
VALUES (2, 'Benri The Cat', 'CHILD');
INSERT INTO songs (ID, Title, Genre)
VALUES (3, 'My Mouse Won't Work Blues', 'COMPUTER ENGINEER');
INSERT INTO songs (ID, Title, Genre)
VALUES (4, 'My Pen Leaked - Ballad Of The Pocket Protector', 'COMPUTER ENGINEER');
INSERT INTO songs (ID, Title, Genre)
VALUES (5, 'The Mechanical Pencil - Get The Lead Out', 'HEAVY METAL');
COMMIT;
Creating an Oracle Text index is completely different from creating any other database index.Before creating the index it is required to determine which features want to use and create the necessary structures to support the index.
Most frequently used type of Oracle Text index is CONTEXT index.
if not need multilingual features so this type of simple index creation is sufficient.The basic index creation is broken into two steps:
Step 1: Create the preferences. Preferences tell Oracle Text how index should be stored,how data should be filtered, the language(s) that will be indexed, and how fuzzy and stemming queries should be treated. For this use the CTX_DDL built-in package to create the preferences and set the attributes. The CTX_DDL package is owned by the user CTXSYS . The TEXT_USER schema (user) employed for the examples has EXECUTE permissions on CTX_DDL, because it was granted the CTXAPP role.
BEGIN
CTX_DDL.CREATE_PREFERENCE('english_lexer','basic_lexer');
CTX_DDL.SET_ATTRIBUTE('english_lexer','index_themes','no');
END;
/
The CREATE_PREFERENCE procedure establishes the name and type of LEXER to be used. A LEXER decides how text is broken apart for indexing, and it can be set for different languages. Here BASIC_LEXER object type is used , which supports most Western European, white-space-delimited languages.
Lexer
The lexers job is to separate the sectioners output into words or tokens.In the
simplest case for a Western European language, the lexer just splits text into
uninterrupted strings of alphanumeric characters. So the string:
Aha! Its the 5:15 train, coming here now!
would be split into the words, minus any punctuation or special symbols:
aha it s the 5 15 train coming here now
The lexer typically removes stopwords,which are common words defined by the
application developer; or taken from a default list. That would likely reduce the
list above to:
aha * * * 5 15 train coming * now
Note the asterisks representing removed stopwords. Although they are not actually
indexed, the presence of a stopword at the position is noted in the index. In a
search, any stopword will match that word when used as part of a phrase. For
example, “kicking the ball” will match “kicking a ball” but will not match “kicking
ball”.The set of stopwords may be specified by the application developer, who can also choose to explicitly define all numbers as stopwords.
The following command shows how to set the Japanese lexer:
ctx_ddl.create_preference('JAPANESE_LEXER','japanese_vgram_lexer')
Set Chinese Lexer
exec ctx_ddl.create_preference('CHINESE_LEXER','chinese_vgram_lexer');
Set Korean Lexer
exec ctx_ddl.create_preference('KOREAN_LEXER','korean_morph_lexer');
If the language of the documents is
not known, the new AUTO_LEXER may be used, which provides automatic language recognition, and extensive segmentation and stemming capabilities for multiple languages.
This are the languages supported by the AUTO_LEXER. Those in
bold support context-sensitive stemming.
Arabic, Catalan,Czech, Danish, Dutch, English, Finnish, French, German, Greek,
Hebrew, Hungarian, Italian,Nynorsk,Polish, Portuguese, Romanian, Russian, Serbian, Slovak,Slovenian, Spanish, Swedish, Simplified Chinese, Traditional Chinese, Croatian,Japanese, Korean, Bokmal, Persian, Thai, Turkish
Set Multi-lexer named GLOBAL_LEXER
exec ctx_ddl.create_preference('GLOBAL_LEXER', 'multi_lexer');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','default','ENGLISH_LEXER');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','traditional chinese','CHINESE_LEXER','chn');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','japanese','JAPANESE_LEXER','jpn');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','korean','KOREAN_LEXER','kor');
DataStore
The datastore defines from where the text to be indexed should be fetched.Provided datastores allow for text which is stored within a database, on a file system, or accessed remotely via the HTTP protocol (the URL datastore).
Default Datastore
The default datastore is in the database itself. Text may be stored in a VARCHAR2 column (up to 4000 characters), or in a CLOB (Character Large Object) column.Formatted text (such as Word or PDF documents) can be stored in BLOB (Binary Large Object) columns.
File Datastore
Text to be indexed is stored on any file system which is accessible to the database
server. The name or path to the file is stored in the database, typically in a VARCHAR2 column.
URL Datastore
The database contains an HTTP protocol URL, and the text to be indexed is
fetched directly from the URL at indexing time.
SET_ATTRIBUTE procedure is used to instruct Oracle Text not to generate themes
Step 2:
CREATE INDEX song_index ON songs(title) INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('LEXER english_lexer STOPLIST ctxsys.default_stoplist')
The first parameter is the LEXER.The second parameter in this example is the STOPLIST.Stoplists provide Oracle with a list of words not to index. Typical words to exclude are of,the, a, is, and so on. In the example, the default stoplist is provided as the parameter value. This stoplist is shipped with Oracle Database and is owned by the CTXSYS user.
The CTXAPP role granted to my user provides permissions to use this stoplist.
Oracle Text provides three types of indexes that cover all text search needs:
standard, catalog, and classification
• Standard index type for traditional full-text retrieval over documents and
web pages. The context index type provides a rich set of text search capabilities for finding
spurious results.
• Catalog index type - the first text index designed specifically for eBusiness
catalogs. The ctxcat catalog index type provides flexible searching and
sorting at web-speed.
• Classification index type for building classification or routing applications.
The ctxrule index type is created on a table of queries, where the queries
define the classification or routing criteria.
Testing the Index
By running a query using the CONTAINS() operator
SELECT score(1), title, genre FROM songs
WHERE CONTAINS(title, 'mice', 1) > 0;
This query should return a single row:
SCORE(1)TITLE GENRE
------- ----------------------- -----
5 The Preble Mice Go Squeak CHILD
SCORE() operator ranks search results by relevance and provides a numeric value in the result set that allows me to determine which values are the best matches to the search criteria. In this query, SCORE() is given a label of 1 (inside the parentheses) that corresponds to the matching CONTAINS() search. The last argument in the CONTAINS() clause has the same label value, 1. If there are multiple CONTAINS() operators, separate labels can be used with SCORE() and CONTAINS() to determine the relevance for each.
Here a wildcard (%) character is not used for searching.This is because Oracle Text
creates tokens in its indexing process. These tokens(words or phrases), generated based on the LEXER defined in the index creation, are words and phrases found in the text. When a query is issued with the CONTAINS() operator, the Oracle Text index tables—not the document where the text originated—are searched for a match.Here tokens (words or phrases)are searched and not blocks of text so a wildcard is not necessary if full term is supplied.If a token matches the keyword, the source document is retrieved as a match.
However, there are occasions when a wildcard character would be used. If there are two documents,one with the word pen and the other with the word pencil, it is needed to search for pen% in order to get both records.
Another type of query that uses the CONTAINS() operator is called stemming
(search for mice and find mouse). Some words, although originating from the same root or stem, are not spelled similarly enough for a wildcard to be of any use. For example, a search for mice% will not return mouse. If root form of a word is known , all variants of that word can be matched
Because mice and mouse share the same stem,
SELECT score(1), title, genre FROM songs WHERE CONTAINS(title, '$mice', 1) > 0;
This query now returns two rows:
SCORE(1)TITLE GENRE
------- ----------------------- ---------------
5 My Mouse Wont Work Blues COMPUTER ENGINEER
5 The Preble Mice Go Squeak CHILD
This query returns both rows with the same score.It can find either form by adding a $ in front of the keyword mice.
Tuning and administering Oracle Text
One of the most useful reports for tuning Oracle Text applications is QUERY_LOG_SUMMARY. This report lets you analyze your indexes and queries to maximize effectiveness by checking how queries match data.
In SONGS table there is a song called "My Mouse Wont Work Blues." If That is a Top 40 hit, so it should be flying off the shelves, but nobody is buying it. One place to look for reasons is the QUERY_LOG_SUMMARY report. Using the QUERY_LOG_SUMMARY report,it can see if people are performing searches for similar titles but not the exact title, meaning that the song is never returned as a match.
The first thing need to do is begin logging the queries which is done with the CTX_OUTPUT package, as follows:
exec ctx_output.start_query_log('text_index.txt');
With the query log turned on, run the same incorrect query 100 times as follows:
DECLARE
v_title VARCHAR2(50);
BEGIN
FOR y IN 1..100 LOOP
BEGIN
SELECT title INTO v_title FROM songs
WHERE ID = 3 AND CONTAINS (title,'mice') > 0;
EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL;
END;
END LOOP;
END;
/
After the querying has finished,turn logging off, as follows:
exec ctx_output.end_query_log;
Using the QUERY_LOG_SUMMARY procedure,it can evaluate the log file contents using the following:
DECLARE
v_report CTX_REPORT.QUERY_TABLE;
BEGIN
CTX_REPORT.QUERY_LOG_SUMMARY('text_index.txt',
'song_index',v_report,5,TRUE,FALSE);
FOR y in 1..v_report.count LOOP
DBMS_OUTPUT.PUT_LINE('The query for'
||v_report(y).query||' ran and failed'
||v_report(y).times||' times.');
END LOOP;
EXCEPTION WHEN OTHERS
THEN
DBMS_OUTPUT.PUT_LINE(sqlerrm);
END;
/
This returns the following result:
The query for mice ran and failed 100 times.
The reason for the search failure is obvious. ID 3 contains the word mouse,
not mice. Having identified the problem with my keyword search, I can correct the
query by adding stemming ($), as described
SELECT title FROM songs WHERE ID = 3 AND CONTAINS (title, '$mice') > 0;
TITLE
-------------------------
My Mouse Wont Work Blues
This search returns the record what want, so adding stemming to the online stores search— specify that all user searches include the '$' by default. This results in successful searches on mouse and mice, and soon the CDs are selling.
-------------------------------------------
----------------------------------------------------------------
User = CTXSYS
Mainly used for Not basic SQL searching; Its about full retrieval against virtually any datatype (including all LOB types). The indexing & searching abilities of Oracle Text are not restricted to data stored in the database. It can index & search documents stored on the filesystem also and index more than 150 document types, including Microsoft Word, PDF, and XML.
Oracle Text search functionality includes fuzzy, stemming (search for mice and find mouse),wildcard
Demo
CREATE USER text_user IDENTIFIED BY oracle;
GRANT connect, resource, ctxapp TO text_user;
CONN text_user/oracle
CREATE TABLE songs (ID NUMBER(10),
Title VARCHAR2(50), Genre VARCHAR2(50));
INSERT INTO songs (ID, Title, Genre)
VALUES (1, 'The Preble Mice Go Squeak', 'CHILD');
INSERT INTO songs (ID, Title, Genre)
VALUES (2, 'Benri The Cat', 'CHILD');
INSERT INTO songs (ID, Title, Genre)
VALUES (3, 'My Mouse Won't Work Blues', 'COMPUTER ENGINEER');
INSERT INTO songs (ID, Title, Genre)
VALUES (4, 'My Pen Leaked - Ballad Of The Pocket Protector', 'COMPUTER ENGINEER');
INSERT INTO songs (ID, Title, Genre)
VALUES (5, 'The Mechanical Pencil - Get The Lead Out', 'HEAVY METAL');
COMMIT;
Creating an Oracle Text index is completely different from creating any other database index.Before creating the index it is required to determine which features want to use and create the necessary structures to support the index.
Most frequently used type of Oracle Text index is CONTEXT index.
if not need multilingual features so this type of simple index creation is sufficient.The basic index creation is broken into two steps:
Step 1: Create the preferences. Preferences tell Oracle Text how index should be stored,how data should be filtered, the language(s) that will be indexed, and how fuzzy and stemming queries should be treated. For this use the CTX_DDL built-in package to create the preferences and set the attributes. The CTX_DDL package is owned by the user CTXSYS . The TEXT_USER schema (user) employed for the examples has EXECUTE permissions on CTX_DDL, because it was granted the CTXAPP role.
BEGIN
CTX_DDL.CREATE_PREFERENCE('english_lexer','basic_lexer');
CTX_DDL.SET_ATTRIBUTE('english_lexer','index_themes','no');
END;
/
The CREATE_PREFERENCE procedure establishes the name and type of LEXER to be used. A LEXER decides how text is broken apart for indexing, and it can be set for different languages. Here BASIC_LEXER object type is used , which supports most Western European, white-space-delimited languages.
Lexer
The lexers job is to separate the sectioners output into words or tokens.In the
simplest case for a Western European language, the lexer just splits text into
uninterrupted strings of alphanumeric characters. So the string:
Aha! Its the 5:15 train, coming here now!
would be split into the words, minus any punctuation or special symbols:
aha it s the 5 15 train coming here now
The lexer typically removes stopwords,which are common words defined by the
application developer; or taken from a default list. That would likely reduce the
list above to:
aha * * * 5 15 train coming * now
Note the asterisks representing removed stopwords. Although they are not actually
indexed, the presence of a stopword at the position is noted in the index. In a
search, any stopword will match that word when used as part of a phrase. For
example, “kicking the ball” will match “kicking a ball” but will not match “kicking
ball”.The set of stopwords may be specified by the application developer, who can also choose to explicitly define all numbers as stopwords.
The following command shows how to set the Japanese lexer:
ctx_ddl.create_preference('JAPANESE_LEXER','japanese_vgram_lexer')
Set Chinese Lexer
exec ctx_ddl.create_preference('CHINESE_LEXER','chinese_vgram_lexer');
Set Korean Lexer
exec ctx_ddl.create_preference('KOREAN_LEXER','korean_morph_lexer');
If the language of the documents is
not known, the new AUTO_LEXER may be used, which provides automatic language recognition, and extensive segmentation and stemming capabilities for multiple languages.
This are the languages supported by the AUTO_LEXER. Those in
bold support context-sensitive stemming.
Arabic, Catalan,Czech, Danish, Dutch, English, Finnish, French, German, Greek,
Hebrew, Hungarian, Italian,Nynorsk,Polish, Portuguese, Romanian, Russian, Serbian, Slovak,Slovenian, Spanish, Swedish, Simplified Chinese, Traditional Chinese, Croatian,Japanese, Korean, Bokmal, Persian, Thai, Turkish
Set Multi-lexer named GLOBAL_LEXER
exec ctx_ddl.create_preference('GLOBAL_LEXER', 'multi_lexer');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','default','ENGLISH_LEXER');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','traditional chinese','CHINESE_LEXER','chn');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','japanese','JAPANESE_LEXER','jpn');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','korean','KOREAN_LEXER','kor');
DataStore
The datastore defines from where the text to be indexed should be fetched.Provided datastores allow for text which is stored within a database, on a file system, or accessed remotely via the HTTP protocol (the URL datastore).
Default Datastore
The default datastore is in the database itself. Text may be stored in a VARCHAR2 column (up to 4000 characters), or in a CLOB (Character Large Object) column.Formatted text (such as Word or PDF documents) can be stored in BLOB (Binary Large Object) columns.
File Datastore
Text to be indexed is stored on any file system which is accessible to the database
server. The name or path to the file is stored in the database, typically in a VARCHAR2 column.
URL Datastore
The database contains an HTTP protocol URL, and the text to be indexed is
fetched directly from the URL at indexing time.
SET_ATTRIBUTE procedure is used to instruct Oracle Text not to generate themes
Step 2:
CREATE INDEX song_index ON songs(title) INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('LEXER english_lexer STOPLIST ctxsys.default_stoplist')
The first parameter is the LEXER.The second parameter in this example is the STOPLIST.Stoplists provide Oracle with a list of words not to index. Typical words to exclude are of,the, a, is, and so on. In the example, the default stoplist is provided as the parameter value. This stoplist is shipped with Oracle Database and is owned by the CTXSYS user.
The CTXAPP role granted to my user provides permissions to use this stoplist.
Oracle Text provides three types of indexes that cover all text search needs:
standard, catalog, and classification
• Standard index type for traditional full-text retrieval over documents and
web pages. The context index type provides a rich set of text search capabilities for finding
spurious results.
• Catalog index type - the first text index designed specifically for eBusiness
catalogs. The ctxcat catalog index type provides flexible searching and
sorting at web-speed.
• Classification index type for building classification or routing applications.
The ctxrule index type is created on a table of queries, where the queries
define the classification or routing criteria.
Testing the Index
By running a query using the CONTAINS() operator
SELECT score(1), title, genre FROM songs
WHERE CONTAINS(title, 'mice', 1) > 0;
This query should return a single row:
SCORE(1)TITLE GENRE
------- ----------------------- -----
5 The Preble Mice Go Squeak CHILD
SCORE() operator ranks search results by relevance and provides a numeric value in the result set that allows me to determine which values are the best matches to the search criteria. In this query, SCORE() is given a label of 1 (inside the parentheses) that corresponds to the matching CONTAINS() search. The last argument in the CONTAINS() clause has the same label value, 1. If there are multiple CONTAINS() operators, separate labels can be used with SCORE() and CONTAINS() to determine the relevance for each.
Here a wildcard (%) character is not used for searching.This is because Oracle Text
creates tokens in its indexing process. These tokens(words or phrases), generated based on the LEXER defined in the index creation, are words and phrases found in the text. When a query is issued with the CONTAINS() operator, the Oracle Text index tables—not the document where the text originated—are searched for a match.Here tokens (words or phrases)are searched and not blocks of text so a wildcard is not necessary if full term is supplied.If a token matches the keyword, the source document is retrieved as a match.
However, there are occasions when a wildcard character would be used. If there are two documents,one with the word pen and the other with the word pencil, it is needed to search for pen% in order to get both records.
Another type of query that uses the CONTAINS() operator is called stemming
(search for mice and find mouse). Some words, although originating from the same root or stem, are not spelled similarly enough for a wildcard to be of any use. For example, a search for mice% will not return mouse. If root form of a word is known , all variants of that word can be matched
Because mice and mouse share the same stem,
SELECT score(1), title, genre FROM songs WHERE CONTAINS(title, '$mice', 1) > 0;
This query now returns two rows:
SCORE(1)TITLE GENRE
------- ----------------------- ---------------
5 My Mouse Wont Work Blues COMPUTER ENGINEER
5 The Preble Mice Go Squeak CHILD
This query returns both rows with the same score.It can find either form by adding a $ in front of the keyword mice.
Tuning and administering Oracle Text
One of the most useful reports for tuning Oracle Text applications is QUERY_LOG_SUMMARY. This report lets you analyze your indexes and queries to maximize effectiveness by checking how queries match data.
In SONGS table there is a song called "My Mouse Wont Work Blues." If That is a Top 40 hit, so it should be flying off the shelves, but nobody is buying it. One place to look for reasons is the QUERY_LOG_SUMMARY report. Using the QUERY_LOG_SUMMARY report,it can see if people are performing searches for similar titles but not the exact title, meaning that the song is never returned as a match.
The first thing need to do is begin logging the queries which is done with the CTX_OUTPUT package, as follows:
exec ctx_output.start_query_log('text_index.txt');
With the query log turned on, run the same incorrect query 100 times as follows:
DECLARE
v_title VARCHAR2(50);
BEGIN
FOR y IN 1..100 LOOP
BEGIN
SELECT title INTO v_title FROM songs
WHERE ID = 3 AND CONTAINS (title,'mice') > 0;
EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL;
END;
END LOOP;
END;
/
After the querying has finished,turn logging off, as follows:
exec ctx_output.end_query_log;
Using the QUERY_LOG_SUMMARY procedure,it can evaluate the log file contents using the following:
DECLARE
v_report CTX_REPORT.QUERY_TABLE;
BEGIN
CTX_REPORT.QUERY_LOG_SUMMARY('text_index.txt',
'song_index',v_report,5,TRUE,FALSE);
FOR y in 1..v_report.count LOOP
DBMS_OUTPUT.PUT_LINE('The query for'
||v_report(y).query||' ran and failed'
||v_report(y).times||' times.');
END LOOP;
EXCEPTION WHEN OTHERS
THEN
DBMS_OUTPUT.PUT_LINE(sqlerrm);
END;
/
This returns the following result:
The query for mice ran and failed 100 times.
The reason for the search failure is obvious. ID 3 contains the word mouse,
not mice. Having identified the problem with my keyword search, I can correct the
query by adding stemming ($), as described
SELECT title FROM songs WHERE ID = 3 AND CONTAINS (title, '$mice') > 0;
TITLE
-------------------------
My Mouse Wont Work Blues
This search returns the record what want, so adding stemming to the online stores search— specify that all user searches include the '$' by default. This results in successful searches on mouse and mice, and soon the CDs are selling.
-------------------------------------------
Subscribe to:
Posts (Atom)