Mar 7, 2009

Query Analyzer

To check whether select or DML statements are correct ,here they will not execute (ie: if we check an insert statement it will not insert particular data into that table ) But DDL statement will get executed during parsing itself .

declare
c integer := dbms_sql.open_cursor;
begin
dbms_sql.parse(c, 'select * from scott.emp', dbms_sql.native);
dbms_sql.close_cursor (c);
end;

Mar 3, 2009

Oracle Text - Part 6

CTX_DOC

manipulate SQL queries by injecting arbitrary SQL code via the THEMES, GIST, TOKENS, FILTER, HIGHLIGHT, and MARKUP procedures

-----------
Sectioner

The sectioner object is responsible for identifying the containing section(s) for each text unit. Typically, these sections will be predefined HyperText Markup Language (HTML) or eXtensible Markup Language (XML) sections. Optionally,the sectioner can process all tags as sections delimiters.

For example:

<TITLE>XML Handbook</TITLE>. This allows search between tags using the WITHIN operator. Use of the WITHIN is illustrated in the section on XML searching.

Mar 2, 2009

Oracle Text - Part 5

Generating XML Output

The INDEX_STATS procedure supports both formatted text and XML output.The following code
creates the INDEX_STATS report in text format:

-- Table to store our report
CREATE TABLE index_report (id NUMBER(10), report CLOB);

DECLARE
v_report CLOB := null;
BEGIN
CTX_REPORT.INDEX_STATS(index_name => 'SONG_INDEX',
report => v_report,part_name => NULL,frag_stats => NULL,
list_size => 20,report_format => NULL);
INSERT INTO index_report (id, report) VALUES (1, v_report);
COMMIT;
DBMS_LOB.FREETEMPORARY(v_report);
END;
/

Then output can be viewed:

SELECT report FROM index_report WHERE id = 1;

To find out the size of the index, we can use the function CTX_REPORT.INDEX_SIZE. This is a function returning a CLOB.

select ctx_report.index_size('SONG_INDEX') from dual;

for a XML report :
SQL> select ctx_report.index_size('SONG_INDEX', null, 'XML') from dual;

This will give us a heap of XML output, including sizes for each object.

Here's a section of the output:

<SIZE_OBJECT_NAME> text_user.SYS_IL0000051565C00006$$
</SIZE_OBJECT_NAME>
<SIZE_OBJECT_TYPE> INDEX (LOB)
</SIZE_OBJECT_TYPE>
<SIZE_TABLE_NAME> text_user.DR$TI$I
</SIZE_TABLE_NAME>
<SIZE_TABLESPACE_NAME> USERS
</SIZE_TABLESPACE_NAME>
<SIZE_BLOCKS_ALLOCATED> 8
</SIZE_BLOCKS_ALLOCATED>
<SIZE_BLOCKS_USED> 4
</SIZE_BLOCKS_USED>
<SIZE_BYTES_ALLOCATED> 65536
</SIZE_BYTES_ALLOCATED>
<SIZE_BYTES_USED> 32768
</SIZE_BYTES_USED>
</SIZE_OBJECT>
<SIZE_OBJECT>


selecting a CLOB, force it into an XMLTYPE value:
select xmltype(ctx_report.index_size('SONG_INDEX', null, 'XML')) from dual;

select extract(xmltype(ctx_report.index_size('SONG_INDEX', null, 'XML')), '//SIZE_OBJECT') from dual;

We're now using the XML DB "extract" function to fetch all of the XML within tags. The syntax '//SIZE_OBJECT' is an XPATH expression meaning "all the XML within a SIZE_OBJECT element anywhere below the root element".the reader is encouraged to look at the tutorial at w3Schools.

select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT') from dual;

Now specify that only information from the table $I is want . this is done with
within a predicate the xpath expression:

select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]') from dual;

The size in bytes can be get by adding the XPATH expression:

select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED') from dual;

To avoid tags. There's two ways of doing this 1) the text() function within the XPATH, 2) use extractValue rather than extract.

select extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED/text()') as "Table Size" from dual;

select extractValue(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')), '//SIZE_OBJECT[SIZE_OBJECT_NAME="text_user.DR$TI$I"]/SIZE_BYTES_USED')as "Table Size" from dual;

For fetching ALL the sizes, perhaps to add them together to get a summary XMLSequence can use,which returns a collection of XMLType values. This function can use in a TABLE clause to unnest the collection values into multiple rows. Now that we're generating a "table", we no longer need to reference DUAL.

To get just the sizes:

select * from table(xmlsequence( extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT/SIZE_BYTES_USED')));

To get all the object info, one object per row:

select * from table(xmlsequence( extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')), '//SIZE_OBJECT')));

Processing information into individual values:

select extractValue(Column_Value, '/SIZE_OBJECT/SIZE_OBJECT_NAME') as "Name",
extractValue(Column_Value, '/SIZE_OBJECT/SIZE_TABLESPACE_NAME') as "Tablespace",
extractValue(Column_Value, '/SIZE_OBJECT/SIZE_BYTES_USED') as "Bytes"
from table(xmlsequence(extract(xmltype(replace(ctx_report.index_size('SONG_INDEX', null, 'XML'),chr(10),'')),'//SIZE_OBJECT')));

To get sum of all the sizes to get an aggragate total size of all objects
used in the index.

select sum(extractValue(Column_Value, '/SIZE_OBJECT/SIZE_BYTES_USED')) "Total"
from table(xmlsequence(extract(xmltype(replace(ctx_report.index_size('ti', null, 'XML'),chr(10),'')),'//SIZE_OBJECT')));

INDEX_STATS

Most of the CTX_REPORT procedures have two forms - a function which returns a CLOB (as done above)and an alternate form which requires to pass in a CLOB to be populated. One procedure - INDEX_STATS - does not have a function version. This is because it needs to do a full table scan of the main index table, which is likely to take some time.Therefore when calling INDEX_STATS it is needed to go about things a little differently. First create a table with an XMLType column. Then call a PL/SQL block which passes a temporary CLOB to CTX_REPORT, and then inserts that CLOB into the XMLType column. Then various XML operations can perform on that table.

create table output(report XMLType);

declare
tlob clob;
begin
ctx_report.index_stats(index_name=>'ti', report=>tlob,
list_size=>20, report_format=>'XML');
insert into output values (xmlType (replace(tlob,chr(10),'')) );
commit;
dbms_lob.freetemporary(tlob);
end;
/

There is a single row in table OUTPUT, which contains XML report. First well get the
estimated row fragmentation:

select extractValue(report,'//STAT_STATISTIC[@NAME="estimated row fragmentation"]')
as "Fragmentation" from output;

Now get the top three most frequent tokens.

using the position function:

select extract(value(d), '//STAT_TOKEN_TEXT')
from output, table(xmlsequence(extract(report,'/CTXREPORT/INDEX_STATS/STAT_TOKEN_STATS/STAT_TOKEN_LIST[@NAME="most frequent tokens"]/
STAT_TOKEN[position()<4]'))) d; Conclusion
The XML output mode of CTX_REPORT allows powerful manipulation of index information. In order to make full use of these,a good understanding of the XML features of the Oracle database, such as extract, extractValue, XMLSequence, and of XPATH syntax in general is needed.With the use of XML exploding within the data processing world, such an understanding is likely to be very useful in the future.

Oracle Text - Part 4

Progressive Relaxation is a new technique for text searching, available with Oracle 10g.

It assumes a basic working knowledge of Oracle Text, such as the operators used
in query expressions.When Would Use It?

First, let's consider a search scenario.

You have a web site selling books. A user searches for "Michael Crichton" in the "search author" box. OK, easy enough. You do the search, and return the top 10 hits (or whatever) that match the search criteria.

But what if the user mis-spells the firstname, and searches for "Michel Crichton"? In this case, a good strategy for handling this might be to find the top 10 hits from these searches:

1. Any books where the author is exactly "Michel Crichton"
2. Any books containing a fuzzy match of each word, in the right order
3. Any books having either exact word
4. Any books having a fuzzy match of either word.

Of course we can do this in a search like

select book_id from books where contains (author, '(michel crichton) OR (?michel ?crichton) OR (michel OR crichton) OR (?michel OR ?crichton)

But there are two problems with this search:

1. From the user's point of view, hits which are a poor match will be mixed in with hits which are a good match. The user wants to see good matches displayed first.
2. From the system's point of view, the search is inefficient. Even if there were plenty of hits for exactly "Michel Crichton", it would still have to do all the work of the fuzzy expansions and fetch data for all the rows which satisfy the query.

An alternative is to run four separate queries. This way, we can do the exact search first, and then only run the queries with more relaxed criteria if they are needed to get enough hits for the results page.

But apart from the inefficiency of (potentially) running multiple queries, we need to de-duplicate the results. The "relaxed" queries will in many cases hit the rows returned by the exact queries. To avoid duplicates in the result set, the application must screen these hits out, a potentially expensive task in terms of programming and maintenance,if not raw performance.

To solve this problem, Oracle Text in Oracle Database 10g introduces "progressive relaxation".This allows you to specify the different searches to run, and Oracle will run each in turn,returning de-duplicated results until the application stops fetching hits.

The scores returned are manipulated such that if you order by score

Benefits

The benefits of progressive relaxtion is that an application developer can specify operations to be applied to a user query in a declarative manner. It is not necessary to parse the query and apply operators to each search term - the developer just specifies what options should be applied to each term, and how they should be combined (eg AND or OR)

The application also benefits from automatic de-duplication of results at a very early stage in the query (before docid to rowid translation) which is much more efficient than doing it at the final stage, as you would have to if you were running multiple queries.

Query Templates

The actual implementation of progressive relaxation is via the query template mechanism. If you have not come across this before, don't worry - it's quite straightforward and the examples should make it clear. Basically a query template is an XML fragment that is used in the CONTAINS clause in place of a simple query string.

There are some other things that you can do with query templates, such as specifying
language, query grammars and scoring options, but we won't be covering them here.

So - on to our first example:

create table mybooks (title varchar2(20), author varchar2(20));

insert into mybooks values ('Consider the Lillies', 'Ian Crichton Smith');
insert into mybooks values ('Sphere', 'Michael Crichton');
insert into mybooks values ('Stupid White Men', 'Michael Moore');
insert into mybooks values ('Lonely Day', 'Michaela Criton');
insert into mybooks values ('How to Teach Poetry', 'Michaela Morgan');

create index auth_idx on mybooks (author) indextype is ctxsys.context;

SELECT score(1), title, author FROM mybooks WHERE CONTAINS (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0 ORDER BY score(1) DESC;

The output of this query is:

SCORE(1) TITLE AUTHOR
---------- -------------------- --------------------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
26 Stupid White Men Michael Moore
26 Consider the Lillies Ian Crichton Smith
1 How to Teach Poetry Michaela Morgan

It can see that the first line is an exact match, according to the first entry in our progression sequence. The second line corresponds to a fuzzy match of each term in order - our second criteria. The third and fourth rows come from the exact "micheal OR chrichton" and finally the last row has a single match which is a fuzzy hit on one of the terms.

Obviously in this example, it is fetching all the rows, so there is no major advantage in using progressive relaxation. We can limit it to only return the first two hits, using a PL/SQL cursor:

set serveroutput on format wrapped

declare
max_rows integer := 2;
counter integer := 0;
begin
-- do the headings
dbms_output.put_line(rpad('Score', 8)||rpad('Title', 20)||rpad('Author', 20));
dbms_output.put_line(rpad('-----', 8)||rpad('-----', 20)||rpad('------', 20));
-- loop for the required number of rows
for c in (select score(1) scr, title, author from mybooks where contains (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0) loop
dbms_output.put_line(rpad(c.scr, 8)||rpad(c.title, 20)||rpad(c.author, 20));
counter := counter + 1;
exit when counter >= max_rows;
end loop;
end;
/

The output from this is

Score Title Author
----- ----- ------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton

Now theres one more feature an application designer might want. And thats to
stop the search after it has evaluated the first successful search criteria. So in our example, if we get one or more hits on exactly "michael chrichton", we dont want to try any of the other searches. If the exact search fails, we want to try the others only until one of them returns one or more rows.

Theres (currently) no way to do this as part of the query template syntax.However, it is possible to do this at the application level by looking at the scores returned using PLSQL

Score Title Author
----- ----- ------
76 Sphere Michael Crichton
51 Lonely Day Michaela Criton
26 Stupid White Men Michael Moore
26 Consider the Lillies Ian Crichton Smith
1 How to Teach Poetry Michaela Morgan

Now, given that the maximum score of a text query is 100, and that we had four steps in our search, we might be able to notice something here. Specifically, any match on the first step will always score in the top quarter of the possible results - 76% to 100%. The next step will score in the range 51-75%, the next 26-50%, and the final step 1-25%. If we had had five steps in our query, the scores would have been in the ranges 81-100%, 61-80%, 41-60%, 21-40% and 1-20%.

So in order to stop our results after the first valid search, we need to detect the score crossing one of these boundaries. In order to do this we MUST know in advance how many stepsthere are. The PL/SQL for all this is a little more tricky than before:

declare
max_rows integer := 2;
counter integer := 0;
number_of_steps integer := 4;
score_range_size integer; -- 33 for 3 steps, 25 for 4, 20 for 5 etc
this_score_group integer; -- final step is 1, penultimate step is 2 ...
last_score_group integer := 0; -- to compare change
begin
-- do the headings
dbms_output.put_line(rpad('Score', 8)||rpad('Title', 20)||rpad('Author', 20));
dbms_output.put_line(rpad('-----', 8)||rpad('-----', 20)||rpad('------', 20));
for c in (select score(1) scr, title, author from mybooks where contains (author, '
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>', 1) > 0) loop
score_range_size := 100/number_of_steps;
this_score_group := c.scr/score_range_size;
exit when this_score_group < last_score_group;
last_score_group := this_score_group;
dbms_output.put_line(rpad(c.scr , 8)||rpad(c.title, 20)||rpad(c.author, 20));
counter := counter + 1;
exit when counter >= max_rows;
end loop;
end;
/

The output from this is:

Score Title Author
----- ----- ------
76 Sphere Michael Crichton

We could add a new, stricter step (remembering to increase the number_of_steps variable).This won't actually find anything but will demonstrate that the procedure continues until it does find at least one row:

number of steps number := 5;
...
<query>
<textquery>
<progression>
<seq>michael crichton </seq>
<seq>?michael ?crichton </seq>
<seq>michael OR crichton </seq>
<seq>?michael OR ?crichton </seq>
</progression>
</textquery>
</query>

and output would be:

Score Title Author
----- ----- ------
61 Sphere Michael Crichton

there is a simplification which makes life easier for the application developer.
Generating the full syntax above can be complicated to program. So there is a "shorthand" syntax, known as a "query rewrite template":

<query>
<textquery> michael crichton
<progression>
<seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", " "))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "{", "}", "OR"))</rewrite></seq>
<seq><rewrite>transform((TOKENS, "?{", "}", "OR"))</rewrite></seq>
</progression>
</textquery>
</query>

This will generate the same four-step syntax as used above. The arguments after TOKENS are as follows:

* Prefix - what to put before each token
* Suffix - what to put after each token
* Connector - an operator to link each token. Space causes a phrase search.

It is usually a good idea to surround each term with braces "{}", in case the user has entered a reserved word like STEM or NT. Beware, though, that adding a wild card after a brace can have strange effects - {dog}% is not the same as dog%.

Oracle Text - Part 3

CTXCAT indexes work best when text is in "small chunks" - maybe a couple of lines maximum - and searches need to restrict and/or sort the result set according to certain structured criteria - usually numbers or dates.

For example,consider an on-line auction site. Each item for sale has a short
description, a current bid price and dates for the start and end of the auction. A
user might want to see all the records with a
current bid price less than $500. Since he's particularly interested in new items, he wants the results sorted by auction start time.

Such a search would be fairly inefficient using a normal CONTEXT index. The kernel would have to find all the records that matched the text search, then restrict the set to those with the correct price (which requires the use of a different index), and then sort the results using a third index.

By including structured information such as price and date within the CTXCAT index, we are able to make this search very much more efficient.When is a CTXCAT index NOT suitable?

The query language with CTXCAT is considerably simpler than for CONTEXT indexes. Basically, you can search for phrases and words (with wild cards if required), using AND and OR operators. If your application needs more complex text retrieval featurs - such as stemming, thesaurus, fuzzy matching and so on, you should be using a CONTEXT index.

There are also differences in the time and space needed to create the index. CTXCAT indexes take quite a bit longer to create - and use considerably more disk space - than CONTEXT indexes. If you are tight on disk space, you should consider carefully whether CTXCAT indexes are appropriate for you.Any other differences?

Yes - in DML processing (updates, inserts and deletes).CTXCAT indexes are transactional.Where a CONTEXT index uses a "deferred indexing" method - CTXCAT indexes work much more like a normal (b-tree) index. When you commit changes, all necesary changes to the CTXCAT indexes take place before the commit returns

CREATE INDEX indexname ON table(column) INDEXTYPE IS CTXCAT;

EG:
create table auction (
item_id NUMBER PRIMARY KEY, -- auction item identifier
item_desc VARCHAR2(80), -- free-form item description
price NUMBER, -- current price of the item
start_time DATE -- end time of the auction
end_time DATE -- end time of the auction
)

A CTXCAT index (like a CONTEXT index) is a "domain" index. Therefore it supports the
"PARAMETERS" clause. A number of possible parameter settings are shared with CONTEXT indexes.These are: LEXER, MEMORY, STOPLIST, STORAGE and WORDLIST (no other CONTEXT parameters are supported). Howevwer, the most important parameter is a new one: INDEX SET.

INDEX SET defines the structured columns that are to be included in the CTXCAT index. if I want to create an index on the item_desc column, but I need to be able to limit my search results by price, and sort by start_time. I do this by creating a new INDEX SET, and adding the structured columns to it.CTX_DDL package allow to do this:

ctx_ddl.create_index_set('auction_set');
ctx_ddl.add_index ('auction_set', 'price');
ctx_ddl.add_index ('auction_set', 'start_time');

Note that the item_desc column is NOT part of the INDEX SET. item_desc is only mentioned when we come to create the actual index:

CREATE INDEX auction_index ON auction (item_desc)
INDEXTYPE IS CTXCAT
PARAMETERS ('INDEX SET auction_set');

how to search ?

By using the CATSEARCH operator instead of the CONTAINS operator used for
a CONTEXT index.If i want to find all auction items which contains the words "toy" and "dog" but not
the phrase "live animal":

SELECT item_id, item_desc FROM auction
WHERE CATSEARCH (item_desc, '(toy dog) | "live animal"', null) > 0;

A few points to note:

* ANDed terms do not need the word AND or even an "&" operator. AND is assumed between
any adjacent terms.
* NOT is represented by "|" (OR, not used here, is represented by "|")
* Parentheses can be used to group terms
* Double quotes are used to surround a phrase (otherwise "live animal" would have been
read as "live AND animal".

The "null" in the query above is a placeholder for a structured clause. There is no default -
if no structured clause is provided "null" Must be used here.

The structured clause allows to restrict, or sort,the results.the query can be extended
above to find only items costing less than $100,

WHERE CATSEARCH (item_desc, '(toy dog) | "live animal"', 'price < 100') > 0

and want to find the results with the newest items first

WHERE CATSEARCH (item_desc, '(toy dog) | "live animal"',
'price < 100 order by start_time desc') > 0

--------------------------------------------
It is worth noting that the creation of the B-Tree indexes will be considerably quicker if the SORT_AREA_SIZE kernel parameter is increased. The default is 64K - a very low figure. Increasing this to 1MB will have a very significant effect - by doing this index-creation times reduce by a factor of 10 . However note the specified amount of memory will be used by EVERY process connecting, so great care should be taken with this parameter if the database is shared with many other users. It might be possible to increase it just for index creation, then reduce it later.