Technical Blog

Feb 27, 2009

Oracle Text - Part 1

Oracle Text ie: Context Oracle text (formerly: interMedia text) in 10g
----------------------------------------------------------------

User = CTXSYS

Mainly used for Not basic SQL searching; Its about full retrieval against virtually any datatype (including all LOB types). The indexing & searching abilities of Oracle Text are not restricted to data stored in the database. It can index & search documents stored on the filesystem also and index more than 150 document types, including Microsoft Word, PDF, and XML.

Oracle Text search functionality includes fuzzy, stemming (search for mice and find mouse),wildcard

Demo

CREATE USER text_user IDENTIFIED BY oracle;
GRANT connect, resource, ctxapp TO text_user;

CONN text_user/oracle

CREATE TABLE songs (ID NUMBER(10),
Title VARCHAR2(50), Genre VARCHAR2(50));

INSERT INTO songs (ID, Title, Genre)
VALUES (1, 'The Preble Mice Go Squeak', 'CHILD');
INSERT INTO songs (ID, Title, Genre)
VALUES (2, 'Benri The Cat', 'CHILD');
INSERT INTO songs (ID, Title, Genre)
VALUES (3, 'My Mouse Won't Work Blues', 'COMPUTER ENGINEER');
INSERT INTO songs (ID, Title, Genre)
VALUES (4, 'My Pen Leaked - Ballad Of The Pocket Protector', 'COMPUTER ENGINEER');
INSERT INTO songs (ID, Title, Genre)
VALUES (5, 'The Mechanical Pencil - Get The Lead Out', 'HEAVY METAL');
COMMIT;

Creating an Oracle Text index is completely different from creating any other database index.Before creating the index it is required to determine which features want to use and create the necessary structures to support the index.

Most frequently used type of Oracle Text index is CONTEXT index.
if not need multilingual features so this type of simple index creation is sufficient.The basic index creation is broken into two steps:

Step 1: Create the preferences. Preferences tell Oracle Text how index should be stored,how data should be filtered, the language(s) that will be indexed, and how fuzzy and stemming queries should be treated. For this use the CTX_DDL built-in package to create the preferences and set the attributes. The CTX_DDL package is owned by the user CTXSYS . The TEXT_USER schema (user) employed for the examples has EXECUTE permissions on CTX_DDL, because it was granted the CTXAPP role.

BEGIN
CTX_DDL.CREATE_PREFERENCE('english_lexer','basic_lexer');
CTX_DDL.SET_ATTRIBUTE('english_lexer','index_themes','no');
END;
/

The CREATE_PREFERENCE procedure establishes the name and type of LEXER to be used. A LEXER decides how text is broken apart for indexing, and it can be set for different languages. Here BASIC_LEXER object type is used , which supports most Western European, white-space-delimited languages.

Lexer

The lexers job is to separate the sectioners output into words or tokens.In the
simplest case for a Western European language, the lexer just splits text into
uninterrupted strings of alphanumeric characters. So the string:
Aha! Its the 5:15 train, coming here now!
would be split into the words, minus any punctuation or special symbols:
aha it s the 5 15 train coming here now
The lexer typically removes stopwords,which are common words defined by the
application developer; or taken from a default list. That would likely reduce the
list above to:
aha * * * 5 15 train coming * now
Note the asterisks representing removed stopwords. Although they are not actually
indexed, the presence of a stopword at the position is noted in the index. In a
search, any stopword will match that word when used as part of a phrase. For
example, “kicking the ball” will match “kicking a ball” but will not match “kicking
ball”.The set of stopwords may be specified by the application developer, who can also choose to explicitly define all numbers as stopwords.

The following command shows how to set the Japanese lexer:
ctx_ddl.create_preference('JAPANESE_LEXER','japanese_vgram_lexer')

Set Chinese Lexer
exec ctx_ddl.create_preference('CHINESE_LEXER','chinese_vgram_lexer');

Set Korean Lexer
exec ctx_ddl.create_preference('KOREAN_LEXER','korean_morph_lexer');

If the language of the documents is
not known, the new AUTO_LEXER may be used, which provides automatic language recognition, and extensive segmentation and stemming capabilities for multiple languages.

This are the languages supported by the AUTO_LEXER. Those in
bold support context-sensitive stemming.
Arabic, Catalan,Czech, Danish, Dutch, English, Finnish, French, German, Greek,
Hebrew, Hungarian, Italian,Nynorsk,Polish, Portuguese, Romanian, Russian, Serbian, Slovak,Slovenian, Spanish, Swedish, Simplified Chinese, Traditional Chinese, Croatian,Japanese, Korean, Bokmal, Persian, Thai, Turkish

Set Multi-lexer named GLOBAL_LEXER

exec ctx_ddl.create_preference('GLOBAL_LEXER', 'multi_lexer');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','default','ENGLISH_LEXER');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','traditional chinese','CHINESE_LEXER','chn');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','japanese','JAPANESE_LEXER','jpn');
exec ctx_ddl.add_sub_lexer('GLOBAL_LEXER','korean','KOREAN_LEXER','kor');

DataStore

The datastore defines from where the text to be indexed should be fetched.Provided datastores allow for text which is stored within a database, on a file system, or accessed remotely via the HTTP protocol (the URL datastore).

Default Datastore

The default datastore is in the database itself. Text may be stored in a VARCHAR2 column (up to 4000 characters), or in a CLOB (Character Large Object) column.Formatted text (such as Word or PDF documents) can be stored in BLOB (Binary Large Object) columns.

File Datastore

Text to be indexed is stored on any file system which is accessible to the database
server. The name or path to the file is stored in the database, typically in a VARCHAR2 column.

URL Datastore

The database contains an HTTP protocol URL, and the text to be indexed is
fetched directly from the URL at indexing time.

SET_ATTRIBUTE procedure is used to instruct Oracle Text not to generate themes

Step 2:
CREATE INDEX song_index ON songs(title) INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS('LEXER english_lexer STOPLIST ctxsys.default_stoplist')

The first parameter is the LEXER.The second parameter in this example is the STOPLIST.Stoplists provide Oracle with a list of words not to index. Typical words to exclude are of,the, a, is, and so on. In the example, the default stoplist is provided as the parameter value. This stoplist is shipped with Oracle Database and is owned by the CTXSYS user.
The CTXAPP role granted to my user provides permissions to use this stoplist.

Oracle Text provides three types of indexes that cover all text search needs:
standard, catalog, and classification

• Standard index type for traditional full-text retrieval over documents and
web pages. The context index type provides a rich set of text search capabilities for finding
spurious results.

• Catalog index type - the first text index designed specifically for eBusiness
catalogs. The ctxcat catalog index type provides flexible searching and
sorting at web-speed.

• Classification index type for building classification or routing applications.
The ctxrule index type is created on a table of queries, where the queries
define the classification or routing criteria.

Testing the Index

By running a query using the CONTAINS() operator

SELECT score(1), title, genre FROM songs
WHERE CONTAINS(title, 'mice', 1) > 0;

This query should return a single row:

SCORE(1)TITLE GENRE
------- ----------------------- -----
5 The Preble Mice Go Squeak CHILD

SCORE() operator ranks search results by relevance and provides a numeric value in the result set that allows me to determine which values are the best matches to the search criteria. In this query, SCORE() is given a label of 1 (inside the parentheses) that corresponds to the matching CONTAINS() search. The last argument in the CONTAINS() clause has the same label value, 1. If there are multiple CONTAINS() operators, separate labels can be used with SCORE() and CONTAINS() to determine the relevance for each.

Here a wildcard (%) character is not used for searching.This is because Oracle Text
creates tokens in its indexing process. These tokens(words or phrases), generated based on the LEXER defined in the index creation, are words and phrases found in the text. When a query is issued with the CONTAINS() operator, the Oracle Text index tables—not the document where the text originated—are searched for a match.Here tokens (words or phrases)are searched and not blocks of text so a wildcard is not necessary if full term is supplied.If a token matches the keyword, the source document is retrieved as a match.

However, there are occasions when a wildcard character would be used. If there are two documents,one with the word pen and the other with the word pencil, it is needed to search for pen% in order to get both records.

Another type of query that uses the CONTAINS() operator is called stemming
(search for mice and find mouse). Some words, although originating from the same root or stem, are not spelled similarly enough for a wildcard to be of any use. For example, a search for mice% will not return mouse. If root form of a word is known , all variants of that word can be matched

Because mice and mouse share the same stem,

SELECT score(1), title, genre FROM songs WHERE CONTAINS(title, '$mice', 1) > 0;

This query now returns two rows:

SCORE(1)TITLE GENRE
------- ----------------------- ---------------
5 My Mouse Wont Work Blues COMPUTER ENGINEER
5 The Preble Mice Go Squeak CHILD

This query returns both rows with the same score.It can find either form by adding a $ in front of the keyword mice.

Tuning and administering Oracle Text

One of the most useful reports for tuning Oracle Text applications is QUERY_LOG_SUMMARY. This report lets you analyze your indexes and queries to maximize effectiveness by checking how queries match data.

In SONGS table there is a song called "My Mouse Wont Work Blues." If That is a Top 40 hit, so it should be flying off the shelves, but nobody is buying it. One place to look for reasons is the QUERY_LOG_SUMMARY report. Using the QUERY_LOG_SUMMARY report,it can see if people are performing searches for similar titles but not the exact title, meaning that the song is never returned as a match.

The first thing need to do is begin logging the queries which is done with the CTX_OUTPUT package, as follows:

exec ctx_output.start_query_log('text_index.txt');

With the query log turned on, run the same incorrect query 100 times as follows:

DECLARE
v_title VARCHAR2(50);
BEGIN
FOR y IN 1..100 LOOP
BEGIN
SELECT title INTO v_title FROM songs
WHERE ID = 3 AND CONTAINS (title,'mice') > 0;
EXCEPTION
WHEN NO_DATA_FOUND THEN
NULL;
END;
END LOOP;
END;
/

After the querying has finished,turn logging off, as follows:

exec ctx_output.end_query_log;

Using the QUERY_LOG_SUMMARY procedure,it can evaluate the log file contents using the following:

DECLARE
v_report CTX_REPORT.QUERY_TABLE;
BEGIN
CTX_REPORT.QUERY_LOG_SUMMARY('text_index.txt',
'song_index',v_report,5,TRUE,FALSE);

FOR y in 1..v_report.count LOOP
DBMS_OUTPUT.PUT_LINE('The query for'
||v_report(y).query||' ran and failed'
||v_report(y).times||' times.');
END LOOP;
EXCEPTION WHEN OTHERS
THEN
DBMS_OUTPUT.PUT_LINE(sqlerrm);
END;
/
This returns the following result:

The query for mice ran and failed 100 times.

The reason for the search failure is obvious. ID 3 contains the word mouse,
not mice. Having identified the problem with my keyword search, I can correct the
query by adding stemming ($), as described

SELECT title FROM songs WHERE ID = 3 AND CONTAINS (title, '$mice') > 0;

TITLE
-------------------------
My Mouse Wont Work Blues

This search returns the record what want, so adding stemming to the online stores search— specify that all user searches include the '$' by default. This results in successful searches on mouse and mice, and soon the CDs are selling.
-------------------------------------------

Feb 25, 2009

Oracle Spatial hint

For an optimal execution plan, always specify the /*+ ordered */ hint when the query window
(second argument of a spatial operator) comes from a table. For example, the following query
finds all the chemical plants within 5 miles of contaminated wells with ID values 1 and 2.

SELECT /*+ ORDERED */
b.chemical_plant_name FROM well_table a,chemical_plants b
WHERE sdo_within_distance (b.geom, a.geom, 'distance=5
unit=mile') = ‘TRUE’ AND a.id in (1,2);

Oracle Spatial index

Spatial index creation

-- Create the index without any parameters

CREATE INDEX geod_counties_sidx ON geod_counties(geom)
INDEXTYPE IS MDSYS.SPATIAL_INDEX

The following parameters are recommended when creating spatial indexes (ie: whihc is a R-Tree index).

CREATE INDEX sp_idx ON my_table (location)
INDEXTYPE IS mdsys.spatial_index
PARAMETERS ('tablesapce=tb_name work_tablespace=work_tb_name')

• WORK_TABLESPACE - During spatial index creation, the process creates intermediate
tables that get dropped when the index is complete. The intermediate tables can take up to 2 times the size of the final index. If WORK_TABLESPACE is not specified, the
intermediate tables are created in the same tablespace as the final index, causing
fragmentation, and possible performance degradation.
You can use SDO_TUNE.ESTIMATE_RTREE_INDEX_SIZE, and multiply the result by
2 to provide guidance on sizing the work tablespace. The work tablespace can be re-used to create other spatial indexes.

• LAYER_GTYPE – This parameter is needed especially when working with point-only
layers. If a point-only layer stores its points in the SDO_ORDINATE_ARRAY, you can
still specify LAYER_GTYPE=POINT on spatial index creation. This can help query
performance when performing spatial analysis.

• SDO_NON_LEAF_TBL – This parameter is useful for very large spatial indexes (not
necessary for smaller spatial indexes). This generates two spatial index tables instead of one.The smaller spatial index table is the non-leaf table, which is traversed most often during spatial analysis. It can be beneficial to pin the non-leaf table into the buffer pool, since it is accessed most often. See the example below.

-- Create the index
CREATE INDEX geod_counties_sidx ON geod_counties(geom)
INDEXTYPE IS MDSYS.SPATIAL_INDEX PARAMETERS ('sdo_non_leaf_tbl=TRUE');

-- Find the non leaf index table name
SELECT sdo_nl_index_table FROM user_sdo_index_metadata
WHERE sdo_index_name='GEOD_COUNTIES_SIDX';

Pin the table in memory
ALTER TABLE MDNT_A930$ STORAGE(BUFFER_POOL KEEP);

Feb 24, 2009

11g Features :::- ADVANCED COMPRESSION

ORACLE ADVANCED COMPRESSION
---------------------------

Oracle Database 9i introduced Table Compression several years ago to allow data to be compressed
during bulk load operations such as direct path load, CREATE TABLE AS SELECT…. (CTAS), etc.
This form of compression was ideally suited for data warehousing environments where most data is
loaded in the database using batch processes. Oracle Database 11g introduces a new feature called
OLTP Table Compression that allows data to be compressed during all types of data manipulation
operations, including conventional DML such as INSERT and UPDATE. In addition, the new feature
significantly improves performance by reducing the overhead of write operations making it suitable
for transactional or OLTP environments as well
It may be noted that Table Compression feature introduced in Oracle Database 9i is a base feature
of Enterprise Edition (EE) and continues to be so even in Oracle Database 11g. The new OLTP Table
Compression feature, however, is a part of the Oracle Advanced Compression

Performance

1) Table Scan Performance 2.5x Faster
2) 3% Storage Saving
3) 3% Fast DML Operation

CREATE TABLE emp (
emp_id NUMBER
, first_name VARCHAR2(128)
, last_name VARCHAR2(128)
) COMPRESS FOR ALL OPERATIONS;

Compression for Unstructured Data

SecureFiles, a new feature in Oracle Database 11g,for storing unstructured content, such as documents, spreadsheets and XML files which is a storage format for large object (LOB) data types to improve performance, reduce space usage,and enhance security.

SecureFiles Deduplication is an intelligent technology that eliminates duplicate copies of SecureFiles data.Consider an email application where 10 users receive an email with the same 1MB attachment. Without SecureFiles Deduplication, the system would store one copy of the file for each of the 10 users – requiring 10MB of storage. If the email application in ourexample had used SecureFiles with Deduplication, it would have stored the 1MB attachment just once. That’s a 90% savings in storage requirements. In addition to the storage savings, SecureFiles Deduplication also increases application performance. Specifically, write and copy operations are much more efficient since only references to the SecureFiles image
are written. Further, read operations may improve if duplicate SecureFiles data already exists in the buffer cache.

Deduplication can be enabled for SecureFiles as below:

CREATE TABLE images (image_id NUMBER,
image BLOB)
LOB(image) STORE AS SECUREFILE
(TABLESPACE lob_tbs DEDUPLICATE);

SecureFiles Compression

CREATE TABLE images (image_id NUMBER,
image BLOB)
LOB(image)STORE AS SECUREFILE
(TABLESPACE lob_tbs COMPRESS);

Different create statement for Table creation with LOB datatype

CREATE TABLEdoc_tab (pkey number(10) not null, document clob)

CREATE TABLEdoc_tab (pkey number(10) not null, document clob)
LOB (document) STORE AS ( TABLESPACE TEST )

create table doc_tab (pkey number(10) not null, document clob)
lob(document) store as doc_tab_document_lobseg
(nocache logging retention);

/* Enabling Secure File ,Compress, Deduplicate */

create table doc_tab (pkey number(10) not null, document clob)
lob(document) store as SECUREFILE doc_tab_document_lobseg_sf
(nocache logging retention auto COMPRESS DEDUPLICATE);

/* Enabling Secure File ,Compress, Deduplicate,Encryption */

create table doc_tab (pkey number(10) not null,document clob)
lob(document) store as SECUREFILE doc_tab_document_lobseg
(nocachelogging retention auto COMPRESS DEDUPLICATE ENCRYPT);

Compression for Backup Data

How to Enable Data Pump Compression

Users have the following options to determine which parts of a dump file set should be compressed:
• ALL enables compression for the entire export operation.
• DATA-ONLY results in all data being written to the dump file in compressed format.
• METADATA-ONLY results in all metadata being written to the dump file in compressed format.
This is the default.
• NONE disables compression for the entire export operation.

expdp hr FULL=y DUMPFILE=dpump_dir:full.dmp COMPRESS;

Recovery Manager Compression

Oracle Advanced Compression introduces new RMAN Compression capabilities that improve RMAN performance while still drastically reducing the storage requirements for backups. Based on the industry standard ZLIB compression algorithm, RMAN compressed backups are up to 40% faster than compressed backups in Oracle Database 10g.

How to Enable RMAN Compression

Syntax for Fast RMAN compression is as below:
RMAN> CONFIGURE COMPRESSION ALGORITHM ‘zlib’;

RMAN compression can be done as shown below:
RMAN> backup as COMPRESSED BACKUPSET database archivelog all;

--------------

Feb 21, 2009

Instead of Stragg /// Allow Duplicate

SELECT
deptno,
LTRIM(SYS_CONNECT_BY_PATH(ename, ','), ',') employees
FROM (
SELECT deptno,
ename,
row_number() over(PARTITION BY deptno ORDER BY deptno) rnum,
count(*) over(PARTITION BY deptno) tot
FROM scott.emp
)
WHERE rnum=tot
START WITH rnum = 1
CONNECT BY PRIOR rnum = rnum -1 AND PRIOR deptno = deptno;

select tb.deptno , substr ( sys_connect_by_path( tb.EMPNO, ',' ) , 2) as string
from ( select deptno ,EMPNO ,row_number() over ( partition by deptno
order by EMPNO ) as val_index from
scott.emp WHERE EMPNO IS NOT NULL ) tb where
connect_by_isleaf = 1 connect by val_index = prior val_index + 1 and
deptno = prior deptno start with val_index = 1 ;

WITH tab AS
(
SELECT &strg str
FROM DUAL)
SELECT REGEXP_SUBSTR (str, '[^,]+', 1, LEVEL) RESULT
FROM tab
CONNECT BY LEVEL <=
((SELECT SUM (CASE
WHEN SUBSTR (str, LEVEL, 1) = ','
THEN 1
WHEN LENGTH (str) = LEVEL
THEN 1
ELSE 0
END
)
FROM tab
CONNECT BY LEVEL <= LENGTH (str)));

SELECT REPLACE(&strg ,',',CHR(13)) FROM DUAL;

FROM 10G ONWARDS

SELECT deptno,wmsys.wm_concat(ename) d
FROM scott.emp group BY deptno

------------------------------------------
To remove duplicate from a SQL

SELECT JOB, ENAME FROM
(SELECT DISTINCT ENAME,JOB, row_number () OVER (PARTITION BY ENAME ORDER BY ENAME) rn
FROM scott.emp )
WHERE rn = 1

Technical Blog

Pages

Feb 27, 2009

Oracle Text - Part 1

Feb 25, 2009

Oracle Spatial hint

Oracle Spatial index

Feb 24, 2009

11g Features :::- ADVANCED COMPRESSION

Feb 21, 2009

Instead of Stragg /// Allow Duplicate

Explore ORACLE4U

Labels

Translate

Wikipedia