Feb 5, 2015

Integration of Hadoop with Oracle

Oracle Database provides the flexibility to leverage programming language functionality within the database without having to write complex SQL statements by using user defined functions known as Table Functions.

The Map Reduce programming model can be implemented within the Oracle Database using Parallel Pipelined Table Functions and parallel operations. It is possible to write parallel processing tasks as database queries using user defined Table Functions to aggregate or filter the data. Pipelined Table Functions were introduced in Oracle 9i as a way of embedding procedural logic within a data flow. At a logical level, a Table Function is a user defined function that appears in the FROM clause of a SQL statement and operates like a table returning a stream of rows.

This mechanism provides an alternative for SQL developers to perform very complex processing in a procedural way, not easily expressed with SQL. It also follows the Map Reduce paradigm, enabling massively parallel processing within the realm of the database.

Feeding Hadoop Data to the Database for Further Analysis

External tables present data stored in a file system in a table format, and can be used in SQL queries transparently. Hadoop data stored in HDFS can be accessed from inside the Oracle Database by using External Tables through the use of FUSE (File system in User Space) project driver to provide the application programming interface between HDFS and the External Table infrastructure. Using the External Table makes it easier for non-programmers to work with Hadoop data from inside an Oracle Database.

Leveraging Hadoop Processing From the Database

In the event that you need to process some data from Hadoop before it can be correlated with the data from your database, you can control the execution of the Map Reduce programs through a table function using the DBMS_SCHEDULER framework to asynchronously launch an external shell script that submit a Hadoop Map Reduce job. The table function and the Map Reduce program communicate using Oracle’s Advanced Queuing feature.

By leveraging Hadoop processing from the Database you can take advantage of the power of Oracle RDBMS at the same time to simplify the analysis of your data stored in a Hadoop Cluster by streaming data directly from Hadoop with Oracle Queue and Table Function