Deep Data Mining Blog: 2014

Monday, December 15, 2014

Replace Consecutive Blanks With a Single Blank

We can use regexp_replace() function to replace multiple consecutive blanks with a single blank. For example, we have the following table.

SQL> select * from tbl_text;

        ID VAL
---------- --------------------------------
         1 hello     world !
         1 how   are     you doing

In the following query, regexp_replace() finds patterns that have 2 or more spaces and replace them with single blank.

SQL> select val, regexp_replace(val, '[[:space:]]{2,}',' ' ') val2 from tbl_text;

VAL                              VAL2
-------------------------------- --------------------------------
hello     world !                hello world !
how   are     you doing          how are you doing

Friday, December 12, 2014

Insert a Record into a Table Every Second

We can using Oracle function dbms_lock.sleep(n_seconds) to run queries at certain interval, such as inserting data into a table every second. Function dbms_lock.sleep(n_seconds) will suspend the session for n_seconds. To illustrate this function, we insert the system time into a table every second 20 times.

SQL> create table tbl_x (id number, dt date);

SQL> begin
for i in 1..20 loop
insert into tbl_x values(i, sysdate);
dbms_lock.sleep(1);
end loop;
commit;
end;
SQL> /

SQL> select * from tbl_x;

 ID DT
---------- -------------------
  1 12-12-2014 12:44:41
  2 12-12-2014 12:44:42
  3 12-12-2014 12:44:43
  4 12-12-2014 12:44:44
  5 12-12-2014 12:44:45
  6 12-12-2014 12:44:46
  7 12-12-2014 12:44:47
  8 12-12-2014 12:44:48
  9 12-12-2014 12:44:49
 10 12-12-2014 12:44:50
 11 12-12-2014 12:44:51
 12 12-12-2014 12:44:52
 13 12-12-2014 12:44:53
 14 12-12-2014 12:44:54
 15 12-12-2014 12:44:55
 16 12-12-2014 12:44:56
 17 12-12-2014 12:44:57
 18 12-12-2014 12:44:58
 19 12-12-2014 12:44:59
 20 12-12-2014 12:45:00

20 rows selected.

As it is shown above, records are inserted every 1 second.

Wednesday, December 10, 2014

Oracle User Privileges for Data Analysts

As data analysts, we often perform various data work within databases such as Oracle. I have found the following user privileges are very helpful. If possible, database administrators should grant us these privileges.


PRIVILEGE
----------------------------------------
CREATE TABLE
CREATE SYNONYM
CREATE VIEW
CREATE SEQUENCE
CREATE DATABASE LINK
CREATE PROCEDURE
CREATE MATERIALIZED VIEW
CREATE ANY DIRECTORY
CREATE JOB
CREATE MINING MODEL

Monday, December 08, 2014

Create Views Based on Nonexistent Tables

In Oracle, it is possible to create views first based on nonexistent tables. After views are created, we can create the underlying tables and the view will work.
To create a view based on tables that do not exist, we using "create force view" as shown below.

SQL> create force view v_tbla as select * from tbla;
Warning: View created with compilation errors.

It is OK that we got the compilation errors. The key word "force" is necessary. Without it, the view will not be created.

SQL> create view v_tbla as select * from tbla;
create view v_tbla as select * from tbla
                                    *
ERROR at line 1:
ORA-00942: table or view does not exist

If we query this view, we will get an error message. This is fine. The error message will disappear after we create the underlying table tbla.

SQL> select * from v_tbla;
select * from v_tbla
              *
ERROR at line 1:
ORA-04063: view "DMUSER.V_TBLA" has errors

Now we create table that the view is based on and populate it with data. As we see, we can query the view!

SQL> create table tbla (id number, val varchar2(32));
Table created.

SQL> insert into tbla values (1,'hello');
1 row created.

SQL> select * from tbla;

 ID VAL
---------- --------------------------------
  1 hello

SQL> select * from v_tbla;

 ID VAL
---------- --------------------------------
  1 hello

Display Long CLOB Text in Sqlplus

In the following example, the display of CLOB text is truncated.

SQL> create table tbl_test (id number, val clob);

Table created.

SQL> insert into tbl_test values(1,'this is a test.ddlkjdfklj dflkjdlkj 
 jdfhi  dfdlkjflkdlkoieuiooiop jdofoidu  baskjhfp  dfdk dkfiegps 
  dfdalkfaidel hdlfjdlflkd dafjdflkdi igod kdigl dfodfud');

1 row created.

SQL> select * from tbl_test;

        ID
----------
VAL
--------------------------------------------------------------------------------
         1
this is a test.ddlkjdfklj  dflkjdlkj  jdfhi  dfdlkjflkdlkoieuiooiop 
jdofoidu  ba

We can use "set long" to specify how many bytes of CLOB to display. In the following example, "set long 2000" tells sqlplus to display up to 2000 bytes of CLOB text.

SQL> set long 2000
SQL> select * from tbl_test;

        ID
----------
VAL
--------------------------------------------------------------------------------
         1
this is a test.ddlkjdfklj  dflkjdlkj  jdfhi  dfdlkjflkdlkoieuiooiop 
jdofoidu  baskjhfp  dfdk  dkfiegps   dfdalkfaidel hdlfjdlflkd dafjdflkdi
 igod kdigl dfodfud

Sunday, December 07, 2014

Dump the Oracle Schema

We can use Oracle data pump utility to dump the whole schema or selected database objects. The following command dump the schema.

$ expdp prod/XXXX schemas=prod directory=ORA_DATA_EBS_BK dumpfile=prod_bk.dmp logfile=prod_bk.log

The above command creates two files under directory ORA_DATA_EBS_BK.

-rw-r--r-- 1 oracle dba       5129 Dec  7 08:07 prod_bk.log
-rw-r----- 1 oracle dba    2650112 Dec  7 08:07 prod_bk.dmp

We can import them back into a database using impdp command.

$ impdp prod/XXXX schemas=prod directory=ORA_DATA_EBS_BK dumpfile=prod_bk.dmp

Thursday, November 20, 2014

SQL Scripts for Oracle Database Link

In posts Get the source code for Oracle database views and materialized view and Get source code for Oracle database objects: user_source view, we show how to get the SQL scripts that create some database objects.

To get the SQL scripts for existing database links, we can take advantage of dbms_metadata.get_ddl().The following query returns the database links created.

SQL> select  owner, db_link from all_db_links order by owner, db_link;

OWNER                          DB_LINK
------------------------------ ----------------
PROD                           DL_DATA
PROD                           DL_EC1

To get the original SQL scripts that create these database links, we use dbms_metadata.get_ddl() as shown below.

SQL> select dbms_metadata.get_ddl('DB_LINK', 'DL_EC1') from dual;

DBMS_METADATA.GET_DDL('DB_LINK','DL_EC1')
--------------------------------------------------------------------------------

  CREATE DATABASE LINK "DL_EC1"    USING '(description=(address=(protocol=TCP)
(host=ec2-xx-xx-xxx-178.compute-1.amazonaws.com)(port=1521))
(connect_data=(sid=XE)))'

Tuesday, November 18, 2014

Drop User in Oracle

Post Drop All Objects In an Oracle Schema uses PL/SQL to iteratively drop database objects. We can simply drop a user and its database objects in Oracle using "drop user.. cascade". This provides an easy way to clean up the database.

We have to drop a user by connecting to another user. If we try to drop a user that we are currently connecting to, we will get error ORA-01940. The Sqlplus session is connected to user1.

SQL> drop user user1 cascade;
drop user user1 cascade
*
ERROR at line 1:
ORA-01940: cannot drop a user that is currently connected

The user that we are connecting to needs to have "drop user" privilege. In the following example, the Sqlplus session is connected to user2.

SQL> drop user user1 cascade;
rop user user1 cascade

RROR at line 1:
RA-01031: insufficient privileges

We connected to sys and grant "drop user" privilege to user2.

SQL> grant drop user to user2;

Grant succeeded.

Now we connect to user2 and are able to drop user1

SQL> drop user user1 cascade;

User dropped.

Monday, October 06, 2014

Five Trends in Big Data Analytics

Big data are described as having big volume, complex structures and being updated frequently. Analytics, the technology that extract meaningful information from the "raw" data to support decision, is the ultimate driver of the value of the big data. After extensive research, deep-data-mining.com has identified the following five trends in big data analytics.

1. SQL Based In-Database Analytics

Data analytics functions are built within relational database engine. Users take advantage of SQL to perform data mining/predictive analytics task. All processes including data extraction, data preparation, predictive model building and validation, and model deployment are done within the database. SQL based in-database analytics will become a trend, particularly in the enterprise environment. This observation is based on the following facts. Most of the core enterprise data are stored in relationship databases. A lot of business logic that support the daily operation of an enterprises are written in SQL. There are huge number of SQL developers around the world and many of who will be able to perform data analytics task without learning other scripting languages.

2. Apache Spark

According to The Apache Software Foundation, Apache is a fast and general engine for large-scale data processing. Apache Spark runs programs much faster. Its machine learning library MLlib include SVM, logistic regression, decision tree and k-means clustering, etc.

3. the Proliferation R Language

R is a statistical analysis language that is much more "natural" to data scientists than other programming languages such as Java, C or SQL. For example, the R scripts to manipulate vector/matrix or build predictive models are very similar to the mathematics equations found in textbooks. For example, the following R script builds a logistic regression model that predicts y based a, b and c.

  glm(y ~ a+ b + c, data = trainset, family = binomial(link = "logit"))

There is a big community of R users who develop R algorithms and share them in the format of R packages. Thus, we can find almost all data analytics algorithms in R packages.

4. Real Time, In Memory Data Analytics

Traditionally, raw data are that are collected in real time in computer memory but are transformed and loaded into disc-based data warehouse periodically. The data are analyzed offline in a delayed fashion. For example, it is not unusual for a large enterprise to take weeks, if not months, to build a predictive model, test and eventually deploy it. Due to the slowness in identifying new useful patterns in the data, opportunities are lost in the case of new sales or risks are realized in the case of fraud prevention. Thus, there will be a trend to shift traditional offline, disk-based data analytics to online, in-mermory, real time environment.

5. Innovative Data Analytics Applications

As we know, the ultimate purpose of big data is to provide data-driven decision support to solve problems. There will be more and more innovative applications of big data analytics. For example, polices will use model to predict potential repeated criminals. Colleges predict in advance if a student will choose to drop from the school based his background and current situation. Human resource department in a large company can design the best career paths for its employees using models. Applications of data analytics is unlimited.

Sunday, September 28, 2014

Update Multiple Columns In a Table

In the post Update a Table Based on Another Table, we showed how to update a column in a table based on the result from a select query. To update multiple columns at the same time, we simply specify multiple columns after "set" and "select".

SQL> select * from tbl_a order by id;

        ID VALUE            VALUE2
---------- ---------------- ----------------
         1 A
         2 B
         3 C
         4 D

SQL> update tbl_a a set (value, value2) 
                        = ( select a.value||'_1', a.value||'_2' 
                            from tbl_a b where a.id=b.id);

4 rows updated.

SQL> select * from tbl_a order by id;

        ID VALUE            VALUE2
---------- ---------------- ----------------
         1 A_1              A_2
         2 B_1              B_2
         3 C_1              C_2
         4 D_1              D_2

Update a Table Based on Another Table

It is a common task to update records in one table based on those in another table. We can use "update" query to do this. Suppose we have two tables as shown below.

SQL> select * from tbl_1 order by id;

        ID VALUE
---------- --------------------------------
         1 A
         2 X
         3 Y
         4 D

SQL> select * from tbl_2 order by id;

        ID VALUE
---------- --------------------------------
         2 B
         3 C

We want to update the val in tbl_1 with the value in tbl_2 based on id. The desirable result for tbl_1 after the update will be:

       ID VALUE
---------- --------------------------------
         1 A
         2 B
         3 C
         4 D

One way to do this is the following query.


SQL> update tbl_1 a 
set value=(select value from tbl_2 b where a.id=b.id) 
where exists ( select 1 from tbl_2 b where a.id=b.id);

2 rows updated.

SQL> select * from tbl_1 order by id;

        ID VAL
---------- --------------------------------
         1 A
         2 B
         3 C
         4 D

Please notice in the update query, only 2 rows are updated due to the "where exists" clause. If we do not include the "where exists" clause, records whose id are not in tbl_2 will have null values as shown below.

SQL> update tbl_1 a set val=(select val from tbl_2 b where a.id=b.id) ;

4 rows updated.

SQL> select * from tbl_1 order by id;

        ID VALUE
---------- --------------------------------
         1
         2 B
         3 C
         4

In the above update query, 4, instead of 2, rows are updates due to the removal of the "where exists" clause. We can change the above update query without using "where exists" to make it work as the following.

SQL> update tbl_1 a set val=nvl((select value from tbl_2 b where a.id=b.id),a.value) ;

4 rows updated.

SQL> select * from tbl_1 order by id;

        ID VALUE
---------- --------------------------------
         1 A
         2 B
         3 C
         4 D

In the above update query, we added function nvl((select ...), a.value) which means if the select query returns no result (NULL), values will be set as the original values. All four rows are updated. The first update query with "where exists" clause is the best because it only update rows whose id in tbl_2.

Saturday, September 13, 2014

Insert Data into an Oracle View

We are able to insert records into a simple view. By doing that, the data are actually inserted into the physical table that the view is based on. The following is an example.


SQL> create table tbl_test (id number, value varchar2(64));

Table created.

SQL> create view v_tbl_test as select * from tbl_test;

View created.

SQL> insert into tbl_test values(1,'Hello');

1 row created.

SQL> insert into v_tbl_test values(2,'World');

1 row created.

SQL> commit;

Commit complete.

SQL> select * from tbl_test order by id;

        ID VALUE
---------- ----------------------------------------------------------------
         1 Hello
         2 World

SQL> select * from v_tbl_test order by id;

        ID VALUE
---------- ----------------------------------------------------------------
         1 Hello
         2 World

Wednesday, August 27, 2014

Oracle Multiple Table Insert

We can insert the selected results into multiple tables using one query "insert all".


SQL> create table t2 (val number);

Table created.

SQL> create table t2 (val number);

Table created.

SQL> insert all into t1 into t2 select 1 from dual;

2 rows created.

SQL> select * from t1;

  VAL
-----
    1

SQL> select * from t2;

  VAL
-----
    1

However, multiple table insert does not work for remote tables. We can only insert data into one remote table at a time.

SQL> insert all into t1 into t3@DL_ANOTHER_DB select 1 from dual;
insert all into t3 into t2@DL_APEX select 1 from dual
                           *
ERROR at line 1:
ORA-02021: DDL operations are not allowed on a remote database

SQL> insert  into t2@DL_APEX select 1 from dual;

1 row created.

Sunday, August 24, 2014

Oracle Character Return in a String

If a string is too long to display, we can put character returns using "chr(10)" as shown below. As you can see, we can format the long string better. When we generate a long message such as an email from PL/SQL, chr(0) is very helpful.

SQL> select 'hello world!' as message from dual;

MESSAGE
------------
hello world!

SQL> select 'hello'||chr(10)||'world!' as message from dual;

MESSAGE
------------
hello
world!

Schedule Oracle Procedure

We may want to run our Oracle procedures on a regular basis. For example, every day at 10PM, I want to get new transactions from a database schema, calculate the risk score and insert the result into a table. I first created a procedure, proc_all, and then I use DBMS_SCHEDULER.create_job() to schedule the job. This procedure will run automatically every day at 10pm.

BEGIN
  DBMS_SCHEDULER.create_job (
    job_name        => 'my_proc',
    job_type        => 'PLSQL_BLOCK',
    job_action      => 'BEGIN proc_all; END;',
    start_date      => SYSTIMESTAMP,
    repeat_interval => 'freq=daily; byhour=22;',
    end_date        => NULL,
    enabled         => TRUE,
    comments        => 'Job defined entirely by the CREATE JOB procedure.');
END;
/

If we want to stop the job, we can drop it using DBMS_SCHEDULER.drop_job()

begin DBMS_SCHEDULER.drop_job(job_name => 'my_proc'); end;

We can query views DBA_SCHEDULER_JOBS or USER_SCHEDULER_JOBS to get the status of scheduled jobs.

SQL> select JOB_NAME,JOB_ACTION from USER_SCHEDULER_JOBS;

JOB_NAME                         JOB_ACTION
-------------------------- ------------------------------------------------------------------
MY_PROC                   BEGIN proc_all; END;
PROC_CLEAR_TBL            BEGIN execute immediate 'truncate table tbl_tobe_alerted'; END;

Friday, August 22, 2014

Refresh an Oracle Materialized View

An Oracle materialized view is a very useful feature in situations including data refresh on regular basis. When we create a materialized view, it takes a snapshot of query result and store it physically. This is different from a view where only query logic is stored with the view. Once the materialized view is created, its content does not change until it is refreshed. When we need to refresh the materialized view, one way to do it is to use dbms_mview.refresh().

SQL> create materialized view mv_test_refresh as select sysdate dt 
     from dual;

Materialized view created.
SQL> select to_char(dt, 'YYYYMMDD:HH24:Mi:SS') from mv_test_refresh;

TO_CHAR(DT,'YYYYM
-----------------
20140801:10:42:47

SQL> exec dbms_mview.refresh('mv_test_refresh');

PL/SQL procedure successfully completed.

SQL> select to_char(dt, 'YYYYMMDD:HH24:Mi:SS') from mv_test_refresh;

TO_CHAR(DT,'YYYYM
-----------------
20140801:10:50:31

Alternatively, we can create a refresh group to include materialized views that we want to refresh. Then we can run dbms_refresh.refresh() to refresh all materialized views included in the group.

SQL> exec dbms_refresh.make(name=>'test_grp', list=>'mv_test_refresh', 
next_date=>sysdate, interval=>'null');

SQL> exec dbms_refresh.refresh('test_grp');

PL/SQL procedure successfully completed.

SQL> select to_char(dt, 'YYYYMMDD:HH24:Mi:SS') from mv_test_refresh;

TO_CHAR(DT,'YYYYM
-----------------
20140801:10:54:35

We can also schedule when a materialized view will be refreshed automatically. That will be the topic of another post.

Tuesday, July 22, 2014

Get Current Date In Oracle and Server SQL

To get the current system date, we use sysdate in Oracle and getdate() in SQL Server, respectively.

Using Oracle sqlplus.

SQL> select sysdate from dual;

SYSDATE
-------------------
07-22-2014 07:21:21

Using SQL Sever Management Studio.

select getdate()
2014-07-22 11:23:23.097

Wednesday, July 16, 2014

More on Business Rules vs Statistical Models

In a post Predictive Modeling vs Intuitive Business Rules, we described advantages of data-driven models over intuitive rules. The limitations of our thinking are nicely described in a free book written by a former CIA veteran, Richards J. Heuer Psychology of Intelligence Analysis. It is one of my favorite books and I highly recommend it.I have found often we are too confident about our perception and intuition and not aware that we are wrong. (For example, some fraud detection analysts use her intuitions/experience to reject statistically derived optimal fraud scores). Data-driven statistical models are powerful tools to resolve the limitations in our mental processes.

Sunday, June 08, 2014

a Summary of Blog Posts by Topics

The following is a summary of the blog posts grouped by topics.

Build Predictive Models

Predictive Models vs Business Rules

Predictive Model Variables

Import Text File Into Oracle

Unique Identifiers in Oracle

Recency Frequency Monetary

Oracle Materialized View

Database Link to Oracle on Amazon

Calculate Percent Using SQL

Calculate Cumulative Percent Using SQL

Calculate cumulative percentage in Oracle

Find Score Cutoff Value Using SQL

Remove Duplicates Using SQL

Calculate Correlation Coefficients Using SQL

Oracle vs SQL Server

NULL Value

Logistic Regression

Random Sampling

Table Insert

Read Only Table

Make Oracle Table Read Only

Clustering

Ranking

Find Most Frequent

Median Value

Calculate Median value in Oracle

Oracle Source Code

Sunday, June 01, 2014

MS SQL Server iff function vs "case when"

In SQL server, people sometimes use iff function, which is also an Excel function, to assign different values based on a condition. For example, the following query returns 1 if the amt is less than 50, and 0 otherwise.

select id, iif(amt <50, 1,0) as small_amt from tbl_transaction

However, using iff function is not the best practice as iff is not a standard SQL function. The above query only runs successfully on SQL Server. To make the query work for any relational database, such as Oracle, it is better to use SQL standard "case when" statement as shown below.

select id, case when amt<50 then 1 else 0 end small_amt from tbl_transacton

The above query will run successfully on both Oracle and SQL Server databases as they support SQL standard.

Saturday, May 10, 2014

Move Big Files to Amazon Cloud

When we want to upload a big file, e.g. a 8G file,to Amazon Cloud storage, it is a good idea to divide the single big file into smaller chunks and upload them one by one. After all chunks are uploaded to Amazon cloud, we can merge them back as a single file.

For example, we have a binary file of about 7.6G, we can split it into 15 chunks. Then we use ftp to upload them to a Linux EC2 Amazon server. As the last step, we merge them back into a single file.

Step 1. Split the file into chunks of 500M.

$ split -b 500m datafile.dat

The above command generates the following 15 smaller files for datafile.dat.

$ ls -ltr
-rw-r--r--+ 1 oradba None  524288000 May  2 18:44 xaa
-rw-r--r--+ 1 oradba None  524288000 May  2 18:44 xab
-rw-r--r--+ 1 oradba None  524288000 May  2 18:45 xac
-rw-r--r--+ 1 oradba None  524288000 May  2 18:46 xad
-rw-r--r--+ 1 oradba None  524288000 May  2 18:46 xae
-rw-r--r--+ 1 oradba None  524288000 May  2 18:47 xaf
-rw-r--r--+ 1 oradba None  524288000 May  2 18:48 xag
-rw-r--r--+ 1 oradba None  524288000 May  2 18:49 xah
-rw-r--r--+ 1 oradba None  524288000 May  2 18:50 xai
-rw-r--r--+ 1 oradba None  524288000 May  2 18:50 xaj
-rw-r--r--+ 1 oradba None  524288000 May  2 18:51 xak
-rw-r--r--+ 1 oradba None  524288000 May  2 18:51 xal
-rw-r--r--+ 1 oradba None  524288000 May  2 18:52 xam
-rw-r--r--+ 1 oradba None  524288000 May  2 18:52 xan
-rw-r--r--+ 1 oradba None  258613760 May  2 18:52 xao

Step 2. Upload those 15 chunks of files to an EBS volume attached to a Linux EC2 server on Amazon.
Step 3. On the Linux EC2 server, merge files into a single file.

$ cat `ls xa*` >> datafile.dat

If you are using Windows, you may install cygwin to take advantage of the above commands.

Friday, May 09, 2014

Oracle Create Table As Select vs SQL Server Select Into

Oracle CTAS (create table as select) is a convenient way to create a new table based on an existing table.

SQL> select * from tbl_test;

       NUM
----------
         1
         2

SQL> create table tbl_test2 as select * from tbl_test;

Table created.

SQL> select * from tbl_test2;

       NUM
----------
         1
         2

In SQL Server, instead of using CTAS, we use "select.. into new_table from old_table" to create a new table based on old table.

select * into tbl_test2 from tbl_test;

Friday, April 25, 2014

Build Best Direct Marketing Predictive Models

A number of years ago, I was applying for a PhD statistician position at an online advertising company. As part of the screening process, I was given a project. The following is a simple description of the problems.

Let's take a look at the following records. ACME sends 10 campaigns through emails to subscriber 1 for 26 weeks. The records are ordered by Week_number. USER_CAT is a demographic code, State_ID the home state where the subscriber is located. CAMPAIGN_ID is a number from 1 to 10. Response 1 means the subscriber responded to the campaign.

SQL> select * from TBL_CAMPAIGN where subscriber_id=1 order by week_number;

WEEK_NUMBER SUBSCRIBER_ID USER_CAT   STATE_ID GENDER CAMPAIGN_ID   RESPONSE
----------- ------------- -------- ---------- ------ ----------- ----------
          1             1 B                 2 M                1          1
          2             1 B                 2 M                2          0
          3             1 B                 2 M                3          0
          4             1 B                 2 M                4          0
          5             1 B                 2 M                5          0
          6             1 B                 2 M                6          1
          7             1 B                 2 M                7          0
          8             1 B                 2 M                8          0
          9             1 B                 2 M                9          0
         10             1 B                 2 M               10          0
         11             1 B                 2 M                1          0
         12             1 B                 2 M                2          0
         13             1 B                 2 M                3          0
         14             1 B                 2 M                4          0
         15             1 B                 2 M                5          0
         16             1 B                 2 M                6          1
         17             1 B                 2 M                7          0
         18             1 B                 2 M                8          0
         19             1 B                 2 M                9          0
         20             1 B                 2 M               10          1
         21             1 B                 2 M                1          0
         22             1 B                 2 M                2          0
         23             1 B                 2 M                3          0
         24             1 B                 2 M                4          1
         25             1 B                 2 M                5          0
         26             1 B                 2 M                6          1

26 rows selected.

There are 26 weeks historical records like the above for 100,000 subscribers. Based on the data, can you answer the following questions?

For week 27, suppose we send emails to only 25% of its subscriber base:
(1) Which subscribers would you send email to?
(2) Which campaign(s) would you deliver to them?
(3) What do you expect the response rate to be?

There are several challenges to solve these problems:
1. It is not simply building predictive models based on static variables such as gender, home state, etc. We need to consider variables that capture the dynamic nature of a subscriber's past responses to campaigns. Things to consider include:
a). Is a subscriber who responded recently more likely to respond?
b). Is the sequence of offering affecting the response rate? For example, if a subscriber is first sent campaign 3, say a new credit care with APR 19%, and then 1 week later a campaign 4, one with APR 6%, we expect that he would more likely to respond to campaign 4.
2. We need to find out which campaign out of ten that a subscriber will most likely to respond to.
3. We need to accurately estimate the response rate. The scores returned by some predictive models are not necessarily probabilities.

I did extensive studies and figured out approaches to solve those problems. There are a number of conclusions that I have drawn from the studies:

1. Finding the best derived variables is the most important step for building a successful model.
2. All models, from simple logistic regression to sophisticated gradient boosting trees, perform reasonable well.
3. Pre-modeling tasks,e.g, data loading, merging, calculating derived variables, have taken more than 85% of the effort.

Monday, April 14, 2014

Drop All Objects In an Oracle Schema

In earlier post, we talk about how to drop all tables in an Oracle Schema. Here we are going to present a script that drop all the database objects in a schema including tables, views, functions, procedures, mining objects, etc. Please see the following PL/SQL script.

               Warning!!!!! The script below will remove all objects
                       under the current schema!!!

begin

for i in 1..3 loop
for r in (select object_name, object_type from user_objects 
          order by object_type, object_name)
  loop
  begin
    if (r.object_type = 'MINING MODEL') then 
     execute immediate ' begin dbms_data_mining.drop_model('||''''||
       r.object_name||''''||'); end;';
    elsif (r.object_type = 'TABLE') then
      execute immediate 'drop table "'||r.object_name
                    ||'" cascade constraints purge';
    else
       execute immediate 'drop '||r.object_type||' "'||r.object_name||'"';
    end if;
    exception when others then null;
  end;
  end loop;
end loop;
end;
/

Warning!!!!! The script above will remove all objects under the current schema!!!
In the above scripts, there are a number of things that we want to mentions.
1. Mining objects are predictive models. A mining object may contain tables with names starting with DM$. 2. The top level loop of for i in 1..3 is to remove all the objects three times. The is because objects may have dependency relationships. Objects may not be dropped if other objects are dependent on them. The simple solution to this is to try to drop all objects several times.
I saved it as file this_will_drop_all_user_objects.sql. I logged in my schema using sqlplus and run the scripts.

$sqlplus myschema/mypassword

SQL> @this_will_drop_all_user_objects 

PL/SQL procedure successfully completed.

All objects are removed under the current schema. Again, please be careful before run the script! It will remove all objects under the current schema.

Wednesday, April 09, 2014

Create Function Based Index on Oracle Table

Function based index is a very useful tool. I took advantage of it when I was involved in a project to deploy predictive models into a production Oracle database. The are two requirements:

1. I needed to write about two million records of two columns, account_num (varchar2(19)) and a code (varchar2(4)) into a table.
2. In production, we need to quickly find the code for each account number.
There is one restriction.
1. There is only one column col_xyz( varchar2(32)) in a table tbl_abc that we can write to.

What I did was to concatenate the account_num and code columns and store them in the column col_xyz. I then create function based index on the first 19 characters of col_xyz, the account_num. The following are SQL scripts involved.
My original table looks like the following.

SQL> select * from MY_TABLE where rownum <5;

ACCOUNT_NUM                            CODE
-------------------------------------- ----
AAAAAAAA00000000984                    3045
AAAAAAAA00000001421                    3045
AAAAAAAA00000002644                    3045
AAAAAAAA00000004569                    3045

I concatenate account_num and code and put them into tbl_abc as column col_xyz.

SQL> insert into tbl_abc (col_xyz) select account_num||code from MY_TABLE;

SQL> select * from tbl_abc where rownum <5;

COL_XYZ
--------------------------------
AAAAAAAA000000009843045
AAAAAAAA000000014213045
AAAAAAAA000000026443045
AAAAAAAA000000045693045

I create function based index on the first 19 characters of col_zyx, the account_num.

SQL> create index tbl_abc_idx on tbl_abc(substr(col_xyz,1,19));

Index created.

Once the function based index is created on substr(col_xyz,1,19), the following query that finds the code for account_num is very fast.

SQL> select col_xyz, substr(col_xyz,1,19) account_num, substr(col_xyz, 20,4) code from tbl_abc where substr(col_xyz,1,19)='AAAAAAAA00000000984';

COL_XYZ                          ACCOUNT_NUM         CODE
-------------------------------- ------------------- ----
AAAAAAAA000000009843045          AAAAAAAA00000000984 3045

Please also see Table Joining Using Function Based Index.

Monday, April 07, 2014

Calculate Stock Price Consecutive Ups Using Pattern Matching

Patter matching SQL support in Oracle 12c is a very powerful. Let take a took at the following stock price table. We want to calculate days when the price goes up for 4 or more consecutive days.

SQL> select * from STOCK order by dat;

SYM DAT            PRICE
--- --------- ----------
XYZ 31-MAY-11         14
XYZ 01-JUN-11         19
XYZ 02-JUN-11         21
XYZ 03-JUN-11         23
XYZ 04-JUN-11         27
XYZ 05-JUN-11         14
XYZ 06-JUN-11         17
XYZ 07-JUN-11         22
XYZ 08-JUN-11         26
XYZ 09-JUN-11         27
XYZ 10-JUN-11         21
XYZ 11-JUN-11         17
XYZ 12-JUN-11         27
XYZ 13-JUN-11         27
XYZ 14-JUN-11         16
XYZ 15-JUN-11         14
XYZ 16-JUN-11         16
XYZ 17-JUN-11         26
XYZ 18-JUN-11         25
XYZ 19-JUN-11         24

To find patterns where the prices goes up for 4 or more consecutive days, we use the following pattern matching query.

SELECT *
FROM stock MATCH_RECOGNIZE (
     PARTITION BY SYMBOL
     ORDER BY dat
     MEASURES  strt.price bal,strt.dat dat,
      last (up.price) as last_up_price,
      last (up.dat) AS last_up_dat
     ONE ROW PER MATCH
     AFTER MATCH SKIP TO LAST UP
     PATTERN (STRT up{4,} )
     DEFINE
        up AS up.price > PREV(up.price)
     ) MR
ORDER BY SYMBOL, dat;


SYM        BAL DAT       LAST_UP_PRICE LAST_UP_D
--- ---------- --------- ------------- ---------
XYZ         14 31-MAY-11            27 04-JUN-11
XYZ         14 05-JUN-11            27 09-JUN-11

In the above query, the pattern is defined as (strt up{4,}), any day followed by 4 days up. It also returns the last up price and day (last (up.price) and last (up.dat)).

Sunday, April 06, 2014

More on Calculating Variables That Predict Customer Churn

In the post Calculate Variables That Predict Customer Churn, we use Oracle analytics function lag() to caculate those accounts with account balances dropping for three consecutive months which may indicate that the customer may leave the bank. This calculation can be done using pattern matching, MATCH_RECOGNIZE, available on Oracle 12c. We use the following table as an example.

SQL> select *from tbl_account1 order by acct_num, dt;

  ACCT_NUM    BALANCE DT
---------- ---------- ---------
     12345       8500 31-JAN-13
     12345       8550 28-FEB-13
     12345       5000 31-MAR-13
     12345       1000 30-APR-13
     12345        200 31-MAY-13
     37688       8800 31-JAN-13
     37688       7000 28-FEB-13
     37688       5300 31-MAR-13
     37688       1300 30-APR-13
     37688        500 31-MAY-13

To calculate those accounts with balances dropping for 3 consecutive months, we use the following MATCH_RECOGNIZE query.

SELECT *
FROM tbl_account1 MATCH_RECOGNIZE (
     PARTITION BY acct_num
     ORDER BY dt
     MEASURES
      last(down.balance) bal,
      last(DOWN.dt) AS dt
     ONE ROW PER MATCH
     AFTER MATCH SKIP TO NEXT ROW
     PATTERN (DOWN{3})
     DEFINE
        DOWN AS DOWN.balance < PREV(DOWN.balance)
     ) MR
ORDER BY acct_num, dt;

  ACCT_NUM        BAL DT
---------- ---------- ---------
     12345        200 31-MAY-13
     37688       1300 30-APR-13
     37688        500 31-MAY-13

In the above query, "PARTITION BY acct_num" specifies the pattern matching will be performed indepdently for each acct_num. "DEFINE DOWN AS DOWN.balance < PREV(DOWN.balance)" defines pattern variable DOWN as a result of comparing the balance from current and previous row (based on "order by dt"). PATTERN (DOWN{3})means finding 3 consecutive downs. The syntax for pattern definition is similar to regular expressions. Last(down.balance) and last(down.dt) keep the last balance and dt in a successful match.

Pattern matching queries can be used to extract useful patterns from time series data. These patterns can then be used as input variables for predictive models. Thus, in Oracle 12c SQL, the combination of pattern matching and predictive models is very powerful.

Saturday, April 05, 2014

More On Zip Code and Predictive Models - Variable of High Cardinality

In the post Zip Code and Predictive Models, we talk about how to use zip codes in predictive models. Zip codes have many distinct or unique values. Many other variables include MCC (Merchant Category Code), credit card transaction terminal ID, and IP address have similar characteristics. Actually, there is a terminology to describe the uniqueness or distinctness of variable, the cardinality. High cardinality variables have many unique values. In extreme case, high cardinality variables are unique for each data record and they practically become unique identifiers. These extremely high cardinality variables are not really useful for being included in predictive model. For example, customer names are very unique. If we include names as one of the input variables to build a predictive model, the model will likely perform extremely well on the training data set by simply memorizing association between the customer name and target variable. However, the model will perform poorly on new data that contain unseen names.

The relatively high cardinality variable such as MCC, credit card transaction terminal ID, and IP address can be handled using the same methodology described in Zip Code and Predictive Models to categorize them into smaller number of groups. For tree-based models, it is not necessary to do this for high cardinality varibles.

Zip Code and Predictive Models

It is obvious that zip codes should be treated as categorical variable instead of numeric. Since zip codes have many distinct values, in some cases they may practically become the unique identifiers for data points. We need to deal with them carefully when building predictive models. Otherwise, the models may perform well on training data set but not so well on new data set due to overlearning problem.

For some predictive models such as decision trees and tree-based ensemble models, we can use zip codes directly. For other models, we may categorize zip codes into smaller number of meaningful groups. For example, we can categorize zip code based on their credit card fraud rates when building fraud dectection model. The following query shows card_id, home_zip and fraud indicator(fraud, 0 normal transaction) of a credit card transaction table.

SQL> select card_id, home_zip, is_fraud from CARD_TXN;

   CARD_ID HOME_ZIP           IS_FRAUD
---------- ---------------- ----------
     12345 012320                    0
     45678 012456                    1
 .......................................

Based on the above table, we can categorize zip code by following the following two steps:
1. Calculate the fraud rate for each zip code in the training data set. For zip code with small number of transactions, the fraud rate will be the overall fraud rate for the whole population.
2. Divide zip codes into groups based on fraud rates. we can calculate fraud rate for each home_zip using the following scripts.

create or replace view v_zip_fraud_rate
as
with
tbl_overall_fraud_r as
(select sum(is_fraud)/count(*) fraud_rate from CARD_TXN ),
tbl_zip_fraud_r as
(select home_zip, count(*) tot_num_txns,
sum(is_fraud) tot_num_fraud, sum(is_fraud)/count(*) fraud_rate
from CARD_TXN group by home_zip )
select home_zip, tot_num_txns, tot_num_fraud,
       case when tot_num_txns<50 then b.fraud_rate
            else b.fraud_rate
       end fraud_rate
from tbl_zip_fraud_r a, tbl_overall_fraud_r b;


SQL> select * from V_ZIP_FRAUD_RATE;

HOME_ZIP         TOT_NUM_TXNS TOT_NUM_FRAUD FRAUD_RATE
---------------- ------------ ------------- ----------
012320                   2000            10         .005
012456                   1000             8         .008
012345                     23             0         .006
..........................

In the above example, zip code "012345" has only no fraud transactions with only 23 transactions in total. We assign the overal average fraud rate to it. In the same fashion, we can categorize zip code based on click through rate, response rate, income, etc.

We can easily group zip code based on fraud rate. For example, we can use "case when" queries as shown below. After we convert the zip code into a small number of categroies, we can use it in a predicitve models(such as a logistic regression model). Again, it is not nessary to categorize zip codes when using decision tree based models.

select HOME_ZIP,
   case when fraud_rate<0.001 then 1
        when fraud_rate<0.003 then 2
        when fraud_rate<0.005 then 3
        when fraud_rate<0.007 then 4
        when fraud_rate<0.008 then 5
        else 6 end zip_segment
from
v_zip_fraud_rate;

Friday, April 04, 2014

Empty All Tables In an Oracle Schema

In stead of removing all tables in our schema as shown in post Drop All Tables In an Oracle Schema , there are situations where we want to keep the tables while empty all the data stored in them. The following PL/SQL script does this.

               Warning!!!!! The script below will empty all tables 
                       under the current schema!!!

begin
for r in (select table_name from user_tables order by table_name)
  loop
  begin
   execute immediate 'truncate table '||r.table_name;
   exception when others then null;
  end;
  end loop;
end;
/

Warning!!!!! The script above will empty all tables under the current schema!!!
I saved it as file this_will_empty_all_user_tables.sql. I logged in my schema using sqlplus and run the scripts.

$sqlplus myschema/mypassword

SQL> @this_will_empty_all_user_tables 

PL/SQL procedure successfully completed.

All data stored in the tables in current schema are gone while the table structure remain unchanged.

Drop All Tables In an Oracle Schema

Sometimes, we want to drop all tables in our schema. To do this, I create the following PL/SQL script file.

               Warning!!!!! The script below will remove all tables 
                       under the current schema!!!

begin
for r in (select table_name from user_tables order by table_name)
  loop
  begin
   execute immediate 'drop table '||r.table_name
                     ||' cascade constraints purge';
   exception when others then null;
  end;
  end loop;
end;
/

Warning!!!!! The script above will remove all tables under the current schema!!!
I saved it as file this_will_drop_all_user_tables.sql. I logged in my schema using sqlplus and run the scripts.

$sqlplus myschema/mypassword

SQL> @this_will_drop_all_user_tables 

PL/SQL procedure successfully completed.

All the tables are removed under the current schema. Again, please be careful before run the script! It will remove all tables under the current schema.

Wednesday, April 02, 2014

Calculate Variables That Predict Customer Churn

Oracle analytics functions such as lag() are very useful in building interesting variables for predictive models. For example, to predict if a bank account will be closed by a customer we may want to look at the account balance history. If the account balance drop consecutively for three months, it may indicate that the customer will leave the bank. Using the following table as an example, we will describe how to build a variable indicating balance dropping for three consecutive months.

SQL> select * from tbl_account1 order by dt desc;

  ACCT_NUM    BALANCE DT
---------- ---------- ---------
     12345        200 31-MAY-13
     12345       1000 30-APR-13
     12345       5000 31-MAR-13
     12345       8550 28-FEB-13
     12345       8500 31-JAN-13

One approach is to use lag() function. In the following query, for each record we calculate the previous 1, 2 and 3 months balances. lag(BALANCE,1), lag(BALANCE,2) and lag(BALANCE,3) indicate the balance in prevous 1, 2 and 3 records based on the order of "dt", respectively. "partition by acct_num" means the calculation is done independently by acct_num.

with
tbl as(
select a.* ,
lag(BALANCE,1) over(partition by acct_num order by dt) last_1,
lag(BALANCE,2) over(partition by acct_num order by dt) last_2,
lag(BALANCE,3) over(partition by acct_num order by dt) last_3
from tbl_account1 a)
select * from tbl order by dt;

  ACCT_NUM    BALANCE DT            LAST_1     LAST_2     LAST_3
---------- ---------- --------- ---------- ---------- ----------
     12345       8500 31-JAN-13
     12345       8550 28-FEB-13       8500
     12345       5000 31-MAR-13       8550       8500
     12345       1000 30-APR-13       5000       8550       8500
     12345        200 31-MAY-13       1000       5000       8550

Once we understand how to calculate the privous month balance using lag() function, we can derive the indictor showing consecutively three months drop in balance using the following query.

with
tbl as(
select a.* ,
lag(BALANCE,1) over(partition by acct_num order by dt) last_1,
lag(BALANCE,2) over(partition by acct_num order by dt) last_2,
lag(BALANCE,3) over(partition by acct_num order by dt) last_3
from tbl_account1 a)
select ACCT_NUM, balance, dt,
    case when balance-last_1 <0 
          and last_1-last_2<0 
          and last_2-last_3<0 then 1
          else 0 end as bal_drop_in_3m
 from tbl order by dt;

  ACCT_NUM    BALANCE DT        BAL_DROP_IN_3M
---------- ---------- --------- --------------
     12345       8500 31-JAN-13              0
     12345       8550 28-FEB-13              0
     12345       5000 31-MAR-13              0
     12345       1000 30-APR-13              0
     12345        200 31-MAY-13              1

We can also use analytic function sum() over() to calcluate the variable as shown below.

with
tbl as(
select a.* ,
decode(sign(balance-lag(BALANCE,1) over(partition by acct_num order by dt)),
        0,0,
        1,0,
        -1) is_drop
from tbl_account1 a
)
select ACCT_NUM, balance, dt,
case when sum(is_drop) over(partition by acct_num 
            order by dt rows 2 preceding) = -3 then 1
     else 0 end BAL_DROP_IN_3M
 from tbl a order by dt;

  ACCT_NUM    BALANCE DT        BAL_DROP_IN_3M
---------- ---------- --------- --------------
     12345       8500 31-JAN-13              0
     12345       8550 28-FEB-13              0
     12345       5000 31-MAR-13              0
     12345       1000 30-APR-13              0
     12345        200 31-MAY-13              1

The above variable BAL_DROP_IN_3M can be used as one of the inputs into a predictive model. We can also use the same method to calculate variables such as stock price going up for five consecutive days, etc.

Why Sometime It Is Not a Good Idea to Export Database Tables as Text Files?

It is a common practice to export a database table as a text file for the purpose of delivering data to another party. However, by converting a database table into a text file we lose many valuable information about data and that may cause problem.
If a column contains special characters such as new line (ASCII code 10 or 0A in hex), the records may not be successfully loaded back into a database as most import utility will treat new line the end of a record. For example, the record id 3 in following table has a new line character between "data" and "analytics". This can be shown using function utl_raw.CAST_TO_RAW().

SQL> select * from tbl_test;

        ID MESSAGE
---------- ----------------------------------------------------------------
         1 Hello
         2 World
         3 data
            analytics
SQL> select message, utl_raw.CAST_TO_RAW(message) ascii_code from tbl_test where id = 3;

MESSAGE                          ASCII_CODE
-------------------------------- ---------------------------------------------------------
data                             64617461200A20616E616C7974696373
 analytics

If we dump this table as a text file, record 3 will not be loaded correctly because the new line normally marks the end of a record. Other characters that make things complicated including comma(,), single quote('), double quote(") etc. Unfortunately, even some ETL tools convert database tables into text files and then load them back into another database. We have experienced problems with these tools.
Then how do we deliver our data to another party if we do not use text files? One approach is to use Oracle utilities such as data pump(expdp/impdp) or exp/imp. We may also use database link to fetch the data from a remote database directly into a table,e.g, create table tbl_abc as select * from tbl_abc@remote_db.

Compare Oracle Tables in Different Databases: Find an ETL Issue

Problem

There are situations where we need to compare tables or views located in different database. For example, a few years ago I was involved in a project to help building a data warehouse to achieve the bank card transactions in production. The card transactions in the production Oracle database are moved through an ETL process to the data warehouse, another Oracle database. The key questions are: how do we know that the data values are the same after the movement through ETL? How do we compare tables in different databases? (We do not automatically assume that the ETL tool does the data movement job correctly. It turned out the some data values actually changes by the ETL.)

One of ways to compare table across Oracle databases is to use database link. Before we create database link using the following command, the user may need to be granted the "create database link " privilege.

SQL> create database link dl_db1 connect to user1 identified by abc123 
using '(description=(address=(protocol=TCP)(host=server1)(port=1521))
(connect_data=(sid=ORCLID)))';

Once the database link is created, we can compare a table the current database to that in another database. To access the table in another database that the database link points to, we specify "@dl_db1" after the table name as shown below. We can query remote table just like local table.

SQL> select count(1) from tbl_123 a, tbl_123@dl_db1 b where a.id=b.id;

In the project that I mentioned above, we compared the tables before and after being moved by ETL and identified that in some cases values changed. Further investigation showed that it was caused by that the ETL tool converted tables in the source database into text files and then loaded them into tables into the destination database. It is usually not a good idea to dump tables text files as some information are lost during the process.

Tuesday, April 01, 2014

Data Mining Model Management: Query user_mining_models View

When we build models in an Oracle database, all those models are database objects that can be queried using SQL. Thus, we can find out the critical information about the models. One of the most useful view is user_mining_models which contains the following information about models:

 MODEL_NAME
 MINING_FUNCTION
 ALGORITHM
 CREATION_DATE
 BUILD_DURATION
 MODEL_SIZE
 COMMENTS

For example, the following query shows models that I have built in my schema.

SQL> select model_name, MINING_FUNCTION, ALGORITHM from user_mining_models;

MODEL_NAME         MINING_FUNCTION                ALGORITHM
------------------ ------------------------------ ------------------------------
ABCOC1M            CLUSTERING                     O_CLUSTER
NB1021             CLASSIFICATION                 NAIVE_BAYES
SVM1029            CLASSIFICATION                 SUPPORT_VECTOR_MACHINES
AR1029             ASSOCIATION_RULES              APRIORI_ASSOCIATION_RULES
AI0929             ATTRIBUTE_IMPORTANCE           MINIMUM_DESCRIPTION_LENGTH
DT1029             CLASSIFICATION                 DECISION_TREE
GLM1031A           CLASSIFICATION                 DECISION_TREE
GLM1031B           CLASSIFICATION                 DECISION_TREE
GLM1031C           CLASSIFICATION                 DECISION_TREE
GLM1031E           CLASSIFICATION                 GENERALIZED_LINEAR_MODEL
KM1031C            CLUSTERING                     KMEANS
OC_SH_CLUS_SAMPLE  CLUSTERING                     O_CLUSTER
KM1211             CLUSTERING                     KMEANS
KM_MODEL_TRY1      CLUSTERING                     KMEANS
GLM0115            CLASSIFICATION                 GENERALIZED_LINEAR_MODEL
KM_MODEL           CLUSTERING                     KMEANS
SVD0119            FEATURE_EXTRACTION             NONNEGATIVE_MATRIX_FACTOR
TMSVD1             FEATURE_EXTRACTION             SINGULAR_VALUE_DECOMP

The following "group by" query summarize types of models that I have built.

SQL> select MINING_FUNCTION, ALGORITHM, count(*) from user_mining_models
 group by MINING_FUNCTION, ALGORITHM order by MINING_FUNCTION, ALGORITHM;

MINING_FUNCTION                ALGORITHM                        COUNT(*)
------------------------------ ------------------------------ ----------
ASSOCIATION_RULES              APRIORI_ASSOCIATION_RULES               1
ATTRIBUTE_IMPORTANCE           MINIMUM_DESCRIPTION_LENGTH              1
CLASSIFICATION                 DECISION_TREE                           4
CLASSIFICATION                 GENERALIZED_LINEAR_MODEL                2
CLASSIFICATION                 NAIVE_BAYES                             1
CLASSIFICATION                 SUPPORT_VECTOR_MACHINES                 1
CLUSTERING                     KMEANS                                  4
CLUSTERING                     O_CLUSTER                               2
FEATURE_EXTRACTION             NONNEGATIVE_MATRIX_FACTOR               1
FEATURE_EXTRACTION             SINGULAR_VALUE_DECOMP                   1

In summary, models are database objects that can be queried using SQL. This provides an efficient ways to manage many models in our database.We can retrieve our models by names, mining functions (classification, regression, etc.), algorithms, build date, comments, etc.

Sunday, March 30, 2014

Oracle On The Fly Model- Clustering

One of the features in Oracle 12c is the ability of building and running models on the fly ( as opposed to the conventional two steps of building a persistent model first and then applying it to the data). On the fly predictive models provides data miners powerful tools for accomplishing sophisticated tasks with easy.
I will use on the fly k-means clustering as an example. Take a look at the following simple consumer data. Cust_ID is the unique identifier for a consumer.

SQL> select * from TBL_CUSTOMER3 where rownum <=10;

   CUST_ID     INCOME  YEARS_RES HOME_VALUE SEX        MARRIED
---------- ---------- ---------- ---------- ---------- ----------
      7422    87499.5          2        117 "M"        "M"
      3356                    15         80 "M"        "U"
      4782    62499.5          3         91 "M"        "M"
      7333    87499.5          7         85 "M"        "M"
       890    42999.5          1         58 "F"        "U"
      6401    87499.5          5        128 "M"        "M"
      2356    87499.5          4         96 "M"        "M"
      1638    87499.5         13        152 "M"        "M"
      6713    62499.5          6         49 "M"        "U"
      3674    87499.5          3        119 "M"        "M"

10 rows selected.

To understand consumers better, usually we want to categorize consumers into small number of groups, instead of dealing with them individually. We can using Oracle predictive function cluster_id() over() to find clusters on the fly as shown below. Here, cluster_id(into 5 using income, YEARS_RES, HOME_VALUE) returns 5 cluster identifiers using the default clustering algorithms (K-means) based on variables income, YEARS_RES, HOME_VALUE. Missing input variables are OK as they will be replaced by the means. Please notice that the numbering of cluster identifier is not important at all,i.e., as long as clusters are different we can number cluster id arbitrarily.


SQL> select * from (select cust_id, 
   cluster_id(into 5 using income, YEARS_RES, HOME_VALUE ) over() cid 
from TBL_CUSTOMER3) where rownum <=10;

   CUST_ID        CID
---------- ----------
      7422          9
      3356          6
      4782          7
      7333          8
       890          5
      6401          9
      2356          9
      1638          8
      6713          7
      3674          9

10 rows selected.

We can calculate some summary information about each group using the following query to gain insight into the data.

SQL> select cid, count(*), avg(income) avg_income, 
avg(years_res) avg_years_res, 
avg(home_value) avg_home_value from 
(select cust_id, income, years_res, home_value, 
cluster_id(into 5 using income, YEARS_RES, HOME_VALUE ) over() cid 
from TBL_CUSTOMER3) group by cid order by count(*) desc;

       CID   COUNT(*)  AVG_INCOME AVG_YEARS_RES AVG_HOME_VALUE
---------- ---------- ----------- ------------- --------------
         7        152    62499.50          3.88          92.89
         9        126    87499.50          2.75         117.98
         5        110    47254.05          2.58          67.37
         8         65    87499.50          9.15         105.58
         6         31    55749.50         13.45          82.39

The clustering models are generated on the fly. When the queries finish, the models are gone. Thus it is very "clean" to run queries involving on the fly models.

Please notice that in the above query, we combine predictive model with standard SQL (group by cid). This is extremely powerful in that we seamlessly raise conventional SQL queries up to a higher level of predictive modeling. We instantly turn a SQL developer into a data miner. Bingo! (If you are interested in this point, please also see my early posts From Oracle SQL Developer to Data Miner and How easy is it for a SQL developer/analyst to become a data miner?

Saturday, March 29, 2014

Oracle Exception NO_DATA_FOUND

Problem

We want to select a column value from a table based on a key. When no record is found for the key, a special message will be displayed.

Solution

Taking the following table as an example.

SQL> select * from TBL_0321;

        ID VAL
---------- --------------------------------
         1 ok

We create a text file find_id.sql that contains the following scripts. When an id is found, it will display "value is:...". When an id is not found, use "exception when NO_DATA_FOUND" kicks in and "Id not found" is displayed.

set serveroutput on;
declare
msg varchar2(32);
begin
 select val into msg from TBL_0321 where id=&1;
 dbms_output.put_line('value is:'||msg);
 exception when NO_DATA_FOUND then
   dbms_output.put_line('Id not found'); 
end;
/

The following are the outputs from actual running of the script.

SQL> @find_id 1
old   4:  select val into msg from TBL_0321 where id=&1;
new   4:  select val into msg from TBL_0321 where id=1;
value is:ok

PL/SQL procedure successfully completed.

SQL> @find_id 2
old   4:  select val into msg from TBL_0321 where id=&1;
new   4:  select val into msg from TBL_0321 where id=2;
Id not found

PL/SQL procedure successfully completed.

Add Comments to Oracle Table

It is a good idea to add comments or descriptions to a table that could be lengthier and more descriptive than the table name. We can retrieve the comments later when needed. This is particularly important for a large project that may involve hundreds of tables.


SQL> comment on table TBL_0321 is 'this is a test';

Comment created.

To retrieve the comments, use the following query.

SQL> select comments from user_tab_comments where table_name='TBL_0321';

COMMENTS
----------------------------------------------------------------------------------
this is a test

Tuesday, March 25, 2014

Using Oracle Dump Function

Oracle dump() function is a useful tool for looking into the "real value", not just the displayed. For example, we may be puzzled to see the results of the query below. Message "data analytics" in Record 3 spans two lines.

SQL> select * from tbl_test;

        ID MESSAGE
---------- --------------------------------
         1 Hello
         2 World
         3 data
            analytics

To find out what is going on, we use dump() function to examine the content of message. On record 3, there is a ASCII character 10 which represents a line new.

SQL> select a.*, dump(id), dump(message) from tbl_test a;

        ID MESSAGE
---------- --------------------------------
DUMP(ID)
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
DUMP(MESSAGE)
------------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
         1 Hello
Typ=2 Len=2: 193,2
Typ=1 Len=5: 72,101,108,108,111

         2 World
Typ=2 Len=2: 193,3
Typ=1 Len=5: 87,111,114,108,100

         3 data
            analytics
Typ=2 Len=2: 193,4
Typ=1 Len=16: 100,97,116,97,32,10,32,97,110,97,108,121,116,105,99,115

To remove the new line, we use replace() function as shown below. Now the display looks much better.

SQL> select id, replace(message, chr(10), '') from tbl_test a;

        ID REPLACE(MESSAGE,CHR(10),'')
---------- ----------------------------------------------------------------
         1 Hello
         2 World
         3 data  analytics

In addition to dump(), we can also use function utl_raw.CAST_TO_RAW() as described in post Watch out invisible characters.

Make Oracle Table Read Only

It is a good idea to protect a static table from being modified by making it read only. Command "alter table .. read only" does precisely that. The following example shows how it works.

SQL> create table tbl_test (id number, message varchar2(64));

Table created.

SQL> insert into tbl_test values(1,'Hello');

1 row created.

SQL> alter table tbl_test read only;

Table altered.

SQL> insert into tbl_test values(2,'World');
insert into tbl_test values(2,'World')
            *
ERROR at line 1:
ORA-12081: update operation not allowed on table "MYSCHEMA"."TBL_TEST"

As we can see, after we make the table read only, we can not insert data into the table. We can still run the following queries against read only tables.

1. Select from a read only table.

SQL> select * from tbl_test;

        ID MESSAGE
---------- ----------------------------------------------------------------
         1 Hello

2. Create index on a read only table

SQL> create index tbl_test_idx on tbl_test(id);

Index created.

3. Rename a read only table

SQL> rename tbl_test to tbl_test_b;

Table renamed.

4. Drop a read only table

SQL> drop table tbl_test_b;

Table dropped.

Finally, to restore a read only table to a writable table, we use the command "alter table ... read write" as shown below. Once a "ready only" table becomes "read write", we can insert data into it.

SQL> alter table tbl_test read write;

Table altered.

SQL> insert into tbl_test values(2,'World');

1 row created.

Rename an Oracle Table or View

We can use "rename" to rename a table or view.

SQL> rename tbl_test01 to tbl_test01b;

Table renamed.

Alternatively, we can rename a table using "alter table ...rename to...".


SQL> alter table tbl_test01c rename to tbl_test01d;

Table altered.

Sunday, March 09, 2014

Flush Shared Pool and Buffer Cache

In the post Display How Long to Run a Query, we talk about how to display the time elapsed runing a query. When we run a "fresh" or new query, the database needs to do a lot of work to parse the SQL text and decide the execution plan. Parsed SQL query is stored in memory and may be reused in the future if the same SQL text is issued. Thus as we query the databases, information about our past queries are buffered in the memory. If we run the same query again, it may be faster because the database can reuse information in the memory. We can run the following SQL commands to flush the share pool (information about SQL text) and buffer cache(table data buffered in memory). After that, we can more accurately measure how long it takes to run a fresh query .

SQL>alter system flush shared_pool;
SQL>alter system flush buffer_cache;

Saturday, March 01, 2014

Oracle Direct Path Insert

Problem

When there are large amount of data to be inserted into a table, it is better to use direct path insert. To do direct path insert, we can specify hit /*+ append */ as shown below.

SQL>  insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a;

However, if we do another direct path insert into the same table, there will be error, ORA-12838: cannot read/modify an object after modifying it in parallel.

SQL>  insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a;
 insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a
*
ERROR at line 1:
ORA-12838: cannot read/modify an object after modifying it in parallel

To do another direct path insert into the same table, we need to commit the first one first.


SQL> commit;

Commit complete.

SQL>  insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a;

4 rows created.

Conventional insert does not have the issue as shown below.

SQL>  insert   into tblabc select a.id, a.a from TBL_TEST3 a;

4 rows created.

SQL>  insert   into tblabc select a.id, a.a from TBL_TEST3 a;

4 rows created.

SQL>  insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a;

4 rows created.

SQL>  insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a;
 insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a
*
ERROR at line 1:
ORA-12838: cannot read/modify an object after modifying it in parallel


SQL> commit;

Commit complete.

SQL>  insert  /*+ append */ into tblabc select a.id, a.a from TBL_TEST3 a;

4 rows created.

Handle Divided By Zero in Oracle

Problem

Ratio is very useful variable. For example, the ratio between credit card account balance and credit limit is a good measurement of the default risk. However, when calculating ratio, it the denominator is zero, Oracle raises an error,ORA-01476: divisor is equal to zero, and stops the query. This is inconvenient as shown below.

SQL> select * from tbl_account;

CUSTOMER_ID ACCT_BALANCE CREDIT_LIMIT
----------- ------------ ------------
          1         2000         5000
          2            0            0
          3         6000        12000

SQL> select a.*, ACCT_BALANCE/CREDIT_LIMIT r from tbl_account a;
ERROR:
ORA-01476: divisor is equal to zero
no rows selected

Solution

One solution to calculating ratio or division is to build a function that returns NULL when the denominator is zero and normal value when it is not zero. The script below builds such a function div(num1, num2) that calculates num1/num2. It will return NULL when num2 is zero and continue the execution.

create or replace FUNCTION div (
   num1 number,
   num2 number
)
   RETURN NUMBER IS
   v_number   NUMBER;
BEGIN
   BEGIN
      v_number:= num1/num2 ;
   EXCEPTION
      WHEN OTHERS THEN
         RETURN NULL;
   END;

   RETURN v_number;
END div;
/

After we run the above script to build function div(), we can use it to replace division "/".

SQL> select a.*, div(ACCT_BALANCE,CREDIT_LIMIT) r from tbl_account a;

CUSTOMER_ID ACCT_BALANCE CREDIT_LIMIT          R
----------- ------------ ------------ ----------
          1         2000         5000         .4
          2            0            0
          3         6000        12000         .5

The following are other outputs of using div()function.

SQL> select div(0,0) r from dual;

         R
----------


SQL> select div(NULL,0) r from dual;

         R
----------


SQL> select div(NULL,NULL) r from dual;

         R
----------

Wednesday, February 26, 2014

NULL Value and Order By

By default, when we order the records by columns, the null values are ranked last.

SQL> select * from TBL_TEST3 order by a, b;

        ID          A          B
---------- ---------- ----------
         1          1          2
         2          1
         4          2          2
         3

We can place rows with nulls first by specifying "nulls fist".

SQL> select * from TBL_TEST3 order by a nulls first, b;

        ID          A          B
---------- ---------- ----------
         3
         1          1          2
         2          1
         4          2          2

NULL Value and Logical Comparison

Logical comparisons in where clause of queries invloving NULL values will be ignored. We use the following table to illustrate this.

SQL> select * from tbl_test3;

        ID          A          B
---------- ---------- ----------
         1          1          2
         2          1
         3
         4          2          2

SQL> select * from tbl_test3 where a=b;

        ID          A          B
---------- ---------- ----------
         4          2          2

SQL> select * from tbl_test3 where a<>b;

        ID          A          B
---------- ---------- ----------
         1          1          2

It is interesting to notice that 1. NULL=NULL is not true; 2. NULL<>NULL is not true either.

NULL Value and Arithmetic Operations

Any arithmetic operations invlove NULL values will return NULL. We use the following table to illustrate this.

SQL> select * from tbl_test3;

        ID          A          B
---------- ---------- ----------
         1          1          2
         2          1
         3
         4          2          2

SQL> select t.*, a+b from tbl_test3 t;

        ID          A          B        A+B
---------- ---------- ---------- ----------
         1          1          2          3
         2          1
         3
         4          2          2          4

Tuesday, February 25, 2014

Four ways of Calculating Average Value

In the post Calculate Average Value - Watch Out NULL Values, we mentioned that we need to pay attention to how the NULL values are handled by the database. Using the following table as an example (that has two records with NULL values), we will show four ways of calculating average.

SQL> select * from tbl_test2;

        ID      VALUE
---------- ----------
         1          2
         2          3
         3          4
         4          5
         5
         6

6 rows selected.

Using sum(value)/count(*) will return the wrong result.

SQL> select sum(value)/count(*) from tbl_test2; (wrong!)

SUM(VALUE)/COUNT(*)
-------------------
         2.33333333

The following four queries are correct.
Method 1. Using avg() function.

SQL> select avg(value) from tbl_test2;

AVG(VALUE)
----------
       3.5

Method 2. Using sum(value)/count(value) function. Count(value) will ignore NULL values.

SQL> select sum(value)/count(value) from tbl_test2;

SUM(VALUE)/COUNT(VALUE)
-----------------------
                    3.5

Method 3. Using sum(value)/sum(case when value is null then 0 else 1 end).

SQL> select sum(value)/sum(case when value is null then 0 else 1 end) from tbl_test2;

SUM(VALUE)/SUM(CASEWHENVALUEISNULLTHEN0ELSE1END)
------------------------------------------------
                                             3.5

Method 4. Using sum(value)/count(*) and put "not null" condition in where clause.

SQL> select sum(value)/count(*) from tbl_test2 where value is not null;

SUM(VALUE)/COUNT(*)
-------------------
                3.5

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Monday, December 15, 2014

Friday, December 12, 2014

Wednesday, December 10, 2014

Monday, December 08, 2014

Sunday, December 07, 2014

Thursday, November 20, 2014

Tuesday, November 18, 2014

Monday, October 06, 2014

1. SQL Based In-Database Analytics

2. Apache Spark

3. the Proliferation R Language

4. Real Time, In Memory Data Analytics

5. Innovative Data Analytics Applications

Sunday, September 28, 2014

Saturday, September 13, 2014

Wednesday, August 27, 2014

Sunday, August 24, 2014

Friday, August 22, 2014

Tuesday, July 22, 2014

Wednesday, July 16, 2014

Sunday, June 08, 2014

Predictive Models vs Business Rules

Predictive Model Variables

Import Text File Into Oracle

Unique Identifiers in Oracle

Recency Frequency Monetary

Oracle Materialized View

Database Link to Oracle on Amazon

Calculate Percent Using SQL

Calculate Cumulative Percent Using SQL

Find Score Cutoff Value Using SQL

Remove Duplicates Using SQL

Calculate Correlation Coefficients Using SQL

Oracle vs SQL Server

NULL Value

Logistic Regression

Random Sampling

Table Insert

Read Only Table

Clustering

Ranking

Find Most Frequent

Median Value

Oracle Source Code

Debug PL/SQL

Hide PL/SQL Scripts

Repair Views

Dump Schema

Move Big Files to Amazon

Sunday, June 01, 2014

Saturday, May 10, 2014

Friday, May 09, 2014

Friday, April 25, 2014

Monday, April 14, 2014

Wednesday, April 09, 2014

Monday, April 07, 2014

Sunday, April 06, 2014

Saturday, April 05, 2014

Friday, April 04, 2014

Wednesday, April 02, 2014

Problem

Tuesday, April 01, 2014

Sunday, March 30, 2014

Saturday, March 29, 2014

Problem

Solution

Tuesday, March 25, 2014

Sunday, March 09, 2014

Saturday, March 01, 2014

Problem

Problem

Solution

Wednesday, February 26, 2014

Tuesday, February 25, 2014