A number of years ago, I was applying for a PhD statistician position at an online advertising company. As part of the screening process, I was given a project. The following is a simple description of the problems.
Let's take a look at the following records. ACME sends 10 campaigns through emails to subscriber 1 for 26 weeks. The records are ordered by Week_number. USER_CAT is a demographic code, State_ID the home state where the subscriber is located. CAMPAIGN_ID is a number from 1 to 10. Response 1 means the subscriber responded to the campaign.
SQL> select * from TBL_CAMPAIGN where subscriber_id=1 order by week_number;
WEEK_NUMBER SUBSCRIBER_ID USER_CAT STATE_ID GENDER CAMPAIGN_ID RESPONSE
----------- ------------- -------- ---------- ------ ----------- ----------
1 1 B 2 M 1 1
2 1 B 2 M 2 0
3 1 B 2 M 3 0
4 1 B 2 M 4 0
5 1 B 2 M 5 0
6 1 B 2 M 6 1
7 1 B 2 M 7 0
8 1 B 2 M 8 0
9 1 B 2 M 9 0
10 1 B 2 M 10 0
11 1 B 2 M 1 0
12 1 B 2 M 2 0
13 1 B 2 M 3 0
14 1 B 2 M 4 0
15 1 B 2 M 5 0
16 1 B 2 M 6 1
17 1 B 2 M 7 0
18 1 B 2 M 8 0
19 1 B 2 M 9 0
20 1 B 2 M 10 1
21 1 B 2 M 1 0
22 1 B 2 M 2 0
23 1 B 2 M 3 0
24 1 B 2 M 4 1
25 1 B 2 M 5 0
26 1 B 2 M 6 1
26 rows selected.
There are 26 weeks historical records like the above for 100,000 subscribers. Based on the data, can you answer the following questions?
For week 27, suppose we send emails to only 25% of its subscriber base:
(1) Which subscribers would you send email to?
(2) Which campaign(s) would you deliver to them?
(3) What do you expect the response rate to be?
There are several challenges to solve these problems:
1. It is not simply building predictive models based on static variables such as gender, home state, etc. We need to consider variables that capture the dynamic nature of a subscriber's past responses to campaigns. Things to consider include:
a). Is a subscriber who responded recently more likely to respond?
b). Is the sequence of offering affecting the response rate? For example, if a subscriber is first sent campaign 3, say a new credit care with APR 19%, and then 1 week later a campaign 4, one with APR 6%, we expect that he would more likely to respond to campaign 4.
2. We need to find out which campaign out of ten that a subscriber will most likely to respond to.
3. We need to accurately estimate the response rate. The scores returned by some predictive models are not necessarily probabilities.
I did extensive studies and figured out approaches to solve those problems. There are a number of conclusions that I have drawn from the studies:
1. Finding the best derived variables is the most important step for building a successful model.
2. All models, from simple logistic regression to sophisticated gradient boosting trees, perform reasonable well.
3. Pre-modeling tasks,e.g, data loading, merging, calculating derived variables, have taken more than 85% of the effort.
Popular Topics
Popular Topics
Friday, April 25, 2014
Build Best Direct Marketing Predictive Models
Monday, April 14, 2014
Drop All Objects In an Oracle Schema
In earlier post, we talk about how to drop all tables in an Oracle Schema. Here we are going to present a script that drop all the database objects in a schema including tables, views, functions, procedures, mining objects, etc. Please see the following PL/SQL script.
Warning!!!!! The script below will remove all objects
under the current schema!!!
begin for i in 1..3 loop for r in (select object_name, object_type from user_objects order by object_type, object_name) loop begin if (r.object_type = 'MINING MODEL') then execute immediate ' begin dbms_data_mining.drop_model('||''''|| r.object_name||''''||'); end;'; elsif (r.object_type = 'TABLE') then execute immediate 'drop table "'||r.object_name ||'" cascade constraints purge'; else execute immediate 'drop '||r.object_type||' "'||r.object_name||'"'; end if; exception when others then null; end; end loop; end loop; end; /
Warning!!!!! The script above will remove all objects under the current schema!!!In the above scripts, there are a number of things that we want to mentions. 1. Mining objects are predictive models. A mining object may contain tables with names starting with DM$. 2. The top level loop of for i in 1..3 is to remove all the objects three times. The is because objects may have dependency relationships. Objects may not be dropped if other objects are dependent on them. The simple solution to this is to try to drop all objects several times. I saved it as file this_will_drop_all_user_objects.sql. I logged in my schema using sqlplus and run the scripts.
$sqlplus myschema/mypassword SQL> @this_will_drop_all_user_objects PL/SQL procedure successfully completed.All objects are removed under the current schema. Again, please be careful before run the script! It will remove all objects under the current schema.
Wednesday, April 09, 2014
Create Function Based Index on Oracle Table
Function based index is a very useful tool. I took advantage of it when I was involved in a project to deploy predictive models into a production Oracle database. The are two requirements: 1. I needed to write about two million records of two columns, account_num (varchar2(19)) and a code (varchar2(4)) into a table. 2. In production, we need to quickly find the code for each account number. There is one restriction. 1. There is only one column col_xyz( varchar2(32)) in a table tbl_abc that we can write to. What I did was to concatenate the account_num and code columns and store them in the column col_xyz. I then create function based index on the first 19 characters of col_xyz, the account_num. The following are SQL scripts involved. My original table looks like the following.
SQL> select * from MY_TABLE where rownum <5; ACCOUNT_NUM CODE -------------------------------------- ---- AAAAAAAA00000000984 3045 AAAAAAAA00000001421 3045 AAAAAAAA00000002644 3045 AAAAAAAA00000004569 3045I concatenate account_num and code and put them into tbl_abc as column col_xyz.
SQL> insert into tbl_abc (col_xyz) select account_num||code from MY_TABLE; SQL> select * from tbl_abc where rownum <5; COL_XYZ -------------------------------- AAAAAAAA000000009843045 AAAAAAAA000000014213045 AAAAAAAA000000026443045 AAAAAAAA000000045693045I create function based index on the first 19 characters of col_zyx, the account_num.
SQL> create index tbl_abc_idx on tbl_abc(substr(col_xyz,1,19)); Index created.Once the function based index is created on substr(col_xyz,1,19), the following query that finds the code for account_num is very fast.
SQL> select col_xyz, substr(col_xyz,1,19) account_num, substr(col_xyz, 20,4) code from tbl_abc where substr(col_xyz,1,19)='AAAAAAAA00000000984'; COL_XYZ ACCOUNT_NUM CODE -------------------------------- ------------------- ---- AAAAAAAA000000009843045 AAAAAAAA00000000984 3045Please also see Table Joining Using Function Based Index.
Monday, April 07, 2014
Calculate Stock Price Consecutive Ups Using Pattern Matching
Patter matching SQL support in Oracle 12c is a very powerful. Let take a took at the following stock price table. We want to calculate days when the price goes up for 4 or more consecutive days.
SQL> select * from STOCK order by dat; SYM DAT PRICE --- --------- ---------- XYZ 31-MAY-11 14 XYZ 01-JUN-11 19 XYZ 02-JUN-11 21 XYZ 03-JUN-11 23 XYZ 04-JUN-11 27 XYZ 05-JUN-11 14 XYZ 06-JUN-11 17 XYZ 07-JUN-11 22 XYZ 08-JUN-11 26 XYZ 09-JUN-11 27 XYZ 10-JUN-11 21 XYZ 11-JUN-11 17 XYZ 12-JUN-11 27 XYZ 13-JUN-11 27 XYZ 14-JUN-11 16 XYZ 15-JUN-11 14 XYZ 16-JUN-11 16 XYZ 17-JUN-11 26 XYZ 18-JUN-11 25 XYZ 19-JUN-11 24To find patterns where the prices goes up for 4 or more consecutive days, we use the following pattern matching query.
SELECT * FROM stock MATCH_RECOGNIZE ( PARTITION BY SYMBOL ORDER BY dat MEASURES strt.price bal,strt.dat dat, last (up.price) as last_up_price, last (up.dat) AS last_up_dat ONE ROW PER MATCH AFTER MATCH SKIP TO LAST UP PATTERN (STRT up{4,} ) DEFINE up AS up.price > PREV(up.price) ) MR ORDER BY SYMBOL, dat; SYM BAL DAT LAST_UP_PRICE LAST_UP_D --- ---------- --------- ------------- --------- XYZ 14 31-MAY-11 27 04-JUN-11 XYZ 14 05-JUN-11 27 09-JUN-11In the above query, the pattern is defined as (strt up{4,}), any day followed by 4 days up. It also returns the last up price and day (last (up.price) and last (up.dat)).
Sunday, April 06, 2014
More on Calculating Variables That Predict Customer Churn
In the post Calculate Variables That Predict Customer Churn, we use Oracle analytics function lag() to caculate those accounts with account balances dropping for three consecutive months which may indicate that the customer may leave the bank. This calculation can be done using pattern matching, MATCH_RECOGNIZE, available on Oracle 12c. We use the following table as an example.
SQL> select *from tbl_account1 order by acct_num, dt; ACCT_NUM BALANCE DT ---------- ---------- --------- 12345 8500 31-JAN-13 12345 8550 28-FEB-13 12345 5000 31-MAR-13 12345 1000 30-APR-13 12345 200 31-MAY-13 37688 8800 31-JAN-13 37688 7000 28-FEB-13 37688 5300 31-MAR-13 37688 1300 30-APR-13 37688 500 31-MAY-13To calculate those accounts with balances dropping for 3 consecutive months, we use the following MATCH_RECOGNIZE query.
SELECT * FROM tbl_account1 MATCH_RECOGNIZE ( PARTITION BY acct_num ORDER BY dt MEASURES last(down.balance) bal, last(DOWN.dt) AS dt ONE ROW PER MATCH AFTER MATCH SKIP TO NEXT ROW PATTERN (DOWN{3}) DEFINE DOWN AS DOWN.balance < PREV(DOWN.balance) ) MR ORDER BY acct_num, dt; ACCT_NUM BAL DT ---------- ---------- --------- 12345 200 31-MAY-13 37688 1300 30-APR-13 37688 500 31-MAY-13In the above query, "PARTITION BY acct_num" specifies the pattern matching will be performed indepdently for each acct_num. "DEFINE DOWN AS DOWN.balance < PREV(DOWN.balance)" defines pattern variable DOWN as a result of comparing the balance from current and previous row (based on "order by dt"). PATTERN (DOWN{3})means finding 3 consecutive downs. The syntax for pattern definition is similar to regular expressions. Last(down.balance) and last(down.dt) keep the last balance and dt in a successful match. Pattern matching queries can be used to extract useful patterns from time series data. These patterns can then be used as input variables for predictive models. Thus, in Oracle 12c SQL, the combination of pattern matching and predictive models is very powerful.
Saturday, April 05, 2014
More On Zip Code and Predictive Models - Variable of High Cardinality
Zip Code and Predictive Models
It is obvious that zip codes should be treated as categorical variable instead of numeric. Since zip codes have many distinct values, in some cases they may practically become the unique identifiers for data points. We need to deal with them carefully when building predictive models. Otherwise, the models may perform well on training data set but not so well on new data set due to overlearning problem. For some predictive models such as decision trees and tree-based ensemble models, we can use zip codes directly. For other models, we may categorize zip codes into smaller number of meaningful groups. For example, we can categorize zip code based on their credit card fraud rates when building fraud dectection model. The following query shows card_id, home_zip and fraud indicator(fraud, 0 normal transaction) of a credit card transaction table.
SQL> select card_id, home_zip, is_fraud from CARD_TXN; CARD_ID HOME_ZIP IS_FRAUD ---------- ---------------- ---------- 12345 012320 0 45678 012456 1 .......................................Based on the above table, we can categorize zip code by following the following two steps: 1. Calculate the fraud rate for each zip code in the training data set. For zip code with small number of transactions, the fraud rate will be the overall fraud rate for the whole population. 2. Divide zip codes into groups based on fraud rates. we can calculate fraud rate for each home_zip using the following scripts.
create or replace view v_zip_fraud_rate as with tbl_overall_fraud_r as (select sum(is_fraud)/count(*) fraud_rate from CARD_TXN ), tbl_zip_fraud_r as (select home_zip, count(*) tot_num_txns, sum(is_fraud) tot_num_fraud, sum(is_fraud)/count(*) fraud_rate from CARD_TXN group by home_zip ) select home_zip, tot_num_txns, tot_num_fraud, case when tot_num_txns<50 then b.fraud_rate else b.fraud_rate end fraud_rate from tbl_zip_fraud_r a, tbl_overall_fraud_r b; SQL> select * from V_ZIP_FRAUD_RATE; HOME_ZIP TOT_NUM_TXNS TOT_NUM_FRAUD FRAUD_RATE ---------------- ------------ ------------- ---------- 012320 2000 10 .005 012456 1000 8 .008 012345 23 0 .006 ..........................In the above example, zip code "012345" has only no fraud transactions with only 23 transactions in total. We assign the overal average fraud rate to it. In the same fashion, we can categorize zip code based on click through rate, response rate, income, etc. We can easily group zip code based on fraud rate. For example, we can use "case when" queries as shown below. After we convert the zip code into a small number of categroies, we can use it in a predicitve models(such as a logistic regression model). Again, it is not nessary to categorize zip codes when using decision tree based models.
select HOME_ZIP, case when fraud_rate<0.001 then 1 when fraud_rate<0.003 then 2 when fraud_rate<0.005 then 3 when fraud_rate<0.007 then 4 when fraud_rate<0.008 then 5 else 6 end zip_segment from v_zip_fraud_rate;
Friday, April 04, 2014
Empty All Tables In an Oracle Schema
In stead of removing all tables in our schema as shown in post Drop All Tables In an Oracle Schema
, there are situations where we want to keep the tables while empty all the data stored in them. The following PL/SQL script does this.
Warning!!!!! The script below will empty all tables
under the current schema!!!
begin for r in (select table_name from user_tables order by table_name) loop begin execute immediate 'truncate table '||r.table_name; exception when others then null; end; end loop; end; /
Warning!!!!! The script above will empty all tables under the current schema!!!I saved it as file this_will_empty_all_user_tables.sql. I logged in my schema using sqlplus and run the scripts.
$sqlplus myschema/mypassword SQL> @this_will_empty_all_user_tables PL/SQL procedure successfully completed.All data stored in the tables in current schema are gone while the table structure remain unchanged.
Drop All Tables In an Oracle Schema
Sometimes, we want to drop all tables in our schema. To do this, I create the following PL/SQL script file.
Warning!!!!! The script below will remove all tables
under the current schema!!!
begin for r in (select table_name from user_tables order by table_name) loop begin execute immediate 'drop table '||r.table_name ||' cascade constraints purge'; exception when others then null; end; end loop; end; /
Warning!!!!! The script above will remove all tables under the current schema!!!I saved it as file this_will_drop_all_user_tables.sql. I logged in my schema using sqlplus and run the scripts.
$sqlplus myschema/mypassword SQL> @this_will_drop_all_user_tables PL/SQL procedure successfully completed.All the tables are removed under the current schema. Again, please be careful before run the script! It will remove all tables under the current schema.
Wednesday, April 02, 2014
Calculate Variables That Predict Customer Churn
Oracle analytics functions such as lag() are very useful in building interesting variables for predictive models. For example, to predict if a bank account will be closed by a customer we may want to look at the account balance history. If the account balance drop consecutively for three months, it may indicate that the customer will leave the bank. Using the following table as an example, we will describe how to build a variable indicating balance dropping for three consecutive months.
SQL> select * from tbl_account1 order by dt desc; ACCT_NUM BALANCE DT ---------- ---------- --------- 12345 200 31-MAY-13 12345 1000 30-APR-13 12345 5000 31-MAR-13 12345 8550 28-FEB-13 12345 8500 31-JAN-13One approach is to use lag() function. In the following query, for each record we calculate the previous 1, 2 and 3 months balances. lag(BALANCE,1), lag(BALANCE,2) and lag(BALANCE,3) indicate the balance in prevous 1, 2 and 3 records based on the order of "dt", respectively. "partition by acct_num" means the calculation is done independently by acct_num.
with tbl as( select a.* , lag(BALANCE,1) over(partition by acct_num order by dt) last_1, lag(BALANCE,2) over(partition by acct_num order by dt) last_2, lag(BALANCE,3) over(partition by acct_num order by dt) last_3 from tbl_account1 a) select * from tbl order by dt; ACCT_NUM BALANCE DT LAST_1 LAST_2 LAST_3 ---------- ---------- --------- ---------- ---------- ---------- 12345 8500 31-JAN-13 12345 8550 28-FEB-13 8500 12345 5000 31-MAR-13 8550 8500 12345 1000 30-APR-13 5000 8550 8500 12345 200 31-MAY-13 1000 5000 8550Once we understand how to calculate the privous month balance using lag() function, we can derive the indictor showing consecutively three months drop in balance using the following query.
with tbl as( select a.* , lag(BALANCE,1) over(partition by acct_num order by dt) last_1, lag(BALANCE,2) over(partition by acct_num order by dt) last_2, lag(BALANCE,3) over(partition by acct_num order by dt) last_3 from tbl_account1 a) select ACCT_NUM, balance, dt, case when balance-last_1 <0 and last_1-last_2<0 and last_2-last_3<0 then 1 else 0 end as bal_drop_in_3m from tbl order by dt; ACCT_NUM BALANCE DT BAL_DROP_IN_3M ---------- ---------- --------- -------------- 12345 8500 31-JAN-13 0 12345 8550 28-FEB-13 0 12345 5000 31-MAR-13 0 12345 1000 30-APR-13 0 12345 200 31-MAY-13 1We can also use analytic function sum() over() to calcluate the variable as shown below.
with tbl as( select a.* , decode(sign(balance-lag(BALANCE,1) over(partition by acct_num order by dt)), 0,0, 1,0, -1) is_drop from tbl_account1 a ) select ACCT_NUM, balance, dt, case when sum(is_drop) over(partition by acct_num order by dt rows 2 preceding) = -3 then 1 else 0 end BAL_DROP_IN_3M from tbl a order by dt; ACCT_NUM BALANCE DT BAL_DROP_IN_3M ---------- ---------- --------- -------------- 12345 8500 31-JAN-13 0 12345 8550 28-FEB-13 0 12345 5000 31-MAR-13 0 12345 1000 30-APR-13 0 12345 200 31-MAY-13 1The above variable BAL_DROP_IN_3M can be used as one of the inputs into a predictive model. We can also use the same method to calculate variables such as stock price going up for five consecutive days, etc.
Why Sometime It Is Not a Good Idea to Export Database Tables as Text Files?
It is a common practice to export a database table as a text file for the purpose of delivering data to another party. However, by converting a database table into a text file we lose many valuable information about data and that may cause problem. If a column contains special characters such as new line (ASCII code 10 or 0A in hex), the records may not be successfully loaded back into a database as most import utility will treat new line the end of a record. For example, the record id 3 in following table has a new line character between "data" and "analytics". This can be shown using function utl_raw.CAST_TO_RAW().
SQL> select * from tbl_test; ID MESSAGE ---------- ---------------------------------------------------------------- 1 Hello 2 World 3 data analytics SQL> select message, utl_raw.CAST_TO_RAW(message) ascii_code from tbl_test where id = 3; MESSAGE ASCII_CODE -------------------------------- --------------------------------------------------------- data 64617461200A20616E616C7974696373 analyticsIf we dump this table as a text file, record 3 will not be loaded correctly because the new line normally marks the end of a record. Other characters that make things complicated including comma(,), single quote('), double quote(") etc. Unfortunately, even some ETL tools convert database tables into text files and then load them back into another database. We have experienced problems with these tools. Then how do we deliver our data to another party if we do not use text files? One approach is to use Oracle utilities such as data pump(expdp/impdp) or exp/imp. We may also use database link to fetch the data from a remote database directly into a table,e.g, create table tbl_abc as select * from tbl_abc@remote_db.
Compare Oracle Tables in Different Databases: Find an ETL Issue
Problem
There are situations where we need to compare tables or views located in different database. For example, a few years ago I was involved in a project to help building a data warehouse to achieve the bank card transactions in production. The card transactions in the production Oracle database are moved through an ETL process to the data warehouse, another Oracle database. The key questions are: how do we know that the data values are the same after the movement through ETL? How do we compare tables in different databases? (We do not automatically assume that the ETL tool does the data movement job correctly. It turned out the some data values actually changes by the ETL.)
One of ways to compare table across Oracle databases is to use database link. Before we create database link using the following command, the user may need to be granted the "create database link " privilege.
SQL> create database link dl_db1 connect to user1 identified by abc123 using '(description=(address=(protocol=TCP)(host=server1)(port=1521)) (connect_data=(sid=ORCLID)))';Once the database link is created, we can compare a table the current database to that in another database. To access the table in another database that the database link points to, we specify "@dl_db1" after the table name as shown below. We can query remote table just like local table.
SQL> select count(1) from tbl_123 a, tbl_123@dl_db1 b where a.id=b.id;In the project that I mentioned above, we compared the tables before and after being moved by ETL and identified that in some cases values changed. Further investigation showed that it was caused by that the ETL tool converted tables in the source database into text files and then loaded them into tables into the destination database. It is usually not a good idea to dump tables text files as some information are lost during the process.
Tuesday, April 01, 2014
Data Mining Model Management: Query user_mining_models View
When we build models in an Oracle database, all those models are database objects that can be queried using SQL. Thus, we can find out the critical information about the models. One of the most useful view is user_mining_models which contains the following information about models:
MODEL_NAME MINING_FUNCTION ALGORITHM CREATION_DATE BUILD_DURATION MODEL_SIZE COMMENTSFor example, the following query shows models that I have built in my schema.
SQL> select model_name, MINING_FUNCTION, ALGORITHM from user_mining_models; MODEL_NAME MINING_FUNCTION ALGORITHM ------------------ ------------------------------ ------------------------------ ABCOC1M CLUSTERING O_CLUSTER NB1021 CLASSIFICATION NAIVE_BAYES SVM1029 CLASSIFICATION SUPPORT_VECTOR_MACHINES AR1029 ASSOCIATION_RULES APRIORI_ASSOCIATION_RULES AI0929 ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTH DT1029 CLASSIFICATION DECISION_TREE GLM1031A CLASSIFICATION DECISION_TREE GLM1031B CLASSIFICATION DECISION_TREE GLM1031C CLASSIFICATION DECISION_TREE GLM1031E CLASSIFICATION GENERALIZED_LINEAR_MODEL KM1031C CLUSTERING KMEANS OC_SH_CLUS_SAMPLE CLUSTERING O_CLUSTER KM1211 CLUSTERING KMEANS KM_MODEL_TRY1 CLUSTERING KMEANS GLM0115 CLASSIFICATION GENERALIZED_LINEAR_MODEL KM_MODEL CLUSTERING KMEANS SVD0119 FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR TMSVD1 FEATURE_EXTRACTION SINGULAR_VALUE_DECOMPThe following "group by" query summarize types of models that I have built.
SQL> select MINING_FUNCTION, ALGORITHM, count(*) from user_mining_models group by MINING_FUNCTION, ALGORITHM order by MINING_FUNCTION, ALGORITHM; MINING_FUNCTION ALGORITHM COUNT(*) ------------------------------ ------------------------------ ---------- ASSOCIATION_RULES APRIORI_ASSOCIATION_RULES 1 ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTH 1 CLASSIFICATION DECISION_TREE 4 CLASSIFICATION GENERALIZED_LINEAR_MODEL 2 CLASSIFICATION NAIVE_BAYES 1 CLASSIFICATION SUPPORT_VECTOR_MACHINES 1 CLUSTERING KMEANS 4 CLUSTERING O_CLUSTER 2 FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR 1 FEATURE_EXTRACTION SINGULAR_VALUE_DECOMP 1In summary, models are database objects that can be queried using SQL. This provides an efficient ways to manage many models in our database.We can retrieve our models by names, mining functions (classification, regression, etc.), algorithms, build date, comments, etc.