Sunday, November 10, 2013

More on How to Find the Most Important Variables for a Predictive Model

Problem

To make a predictive model, we need independent variables as inputs and a single variable as the target. Typically, both independent and target variables are stored in a single table. Often there are many independent variables, say 50 or 200 of them, such as age, sex, annual income, credit limits, and transaction variables etc. How do we select a small number of variables that are most predicative of the target variable and use them to build a model that is robust?

Solution

In the early post Find the Most Important Variables In Predictive Models, we described that there is a drawback to justify the importance of variable individually. Ideally we should take a set of variables as a whole into consideration. One of the good approaches is Oracle’s Attribute Importance model.

I built a credit card transaction data that contains column is_fraud as the target, id as the unique record identifier and other independent variables that I want to analyze. The data set is v_training_set. I write the following PL/SQL script to build an attribute importance model.

begin
DBMS_DATA_MINING.CREATE_MODEL(
model_name => 'VAR_IMPORTANCE',
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name => 'v_training_set',
case_id_column_name => 'id',
target_column_name => 'is_fraud');
END;

When the attribute importance model,VAR_IMPORTANCE, is done, all independent, i.e., all variables except target and record identifiers, are assigned an importance value. The higher the value of a variable, the more important it is in predicting the target. We can review our result using the following SQL. (I deliberately masked the independent variable names because I think fraud detection is an sensitive matter and we do not want to give away too much information to fraudsters who might be reading this blog post.)

SQL> select attribute_name, IMPORTANCE_VALUE, rank from TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('VAR_IMPORTANCE')) order by rank;

ATTRIBUTE_NAME IMPORTANCE_VALUE RANK
VAR__EMV___ .024118198 1
VAR__TER___ .014412195 2
VAR__RET___ .013569008 3
VAR__PT____ .008679484 4
VAR__TRA___ .008009003 5
VAR__TER___ .007207152 6
VAR__CHK___ .006322795 7
VAR__NRT___ .00591138 8
VAR__NRT___ .005564262 9
VAR__MSG___ .005332518 10
VAR__TRA___ .004457798 11
VAR__NRT___ .00409855 12
VAR__PIN___ .003852347 13
VAR__PRO___ .001177829 14
VAR__RES___ .000911889 15
VAR__TER___ .000767663 16
VAR__FIL___ .000448031 17
VAR__ACC___ .000172331 18
VAR__PRM___ .00009502 19
VAR__AUT___ .00009502 19
VAR__FRW___ .00009502 19
VAR__AUT___ .000092784 20
VAR__ACC___ .000012229 21
VAR__PT____ .000003078 22
VAR__ACC___ -6.946E-06 23
VAR__TRA___ -.02799773 24
VAR__TER___ -.16968261 25
VAR__TRA___ -.39228472 26
VAR__TIE___ -.57372309 27

Conclusion

Oracle’s Attribute Importance function ranks variables based on their importance in predicting the target. It is a great tool for selecting a small number of input variables out of many before we build a predictive model.

1 comment:

Deepesh "DJ" said...

Hi, Is the attribute importance function in oracle similar to the Information Value feature used commonly in analytics?