Problem
To make a predictive model, we need independent variables as inputs and a single variable as the target. Typically, both independent and target variables are stored in a single table. Often there are many independent variables, say 50 or 200 of them, such as age, sex, annual income, credit limits, and transaction variables etc. How do we select a small number of variables that are most predicative of the target variable and use them to build a model that is robust?
Solution
In the early post Find the Most Important Variables In Predictive Models, we described that there is a drawback to justify the importance of variable individually. Ideally we should take a set of variables as a whole into consideration. One of the good approaches is Oracle’s Attribute Importance model.
I built a credit card transaction data that contains column is_fraud as the target, id as the unique record identifier and other independent variables that I want to analyze. The data set is v_training_set. I write the following PL/SQL script to build an attribute importance model.
beginDBMS_DATA_MINING.CREATE_MODEL(
model_name => 'VAR_IMPORTANCE',
mining_function => DBMS_DATA_MINING.ATTRIBUTE_IMPORTANCE,
data_table_name => 'v_training_set',
case_id_column_name => 'id',
target_column_name => 'is_fraud');
END;
When the attribute importance model,VAR_IMPORTANCE, is done, all independent, i.e., all variables except target and record identifiers, are assigned an importance value. The higher the value of a variable, the more important it is in predicting the target. We can review our result using the following SQL. (I deliberately masked the independent variable names because I think fraud detection is an sensitive matter and we do not want to give away too much information to fraudsters who might be reading this blog post.)
SQL> select attribute_name, IMPORTANCE_VALUE, rank from TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('VAR_IMPORTANCE')) order by rank;
ATTRIBUTE_NAME | IMPORTANCE_VALUE | RANK |
---|---|---|
VAR__EMV___ | .024118198 | 1 |
VAR__TER___ | .014412195 | 2 |
VAR__RET___ | .013569008 | 3 |
VAR__PT____ | .008679484 | 4 |
VAR__TRA___ | .008009003 | 5 |
VAR__TER___ | .007207152 | 6 |
VAR__CHK___ | .006322795 | 7 |
VAR__NRT___ | .00591138 | 8 |
VAR__NRT___ | .005564262 | 9 |
VAR__MSG___ | .005332518 | 10 |
VAR__TRA___ | .004457798 | 11 |
VAR__NRT___ | .00409855 | 12 |
VAR__PIN___ | .003852347 | 13 |
VAR__PRO___ | .001177829 | 14 |
VAR__RES___ | .000911889 | 15 |
VAR__TER___ | .000767663 | 16 |
VAR__FIL___ | .000448031 | 17 |
VAR__ACC___ | .000172331 | 18 |
VAR__PRM___ | .00009502 | 19 |
VAR__AUT___ | .00009502 | 19 |
VAR__FRW___ | .00009502 | 19 |
VAR__AUT___ | .000092784 | 20 |
VAR__ACC___ | .000012229 | 21 |
VAR__PT____ | .000003078 | 22 |
VAR__ACC___ | -6.946E-06 | 23 |
VAR__TRA___ | -.02799773 | 24 |
VAR__TER___ | -.16968261 | 25 |
VAR__TRA___ | -.39228472 | 26 |
VAR__TIE___ | -.57372309 | 27 |
Conclusion
Oracle’s Attribute Importance function ranks variables based on their importance in predicting the target. It is a great tool for selecting a small number of input variables out of many before we build a predictive model.
1 comment:
Hi, Is the attribute importance function in oracle similar to the Information Value feature used commonly in analytics?
Post a Comment