Data in transactional format
USER_ID CAT STATE_CD CAMPAIGN_DATE CAMPAIGN_CD RESPONSE
1001 A MA 01-MAY-12 A N
1001 A MA 08-MAY-12 A N
1001 A MA 15-MAY-12 A Y
1001 A MA 22-MAY-12 A N
1001 A MA 29-MAY-12 A N
1001 A MA 06-JUN-12 A N
1001 B CT 06-JUN-12 A N
1002 B CT 01-MAY-12 A N
1002 B CT 08-MAY-12 A N
1002 B CT 15-MAY-12 A Y
1002 B CT 22-MAY-12 A N
1002 B CT 29-MAY-12 A Y
If all the independent variables are categorical, we can convert the data in transactional format into a more compact one by summarizing the data using SQL script similar to the following. We count the numbers of responses and non responses for each unique combination of independent variable values. For continuous variables, if we want, we can transform them into categorical using techniques like binning.
select cat, state_cd, campgain_cd,
sum(case when response='Y' then 1 else 0 end) num_response,
sum(case when response='N' then 1 else 0 end) num_no_response
from tbl_txn group by cat, state_cd, campgain_cd;
Data in the summary format
CAT STATE_CD CAMPAIGN_CD NUM_RESPONSE NUM_NO_RESPONSE
A MA A 125 1025
B CT C 75 2133
..........................
Summarizing data first can greatly reduce the data size and save memory space when building the model. This is particularly useful if we are use memory-based modeling tools such as R.
If we use R to build the logistic regression model, the script for training data in transactional format is similar to the following.
glm(formula=RESPONSE~CAT+STATE_CD+CAMPAIGN_CD,
data=train.set1,family = binomial(link = "logit")) ->model1
The R scripts for building a logistic model based on summary data is show below.
glm(formula=
cbind(NUM_RESPONSE,NUM_NO_RESPONSE) ~CAT+STATE_CD+CAMPAIGN_CD,
data=train.set1,family = binomial(link = "logit")) ->model2
data=train.set1,family = binomial(link = "logit")) ->model2
1 comment:
This stage in model development process is probably the longest and the most difficult phase of any credit risk model development project. It’s main purpose is to determine if scorecard development is can be built (or not) as well as to set the high-level parameters for the project. Those parameters are typically exclusions, target definition, sample window, and performance window.
I talk about this at Highstone Tower blog very often... feel free to comment
http://www.highstonetower.com/?p=1718
Post a Comment