For example, to find if credit score is a good indicator of account default, we calculate the default rate for each credit class as shown below (or we may perform some statistical tests such as a Chi-Square test for that matter). As we can see, the low credit score class have a default rate of 18% vs that of 6% for the high credit score class. Thus, credit class is considered as a good variable to build a default model.
Credit score and account default.
However, this approach has its limit because it does not take the relationship among variables into consideration. This can be illustrated using an imaginary example. We want to asses if height and weight of people are indicative of getting a disease. We can calculate the following tables for height and weight, respectively. Since short or tall people have the same percentage of sick people (2%), we may conclude that height is not relevant to predicting the disease. Similarly, we also think weight is not important.
Height and Disease
Weight and Disease
Height/Weight and Disease
As we see, this approach may exclude variables that are actually good. When we determine the most important variables for building a predictive model, ideally we should take a set of variables as a whole into consideration. More often than not, it is the relationships between variables that provide the best predictive power. How to find or generate the most useful variables for predictive models is so crucial that we will talk more about it in upcoming blog posts. I have written another post More on How to Find the Most Important Variables for a Predictive Model using Oracle Attribute Importance function.
3 comments:
Hi
This is a simple and nice example to get the point. And it makes the case for checking groups of variables in terms of combined Information Gain, for instance using WEKA search methods (classes inheriting from ASSearch.
However I must say that in Text Mining problems, when you are forced to handle thousands of variables, examining the predictive power of groups of variables can be very costly. I believe you may suggest using algebraic methods for feature extraction like Singular Value Decomposition for those cases...
Thanks for the post and regards
Jose,
Thank you for your comments. You are absolutely right. Methods like Singular Value Decomposition are great ways to generating feature variables. I will talk about them in another post.
Jay
I have read your blog its very Interesting and informative. Call for low cost ERP software. ERP Software in Chennai
Post a Comment