Deep Data Mining Blog: Dealing with missing values in data mining

Sunday, September 02, 2012

Dealing with missing values in data mining

We have found that when analyzing large amount of data such as financial transactions, medical claims, cell phone calls, and credit information, missing values are common. Simply ignoring data points with missing values is recommended in most cases. There are a number of ways to handle missing values for categorical and continuous variables.

The following examples illustrate how to replace missing values in categorical variables.
Original data:
Record_ID Variable_X
1 A
2 B
3 A
........................
95 Missing

Method 1. Replace missing value with a special value (e.g. X)
Record_ID Variable_X
1 A
2 B
3 A
........................
95 X

In Oracle, this can be done easily using function nvl(). The following is the SQL scripts to perform the replacement.

select nvl(Variable_X,'X') as Variable_X, ..... from original_table;

Method 2. Replace missing value with the most frequent value.

Record_ID Variable_X
1 A
2 B
3 A
........................
95 A
The following is the SQL scripts to perform the replacement.

select nvl(Variable_X,'A') as Variable_X, ..... from original_table;

The following examples illustrate how to replace missing values in continuous variables.
Original data:
Record_ID Variable_X
1 1.3
2 0.5
3 2.9
........................
95 Missing

Method 3. Replace missing value with mean or median value.
Record_ID Variable_X
1 1.3
2 0.5
3 2.9
........................
95 1.56
The following is the SQL scripts to perform the replacement.
select nvl(Variable_X, 1.56) as Variable_X, ..... from original_table;

Method 4. We first convert continuous variable into categorical variable. This is also called binning. I will discuss binning in another post. We define missing as a special category.
The following is the SQL scripts to perform the above logic.
case when Variable_X <=1.2 then 'A'
when Variable_X <=1.8 then 'B'
when Variable_X > 1.8 then 'C'
else 'Z' end Variable_X_BIN

New Variable Variable_X_BIN.
Record_ID Variable_X --> Variable_X_BIN
1 1.3 ---> B
2 0.5 ---> A
3 2.9 --->B
........................
95 missing ---->Z

There are more sophisticated methods to impute missing values using multiple variable models. But they are rarely used in real world applications.

10 Most Influential People	Text Files and Oracle DB	Predictive Model vs Rule	Build Predictive Model	About Predictive Model Variable	Logistic Regression
Recency Frequency Monetary Analysis	Unique Identifier in Oracle	Materialized View	Database Link	Calculate Percentage Using SQL	Handle NULL Value
Calculate Cumulative Perentage	Find Score Cutoff Value	Remove Duplicates	Calculate Correlation Coefficients	Oracle vs SQL Server	Random Sampling
Table Insert	Read Only Table	Clustering	Ranking	Find Most Frequent	Median Value
Oracle Source Code	Debug PL/SQL	Hide PL/SQL Scripts	Repair Views	Dump Schema	Move Big Files to Amazon

Popular Topics

Popular Topics

Sunday, September 02, 2012

Dealing with missing values in data mining

No comments: