The following examples illustrate how to replace missing values in categorical variables.
Original data:
Record_ID Variable_X
1 A
2 B
3 A
........................
95 Missing
Method 1. Replace missing value with a special value (e.g. X)
Record_ID Variable_X
1 A
2 B
3 A
........................
95 X
In Oracle, this can be done easily using function nvl(). The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'X') as Variable_X, ..... from original_table;
Method 2. Replace missing value with the most frequent value.
Record_ID Variable_X1 A
2 B
3 A
........................
95 A
The following is the SQL scripts to perform the replacement.
select nvl(Variable_X,'A') as Variable_X, ..... from original_table;
The following examples illustrate how to replace missing values in continuous variables.
Original data:
Record_ID Variable_X
1 1.3
2 0.5
3 2.9
........................
95 Missing
Method 3. Replace missing value with mean or median value.
Record_ID Variable_X
1 1.3
2 0.5
3 2.9
........................
95 1.56
The following is the SQL scripts to perform the replacement.
select nvl(Variable_X, 1.56) as Variable_X, ..... from original_table;
Method 4. We first convert continuous variable into categorical variable. This is also called binning. I will discuss binning in another post. We define missing as a special category.
The following is the SQL scripts to perform the above logic.
case when Variable_X <=1.2 then 'A'
when Variable_X <=1.8 then 'B'
when Variable_X > 1.8 then 'C'
else 'Z' end Variable_X_BIN
New Variable Variable_X_BIN.
Record_ID Variable_X --> Variable_X_BIN
1 1.3 ---> B
2 0.5 ---> A
3 2.9 --->B
........................
95 missing ---->Z
There are more sophisticated methods to impute missing values using multiple variable models. But they are rarely used in real world applications.
No comments:
Post a Comment