We need to be aware of how NULL/missing values are handled in SQL query so that we will not be surprised by query results that appear "wrong". This is descried using the following simple table as an example. The fifth record has a NULL value.
SQL> select id, value from tbl_data order by id;
ID | VALUE |
---|---|
1 | -1 |
2 | 0 |
3 | 1 |
4 | 2 |
5 |
If we calculate the total number of records in the table, number of records with values>=0 and values <0, they are 5, 3 and 1, respectively, as shown below. As we can see, the number of records for values >=0 (3) plus that <0 (1) is less than the total number of records (5). This is because NULL values appear in the SQL where clause exclude records from the consideration.
SQL> select count(*) from tbl_data;
COUNT(1) |
---|
5 |
SQL> select count(*) from tbl_data where value>=0;
COUNT(1) |
---|
3 |
SQL> select count(*) from tbl_data where value<0;
COUNT(1) |
---|
1 |
A better way to calcluate this kind of statisitcs is to use "case when" instead of where clause as shown below.
SQL> select count(*) total, sum(case when value>=0 then 1 else 0 end) non_negative, sum(case when value<0 then 1 else 0 end) negative, sum(case whe n value is null then 1 else 0 end) n_missing from tbl_data;
TOTAL | NON_NEGATIVE | NEGATIVE | N_MISSING |
---|---|---|---|
5 | 3 | 1 | 1 |
Or even better, we combine "case when" with "group by" to calculate the statistics. "Group by" is one of my favorites as it gives the complete picture (including NULL values) about the data.
SQL> select v, count(*) from (select case when value>=0 then 'non-negative' when value <0 then 'negative' else 'missing' end v from tbl_data) group by v;
V | COUNT(1) |
---|---|
negative | 1 |
non-negative | 3 |
missing | 1 |
In summary, if we are aware of how NULL values are handles in the database, we will not be surprised by query results that appear "wrong".
No comments:
Post a Comment