Problem
How do we randomly sample, say 20 records, from the table below that contains 107 employees?SQL> select EMPLOYEE_ID, FIRST_NAME, LAST_NAME from EMPLOYEES where rownum <=10; EMPLOYEE_ID FIRST_NAME LAST_NAME ----------- -------------------- ------------------------- 100 Steven King 101 Neena Kochhar 102 Lex De Haan 103 Alexander Hunold 104 Bruce Ernst 105 David Austin 106 Valli Pataballa 107 Diana Lorentz 108 Nancy Greenberg 109 Daniel Faviet
Solution
We use the method describe in Randomly Sample a Precise Number of Records and More on random sampling in Oracle. Let's follow the steps.
1. Verify that employee_id is indeed the unique identifier.
SQL> select count(*), count(distinct employee_id) from EMPLOYEES; COUNT(*) COUNT(DISTINCTEMPLOYEE_ID) ---------- -------------------------- 107 1072. Generate a random number for each employee_id using dbms_random.value. Function dbms_random.value generates uniformly distributed random number greater than or equal to 0 and less than 1.
SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, dbms_random.value rnd from EMPLOYEES; Table created.The following is what tbl_emp_id_rnd looks like.
SQL> select * from tbl_emp_id_rnd where rownum <=10; EMPLOYEE_ID RND ----------- ---------- 100 .466996031 101 .325172718 102 .643593904 103 .822225992 104 .657242181 105 .244060518 106 .446914037 107 .423664122 108 .033736378 109 .405546964 10 rows selected.3. Generate unique rank based on the random number for each record.
SQL> create table tbl_emp_id_rnd_rnk as select a.*, row_number() over(order by rnd) rnk from tbl_emp_id_rnd a; Table created.The following is what the table with rank looks like. To get the randomly sampled 20 records, we simply select records with ranks less or equal 20. The trick is based on that the rank of random numbers are also random.
SQL> select * from tbl_emp_id_rnd_rnk where rnk <=20 order by rnk; EMPLOYEE_ID RND RNK ----------- ---------- ---------- 177 .001160678 1 121 .001293396 2 151 .013037966 3 126 .018864319 4 108 .033736378 5 158 .09466332 6 168 .095874461 7 165 .116139108 8 166 .116236033 9 149 .121777964 10 136 .126158543 11 116 .140726813 12 193 .147478226 13 143 .164627124 14 194 .175347727 15 145 .206125327 16 176 .207418722 17 160 .207918621 18 201 .21827842 19 183 .218636576 20 20 rows selected.
Conclusions
We use Oracle dbms_random.value to first generate a random number of each record in the table and then row_number functions to calculate the rank of the random number. We then use the rank to select the desired number of records. Please notice that it is not a good practice to use the employee_id in the table directly for sampling purpose because we are not sure if it is randomly generated.
No comments:
Post a Comment