Problem
Random sampling is a very common task in a data analytics project. Sometime we want to sample precise number of records such as described in the post Take Randomly Sampled Records From the Table. Sometime, we randomly split a data set into training and testing sets at 70/30 (or whatever ratio). How do we efficiently perform multiple random sampling tasks?
Solution
I have found it is convenient to create a permanent table that contains the unique record id (or account id, card number, etc.) and a uniformly distributed random number. The following is an example of creating such a table described in Take Randomly Sampled Records From the Table. dbms_random.value returns uniformly distributed random number ranging from 0 to 1.
SQL> create table tbl_emp_id_rnd as select EMPLOYEE_ID, dbms_random.value rnd from EMPLOYEES; Table created. SQL> select * from tbl_emp_id_rnd where rownum <=10; EMPLOYEE_ID RND ----------- ---------- 100 .466996031 101 .325172718 102 .643593904 103 .822225992 104 .657242181 105 .244060518 106 .446914037 107 .423664122 108 .033736378 109 .405546964 10 rows selected.We can perform many types of random sampling based on such a table whenever we need to. For example, we can split the records into 70% training and 30% testing set. And we can change the ratio of 70/30 easily.
SQL> create view v_taining_emp_id as select EMPLOYEE_ID from tbl_emp_id_rnd where rnd <=0.7; View created. SQL> create view v_testing_emp_id as select EMPLOYEE_ID from tbl_emp_id_rnd where rnd >0.7; View created.I have found such a table is extremely helpful in a project where randomly sampling is performed multiple times.
No comments:
Post a Comment