Problem
We want to calculate the correlation coefficient between two variables, such as the length of employment and salary within a company. If the data are Oracle database table, how do we do it using SQL?SQL> select employee_id, department_id, hire_date, salary from EMPLOYEES where rownum <=20; EMPLOYEE_ID DEPARTMENT_ID HIRE_DATE SALARY ----------- ------------- --------- -------- 100 90 17-JUN-87 24000.0 101 90 21-SEP-89 17000.0 102 90 13-JAN-93 17000.0 103 60 03-JAN-90 9000.0 104 60 21-MAY-91 6000.0 105 60 25-JUN-97 4800.0 106 60 05-FEB-98 4800.0 107 60 07-FEB-99 4200.0 108 100 17-AUG-94 12000.0 109 100 16-AUG-94 9000.0 110 100 28-SEP-97 8200.0 111 100 30-SEP-97 7700.0 112 100 07-MAR-98 7800.0 113 100 07-DEC-99 6900.0 114 30 07-DEC-94 11000.0 115 30 18-MAY-95 3100.0 116 30 24-DEC-97 2900.0 117 30 24-JUL-97 2800.0 118 30 15-NOV-98 2600.0 119 30 10-AUG-99 2500.0 20 rows selected.
Solution
We use Oracle corr function to calculate the correlation coefficient. The following is the query (since the data is old, we assume that the current date is May 30, 2000.)
SQL> select corr(salary, to_date('20000530','YYYYMMDD')-hire_date) cor from EMPLOYEES; COR ---------- .497838041We can verify the correlation coefficient using the following query based on the equation for calculating correlation coefficient.
SQL> with tbl as (select avg(salary) m1, stddev(salary) s1, avg(to_date('20000530','YYYYMMDD')-hire_date) m2, stddev(to_date('20000530','YYYYMMDD')-hire_date) s2 , count(1) n from EMPLOYEES), tbl2 as (select salary, to_date('20000530','YYYYMMDD')-hire_date l from employees) select sum( (salary-m1)*(l-m2)/(n-1)/(s1*s2) ) from tbl, tbl2; SUM((SALARY-M1)*(L-M2)/(N-1)/(S1*S2)) ------------------------------------- .497838041
No comments:
Post a Comment