One of the features in Oracle 12c is the ability of building and running models on the fly ( as opposed to the conventional two steps of building a persistent model first and then applying it to the data). On the fly predictive models provides data miners powerful tools for accomplishing sophisticated tasks with easy. I will use on the fly k-means clustering as an example. Take a look at the following simple consumer data. Cust_ID is the unique identifier for a consumer.
SQL> select * from TBL_CUSTOMER3 where rownum <=10; CUST_ID INCOME YEARS_RES HOME_VALUE SEX MARRIED ---------- ---------- ---------- ---------- ---------- ---------- 7422 87499.5 2 117 "M" "M" 3356 15 80 "M" "U" 4782 62499.5 3 91 "M" "M" 7333 87499.5 7 85 "M" "M" 890 42999.5 1 58 "F" "U" 6401 87499.5 5 128 "M" "M" 2356 87499.5 4 96 "M" "M" 1638 87499.5 13 152 "M" "M" 6713 62499.5 6 49 "M" "U" 3674 87499.5 3 119 "M" "M" 10 rows selected.To understand consumers better, usually we want to categorize consumers into small number of groups, instead of dealing with them individually. We can using Oracle predictive function cluster_id() over() to find clusters on the fly as shown below. Here, cluster_id(into 5 using income, YEARS_RES, HOME_VALUE) returns 5 cluster identifiers using the default clustering algorithms (K-means) based on variables income, YEARS_RES, HOME_VALUE. Missing input variables are OK as they will be replaced by the means. Please notice that the numbering of cluster identifier is not important at all,i.e., as long as clusters are different we can number cluster id arbitrarily.
SQL> select * from (select cust_id, cluster_id(into 5 using income, YEARS_RES, HOME_VALUE ) over() cid from TBL_CUSTOMER3) where rownum <=10; CUST_ID CID ---------- ---------- 7422 9 3356 6 4782 7 7333 8 890 5 6401 9 2356 9 1638 8 6713 7 3674 9 10 rows selected.We can calculate some summary information about each group using the following query to gain insight into the data.
SQL> select cid, count(*), avg(income) avg_income, avg(years_res) avg_years_res, avg(home_value) avg_home_value from (select cust_id, income, years_res, home_value, cluster_id(into 5 using income, YEARS_RES, HOME_VALUE ) over() cid from TBL_CUSTOMER3) group by cid order by count(*) desc; CID COUNT(*) AVG_INCOME AVG_YEARS_RES AVG_HOME_VALUE ---------- ---------- ----------- ------------- -------------- 7 152 62499.50 3.88 92.89 9 126 87499.50 2.75 117.98 5 110 47254.05 2.58 67.37 8 65 87499.50 9.15 105.58 6 31 55749.50 13.45 82.39The clustering models are generated on the fly. When the queries finish, the models are gone. Thus it is very "clean" to run queries involving on the fly models. Please notice that in the above query, we combine predictive model with standard SQL (group by cid). This is extremely powerful in that we seamlessly raise conventional SQL queries up to a higher level of predictive modeling. We instantly turn a SQL developer into a data miner. Bingo! (If you are interested in this point, please also see my early posts From Oracle SQL Developer to Data Miner and How easy is it for a SQL developer/analyst to become a data miner?