Sunday, November 15, 2009

Comparative Window Functions...

I've been known as a huge fan of Analytic functions (as evidenced by the Rock and Roll linkability!)

And - they could be getting better in the near future. Read this document for a proposal to allow analytics to access the current row value to be compared against any other row value in a defined window.

I've already supplied them with my feedback (which started with "this is an awesome idea") - and you can too - by posting it here. They'll be checking back to see what you say.

Also, this is being proposed as well:

Another window function extension, not contained in the attached proposal, is the notion of VALUE based windows. Currently, we have ROW based (or physical) and RANGE based (logical) windows. RANGE window has limitation in that there can only be one sort key in window ORDER BY. On the other hand, ROW based window is agnostic to column value and can be non-deterministic.

The new VALUE based window allows one to include all rows with "n" values before or after the current row's value. For example, VALUE 2 PRECEDING and 3 FOLLOWING would include all rows with 2 values that are prior to current row's value and all rows with 3 values that come after the current row's value in sort order.

ticker txndate volume
orcl 1 10
orcl 2 10 <--------------------------- start of window for (orcl,6,12)
orcl 2 11
orcl 2 11
orcl 3 11
orcl 6 12 <=== assume this is current row
orcl 7 12
orcl 11 11
orcl 11 12
orcl 11 12
orcl 13 11 <------------------------- end of window for (orcl,6,12)


Similar RANGE window would have rows [orcl,6,12] through [orcl,7,12]. Similar
ROW window would include rows [orcl,3,1] through [orcl,11,11].

The VALUE based window would find usefulness when there are gaps in the dataset. For example, a query like "find the intra-day maximum for a stock in the past three trading days". Today, to do this one has to aggregate on trading date and then compute the moving max (in the past 3 days).

VALUE based window can have multiple keys in ORDER BY.

Thanks in advance for any feedback or ideas you might have on this.
POST A COMMENT

28 Comments:

Blogger VA said....

Doesn't the MODEL clause do this sort of thing already?

Sun Nov 15, 10:38:00 AM EST  

Blogger Thomas Kyte said....

@VA

there are common attributes between analytics and the model clause.

what I like about analytics is, well, they are definitely more readily understood.

And analytics are becoming widely implemented in various databases.

Also, connect by and recursive with subquery factoring - duplicate sorts of functionality. decode and case. X and Y, are always lots of ways to do something - some are just easier than others.


How about this for a comparision, if you say the model clause does this sort of thing, that means that model can do all/many/most analytics - meaning, why have them (analytics) at all?

Sun Nov 15, 10:54:00 AM EST  

Blogger VA said....

Tom - Don't get me wrong, I love analytics. MODEL is frankly too complex (for me) so if analytics is enhanced to do more things using the same intuitive, familiar syntax, that will be great!

Sun Nov 15, 11:14:00 AM EST  

Anonymous Anonymous said....

Somehow I have the idea this will make people even more confused about the terms null and index.

Sun Nov 15, 01:11:00 PM EST  

Blogger SeanMacGC said....

Agreed, this is indeed an awesome prospect. I think the idea of 'VALUE based windows' represents a very valuable extension too.

As to the authors' question of whether CURRENT_ROW might be a better term than ANCHOR_ROW, I would tend to agree with that, and would have a preference (slightly) for the former.

Mon Nov 16, 05:18:00 AM EST  

Anonymous Matthias Rogel said....

one point I do not understand:
in Section 3.4 they say
"The semantics of markers and offsets as described in the previous section for physical windows (ROWS specification) would work for logical windows (RANGE specification) when there are no duplicates."

However, in physical windows, there might also be duplicates ! In their examples, there are no duplicates because of footnote 1: "(ticker, day) primary key".

However, what if the order-by-clause is not deterministic ?

For example:
create table t(s varchar2(10), i int, j int, k int);
SQL> insert into t(s, i, j, k) values('abc', 1,1,3);
SQL> insert into t(s, i, j, k) values('abc', 1,1,4);
SQL> insert into t(s, i, j, k) values('abc', 2,8,9);
SQL> select * from t;

S I J K
---------- ---------- ---------- ----------
abc 1 1 3
abc 1 1 4
abc 2 8 9

what would

select x
from
(
select i ,
sum(case when 3=INDEX(k, first_row) then 1 end)
over (partition by s order by i, j rows between 2 preceding and 1 preceding) as x
from t
)
where i=2

return ?

of course, now we are also indeterministic:
SQL> select t.*, sum(k) over (partition by id order by i, j rows between 2 preceding and 1 preceding) as x
2 from (select * from t order by dbms_random.value) t;

ID I J K X
---------- ---------- ---------- ---------- ----------
abc 1 1 3
abc 1 1 4 3
abc 2 8 9 7
SQL> /

ID I J K X
---------- ---------- ---------- ---------- ----------
abc 1 1 3
abc 1 1 4 3
abc 2 8 9 7
SQL> /

S I J K X
---------- ---------- ---------- ---------- ----------
abc 1 1 4
abc 1 1 3 4
abc 2 8 9 7

Mon Nov 16, 06:31:00 AM EST  

OpenID craigc said....

IsPresent stuck out as a little awkward for me. I see that microsoft has an IsNull function (not a boolean function though - is it in the standard?).

... is it possible to to use Null in the name - for ease of assimilating the contruct. Oracle SQL would naturally use IsNVL(?)

Mon Nov 16, 05:26:00 PM EST  

Blogger ryelli said....

I couldn't agree more with extending the analytics clauses. I have not had time to digest the model clause, but it took very little time to understand the analytics syntax and usage.

I've been solving many problems that are either complex to solve or, while not complex, are excessively verbose with ease and elegance using analytics.

Any and all additions to the analytics suite is wonderful. Even if the new stuff gets too complex to understand, many can still comprehend and effectively use the basic syntax.

Tue Nov 17, 08:08:00 AM EST  

Anonymous Alberto Dell'Era said....

It is definetely an excellent idea, I remember having to write some queries very similar to Q2, and I would welcome very much such an improvement in both expressive power and performance.

I'm wondering about syntax - why not using vector-style square brackets, I mean

price[ ANCHOR_ROW ]

instead of

INDEX( price, ANCHOR_ROW )

or

max( sal[ FIRST_GROUP) ] )

instead of

INDEX( max(sal), FIRST_GROUP)

it looks more intuitive and readable to me.

Tue Nov 17, 09:31:00 AM EST  

Blogger Mark Stewart said....

I love the idea of value range windows; always seemed a more natural and useful semantic than the current row and range windows. But then I'm just learning this whole analytics stuff. Anything beyond a lag() is beyond me! :)

Wed Nov 18, 02:03:00 PM EST  

Anonymous Log Buffer said....

"Tom Kyte wants your opinions on comparative window functions: [...]"

Log Buffer #170

Fri Nov 20, 03:00:00 PM EST  

Blogger Mishra said....

Hi TOM,
Can you please suggest any good book for better understanding of Oracle Analytical functions as I'm a datawarehouse developer and need these functions regularly but never sure which one to use...or sometimes don't even know their existence as well...as in the case of PRECEDING..never knew this existed...

Thanks in advance.

Wed Nov 25, 06:16:00 AM EST  

Blogger Thomas Kyte said....

@Mishra

have you read the data warehousing guide? It is pretty comprehensive on this topic:

http://download.oracle.com/docs/cd/B19306_01/server.102/b14223/analysis.htm#i1007779

Wed Nov 25, 08:52:00 AM EST  

Blogger maninder said....

nice blog
Cloud Services
Linux

Thu Nov 26, 07:45:00 AM EST  

Blogger Charlie 1 木匠 said....

In what future Oracle database release can we use these new analytic functions?

Fri Nov 27, 04:03:00 PM EST  

Blogger Thomas Kyte said....

@Charlie

sometime after 11g Release 2 ;) sorry, that is about as specific as we can be...

Fri Nov 27, 04:38:00 PM EST  

Blogger xuhui said....

Hi Tom,
Is there a nice way using analytic functions to get this?

CREATE TABLE T (ID NUMBER PRIMARY KEY, VALUE VARCHAR2(10));

INSERT INTO T VALUES (1,'a');
INSERT INTO T VALUES (2,'b');
INSERT INTO T VALUES (3,'c');
INSERT INTO T VALUES (4,'a');
INSERT INTO T VALUES (5,'b');
INSERT INTO T VALUES (6,'c');

SELECT id,value, COUNT(DISTINCT VALUE) OVER(ORDER BY ID) FROM T;

ERROR at line 1:
ORA-30487: ORDER BY not allowed here

What I'm expected:

SELECT id,value, (SELECT COUNT(DISTINCT VALUE) FROM t WHERE id<=t1.id)
FROM t t1;

1,'a',1
2,'b',2
3,'c',3
4,'a',3
5,'b',3
6,'c',3

Thank you!

Sat Nov 28, 09:23:00 PM EST  

Anonymous Sokrates said....

xuhui,

would be a nice enhancement, I think.
There seems to be no obvious reason for this restraint

Sun Nov 29, 04:12:00 AM EST  

Anonymous Centinul said....

xuhui --

Interesting problem for analytical functions. It looks like you are after a cumulative count capability.

I haven't done any testing outside of the sample data set you provided.

Here is what I came up with:

SELECT ID
, VALUE
, SUM(LAG_CHECK) OVER (ORDER BY ID, VALUE) AS CUMULATIVE_COUNT
FROM
(
SELECT ID
, VALUE
, (CASE WHEN LAG(VALUE) OVER (PARTITION BY VALUE ORDER BY ID) = VALUE THEN 0 ELSE 1 END) AS LAG_CHECK
FROM T
)
ORDER BY 1, 2

I haven't done any performance comparisons against your non-analytical solution either.

Mon Nov 30, 11:03:00 PM EST  

Blogger xuhui said....

Thank you Centinul! That is a nice solution.

Fri Dec 04, 01:05:00 PM EST  

Blogger Dan said....

I think pretty much any extension to the analytic functions would be great. I'm looking for the ability to obtain a column which corresponds to a value obtained by an analytic function with window.

i.e. If I have a table with, say, test results and dates. I can obtain the maximum test result in a moving daily window using the current 'range' function. What I can't do, is obtain the date at which the maximum occurred. Is there a way to do this currently, or would this be a useful addition?

Thanks,

Dan Scott
http://danieljamesscott.org

Tue Jan 12, 09:59:00 AM EST  

Blogger Thomas Kyte said....

@Dan

a trick can get that, encode the data so that it is sortable, and then extract what you need.

find the max hiredate of the set of people that have the maximum salary in EMP partitioned by job and sorted by ename (hint: the words "of the set of" should indicate why what you ask for isn't necessarily possible - there are MANY rows that have the max value in many cases!)


ops$tkyte%ORA10GR2> select ename, job, sal, hiredate,
2 to_date( substr(
3 max( to_char( nvl(sal,0), 'fm00000.00' ) || to_char( hiredate, 'yyyymmddhh24miss' ))
4 over ( partition by job order by ename ) , 9 ), 'yyyymmddhh24miss' )
5 from scott.emp
6 /

ENAME JOB SAL HIREDATE TO_DATE(SUBSTR(MAX(T
---------- --------- ---------- -------------------- --------------------
FORD ANALYST 3000 03-dec-1981 00:00:00 03-dec-1981 00:00:00
SCOTT ANALYST 6727.52 19-apr-1987 00:00:00 19-apr-1987 00:00:00
ADAMS CLERK 1100 23-may-1987 00:00:00 23-may-1987 00:00:00
JAMES CLERK 950 03-dec-1981 00:00:00 23-may-1987 00:00:00
MILLER CLERK 1300 23-jan-1982 00:00:00 23-jan-1982 00:00:00
SMITH CLERK 800 17-dec-1980 00:00:00 23-jan-1982 00:00:00
BLAKE MANAGER 2850 01-may-1981 00:00:00 01-may-1981 00:00:00
CLARK MANAGER 2450 09-jun-1981 00:00:00 01-may-1981 00:00:00
JONES MANAGER 2975 02-apr-1981 00:00:00 02-apr-1981 00:00:00
KING PRESIDENT 5000 17-nov-1981 00:00:00 17-nov-1981 00:00:00
ALLEN SALESMAN 1600 20-feb-1981 00:00:00 20-feb-1981 00:00:00
MARTIN SALESMAN 1250 28-sep-1981 00:00:00 20-feb-1981 00:00:00
TURNER SALESMAN 1500 08-sep-1981 00:00:00 20-feb-1981 00:00:00
WARD SALESMAN 1250 22-feb-1981 00:00:00 20-feb-1981 00:00:00

14 rows selected.

Tue Jan 12, 12:26:00 PM EST  

Blogger Dan said....

Hi,

Excellent, thanks for that. That will certainly work, but it is a little fragile due to the datatype in the result column being encoded in the sort string.

I have found the FIRST function which appears to be exactly what I'm looking for. However, this function only accepts 'PARTITION BY', it won't accept a 'RANGE' which is a shame. My alternative would be to do a self join (yuck) to obtain the 2 day window, and then use analytic functions/GROUP BY to obtain the desired data.

I agree, your 'set of' comment does indicate an ambiguity in what I'm asking. In this case, it would be nice to nest aggregate functions. i.e. If there are multiple dates which have the MAX result, then use another aggregate function to select the desired value (FIRST_VALUE, COUNT, AVG, MAX, etc). Similar to the GROUP BY clause.

Thanks again for your solution,

Dan

Tue Jan 12, 01:00:00 PM EST  

Blogger Thomas Kyte said....

@Dan

can you expand on your fragile comment? How is this fragile?


We encode the data in a non-fragile fashion. Please give a 'fragile' example

Tue Jan 12, 01:09:00 PM EST  

Blogger Dan said....

Hi,

I just meant that the code is dependent on the datatype of the column and we have to be careful to ensure that the encoding/decoding is correct.

If we want to extract the MAX value as well as the date from the concatenated string, then it must match the format 'fm00000.00' used in the encoding.

Each time this code is used, then we have to make sure that the format string is altered in two places to match the datatype of the column.

So, if sal is an integer, then it would have one format, if it is a date, another must be used, if it is a floating point number, another again, etc.

It just feels a little 'fragile' to have datatype encoding/decoding in the code because it cannot easily be reused. :)

Thanks,

Dan

Tue Jan 12, 01:38:00 PM EST  

Blogger THE RAJESH BLOG said....

Tom,

I am trying to make use of INDEX()& ANCHOR_ROW demonstration. but i ended up with error.

hr@10G> SELECT EMPNO,
2 ename,
3 JOB,
4 mgr,
5 SAL,
6 deptno,
7 sum( case when sal >= index(sal,anchor_row) then sal else 0 end) over(order by sal rows between 4 PRECEDING and current row) as rnk
8 FROM EMP
9 WHERE DEPTNO = 20
10 ORDER BY deptno
11 /
sum( case when sal >= index(sal,anchor_row) then sal else 0 end) over(order by sal rows between 4 PRECEDING and current row) as rnk
*
ERROR at line 7:
ORA-00936: missing expression

It that not supported in 10gR2?

Sat Jan 23, 06:36:00 AM EST  

Blogger Thomas Kyte said....

@Rajesh

this was a proposal, an idea, being thrown out to get feedback. It currently isn't available with any release of the software.

Sat Jan 23, 09:27:00 AM EST  

Blogger THE RAJESH BLOG said....

Tom,

I know " Analytics are the coolest thing to happen to SQL since the keyword Select ". If these features were included then Analytics will become " Most Coolest thing " :)

Mon Jan 25, 01:15:00 AM EST  

POST A COMMENT

<< Home