Tuesday, April 03, 2007

So, in your opinion...

I was asked to review a SQL feature under consideration. It looks pretty neat - pattern matching over partitioned, ordered rows. Answer questions like "find me the stocks that are exhibiting a 'W' pattern" (trying to catch them on the up-swing). Or maybe "find me things that started out expensive, but have (rapidly) over time become very cheap". Or network usage / audit trail analysis (looking for specific patterns of behavior).

So, I got the paper (you can get it here) and read it. I gave my feedback, but I'd like to hear yours - specifically:

  • Do you think you'd find a SQL feature like this useful, truly useful.
  • Do you think the suggested implementation is "complete" enough.
  • Anything else you'd like to say

I can assure you the people working on this will be reading your feedback - definitely.

A word of warning though - the paper is a bit "dense", it might well take more than a single reading to 'get it'. A working knowledge of the existing set of analytic functions and their syntax would be really useful. If I didn't know analytics as well as I do - I'm not sure how far I'd make it through the paper and still be able to understand it.

I can say - I've seen the SQL needed to recognize a "W" using analytics... It is not "pretty", this proposed language feature would make it pretty easy - once you 'got it'.
POST A COMMENT

40 Comments:

Anonymous Anonymous said....

I have one general comments. Can oracle make sure to document new sql features better and provide examples. The Model clause is poorly documented and does not have that many examples. Just saying its provides excel functionality isn't enough.

several examples really help you to get it.

this might help financial companies doing data mining for patterns in stocks. I would bet they have alot of complicated software right now to do this. If they actually have good database people they could move it to sql.

Tue Apr 03, 05:30:00 PM EDT  

Blogger Niall Litchfield said....

that is not a paper to read at twenty to eleven at night! Zzzzz. I guess I'll read it another day.

Tue Apr 03, 05:40:00 PM EDT  

Anonymous T French said....

One application might be for use with determining runs, trends and outliers in the context of statistical process control (SPC).
The quality of manufactured products is always subject to a certain amount of variation as a result of chance. Some stable "system of chance causes" are inherent in any production process. Variation within this stable pattern is inevitable. Statistical process control is used to detect those situations that are outside this normal variability, allowing one to act on true anaomalies and NOT act on others that will result in 'tamperring' and will likely make a stable process unstable. Consequently, this approach is used in many industries to improve quality and reduce costs. This approach involves detecting patterns in a series of summary values (usually time series) that should only occur rarely . A set of rules is often followed that define what an anomaly situation might be.

Some of the usual rules (a few of which are often ignored) are as follows:

a) 1 point above Upper Spec
b) 1 point below Lower Spec
A) 1 point above Zone A
B) 1 point below Zone A
C) 2 of 3 successive points in upper Zone A or beyond
D) 2 of 3 successive points in lower Zone A or beyond
E) 4 of 5 successive points in upper Zone B or beyond
F) 4 of 5 successive points in lower Zone B or beyond
G) 8 points in a row above center line
H) 8 points in a row below center line
I) 15 points in a row in Zone C (above and below center)

J) 8 points on both sides of center with 0 in Zone C
K) 14 points in a row alternating up and down
L) 6 points in a row steadily increasing or decreasing

Zones A, B, C refer, for example, to the overall poulation mean +/1 1 or 2 or 3 standard errors.


T. French

Tue Apr 03, 05:53:00 PM EDT  

Anonymous Chris Poole said....

Hi Tom,

All I can say is yes please and more of the same. I am currently working on SQL (using analytics) to do *exactly this* for stock market trend analysis. I'm sure I'm not the only one out there.

I'm just wondering if you would be able to share the SQL you allude to in your post? the one that recognises a 'W'? Thats a classic 'double bottom' and normally signals a trend reversal! :)

Cheers

Chris


Word verification: krbqmb
I'm sure that generated an OERI 600 for me once!

Tue Apr 03, 07:36:00 PM EDT  

Anonymous Piero said....

Incredible!
That's what I've always desired. Instead of writing VBA code in Msaccess (after some digging with analytics), I would be able to find new patterns and write down the old ones, just writing SQL statements.
Moreover I would throw away all that code: just SQL

Tue Apr 03, 11:42:00 PM EDT  

Anonymous darcy said....

I just learned the table ( cast ....... ) thingy and it's really great because I can't use a global temporary table. Schmancey has its time and place.

But you know what I would really like?

A boolean datatype for table columns et al.

If it's already been announced for 11, colour me HAPPY.

Wed Apr 04, 01:09:00 AM EDT  

Blogger Milo said....

Would Oracle truly be able to do this faster/better than, say, a bit of procedural code retrieving the data and then looping over it to detect trends?

I'm all for "do it all in a single SQL statement", but sometimes that just isn't appropriate. I'm honestly wondering if there is a benefit to making Oracle do this.

Wed Apr 04, 09:53:00 AM EDT  

Blogger Thomas Kyte said....

Would Oracle truly be able to do this faster/better than, say, a bit of procedural code retrieving the data and then looping over it to detect trends?

ABSOLUTELY it would

Just like analytics

Just like lots of stuff built in to sql

absolutely

Wed Apr 04, 09:55:00 AM EDT  

Blogger Milo said....

In that case I would say: Yes this is useful. Unfortunately, it's like regular expressions, which are extremely useful, only you can't always tell ahead of time when you'll need them.

I think this will actually hit a lot of the problems that regular expressions have:
* Syntax scaring people away
* Easy to introduce bugs that are hard to spot
* Not everyone's an expert

I don't see a purpose for MATCH_RECOGNIZE for myself right now, but that doesn't mean that I wouldn't end up using it if it was available...

Wed Apr 04, 10:07:00 AM EDT  

Blogger Steve said....

I can definitely see utility in the new feature.

As our society digests the mountain of information we create, there is increasing desire to be "alerted" when conditions happen. In the past, the conditions were simple. Tell me when sales are 10% above last quarter.

However, that is no longer meaningful nor actionable. As T French mentioned, many companies are now looking for more complex pattern recognition to drive decision and action. For example, tell me when sales have been above average for 3 quarters in a row.

Of course, companies have tools like statistics programs, BI tools, etc that will all produce more complex alerting. We know that just with constraint checking, the closer that is handled to the database, the better off we are.

By allowing easier pattern recognition into the SQL, we could open up the world of information to more reliable information. Why? Because now the business logic would live with the data and not with the application. That opens up the possibility that many specialized applications could look at the same information and give the same answers/alerts.

I'm salivating already.

Wed Apr 04, 10:13:00 AM EDT  

Anonymous Mark said....

I don't have a current need for such a feature, but it sounds like a nice tool to have in a database "swiss army knife" so to speak...

Mark

Wed Apr 04, 10:40:00 AM EDT  

Anonymous Anonymous said....

I would think Oracle could make it faster because they have access to the code that runs the database engine. so they can go in and make a low level engine and optimize it for certain types of data retrieval.

is this new feature for version 11? I thought version 11 was out this month at the earliest?

Wed Apr 04, 12:18:00 PM EDT  

Blogger Thomas Kyte said....

this would not be a 11g R1 feature, it is still under discussion.

Just trying to get a feel for the "do you think this is useful" factor

Wed Apr 04, 12:23:00 PM EDT  

Anonymous Anonymous said....

Enhancement in an excellent direction, I must agree. The question which I would have is How would this data be queried?. Consider ticker information pouring at the rate of 1000s per second. It would not work, if we have to save that data into a database and then query it - the latency at that point would not be acceptable. If these "window'd" functions could be run in directly on data pouring in, it would be impressive. Something like data pouring into a for e.g. TimesTen db (or any in mem db) and then queried there to reduce the latency. From my point of view, any decision that would have to be made based on these new functions, it would have to be fast! So all in all, excellent ideas in the paper but would be nice to go a few extra steps and have a way to query using these calls in real time.

Wed Apr 04, 01:20:00 PM EDT  

Blogger Eriks said....

Key words are once you 'got it'. Definitely not an easy feature to learn. I could only imagine scratching my head and trying to figure out what that previous guy/girl had in mind when writing this and why this does not work. Apart from that looks useful, even though I haven't had to accomplish anything like this, even in a non-sql way.

Wed Apr 04, 02:03:00 PM EDT  

Blogger Alberto Dell'Era said....

I disagree on the syntax being ugly; I am far from completely "getting" all of it, by after a couple of peeks I have already understood the general structure (something that doesn't occur to me very often, I must add).

An equivalent implementation in pl/sql or any other language would be, I'm sure, a couple of orders of magnitude harder to understand.

Wed Apr 04, 03:25:00 PM EDT  

Anonymous kevin balfe said....

I hardly can say that I understand ALL of it, but the syntax seems remarkably transparent, considering what it does. The DEFINE clause strikes me as a model of clarity and good functionality. I think there would be many, many "data mining" uses for this. (Would be nice if the Oracle Documentation would explain it as clearly and completely as this paper.)

Wed Apr 04, 05:13:00 PM EDT  

Blogger Jeff Moss said....

...not read much of the article yet but I can certainly say that after reading the first paragraph where it says "We see this often in Complex
Event Processing where business processes are driven from sequence of events...", my current client would definitely gain a significant benefit from this...heck, I've just spent the last few months building code to do similar things using analytics as you say...I look forward to "rewriting" my code at some stage!

Thu Apr 05, 01:37:00 AM EDT  

Anonymous J. Scheidtmann said....

This would be a very useful feature for any time series data, though I do not have an immediate application.

Here are my technical comments:

- The paper does not specify what happens when the pattern matches empty set. The authors should therefore at least add "non-empty" at a couple of places.

- The example with ((A* B*) | (B* A*)) is rather difficult to follow as the situation is complicated by the pattern matching an empty set. Make that ((A+ B*) | (B+ A*)) and it makes more sense.

- As I read the specs, the set of maximal non-empty matches for this example should be (on page 9):
{a1,b1,b2} {b1,b2,a2,a3,a4} {b2,a2,a3,a4} {a2,a3,a4,b2,b4} {a3,a4,b3,b4} {a4,b3,b4} {b3,b4} {b4}, as SKIP TO NEXT ROW is specified and Preferment 1 applies and I see no reason to discard the additional matches according to section 2.13.

- The discussion of incremental matching jumps from the set oriented definition to a streaming implementation. For example visible partition [a1,b1,b2,a2] has match {b1,b2,a2} but this is output only after applying SKIP TO in the next step.

- Is INCREMENTAL equivalent to matching ^((A* B*) | (B* A*))$ on the visible window?

Thu Apr 05, 07:13:00 AM EDT  

Anonymous Anonymous said....

" I can say - I've seen the SQL needed to recognize a "W" using analytics... It is not "pretty", this proposed language feature would make it pretty easy - once you 'got it'. "


Daft question alert - whats a "W"? Are you referring to a trend graph (Down up Down up) ??

Thu Apr 05, 09:30:00 AM EDT  

Blogger Thomas Kyte said....

W is a down, up, down, up, ... yes

Thu Apr 05, 10:32:00 AM EDT  

Anonymous evden eve nakliyat said....

very nice informations.thank you very much.and ver nice blog i will come here all the time.thankss...

Thu Apr 05, 03:50:00 PM EDT  

Anonymous Mark Brady said....

Check out LIM -- it's a time series database. They show off this kind of functionality frequently. Although their language makes it very simple. Answering questions like show me a W pattern in the price of oil within 30 days of an OPEC meeting is trivial.

Would you do this type of pattern matching other than against a time series? I think probably not and if that's the case, to make this really, really useful, I'd like to see this feature combined with some other time series features...

Thu Apr 05, 05:22:00 PM EDT  

Anonymous Anonymous said....

Why not put in the ability to match any arbitrary curve to any arbitrary degree of accuracy while we're at it?

Thu Apr 05, 08:23:00 PM EDT  

Blogger Tony said....

analysis (looking for specific patterns of behavior).

We issue permits to facilities for controlling the pollutants put into waterbodies. Each permit has a five year life-cycle, during which time the facilities have to collect data about what they are actually putting into the water. At the end of the five years we issue a new permit, generally with tougher guidelines.

What type of trend analysis and results would this feature provide that would help us build new guidelines. Basically we want/need to look at bad behavior and attempt to rachet that down over time.

Fri Apr 06, 05:08:00 PM EDT  

Anonymous James T. said....

Hi all,

I'm James, one of the Oracle developers involved in the pattern recognition project. I really appreciate the interest you all are showing, and I want to respond in particular to J. Scheidtmann's technical comments on this post.

> - The paper does not specify what happens when the pattern matches empty set.
> The authors should therefore at least add "non-empty" at a couple of places.

If I understand what you mean by "empty set," you're suggesting an empty PATTERN() clause, correct? If so, this is a good point. We'll add a description of that functionality.

> - The example with ((A* B*) | (B* A*)) is rather difficult to follow as the
> situation is complicated by the pattern matching an empty set. Make that ((A+
> B*) | (B+ A*)) and it makes more sense.

That's probably fair. We'll improve our examples.

> - As I read the specs, the set of maximal non-empty matches for this example
> should be (on page 9): {a1,b1,b2} {b1,b2,a2,a3,a4} {b2,a2,a3,a4}
> {a2,a3,a4,b2,b4} {a3,a4,b3,b4} {a4,b3,b4} {b3,b4} {b4}, as SKIP TO NEXT ROW
> is specified and Preferment 1 applies and I see no reason to discard the
> additional matches according to section 2.13.

Nice catch. I believe that's a mistake.

> - The discussion of incremental matching jumps from the set oriented
> definition to a streaming implementation. For example visible partition
> [a1,b1,b2,a2] has match {b1,b2,a2} but this is output only after applying
> SKIP TO in the next step.

I don't quite understand what you mean here. Could you possibly rephrase the comment?

> - Is INCREMENTAL equivalent to matching ^((A* B*) | (B* A*))$ on the visible
> window?

I'm not completely confident I know which example you're referring to, but whenever ^ and $ are both used, incremental and maximal should have the same results.

Cheers,
James T.

Mon Apr 09, 06:41:00 AM EDT  

Anonymous Anonymous said....

Here are few examples illustrating pattern recognition borrowed
from "Efficient Support for Time Series Queries in Data Stream Management
Systems" by Y.Bai, et al (from UCLA CS department).

Assume an incoming stream of reading from highway speed sensors. The schema
for the stream is: speed_sensors(location, speed, time). Sensors from
individual locations send their data every minutes.

Assume that we want to discover traffic jams. Jam is defined as a series of
decreasing speeds that leads to a more than 70% speed recuction within a time
span of at most 6 minutes and at least 3 minutes. (remember that we record one
measurement per second). Consider these sensor readings around LAX (not a good
location in general for a fast travel).

location speed time
LAX 60 10:00
LAX 70 10:01
LAX 72 10:02
LAX 60 10:03
LAX 30 10:04
LAX 20 10:05
LAX 10 10:06
LAX 10 10:07
LAX 20 10:08
LAX 50 10:09
LAX 50 10:10
LAX 20 10:11
LAX 20 10:12
LAX 50 10:13
LAX 60 10:14
LAX 10 10:15
LAX 10 10:16
LAX 10 10:17
LAX 10 10:18
LAX 10 10:19
LAX 10 10:20
LAX 60 10:21
LAX 60 10:22
LAX 60 10:23

Our jam detecting query is:

SELECT location, start_jam, start_speed
FROM
speed_sensors MATCH_RECOGNIZE(PARTITION BY location ORDER BY time
ONE ROW PER MATCH
MAXIMAL MATCH
MEASURES speed_50.time AS start_jam, speed_70pct.time AS start_speed
PATTERN (speed_50, speed_decr+ speed_70pct+)
DEFINE
speed_50 AS speed > 50, -- speed starting from 50
speed_decr AS speed_decr.speed < PREV(speed_decr.speed) -- decreasing
count(speed_decr.*) < 6 -- within at most 6 minutes
speed_70pct AS speed_70pct.speed < 0.3*speed_50.speed AND -- last speed 70% of start speed
count(speed_decr.*) > 3 -- and at leaset 3 minutes decreasing
)

On the above data we get this jams:

location start_jam, start_speed
LAX 10:01 72
LAX 10:14 60


Observe that the slowdown that started as time 10:10 is not considered a jam,
since it didn't last more than 3 minutes.

Suppose that we also want to detect the end of a jam. Assume that
jam continues as long as the speed is less than 70% of the initial speed. We
simply change the pattern to:

SELECT location, start_jam, start_speed, end_jam, end_speed
FROM
speed_sensors MATCH_RECOGNIZE(PARTITION BY location ORDER BY time
ONE ROW PER MATCH
MAXIMAL MATCH
MEASURES speed_50.time AS start_jam, speed_70pct.time AS start_speed
speed_end.time AS end_jam, speed_end.speed AS end_speed
PATTERN (speed_50, speed_decr+ speed_70pct+, speed_end)
DEFINE
speed_50 AS speed > 50, -- speed starting from 50
speed_decr AS speed_decr.speed < PREV(speed_decr.speed) -- decreasing
count(speed_decr.*) < 6 -- within at most 6 minutes
speed_70pct AS speed_70pct.speed < 0.3*speed_50.speed -- last speed 70% of start speed
speed_end AS speed_end.speed > 0.3*speed_50.speed AND -- speed increased above start point
count(speed_decr.*) > 3 -- and at leaset 3 minutes decreasing
)

We get this results:

location start_jam, start_speed end_jam end_speed
LAX 10:01 72 10:09 50
LAX 10:14 60 10:21 60

Mon Apr 09, 09:04:00 PM EDT  

Anonymous Anonymous said....

Here are few examples illustrating pattern recognition borrowed
from "Efficient Support for Time Series Queries in Data Stream Management
Systems" by Y.Bai, et al (from UCLA CS department).

Assume an incoming stream of reading from highway speed sensors. The schema
for the stream is: speed_sensors(location, speed, time). Sensors from
individual locations send their data every minutes.

Assume that we want to discover traffic jams. Jam is defined as a series of
decreasing speeds that leads to a more than 70% speed recuction within a time
span of at most 6 minutes and at least 3 minutes. (remember that we record one
measurement per second). Consider these sensor readings around LAX (not a good
location in general for a fast travel).

location speed time
LAX 60 10:00
LAX 70 10:01
LAX 72 10:02
LAX 60 10:03
LAX 30 10:04
LAX 20 10:05
LAX 10 10:06
LAX 10 10:07
LAX 20 10:08
LAX 50 10:09
LAX 50 10:10
LAX 20 10:11
LAX 20 10:12
LAX 50 10:13
LAX 60 10:14
LAX 10 10:15
LAX 10 10:16
LAX 10 10:17
LAX 10 10:18
LAX 10 10:19
LAX 10 10:20
LAX 60 10:21
LAX 60 10:22
LAX 60 10:23

Our jam detecting query is:

SELECT location, start_jam, start_speed
FROM
speed_sensors MATCH_RECOGNIZE(PARTITION BY location ORDER BY time
ONE ROW PER MATCH
MAXIMAL MATCH
MEASURES speed_50.time AS start_jam, speed_70pct.time AS start_speed
PATTERN (speed_50, speed_decr+ speed_70pct+)
DEFINE
speed_50 AS speed > 50, -- speed starting from 50
speed_decr AS speed_decr.speed < PREV(speed_decr.speed) -- decreasing
count(speed_decr.*) < 6 -- within at most 6 minutes
speed_70pct AS speed_70pct.speed < 0.3*speed_50.speed AND -- last speed 70% of start speed
count(speed_decr.*) > 3 -- and at leaset 3 minutes decreasing
)

On the above data we get this jams:

location start_jam, start_speed
LAX 10:01 72
LAX 10:14 60


Observe that the slowdown that started as time 10:10 is not considered a jam,
since it didn't last more than 3 minutes.

Suppose that we also want to detect the end of a jam. Assume that
jam continues as long as the speed is less than 70% of the initial speed. We
simply change the pattern to:

SELECT location, start_jam, start_speed, end_jam, end_speed
FROM
speed_sensors MATCH_RECOGNIZE(PARTITION BY location ORDER BY time
ONE ROW PER MATCH
MAXIMAL MATCH
MEASURES speed_50.time AS start_jam, speed_70pct.time AS start_speed
speed_end.time AS end_jam, speed_end.speed AS end_speed
PATTERN (speed_50, speed_decr+ speed_70pct+, speed_end)
DEFINE
speed_50 AS speed > 50, -- speed starting from 50
speed_decr AS speed_decr.speed < PREV(speed_decr.speed) -- decreasing
count(speed_decr.*) < 6 -- within at most 6 minutes
speed_70pct AS speed_70pct.speed < 0.3*speed_50.speed -- last speed 70% of start speed
speed_end AS speed_end.speed > 0.3*speed_50.speed AND -- speed increased above start point
count(speed_decr.*) > 3 -- and at leaset 3 minutes decreasing
)

We get this results:

location start_jam, start_speed end_jam end_speed
LAX 10:01 72 10:09 50
LAX 10:14 60 10:21 60

Mon Apr 09, 09:06:00 PM EDT  

Blogger J. Scheidtmann said....

Hi James,

You said:
> I said:
[...]
> > - The paper does not specify what happens when the pattern matches empty set.
> > The authors should therefore at least add "non-empty" at a couple of places.
>
> If I understand what you mean by "empty set," you're
> suggesting an empty PATTERN() clause, correct? If so, this is
> a good point. We'll add a description of that functionality.

In principle yes, but realised by a non-empty PATTERN():
Suppose "x" as string to match and the pattern ((a*b*)|(b*a*)).
This pattern has two matches: The empty string before "x" and the
empty string after "x". For PATTERN()s of rows: What happens when
the pattern "matches" the empty row set before each non-matching
row, between non-matching rows and after the last
non-matching row?

> > - The example with ((A* B*) | (B* A*)) is rather difficult to follow as the
> > situation is complicated by the pattern matching an empty set. Make that ((A+
> > B*) | (B+ A*)) and it makes more sense.
>
> That's probably fair. We'll improve our examples.
>
> > - As I read the specs, the set of maximal non-empty matches for this example
> > should be (on page 9): {a1,b1,b2} {b1,b2,a2,a3,a4} {b2,a2,a3,a4}
> > {a2,a3,a4,b2,b4} {a3,a4,b3,b4} {a4,b3,b4} {b3,b4} {b4}, as SKIP TO NEXT ROW

Last line should be: {a2,a3,a4,b*3*,b4} {...

[...]
> > - The discussion of incremental matching jumps from the set oriented
> > definition to a streaming implementation. For example visible partition
> > [a1,b1,b2,a2] has match {b1,b2,a2} but this is output only after applying
> > SKIP TO in the next step.
>
> I don't quite understand what you mean here. Could you possibly rephrase the comment?

I was puzzled that [a1,b1,b2,a2] is not matched again after
outputting the first match ("There is no match, hence ..." on page
10, line 5). And that {b1,b2,a2} is output only after applying
the skip to clause. This makes perfectly sense, when considering
a streaming implementation which restricts itself to matches
anchored on the first element of the sequence (as you will get
other matches later). If you pointed that out, it would have
been easier to read for me.

> > - Is INCREMENTAL equivalent to matching ^((A* B*) | (B* A*))$ on the visible
> > window?
>
> I'm not completely confident I know which example you're
> referring to, but whenever ^ and $ are both used, incremental
> and maximal should have the same results.

I was referring to section 2.9, pages 9 and 10.

Suppose PATTERN (^(A+)$) SKIP TO NEXT ROW, matching this partition:

row number, row id, attribute
1, a1, a
2, a2, a

Expected result with start and end markers of true partition (^^
and $$) and visible partition (^ and $):

ALL: {^^,a1,a2,$$}
MAXIMAL: {^^,a1,a2,$$} (a subset of ALL - see page 8)

Should MAXIMAL output {^,a2,$$} because of the SKIP TO clause?

INCREMENTAL:
{^^,a1,$} {^^,a1,a2,$$} {^,a2,$$} (union of all maximal matches
after every tuple, see page 8 or according to the sequence of
sets of visible rows in the partition on page 10)
or {^^,a1,a2,$$} alone?

Thanks in advance,

Jens Scheidtmann

Tue Apr 10, 09:03:00 AM EDT  

Anonymous Anonymous said....

>>> Jam is defined as a series of
decreasing speeds that leads to a more than 70% speed recuction within a time
span of at most 6 minutes and at least 3 minutes. <<<<

Wouldn't it be easier to just redefine traffic jam as:

Jam is defined as continued series of speed points each evaluated to less than 30% of daily average with time length at least 3 minutes.

Tue Apr 10, 01:16:00 PM EDT  

Blogger Srikanth said....

Here is how W pattern can be obtained using Oracle's analytical functions. The view wpat_v47 returns the starting rows of a matching the W pattern. I use the term window function and analytical function interchangeably.

/*create sample tables and views*/

drop table stocks;
create table stocks(
ticker varchar2(5),
trade_date date,
closing_value number);


/*
mark slopes of W as either:
** 0 for downward slopes
** 1 for upward slopes
** 2 for horizontal
** 0 for last row
*/

create or replace view wpat_v41 as
select
tk,td,val,rn,
(case
when nval < val then 0
when nval = val then 2
when nval is NULL then 0
else 1 end) mrk
from (
select
ticker tk,
trade_date td,
closing_value val,
lead(closing_value) over (partition by ticker order by trade_date) nval,
row_number() over (partition by ticker order by trade_date) rn
from stocks
);

/*
I now find the transition points
(eg, point at which stock trends
from downward to upward).
Then I assign state value to rows:
** non-null for transitioning rows
** null for others
*/
create or replace view wpat_v42 as
select
tk,td,val,rn,
mrk,pmrk,
(case
when pmrk is NULL then NULL
when pmrk = mrk then NULL
when pmrk = 0 and mrk = 1 then 'a'
when pmrk = 0 and mrk = 2 then 'b'
when pmrk = 1 and mrk = 0 then 'c'
when pmrk = 1 and mrk = 2 then 'd'
when pmrk = 2 and mrk = 0 then 'e'
when pmrk = 2 and mrk = 1 then 'f'
else NULL end) state
from (
select
tk,td,val,rn,
mrk,
lag(mrk) over (partition by tk order by td) pmrk
from wpat_v41);

/*
Now, look ahead and form a pattern based on first N non-null state values.
For W pattern, we need 3 state values.
** one for downward to upward trend
** one for upward to downward
** one for downward to upward

If Oracle had an analytical function that gives us Nth NONNULL value, it's much easier. Since Oracle doesn't have one, I do it the following way
*/


/*
get the first non-null state value
*/
create or replace view wpat_v43 as
select
tk,td,val,rn,
mrk,pmrk,state,
first_value(state ignore nulls) over (
partition by tk order by td rows between 1 following and unbounded following) t1,
first_value(case when state is NULL then NULL else rn end ignore nulls) over (
partition by tk order by td rows between 1 following and unbounded following) - rn +1 r1
from wpat_v42;

/*
get the second non-null state value
*/
create or replace view wpat_v44 as
select
tk,td,val,rn,
mrk,pmrk,state,t1,r1,
first_value(state ignore nulls) over (
partition by tk order by td rows between r1 following and unbounded following) t2,
first_value(case when state is NULL then NULL else rn end ignore nulls) over (
partition by tk order by td rows between r1 following and unbounded following) - rn + 1 r2
from wpat_v43;

/*
get the third non-null state value
*/
create or replace view wpat_v45 as
select
tk,td,val,rn,
mrk,pmrk,state,t1,r1,t2,r2,
first_value(state ignore nulls) over (
partition by tk order by td rows between r2 following and unbounded following) t3,
first_value(case when state is NULL then NULL else rn end ignore nulls) over (
partition by tk order by td rows between r2 following and unbounded following) - rn +1 r3
from wpat_v44;

/*
form the pattern and the end point
*/
create or replace view wpat_v46 as
select
tk,td,val,rn,
mrk,pmrk,state,
t1||t2||t3 patt,
first_value(td) over (
partition by tk order by td rows between r3 following and unbounded following) ed
from wpat_v45;

/*
final query
*/
create or replace view wpat_v47 as
select tk,td,val,ed
from wpat_v46
where patt = 'aca';

/*
sample data
*/

insert into stocks values('ABCD', to_date('01-Mar-06', 'DD-MON-YY'), 10);
insert into stocks values('ABCD', to_date('02-Mar-06', 'DD-MON-YY'), 11);
insert into stocks values('ABCD', to_date('03-Mar-06', 'DD-MON-YY'), 12);
insert into stocks values('ABCD', to_date('04-Mar-06', 'DD-MON-YY'), 11);
insert into stocks values('ABCD', to_date('05-Mar-06', 'DD-MON-YY'), 10);
insert into stocks values('ABCD', to_date('06-Mar-06', 'DD-MON-YY'), 9);
insert into stocks values('ABCD', to_date('07-Mar-06', 'DD-MON-YY'), 10);
insert into stocks values('ABCD', to_date('08-Mar-06', 'DD-MON-YY'), 11);
insert into stocks values('ABCD', to_date('09-Mar-06', 'DD-MON-YY'), 12);
insert into stocks values('ABCD', to_date('10-Mar-06', 'DD-MON-YY'), 9);
insert into stocks values('ABCD', to_date('11-Mar-06', 'DD-MON-YY'), 10);
insert into stocks values('ABCD', to_date('12-Mar-06', 'DD-MON-YY'), 9);

select *
from wpat_v47;

SQL given above finds overlapping W patterns. It can easily be changed to obtain non-overlapping W pattern.

Tue Apr 10, 04:28:00 PM EDT  

Anonymous Anonymous said....

I would love to have this, for trend reporting.
However that functionality would be much more powerful if it could be applied to data as it is being inserted.

Use case would be statistical process control: you would want to trigger some sort of asynchronous event as soon as a pattern is recognised on a continous incoming dataflow.

This functionality as described would require a polling process to query the same data over and over to allow for neartime (couple of minutes latency max) reaction to patterns.

Thu Apr 12, 02:26:00 PM EDT  

Blogger SnippetyJoe said....

I've never done the kind of pattern matching described in the paper, so excuse my ignorance if I get anything wrong here, but it sounds to me like this new feature would effectively provide the ability to do regular expression pattern matching over rows of values.

If that's the case then, instead of reinventing the regular expression wheel why not simply aggregate a column of pattern symbols into a single string and then use existing regular expression functions to search the string for specific patterns?

It's too long to show here, but I have some examples of how to do this on SQL Snippets at Drafts: Pattern Matching Over Rows.Finding a "W" pattern in a series of stock prices boils down to a query like this.

select stock_symbol
from stock_price_patterns
where day = 11
and regexp_like( pattern_string_uda, 'D+U+D+U+' ) ;

Assuming I'm not missing something, then I think Oracle's efforts would be better spent developing a built in string aggregation function than adding another feature for doing pattern matching.

HTH.

Thu Apr 12, 04:52:00 PM EDT  

Blogger Alberto Dell'Era said....

>Assuming I'm not missing something, then I think Oracle's efforts would
>be better spent developing a built in string aggregation function than
>adding another feature for doing pattern matching.

But the paper enhancement would not only detect the pattern, it would also (and most importantly) enable you to compute meeasures on it (location, max/min values, etc), etcetera.

Also, I've noticed that they propose to emit the measures as soon as the pattern(s) is(are) discovered, instead of first aggregating (pivoting) all the rows and then search for the pattern(s) - which would be a dramatical performance improvement for big data sets.

Sat Apr 14, 09:03:00 AM EDT  

Anonymous Natti Bar-On said....

Seeing this makes me go "uch" since I'm currently in the process of writing an inter-row rules engine, that is meant to do exactly this. (yes, using analytics etc)

With the business need perfectly clear, I think the syntax is quite clear as well.

I vote to put it in 11gR1!
:-)

Tue Apr 17, 04:29:00 AM EDT  

Blogger ??? ??-??? said....

In addition - how about "find me the entities that DON'T match a common pattern" functionality?
(thinking of the performance implications of that...)

Tue Apr 17, 04:31:00 AM EDT  

Blogger SnippetyJoe said....

Alberto Dell'Era said....
But the paper enhancement would not only detect the pattern, it would also (and most importantly) enable you to compute meeasures on it (location, max/min values, etc), etcetera.


Good point. Calculating location, max, min, etc. values is, however, easily accomplished with my string aggregation approach. I demonstrate how in a new section called "Take Two" at the end of Drafts - Pattern Matching Over Rows

Alberto Dell'Era said....
Also, I've noticed that they propose to emit the measures as soon as the pattern(s) is(are) discovered, instead of first aggregating (pivoting) all the rows and then search for the pattern(s) - which would be a dramatical performance improvement for big data sets.


Emitting measures as soon as a pattern is discovered may not necessarily yield dramatic performance improvements over the string aggregation approach. In many situations, e.g. stock patterns, fraud detection, the target pattern will only be found in a small percentage of cases. The majority of cases will involve full scans of the data series witout emitting any measures at all. Even in the cases where the target pattern does exists there is a 50/50 chance the pattern will appear in the second half of the data so the savings will not be dramatic here either.

Consider also that, to emit measures as soon as a pattern is discovered, the proposed approach may need to maintain intermediate results for each measure after each data point is fetched just in case a match is found. When no match is found (which can be most of the time) the results of the intermediate calculations will simply be discarded and all that overhead will have been performed for nothing.

The string aggregation approach, on the other hand, only looks for measures *after* a pattern match is found.

One last point. The string aggregation approach uses REGEXP_LIKE to find a pattern match. I expect REGEXP_LIKE stops scanning when a match is found just as the new feature would.

Wed Apr 25, 11:56:00 AM EDT  

Anonymous Dan Kefford said....

Tom, James T.,

Are any of the developers maintaining some sort of blog on the development of this project? Will any proof-of-concept or alpha release be made available for anyone outside Oracle to experiment with? Is there any ETA on when we might see this in a future Oracle DB release?

Fri May 25, 10:26:00 AM EDT  

Anonymous Anonymous said....

This will be extremely usefull for all sorts of fraud detection patterns.
Is there any indication that this will be implemented?

Tue Jun 19, 10:27:00 AM EDT  

Blogger Dave said....

In the example on page 3, shouldn't the measures be...

nvl(max(c.tstamp), b.tstamp),
nvl(last(c.price), b.price),

since C can occur zero or more times?

Tue Jul 10, 10:57:00 AM EDT  

POST A COMMENT

<< Home