Friday, August 25, 2006

Back from a time out...

I just got back from a short week in the woods:

IMG_3427

We were very much bandwidth challenged - I was using my Verizon wireless card but connectivity was spotty to say the least. The best place to receive a signal was not a very convenient one:

IMG_3469

We used the second floor balcony and perched the laptop on the railing as far away from the cottage as possible. There we achieve blazing speeds of up to 80/90 kbps! (yes, that is about 10 to 11 Kilobytes per second). So, once I realized that - I just checked out. I would check my email in the morning and the evening - attend to anything that needed immediate attention and either said "sorry, on vacation, but lots of forums out there to try!" or filed it away for later (now being later - I have a scroll bar on my inbox and that drives me nuts). My goal is to achieve this "nirvana" again.

Anyway, it was a very secluded place

IMG_0757

and for some of the time the only cottage with anyone in it (there are only three to begin with) was ours. That is my idea of down time. The best part of this vacation - I did not have to pull my wallet out of my back pocket all week - the nearest store was far far away and I wasn't leaving the grounds.

Anyway - was just digging through the email and got a question from someone about how to do something (imagine that). It piqued my curiosity (meaning, it was interesting enough that I didn't just reply "sorry, please don't email me" :) In short, they needed a function that would accept two strings and return Y if the second string was an anagram of the first string, N otherwise. They sent their code with the comment "I think this can be done better".

So I gave it a try. Wow, did it feel good to write a piece of code after a week of not doing much at all computerwise. It woke me right up. Here is my solution:

Now, one up it - any easier approaches you can think of?

ops$tkyte%ORA10GR2> create or replace package anagram_pkg
2 as
3 type anagram_array is table of varchar2(1);
4 function compare
5 ( p_src in varchar2, p_tgt in varchar2 )
6 return varchar2
7 deterministic;
8 end;
9 /

Package created.

ops$tkyte%ORA10GR2> create or replace package body anagram_pkg
2 as
3
4 function compare
5 ( p_src in varchar2, p_tgt in varchar2 )
6 return varchar2
7 deterministic
8 is
9 l_data1 anagram_array default anagram_array();
10 l_data2 anagram_array default anagram_array();
11 l_len number default length(p_src);
12 l_return varchar2(1);
13 begin
14 if ( l_len <> length(p_tgt)
15 or p_src is null
16 or p_tgt is null )
17 then
18 l_return := 'N';
19 else
20 l_data1.extend(l_len);
21 l_data2.extend(l_len);
22 for i in 1 .. l_len
23 loop
24 l_data1(i) := upper(substr( p_src,i,1 ));
25 l_data2(i) := upper(substr( p_tgt,i,1 ));
26 end loop;
27
28 if ( l_data1 = l_data2 )
29 then
30 l_return := 'Y';
31 else
32 l_return := 'N';
33 end if;
34 end if;
35 return l_return;
36 end compare;
37
38 end;
39 /

Package body created.
ops$tkyte%ORA10GR2> select a, b, substr(anagram_pkg.compare(a,b),1,1)
2 from t;

A B S
------------ ------------ -
a N
N
b N
Richard Rihcard Y
Richard RihcaDR Y
Richard Rihhard N
Thomas Kyte They AskTom Y
abba baab Y
abba baaa N

9 rows selected.


In Oracle 9i and before, you can use this query:


    select decode( count(*), l_len, 'Y', 'N' )
into l_return
from (select column_value a,
row_number()
over (order by column_value) rn
from TABLE(cast(l_data1 as anagram_array) )
) src,
(select column_value a,
row_number()
over (order by column_value) rn
from TABLE(cast(l_data2 as anagram_array) )
) tgt
where src.rn = tgt.rn
and src.a = tgt.a;


to compare the two collections (lines 28..33). In order to use the query, you will use "create or replace type anagram_array as ..." and create a SQL type. That is, the anagram_type will be a SQL object type - not a PL/SQL type.

POST A COMMENT

18 Comments:

Anonymous Anonymous said....

"Thomas Kyte" "They AskTom " are anagrams: I found this one wonderful !

Pierre Forstmann

Fri Aug 25, 05:05:00 PM EDT  

Anonymous Mark from NY said....

Wow, I didn't know you could do "if ( l_data1 = l_data2 )"...That certainly made your code simple.

I can't imagine a neater way, but if you're going for performance, this algorithm might be the way to go - http://www.gatago.com/comp/programming/17888209.html

This algorithm gives you linear complexity, whereas the sort Oracle must have to do to compare the two collections is worse (the best sorting algorithms are n log n).

But I'd love to see a benchmark. ;)

Fri Aug 25, 06:40:00 PM EDT  

Anonymous David said....

Fun!

I learned something new with your way... here's my stab at it using basic functions. I have no idea on the performance:

CREATE OR REPLACE FUNCTION is_anagram(
p_src IN VARCHAR2
, p_tgt IN VARCHAR2)
RETURN VARCHAR2 DETERMINISTIC IS
l_src VARCHAR2(32767) DEFAULT UPPER(REPLACE(p_src, ' '));
l_tgt VARCHAR2(32767) DEFAULT UPPER(REPLACE(p_tgt, ' '));
l_return VARCHAR2(1) DEFAULT 'N';
BEGIN
WHILE l_src IS NOT NULL
AND l_tgt IS NOT NULL LOOP
EXIT WHEN LENGTH(l_src) <> LENGTH(l_tgt);
l_src := REPLACE(l_src, SUBSTR(l_tgt, 1, 1));
l_tgt := REPLACE(l_tgt, SUBSTR(l_tgt, 1, 1));

IF l_src IS NULL
AND l_tgt IS NULL THEN
l_return := 'Y';
END IF;
END LOOP;

RETURN l_return;
END is_anagram;

Fri Aug 25, 06:56:00 PM EDT  

Anonymous Mr. Ed said....

var a varchar2(30);
var b varchar2(30);

exec :a := 'Thomas Kyte'; :b := 'They AskTom';

select decode(count(*), 0,'YES', 'NO') is_anagram
from (
select c
from (
select upper(substr(:a,rownum,1)) c, 1 a, null b
from dual connect by level <= length(:a)
union all
select upper(substr(:b,rownum,1)) c, null, 2
from dual connect by level <= length(:b)
)
where c <> ' '
group by c
having count(a) <> count(b)
);

Fri Aug 25, 09:33:00 PM EDT  

Anonymous Anonymous said....

Wouldn't it be easier to simply convert each letter in the source and target strings using ASCII(UPPER()) and add up the numbers? If the numbers match, the 2 strings are anagrams of each other.

Fri Aug 25, 11:09:00 PM EDT  

Anonymous Bryan said....

Anonymous - Adding up the ascii values of letters only works for two letter words. For example, 1+2+5=1+3+4, but 1,2 and 5 is not an anagram of 1,3 and 4. The sum of x numbers does not uniquely determine the collection of x numbers. However, if you mapped letters to prime numbers (A-2, B-3, C-5, ...) and multiplied the numbers together, that would work becaus of the fundamental theorem of arithmetic (http://en.wikipedia.org/wiki/Fundamental_theorem_of_arithmetic). I wouldn't recommend this, because the numbers are going to get way too big!

Fri Aug 25, 11:27:00 PM EDT  

Anonymous Mr. Ed said....

Bryan, product of primes is too big, but the log of the product isn't too bad:

create or replace function anagram_num (s varchar2)
return float
as
type P_type is table of integer index by binary_integer;
P P_type;
n float default 0.0;
begin
if s is null then return 0; end if;

P(1) := 2; P(2) := 3; P(3) := 5; P(4) := 7; P(5) := 11; P(6) := 13; P(7) := 17;
P(8) := 19; P(9) := 23; P(10) := 29; P(11) := 31; P(12) := 37; P(13) := 41; P(14) := 43;
P(15) := 47; P(16) := 53; P(17) := 59; P(18) := 61; P(19) := 67; P(20) := 71; P(21) := 73;
P(22) := 79; P(23) := 83; P(24) := 89; P(25) := 97; P(26) := 101; P(27) := 103;
P(28) := 107; P(29) := 109; P(30) := 113; P(31) := 127; P(32) := 131; P(33) := 137;
P(34) := 139; P(35) := 149; P(36) := 151; P(37) := 157; P(38) := 163; P(39) := 167;
P(40) := 173; P(41) := 179; P(42) := 181; P(43) := 191; P(44) := 193; P(45) := 197;
P(46) := 199; P(47) := 211; P(48) := 223; P(49) := 227; P(50) := 229; P(51) := 233;
P(52) := 239; P(53) := 241; P(54) := 251; P(55) := 257; P(56) := 263;
P(57) := 269; P(58) := 271; P(59) := 277; P(60) := 281; P(61) := 283; P(62) := 293;

for i in 1..length(s) loop
if substr(s, i, 1) <> ' ' then
n := n + log( 10, P(ascii(upper(substr(s, i, 1))) - 33 + 1) );
end if;
end loop;
return n;
end;
/

select decode(anagram_num('ORACLE WORLD'), anagram_num('CLEAR OLD ROW'),'YES', 'NO') is_anagram from dual;

Sat Aug 26, 03:21:00 AM EDT  

Blogger Thomas Kyte said....

quick test shows that

anagram_pkg used about 20% more cpu than is_anagram (50,000 calls took about 6 cpu seconds for anagram_pkg and about 5 cpu seconds for is_anagram using ALL_OBJECTS and comparing object_name to itself)

anagram_num however used over 80 cpu seconds to do the same.

Sat Aug 26, 09:44:00 AM EDT  

Blogger Thomas Kyte said....

although for anagram_num, we can get the cpu down to about 10 cpu seconds simply by:

type P_type is table of binary_double index by binary_integer;
P P_type;
n binary_double default 0.0;

using native floating point numbers for the table type instead of integers...

and remember with floats - we might need to use

abs( a-b ) <= some_tolerance

instead of just equals.... Using the original anagram_num (with integers)

ops$tkyte%ORA10GR2> variable letters varchar2(52)
ops$tkyte%ORA10GR2> column l format a20 word_wrap
ops$tkyte%ORA10GR2> column rl format a20 word_wrap
ops$tkyte%ORA10GR2> exec :letters := 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

PL/SQL procedure successfully completed.

ops$tkyte%ORA10GR2> select :letters l,
2 reverse(:letters) rl,
3 anagram_num( :letters ) - anagram_num( reverse(:letters) ) diff,
4 decode( anagram_num( :letters ) ,
5 anagram_num( reverse(:letters) ),
6 'Y', 'N' )
7 from dual;

L RL DIFF D
-------------------- -------------------- ---------- -
abcdefghijklmnopqrst ZYXWVUTSRQPONMLKJIHG 1.0000E-36 N
uvwxyzABCDEFGHIJKLMN FEDCBAzyxwvutsrqponm
OPQRSTUVWXYZ lkjihgfedcba


ops$tkyte%ORA10GR2>
ops$tkyte%ORA10GR2> select anagram_pkg.compare(:letters,reverse(:letters)),
2 is_anagram(:letters,reverse(:letters))
3 from dual;

ANAGRAM_PKG.COMPARE(:LETTERS,REVERSE(:LETTERS))
-------------------------------------------------------------------------------
IS_ANAGRAM(:LETTERS,REVERSE(:LETTERS))
-------------------------------------------------------------------------------
Y
Y

those strings appear different.

Sat Aug 26, 09:59:00 AM EDT  

Anonymous Anonymous said....

CREATE OR REPLACE FUNCTION ARE_ANAGRAMS
(p_str1 VARCHAR2, p_str2 VARCHAR2)
RETURN VARCHAR2 AS
l_len number := length(p_str1);
l_str varchar2(32767) := upper(p_str2);
BEGIN
if length(p_str2) <> l_len or p_str1 is null or p_str2 is null then
return 'N';
end if;

for i in 1..l_len loop
l_str := translate (l_str,upper(substr(p_str1,i,1)),'x');
end loop;

if rpad ('x',l_len,'x') = l_str then
return 'Y';
else
return 'N';
end if;

END;

Sat Aug 26, 04:03:00 PM EDT  

Anonymous Anonymous said....

A variation:

CREATE OR REPLACE FUNCTION are_anagrams2(p_str1 VARCHAR2, p_str2 VARCHAR2) RETURN VARCHAR2 AS
l_len NUMBER := LENGTH(p_str1);
BEGIN

IF l_len <> LENGTH(p_str2) OR p_str1 IS NULL OR p_str2 IS NULL THEN
RETURN 'N';
END IF;

IF rpad(TRANSLATE(upper(p_str1), upper(p_str2), '+'), l_len, '+') = rpad('+', l_len, '+') THEN
RETURN 'Y';
ELSE
RETURN 'N';
END IF;

END;

Albert Nelson A

Sat Aug 26, 04:36:00 PM EDT  

Blogger Thomas Kyte said....

Translate won't be extremely useful here - neither are_anagrams* work correctly:

SQL> select anagram_pkg.compare( 'baa', 'aaa' ) a,
2 are_anagrams( 'baa', 'aaa' ) b
3 from dual;

A B
- -
N Y

SQL> select anagram_pkg.compare( 'bbab', 'abba' ) a,
2 are_anagrams2( 'bbab', 'abba' ) b
3 from dual;

A B
- -
N Y

Sat Aug 26, 04:46:00 PM EDT  

Anonymous Matthias Rogel said....

nice discussion.

Nobody thought of a recursive
solution,
which should be simple enough like the following
(am on holidays without any Oracle
and cannot test it though):

create function is_rec_anagram
(s1 in varchar2, s2 in varchar2) return varchar2 deterministic is
n integer;
begin
if s1 is null then
return
case when s2 is null
then 'Y' -- '' and '' are anagrams I would say
else 'N'
end;
end if;
n := instr(upper(s2),
upper(substr(s1,1,1)));
return
case n when 0 then 'N'
else
is_rec_anagram(
substr(s1,2),
concat(substr(s2,1,n-1) /* probably false if n=1, to test */,
substr(s2,n+1)))
end;
end is_rec_anagram;

Sun Aug 27, 05:07:00 AM EDT  

Blogger William Robertson said....

Are those <pre> tags?

Sun Aug 27, 06:17:00 PM EDT  

Blogger Thomas Kyte said....

yes, they are pre tags :)

but only because live writer didn't change the spaces into  

technically did not *need* the pre tag, but I agree it makes it easier.

Sun Aug 27, 06:26:00 PM EDT  

Blogger Jayadas Chelur said....

Another approach ... not sure about the efficiency, though.

CREATE OR REPLACE FUNCTION anagram_test( p_str1 IN VARCHAR2, p_str2 IN VARCHAR2 )
RETURN VARCHAR2
IS
l_count NUMBER;
l_retval CHAR(1) := 'N';
BEGIN
IF (p_str1 IS NOT NULL) AND
(p_str2 IS NOT NULL) AND
(LENGTH(p_str1) = LENGTH(p_str2))
THEN
l_count := 0;
FOR i IN 1..LENGTH(p_str1) LOOP
l_count := l_count + NVL(LENGTH(REPLACE(p_str2,SUBSTR(p_str1,i,1),'')),0)
- NVL(LENGTH(REPLACE(p_str1,SUBSTR(p_str1,i,1),'')),0);
END LOOP;

IF l_count = 0 THEN
l_retval := 'Y';
END IF;
END IF;

RETURN l_retval;
END;

Mon Aug 28, 01:33:00 PM EDT  

Anonymous Cameron said....

This is both interesting and fun.

One of my interests is in fuzzy address matching technology. You can use soundex, double metaphone, Levenshtein distance, etc. I really like using the Levenshtein distance, but one limitation is that I have rarely seen one that also counts adjacent character swaps as a distance of one instead of two. I wonder if some of the code used for this anagram function could be adapted and incorporated into some of the code used for LD to create a better LD function?

Tue Aug 29, 11:10:00 AM EDT  

Anonymous Barbara Boehmer said....

I just stumbled upon this old thread and saw Cameron's comment about,

"I really like using the Levenshtein distance, but one limitation is that I have rarely seen one that also counts adjacent character swaps as a distance of one instead of two."

There is a Damerau-Levenshtein distance algorithm that does that:

http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

I took the pseudo-code from the link above and applied it to my existing ld function, changing most of the variable names for easy comparison, and here is the result:

SCOTT@orcl_11g> CREATE OR REPLACE FUNCTION dld -- Damerau–Levenshtein distance
2 (str1 VARCHAR2 DEFAULT NULL,
3 str2 VARCHAR2 DEFAULT NULL)
4 RETURN NUMBER DETERMINISTIC
5 AS
6 lenStr1 NUMBER := NVL (LENGTH (str1), 0);
7 lenStr2 NUMBER := NVL (LENGTH (str2), 0);
8 TYPE mytabtype IS TABLE OF NUMBER INDEX BY BINARY_INTEGER;
9 TYPE myarray IS TABLE OF mytabtype INDEX BY BINARY_INTEGER;
10 d myarray;
11 cost NUMBER := 0;
12 BEGIN
13 IF str1 = str2 THEN
14 RETURN 0;
15 ELSIF lenStr1 = 0 OR lenStr2 = 0 THEN
16 RETURN GREATEST (lenStr1, lenStr2);
17 ELSIF lenStr1 = 1 AND lenStr2 = 1 AND str1 <> str2 THEN
18 RETURN 1;
19 ELSE
20 FOR j in 0 .. lenStr2 LOOP
21 d (0) (j) := j;
22 END LOOP;
23 FOR i in 1 .. lenStr1 LOOP
24 d (i) (0) := i;
25 FOR j in 1 .. lenStr2 LOOP
26 IF SUBSTR (str1, i, 1) = SUBSTR (str2, j, 1) THEN
27 cost := 0;
28 ELSE
29 cost := 1;
30 END IF;
31 d (i) (j) := LEAST (
32 d (i-1) (j) + 1, -- deletion
33 d (i) (j-1) + 1, -- insertion
34 d (i-1) (j-1) + cost -- substitution
35 );
36 IF i > 1 AND j > 1
37 AND SUBSTR (str1, i, 1) = SUBSTR (str2, j-1, 1)
38 AND SUBSTR (str1, i-1, 1) = SUBSTR (str2, j, 1)
39 THEN
40 d (i) (j) := LEAST (
41 d (i) (j),
42 d (i-2) (J-2) + cost -- transposition
43 );
44 END IF;
45 END LOOP;
46 END LOOP;
47 RETURN d (lenStr1) (lenStr2);
48 END IF;
49 END dld;
50 /

Function created.

SCOTT@orcl_11g> SELECT ld ('Kyte', 'Ktye'),
2 dld ('Kyte', 'Ktye')
3 FROM DUAL
4 /

LD('KYTE','KTYE') DLD('KYTE','KTYE')
----------------- ------------------
2 1

SCOTT@orcl_11g>

Fri May 30, 06:37:00 PM EDT  

POST A COMMENT

<< Home