Sunday, August 06, 2006

An interesting data set...

An interesting data set. Remember not too long ago – a big deal was made out of the US government request of the big search engines for their search details (what the search strings where). Well, now AOL is giving that away for all.

Of course, I just had to download it and put it into Oracle to take a peek at it. It was a 500mb download, a tgz file (tar gzipped). Surprise surprise, the tgz file contains – 10 gzipped files! One wonders why they tgz’ed them – when just tar would have resulted in a smaller file (compressing compressed data makes it bigger in most cases). Anyway, their readme said it had 36,389,567 rows of data – and all but 2 of them loaded up (bad data). I used an external table and a simple create table as select to load it up in a couple of seconds:

CREATE TABLE "AOL_ET"
(
"ANONID" NUMBER,
"QUERY_STR" VARCHAR2(4000),
"QUERY_TIME" DATE,
"ITEM_RANK" NUMBER,
"CLICK_URL" VARCHAR2(4000)
)
ORGANIZATION external
(
TYPE oracle_loader
DEFAULT DIRECTORY SYS_SQLLDR_XT_TMPDIR_00000
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
BADFILE 'SYS_SQLLDR_XT_TMPDIR_00000':'test_%p.bad'
LOGFILE 'test.log_xt'
READSIZE 1048576
FIELDS TERMINATED BY 0x'09' LDRTRIM
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
(
"ANONID" CHAR(255)
TERMINATED BY 0x'09',
"QUERY_STR" CHAR(4000)
TERMINATED BY 0x'09',
"QUERY_TIME" CHAR(255)
TERMINATED BY 0x'09'
DATE_FORMAT DATE MASK "yyyy-mm-dd hh24:mi:ss",
"ITEM_RANK" CHAR(255)
TERMINATED BY 0x'09',
"CLICK_URL" CHAR(4000)
TERMINATED BY 0x'09'
)
)
location
(
'user-ct-test-collection-01.txt',
'user-ct-test-collection-02.txt',
'user-ct-test-collection-03.txt',
'user-ct-test-collection-04.txt',
'user-ct-test-collection-05.txt',
'user-ct-test-collection-06.txt',
'user-ct-test-collection-07.txt',
'user-ct-test-collection-08.txt',
'user-ct-test-collection-09.txt',
'user-ct-test-collection-10.txt'
)
)REJECT LIMIT UNLIMITED
parallel
/

And now I have a new set of test data to play with. Some of the query strings are plain strange, interesting to read sometimes.

Of course, I had to see how many of them lead to asktom – AOL users are not big asktom users apparently – only 2 searches out of the 36 million plus led to my site. The two searches used where:

random number generator errors windows xp
mean mod median

I was curious where most of these searches did lead, so a simple SQL query:

aol%ORA10GR2> l
1 select click_url, count(*)
2 from aol_data
3 group by click_url
4 having count(*) > 500
5* order by 2

showed me:


http://www.ask.com 37752
http://www.hotmail.com 38391
http://www.geocities.com 40547
http://www.bankofamerica.com 48534
http://mail.yahoo.com 53856
http://www.ebay.com 77947
http://www.mapquest.com 96136
http://www.imdb.com 98549
http://www.amazon.com 106119
http://en.wikipedia.org 122539
http://www.yahoo.com 161082
http://www.myspace.com 167070
http://www.google.com 366623
16946938

3932 rows selected.

Almost half of the searches led nowhere (that’s disappointing), the user never clicked on anything. By far – the most destination from AOL search - google! The AOL search leads people to google frequently. I was wondering why this might be – so I peeked at the QUERY_STR used often that resulted in a hit on the google site:

aol%ORA10GR2> select query_str, count(*)
2 from aol_data
3 where click_url = 'http://www.google.com'
4 group by query_str
5 having count(*) >= 100
  1. order by 2;

QUERY_STR COUNT(*)
------------------------------ ----------
google.cm 101
www. google.com 110
google' 116
www google 118
www google com 118
goggle 126
googl search 128
www google search com 132
http www.google.com 134
googles 139
enable cookies 144
www.google .com 147
www.google com 154
www.porn.com 191
www.googl.com 205
googl.com 235
google .com 380
google. com 412
google. 587
www google.com 846
google search 1377
google com 1795
www.google 1961
- 5918
www.google.com 26009
google.com 51489
google 257402

27 rows selected.

Guess that answers that – people most often use AOL search to find out where google is… That search term that was used 191 times above – that is a recurring theme in this data set by the way. Believe it or not.

Does make me wonder if this is a breach of privacy/trust here. Not sure that I’m expecting my ISP/service provider to publish stuff like this. They say it is anonymized, but some of the search strings could contain identifying information of some sort.

updated about 7 hours later...


Yeah, I wasn't sure it was a good move on their part - I don't use AOL myself - my family does. I'm not sure I like the data being released - and some others, well they really don't.

Too late to put the cat back into the bag, when I downloaded it - 195 others preceded me, now it is well over 700 and climbing.

Funny to see that others have noticed that the biggest click away from AOL search is.... google.

16:42 by anonymous: Interesting Results, quite a few (366623) searches resulted in a user going to google. Ha.

Updated a day later...


It seems they pulled the content after about 1,000 downloads. Easy come, easy go.
POST A COMMENT

14 Comments:

Anonymous Robbert said....

In all fairness, I'm surprised that the "enable cookies" and "www.porn.com" would lead to google from AOL.

Sun Aug 06, 04:51:00 PM EDT  

Anonymous Alex Gorbachev said....

Tom,
Just want to say thanks. Indeed, it's really nice data to play with. :-)
Cheers, Alex

Sun Aug 06, 06:19:00 PM EDT  

Anonymous Greg said....

It looks to me like a lot of the searches for google will be from the users browser. Depending on what toolbars you have installed in IE, if you enter a URL (such as www google com) that the browser can't resolve it will use a search engine to find the result for you.

Sun Aug 06, 08:39:00 PM EDT  

Blogger Rama Nalam said....

Now, it is showing a blank page. AOL blocked the download.

Sun Aug 06, 10:57:00 PM EDT  

Anonymous Tom Fox said....

Use Google and punch in the URL. Then you can view Google's cache of it. Better yet, just use this link:

http://research.aol.com/pmwiki/pmwiki.php?n=Research.Research?action=downloadman&upname=500kusers.tgz

Sun Aug 06, 11:08:00 PM EDT  

Anonymous Anonymous said....

That link is gone as well. Who got the file and has enough bandwith? ;)

Mon Aug 07, 05:36:00 AM EDT  

Blogger Noons said....

Funny, I asked for a copy of it earlier today but the message is not even showing here!... :-)

Still, if anyone has the file and is willing to burn it in a CD for me, I'll pay for P&P.

Mon Aug 07, 06:50:00 AM EDT  

Anonymous Cameron said....

I would assume that the number of search results leading to Google suggests something other than people are simply searching for it. Consider the following scenario. I think it probably goes something like this:

1. Open your default browser, whose homepage AOL kindly reset to itself when you install AOL.
2. Type "www.google.com", because you want to go search for something. Hit enter.
3. Curse because you realized that the focus was on the search textbox, not the address bar of the browser.
4. Shrug and go ahead and click on the link to Google instead of typing it in again, because your hand is already back on the mouse.

Results: inflated click counts to google, simply because the browser didn't default to focus on the address bar. In fact, several of the sites in that list are probably the same way - mail.yahoo.com, etc...

Mon Aug 07, 09:25:00 AM EDT  

Anonymous Anonymous said....

I second Cameron's scenario. In Netscape 7.2 (at least), if you paste an address into the address bar and press return, it searches. You have to click again on the address bar to go there instead of searching.

Mon Aug 07, 09:52:00 AM EDT  

Anonymous Anonymous said....

Ok, so AOL removed the file altogether. Fortunately, some people have mirrored the file:

http://www.gregsadetsky.com/aol-data/

Mon Aug 07, 02:42:00 PM EDT  

Blogger Robert said....

Tom, this data is not going nto bust XE's limit, is it ? assuming there is still say 80% free

Mon Aug 07, 09:49:00 PM EDT  

Blogger Rich said....

When using your external table syntax on my windoz box I got the "KUP-0420" error (KUP-0420: found record longer than buffer size supported).

So like any good Tom Kyte fan would do I found the fix on asktom.

here is my modified sql for the creation of the external table:

CREATE TABLE AOL_ET
(
"ANONID" NUMBER,
"QUERY_STR" VARCHAR2(4000),
"QUERY_TIME" DATE,
"ITEM_RANK" NUMBER,
"CLICK_URL" VARCHAR2(4000)
)
ORGANIZATION external
(
TYPE oracle_loader
DEFAULT DIRECTORY UTL_DIR
ACCESS PARAMETERS
(
RECORDS DELIMITED BY 0x'0a' CHARACTERSET US7ASCII
BADFILE 'UTL_DIR':'test_%p.bad'
LOGFILE 'test.log_xt'
READSIZE 1048576
FIELDS TERMINATED BY 0x'09' LDRTRIM
MISSING FIELD VALUES ARE NULL
REJECT ROWS WITH ALL NULL FIELDS
(
"ANONID" CHAR(255)
TERMINATED BY 0x'09',
"QUERY_STR" CHAR(4000)
TERMINATED BY 0x'09',
"QUERY_TIME" CHAR(255)
TERMINATED BY 0x'09'
DATE_FORMAT DATE MASK "yyyy-mm-dd hh24:mi:ss",
"ITEM_RANK" CHAR(255)
TERMINATED BY 0x'09',
"CLICK_URL" CHAR(4000)
TERMINATED BY 0x'09'
)
)
location
(
'user-ct-test-collection-01.txt',
'user-ct-test-collection-02.txt',
'user-ct-test-collection-03.txt',
'user-ct-test-collection-04.txt',
'user-ct-test-collection-05.txt',
'user-ct-test-collection-06.txt',
'user-ct-test-collection-07.txt',
'user-ct-test-collection-08.txt',
'user-ct-test-collection-09.txt',
'user-ct-test-collection-10.txt'
)
)REJECT LIMIT UNLIMITED
parallel
/

Thanks Tom!
Best regards...Rich

Tue Aug 08, 11:09:00 AM EDT  

Anonymous Anonymous said....

You rock!! it is totally there..

Wed Aug 09, 03:46:00 PM EDT  

Anonymous Anonymous said....

Cameron said it well.

I have observed that the AOL browser will "bump" you from the url address bar to their search bar -- twice usually. At other times I have finished typing the url and then the browser "wipes it out" and defaults down again to the search bar.

The Department of Justice or FTC ought to look -- this appears to be a conscious (and deceptive) practice to inflate AOL search engine numbers.

Thu Aug 10, 12:42:00 AM EDT  

POST A COMMENT

<< Home