See, you just cannot make this stuff up...
Real email, received today from another Oracle person - asking me a question:
A developer wants to represent all data types as Varchar2(2000) so that they won't have to change sizes in the future. This includes Numbers and Dates. The DBA wants to know if there are any management, performance, or indexing implications.
Ouch (wonder why they stopped at 2000 with the varchar2?) That hurts doesn't it...
Here was my response:
They are suggesting one of the worst ideas known to human data processing people. I cannot over state how BAD AN IDEA THIS IS ENTIRELY.
Funny thing - dates are 7 bytes, you cannot "under" or "over" size them (you never CHANGE their size), timestamps are fixed sizes as well - 7, 11, or 13 bytes depending on fractional seconds and timezone needs.
Funny thing part two: if someone defines a Number(2), we can "alter table t modify X number(3)" to increase the size. But, if someone was to MAX OUT all numbers - we CANNOT shrink them!!!
Funny thing part three: same with varchar2 - we can make them bigger anytime we need - immediately, absolutely. We cannot however fix the oversized varchar2 in the future when the column has data - we can GROW, we cannot SHRINK
Funny thing part four: I spend 20 minutes on this topic - using the right datatype - in all of my seminars, over and over and over again.
Here is a cut and paste from my book Effective Oracle by Design.
Use the Correct Datatype
Using the correct datatype seems like common sense, but virtually every system I look at does one of the following:
• Uses a string to store dates or times
• Uses a string to store numbers
• Uses VARCHAR2(4000) to store all strings.
• Uses CHAR(2000) to store all strings, wasting tons of space and forcing
the use of a lot of trim function calls
• Puts text in a BLOB (raw) type
I have a very simple rule: Put dates in dates, numbers in numbers, and strings in strings. Never use a datatype to store something other than what it was designed for, and use the most specific type possible. Furthermore, only compare dates to dates, strings to strings, and numbers to numbers. When dates and numbers are stored in strings, or stored using inappropriate lengths, your system suffers:
• You lose the edit upon insertion to the database, verifying that your dates are actual dates and numbers are valid numbers.
• You lose performance.
• You potentially increase storage needs.
• You definitely decrease data integrity.
How many of you know what ORA-01722 or ORA-01858 errors are off the top of your head? I bet many of you do, because they are so prevalent in systems where numbers are stored in strings (ORA-01722: invalid number) and dates in strings (ORA-01858: a non-numeric character was found where a numeric was expected).
How Data Integrity Decreases
Using an incorrect datatype is wrong for many reasons, but the first and foremost is data integrity. Systems that use strings for dates or numbers will have some records with dates that are not valid and numbers that are not numbers. It is just the nature of the game here. If you permit any string in your date field, at some point, you will get dirty data in there.
Without data-integrity rules in place, the integrity of your data is questionable. I’ve needed to write the functions to convert strings to dates but return NULL when the date won’t convert. I’ve also needed to try one of five date formats to see if I can get the date to convert. Can you look at 01/02/03 and tell what date that is? Is that yy/mm/dd, dd/mm/yy, or something else?
How Performance Suffers
Beyond the obvious data-integrity issues associated with incorrect datatypes, there are other subtle issues. To demonstrate, we’ll use an example of a table with two date columns. One will be stored in a string using YYYYMMDD and the other as a DATE type. We will index these values and analyze the tables completely.
ops$tkyte@ORA920> create table t
3 select to_char( to_date('01-jan-1995','dd-mon-yyyy')+rownum, 'yyyymmdd' ) str_date,
4 to_date('01-jan-1995','dd-mon-yyyy')+rownum date_date
5 from all_objects
ops$tkyte@ORA920> create index t_str_date_idx on t(str_date);
ops$tkyte@ORA920> create index t_date_date_idx on t(date_date);
ops$tkyte@ORA920> exec dbms_stats.gather_table_stats( user, 'T', method_opt=> 'for all indexed columns size 254', cascade=> true );
Now, let’s see what happens when we query this table using the string date column and the real date column. Pay close attention to the Cost and Card= component of the plan:
ops$tkyte@ORA920> set autotrace on explain
ops$tkyte@ORA920> select * from t
2 where str_date between '20001231' and '20010101';
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=5 Card=406 Bytes=6090)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=5 Card=406 Bytes=6090)
2 1 INDEX (RANGE SCAN) OF 'T_STR_DATE_IDX' (NON-UNIQUE)
ops$tkyte@ORA920> select * from t where date_date between to_date('20001231','yyyymmdd') and to_date('20010101','yyyymmdd');
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=3 Card=1 Bytes=15)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=3 Card=1 Bytes=15)
2 1 INDEX (RANGE SCAN) OF 'T_DATE_DATE_IDX' (NON-UNIQUE)
So, what happened there? Well, the optimizer understands VARCHAR2 types and it understands DATE types. The optimizer knows that between the two DATE items December 31, 2000, and January 1, 2001, there is only one day. The optimizer also thinks that between the two string items ‘20001231’ and ‘20010101’ there are a whole bunch of values. The cardinality is thrown off.
But, so what? What do we care if the cardinality is wrong? It won’t affect our output—the answer. That is correct, but it could have some impact on our overall performance. Consider a different query against the same data, asking for effectively the same result set:
ops$tkyte@ORA920> select * from t
2 where str_date between '20001231' and '20060101';
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=12 Card=2034 Bytes=30510)
1 0 TABLE ACCESS (FULL) OF 'T' (Cost=12 Card=2034 Bytes=30510)
ops$tkyte@ORA920> select * from t where date_date between to_date('20001231','yyyymmdd') and to_date('20060101','yyyymmdd');
0 SELECT STATEMENT Optimizer=CHOOSE (Cost=12 Card=1823 Bytes=27345)
1 0 TABLE ACCESS (BY INDEX ROWID) OF 'T' (Cost=12 Card=1823 Bytes=27345)
2 1 INDEX (RANGE SCAN) OF 'T_DATE_DATE_IDX'
(NON-UNIQUE) (Cost=6 Card=1823)
Note: As discussed in Chapter 6, different database parameter settings will influence the cost of various operations. You may need to increase the range of dates to see the same effect as shown in this example, but it will happen at some threshold.
This time, the fact that we hid a date in a string has a serious side effect. Our query plan has changed. We are now full-scanning for the string date but index-range scanning for the DATE type date. So, besides the fact that there is nothing stopping someone from inserting 20009950 as a date value into our field, the use of a string has withheld valuable information from the database. We lose all around.
How You Might Increase Your Storage Requirements
In addition to using the proper base datatype such as number, date, or string, you should also use the most specific type you can. For example, use VARCHAR2(30) for a field that is up to 30 characters in length; do not use VARCHAR2(4000).
“I work with a modelers group. My modeler would like to define every VARCHAR2 field with
the maximum length, which means that a table with 20 VARCHAR2 fields will all be defined with a maximum of 2000 or 4000 bytes. I tried to talk to him about the reasons we identify data with correct lengths and names in order to understand what we have in our database. He told me that it doesn’t matter, since Oracle just stores the length, etc., and there is no overhead. I don’t believe this is true, but have been jumping between so many databases that I cannot find a document on the internals of Oracle storage. Can you help me out here with this question?”
My gut response was, “This is your data modeler, my goodness!” They are the ones who are supposed to be telling you that it is vital to use the appropriate length when defining fields! That is their job. Let’s just forget about things like storage for a minute, why don’t we ask him:
• What is going to happen when users pull this up in a query tool that formats each field based on the width of the column in the database? They’ll see one column and need to scroll way over to see the second, the third, and so on.
• Say the code prepares a query that selects ten columns that are VARCHAR2. The developers, for performance, would like to array fetch (very important). They would like to array fetch say 100 rows (very typical). So, you have 4,000 10 100 = almost 4MB of RAM the developers must allocate! Now, consider if that were ten VARCHAR2(80) fields (it’s probably much smaller than that). That’s about 78KB. Ask the data modeler how much RAM he is willing to kick in for this system.
• Now, the developers start to build a data-entry screen to put data into the database. Wow, that code field can be 4,000 characters long and that first name can be 4,000 characters long. How the heck is anyone going to know what sort of data can really go in there?
Tell your data modeler group members that they need to consider the length as a constraint. Just as they use primary and foreign keys, they should use the proper and correct length on fields. You can always expand a field via a command like alter table t modify c varchar2(bigger_number). There is no reason to use the maximum length everywhere. It will hurt the applications you develop, because they will mistakenly allocate many megabytes of RAM. Just think of the array fetch example with an application server. Now, it’s not just 4MB; it’s 4MB number of connections. You are talking some real memory here for a single query, and you’ll be doing a lot of them at the same time.
End Ask Tom
A CHAR(2000) will consume 2,000 bytes of storage whether you put in the letter a, the string ‘hello world’, or 2,000 characters. A CHAR is always blank-padded. Additionally, are you thinking about using an occasional index in your system? If so, beware of storage-related problems. Consider the following:
tkyte@ORA817.US.ORACLE.COM> create table t ( a varchar2(4000), b varchar2(4000));
tkyte@ORA817.US.ORACLE.COM> create index t_idx on t(a);
create index t_idx on t(a)
ERROR at line 1:
ORA-01450: maximum key length (3218) exceeded
NOTE: In Oracle9i, the maximum key length is larger – but the restriction still exists. For example, an index on T(a,b) would raise: ORA-01450: maximum key length (6398) exceeded in Oracle9i.
My system has an 8KB block size. I would need to use at least a 16KB block size to index a single column, but even then, if I tried to create a concatenated index on T(A,B), it would fail there!
The same holds true for your numbers and the new Oracle9i TIMESTAMP datatypes: When appropriate, use scales and precisions on those fields in order to better define your data integrity and to give applications that much more information about the data itself.
In short, never be tempted to use anything other than a DATE or TIMESTAMP datatype to hold a date value, and never be tempted to use a VARCHAR2 to hold a number. Use the appropriate and correct type for each to ensure maximum performance, and more important, to protect your data integrity.