LessThanDot

This is day two of the SQL Advent 2012 series of blog posts. Today we are going to take a look at how data types can have an impact in queries and also the size of your database.

Char vs NChar

SQL Server has two data types to store character data1, both of them come in fixed and variable length sizes. The char and varchar data type uses one byte of store to store one character, the nchar and nvarchar data type uses two bytes of store to store one character. The nchar and nvarchar data types are used to store unicode of data

Let’s think about that for a second, what we are saying is that the char and varchar data type can store twice the number of characters in the same amount of store as the nchar and nvarchar data type. Why does this matter, space is cheap right? True, space is getting cheaper but we are also storing more and more data every year.

Now think about what happens you have everything stored as unicode data

What happens to your backup and restore process, will it be faster or slower, will the files be bigger if not compressed?
What about when transferring the results to and from your database server, are the packets able to store the same number of characters.
What about the amount of data on a page, what does this do to indexes and index lookups, how does it affect index maintenance?

If you don’t need it, then don’t use unicode data.

Some examples of what I have seen stored in nchar and nvarchar when realy you shouldn’t:

Zip Code for US addresses

US addresses

Social Security Numbers (which were stored in plain text none the less)

Integer data (enforced by constraints or the app layer to make sure these were only digits)

Let’s take a quick look by running some T-SQL

First create these two tables

CREATE TABLE TestChar (SomeCol char(10))
GO

CREATE TABLE TestNChar (SomeCol nchar(10))
GO

Now populate each with some data

INSERT TestChar
SELECT TOP 1000000 '1234567890'
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4
GO

INSERT TestNChar
SELECT TOP 1000000 '1234567890'
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4
GO

Let’s see how much space is used by both tables

EXEC sp_spaceused 'TestChar'

EXEC sp_spaceused 'TestNChar'

18,824 KB

28,744 KB

If you looked at the reserved column, you will see that the nchar data is using 10 MB more than the char data

Implicit conversions

Besides the storage increase there is also a problem when querying for data that looks like varchar but is stored as unicode. Run the code below.

SET SHOWPLAN_TEXT ON
GO
DECLARE @v varchar(10) = '0123456789'

SELECT * FROM TestChar WHERE SomeCol LIKE  @v +'%'
GO

SET SHOWPLAN_TEXT OFF
GO

Here is the plan for that query

|–Table Scan(OBJECT:([tempdb].[dbo].[TestChar]),

WHERE:([tempdb].[dbo].[TestChar].[SomeCol] like [@v]+‘%’))

If we look at the plan we can see that this looks pretty good

Usually people will sometimes change the datatype of a column but will not change any code that access this column. Let’s now change the data type of the column to nchar

ALTER TABLE TestChar ALTER COLUMN SomeCol nchar(10)
GO

Run the query that gives you the plan again

SET SHOWPLAN_TEXT ON
GO
DECLARE @v varchar(10) = '0123456789'

SELECT * FROM TestChar WHERE SomeCol LIKE  @v +'%'
GO

SET SHOWPLAN_TEXT OFF
GO

Here is the plan

|–Table Scan(OBJECT:([tempdb].[dbo].[TestChar]),

WHERE:([tempdb].[dbo].[TestChar].[SomeCol] like CONVERT_IMPLICIT(nvarchar(11),[@v]+‘%’,0)))

As you can see, there is a conversion going on right now.

In order to get rid of the conversion, use the correct data types

SET SHOWPLAN_TEXT ON
GO
DECLARE @v nvarchar(10) = '0123456789'

SELECT * FROM TestChar WHERE SomeCol LIKE  @v +'%'
GO

SET SHOWPLAN_TEXT OFF
GO

Using larger datatypes when it is not needed

I see this problem mostly with the integer data types. Below is a list of the integer data types together with their storage size and range

tinyint

Storage size is 1 byte. Integer data from 0 through 255.

smallint

Storage size is 2 bytes. Integer data from -2^15 (-32,768) through 2^15 – 1 (32,767).

int

Storage size is 4 bytes. Integer data from -2^31 (-2,147,483,648) through 2^31 – 1 (2,147,483,647).

bigint

Storage size is 8 bytes. Integer data from -2^63 (-9,223,372,036,854,775,808) through 2^63-1 (9,223,372,036,854,775,807).

Now imagine facebook with a billion users decided to use bigint as CountryID in their Country table, this key is then uses as a foreign key in the user demographics table. This is wasteful,either use a smallint since we won’t go through 32 thousand countries in the forseeable feature or use the 2 or 3 character ISO code. The problem is even worse if you have a compound 6 column key and it is used as a foreign key in tons of other tables…that was real fun to clean up….use a surrogate 1 column key in that case…but be sure to test….normalize till it hurts then denormalize till it works….I will cover normalization in another post…just wanted to mention it

That is all for day two of the SQL Advent 2012 series, come back tomorrow for the next one, you can also check out all the posts from last year here: SQL Advent 2011 Recap

1 I know there is text and ntext but hose are deprecated

SQL Advent 2012 Day 2: Data types storage differences

Char vs NChar

Implicit conversions

Using larger datatypes when it is not needed

About SQLDenis