Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « Book Review: Microsoft SQL Server 2012 T-SQL FundamentalsSQL Advent 2012, here is what is coming »
    comments

    When it comes to importing data from an Excel sheet with SSIS, Excel has quite a reputation. And not a terribly good one. Well deserved, to be honest, because numerous issues can rise when dealing with this piece of software. This blog post will deal with the issue I encounter the most on forums, which is the issue of the mixed data types in one column, also known as “Why is some of my data read as NULL?”.

    There are already a numerous blog posts, articles and forum posts written on this subject, but most of them only deal with parts of the issue. I would like a single, comprehensive reference I can point to in the forums, hence this blog post. And because, you know, world domination has to start somewhere…

    Excel: the ugly duckling of source files

    Excel: the ugly duckling of all import files (source)

    The issue

    You have an Excel sheet with data that you would like to import with SSIS. One of the columns has both text data as numeric data. Take this sample as an example:

    The postal code in Belgium usually consists of 4 digits. However, sometimes it is prefixed with “B-“, which makes the data alphanumeric. The house number column has a value containing the word "bus", so it becomes alphanumeric as well. When we pull this data into the SSIS dataflow, we get the following:

     

     

     

     

     

     

     

     

     

    All the alphanumeric data is read as NULL, while the numeric values come in untouched. The problem is not with SSIS, but with the JET OLE DB provider – used to read Excel 2003 or older – or with the ACE OLE DB provider, which is used to read Excel 2007 or newer. Both providers sample the data in a column to determine the data type. The data type with the most occurrences in the sample wins and is selected as the source data type in SSIS. In our sample, the resulting data type is numeric. Since the alphanumeric postal codes cannot be converted to a numeric value, those values are replaced with NULL. Not exactly what we want when extracting data for our ETL process.

    The solution

    The answer to the issue is to add IMEX=1 to the extended properties of the connection string. This tells the provider if intermixed data types are found, the data type specified in the ImportMixedTypes registry setting is taken, which has the default of text. (You can find a list of all the registry paths at the bottom of this blog post). A typical connection string would look like this:

    Provider=Microsoft.Jet.OLEDB.4.0; Data Source=C:\MyExcel.xls;Extended Properties="Excel 8.0;HDR=Yes;IMEX=1";

    You can find more about connection strings at www.connectionstrings.com. In SSIS, you can only modify the Excel connection string manually in the properties window of the Excel connection manager. It cannot be changed through the GUI by double clicking the connection manager. It is possible that you get a warning at the source component, telling you the metadata of some columns have changed. This is normal, as our offending columns have now changed to a string data type.

    After adding IMEX=1, we get this result:

    That looks a little bit better, doesn’t it?

    However…

    The NULL in the house number column disappeared and made place for the actual value, but the NULL in the zip code column is still present. What did just happen here? Remember when I said earlier that the providers sample the data to determine the data type? The JET and ACE OLE DB provider take the first x number of rows, where x is the number specified in the TypeGuessRows registry setting, which has the default of 8. If the first occurrence of intermixed data types happens after this sample, the providers take the data type found in the sample. Again, big trouble for us as we’re still losing data.

    The solution part II

    Luckily the solution is fairly simple: set either the TypeGuessRows registry setting to a higher number (a maximum of 16) or set it to 0. A very popular myth is that setting it to 0 causes the providers to scan all data. However, this is not true as indicated by this KB article. Only the first 16384 rows are scanned, which is in my opinion a pretty high number. Be aware that a performance hit might be possible with larger Excel workbooks as the provider needs to scan more data.

    Make sure you change the registry property in the right version of the providers. The ACE OLE DB provider used in Office 2010 is version 14.0, but the SSIS connection string mentions version 12.0. A big thanks to Valentino (blog | twitter) who pointed me at this problem. You can find more information in his article (look at the bottom for the important update).

    After updating the registry we finally get all our values:

     

     

     

     

     

     

     

     

    But what if…

    What if the first occurrence of intermixed data types happens after row number 16384? (You just don’t have any luck, don’t you?) Or what if the administrator doesn’t allow you to modify the registry? (Always the same with those paranoia people) If you can control the template of the Excel, you can do the following: add a dummy row at the beginning of the sheet, which has alphanumeric data for the violating columns. This way you’ll be sure string data is always included in the sample and that the resulting data type is always alphanumeric. Hide the dummy row in the template and get rid of it in the SSIS dataflow using a conditional split.

    Another alternative you could try is to switch the cell formatting from General to Text, if possible.

    But what if you can’t control the template? Sometimes the Excel files are automatically generated or supplied by a third party. Suck it up and try convincing the supplier of the Excel files to use a file type that is actually intended to store flat file data, such as .csv files or ragged right flat files, instead of bug-inducing spreadsheets. Believe me; it will make your life easier.

    The truncated text problem

    A very similar issue is when you have a text column in your Excel sheets that sometimes has more than 255 characters in a single cell. A common example is a “comments” column. When you try to import this, you can get this error in SSIS:

    [Excel Source [5]] Error: There was an error with Excel Source.Outputs[Excel Source Output].Columns[Comments] on Excel Source.Outputs[Excel Source Output]. The column status returned was: "Text was truncated or one or more characters had no match in the target code page.".

    Bob really hated that customer service...

    The solution part III

    The problem is the same as before: the provider scans the first 8 rows and finds only text of normal length. SSIS takes the default data type of (DT_WSTR,255) for all string data. Our comments are sometimes larger than 255 characters, causing the truncation error.

    If your comments have a maximum length lower than the 4000 character limit of DT_WSTR, you can simply adjust the length of the column in the advanced editor of the source component or you can configure the source to ignore the failure. The last option does give you truncated data though. This blog post by  Bradley Schacht (blog | twitter) describes how you can make those changes.

    If the length exceeds 4000 characters, we have a totally different problem, since it doesn’t fit into the DT_WSTR data type. We need the DT_NTEXT data type – which is a LOB data type – to store the data into the pipeline. This corresponds with a text or nvarchar(max) data type in SQL Server. The solution is the same as before: adjust the TypeGuessRows registry setting so the provider will pick up those lengthy comments and determine the correct data type to store the values.

    The values are shown as <Long Text> as the data viewer can't show LOB data.

    Sometimes SSIS and Excel stubbornly refuse to play nice together and won't adjust the column data type. In that case you’ll have to adjust the data type yourself in the advanced editor of the Excel source.

    Conclusion

    This blog post deals with one of the most common issue found when working with SSIS and Excel: the determination of the source data types by the JET and ACE OLE DB providers. Most issues can easily be solved by adjusting some registry settings. Follow these easy steps to import your Excel data with the least amount of issues:

    1. Add IMEX=1 to the connection string.
    2. Set the TypeGuessRows registry setting to 0.
    3. Double check the data type in the advanced editor of the source component.

    Addendum: the registry settings

    Provider

    Values

    Path

    JET OLE DB

    TypeGuessRows
    ImportMixedTypes

    HKEY_LOCAL_MACHINE\Software\Microsoft\Jet\4.0\Engines\Excel

    ACE OLE DB

    TypeGuessRows
    ImportMixedTypes

    HKEY_LOCAL_MACHINE\Software\Microsoft\Office\12.0\Access Connectivity Engine\Engines\Excel (*)

    (*) For newer versions of Office, you need to replace 12.0 with the correct version number. It took me only 1 hour of debugging to figure that one out.

    Note: in 64-bit systems, the values for the 32-bit providers should be located in the Wow6432Node.

    About the Author

    User bio imageKoen Verbeeck is a Microsoft Business Intelligence consultant at Ordina Belgium, helping clients to get insight in their data. Koen has a comprehensive knowledge of the SQL Server BI stack, with a particular love for Integration Services. He's also a speaker at various conferences.
    Social SitingsTwitterLinkedInLTD RSS Feed
    6208 views
    InstapaperVote on HN

    13 comments

    Comment from: Peter Schott [Visitor] Email · http://schottsql.blogspot.com
    Peter Schott On a related note, you have to be careful with IMEX=1 because it treats all columns as nvarchar(255) for purposes of importing, which plays havoc with those comment columns mentioned above. Excel as a data source is definitely a beating when it comes to trying to import the data through SSIS or anything else that tries to use its column "types". I wish there was something MS could do to allow us to override those guesses and force column types on columns for the source, but your suggestions are probably the best we have for now short of exporting the data and re-importing it from some other file.
    11/27/12 @ 06:32
    Comment from: Koen Verbeeck [Member] Email
    Hi Peter.
    With Excel, you have to be careful about everything. :) While making the screenshots for this blog post, Visual Studio crashed at least 3 times. And I had to restart BIDS 5 times because it would each time put a lock on my Excel file, so I couldn't open it anymore.

    I encountered the issue with the nvarchar(255) and the comments column while I was writing that section. That's why I added "Double check the data type in the advanced editor of the source component."
    11/27/12 @ 06:44
    Comment from: David Forck (thirster42) [Member]
    Awesome post! I've encountered some of these before and don't think I ever got a clear solution.
    11/27/12 @ 07:39
    Comment from: Bill Anton [Visitor] · http://www.byobi.com
    Bill Anton Koen - great post!

    Definitely spent time in SSIS-Excel hell years ago...at the time, I was able to find the "IMEX=1" hack, but still ran into issues because I didn't know about the registry setting regarding the number of rows scanned (which would have solved my problem)...ended up crying mercy and forcing source to switch to CSV. Thanks for the info!
    11/27/12 @ 14:36
    Comment from: Koen Verbeeck [Member] Email
    @Bill: thanks for reading.
    I think you were better off with the .csv source :)
    11/28/12 @ 00:13
    Comment from: Zahid Hanif [Visitor] Email
    Zahid Hanif Thanks Koen for bringing all the issues into a concise post. My current solution converts all my XLS/XLSX files to CSV on the fly so I can avoid all these issues. Getting a registry change through our change control is a nightmare!
    11/28/12 @ 03:31
    Comment from: Furrukh Baig [Visitor] Email · http://furrukhbaig.wordpress.com/
    Furrukh Baig Thanks, Nice read. Microsoft should have improve integration between their own products. I guess you forgot to mention the issues with 64 bit environment. its nightmare.
    11/28/12 @ 07:54
    Mike Unlike SSIS, our software, Advanced ETL Processor works currently with Excel all the time.

    1.Works directly with Excel 3.0-2010
    2.No ODBC, OleDB or MS Jet Required
    3.Works correctly with mixed data types
    4.Works correctly with cells with more 255 characters
    5.No need for IMEX=1, HDR=Yes or Registry hacks (TypeGuessRows)
    6.Works with ranges
    7.Loads data correctly all the time + no need to edit 8.Excel file
    9.Can create Excel files in Excel 3.0-2010 format
    10.Can insert data starting from specific cell
    11.Can clear area before adding data into Excel
    12.Can add headers

    http://etl-tools.com/etl-tools/advanced-etl-pocessor-news/say-no-to-typeguessrows0-imex1-and-excel-odbc-bugs.html
    11/28/12 @ 12:52
    Mike @Bill: thanks for reading.
    I think you were better off with the .csv source :)

    Last time I've checked CSV did not work either

    Following file

    F1,F2,F3
    1,2,3
    12,
    1,3

    Will not work
    May be it is better now
    11/28/12 @ 12:57
    Comment from: Koen Verbeeck [Member] Email
    @Mike: I mentioned it explicitly in the article to avoid confusion: the issues are caused by the OLE DB provides (be it JET or ACE), not by SSIS.

    IMEX and HDR are just connection string properties. I don't see any issue with modifying those.

    By the way, those providers can work with ranges as well.

    Regarding the .csv file: this works perfectly in SSIS 2012. In earlier versions this could easily be solved with a script component.

    Final question: is your ETL tool free as well?
    11/28/12 @ 13:36
    Comment from: Koen Verbeeck [Member] Email
    @Furrukh: thanks for reading and posting a comment. I did not forget about the 64-bit issues, I choose not to include them :) The article was already long enough with the mixed data types issue without me venturing into the crazy world of SSIS, Excel and 64-bitness. Maybe for another blog post :)
    11/28/12 @ 13:54
    Comment from: chopstik [Member]
    chopstik @Koen, this proved to be a very timely post and I had it bookmarked because I knew I would need it soon for an upcoming project. Took me a little bit but at least it's now working and doing what it should be doing. Thanks!
    12/12/12 @ 12:20
    Comment from: Koen Verbeeck [Member] Email
    @chopstik, thanks for reading and taking the time to write a comment. I'm glad my post was useful to at least one person :)
    12/13/12 @ 03:24

    Leave a comment


    Your email address will not be revealed on this site.

    To mislead the spambots.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)