We all have perfectly normalized tables, with perfectly scrubbed data, right? I wish! Sometimes we are stuck with dirty data in legacy applications. What’s worse is that we are sometimes expected to do interesting things with dirty data. In this blog, I will show you how to extract a number from a varchar column that contains letters and numbers.
First, let’s take a look at what some of that data might look like:
2.1 miles
4 miles
Approximately 6.5 miles
3.9
7.2miles
Suppose we wanted to extract just the numeric portion of the text. How would we do this? My first reaction was to use PatIndex to find the first non-numeric character. Unfortunately, this won’t work because of the 3rd row (Approximately 6.5 miles). Then, I thought about CharIndex, knowing there is an optional 3rd parameter that allows us to pick the starting location for the search. Unfortunately, CharIndex doesn’t allow pattern matching, and PatIndex doesn’t accommodate starting somewhere other than the beginning.
How do we do this? Well… it get’s tricky.
The first thing we need to do is to find the character that is a number. For this, we can use PatIndex:
Select PatIndex(‘%[0-9.-]%’, Data)
Next, Let’s remove any characters from the beginning of the string, like this:
Select SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000)
This allows us to accommodate any characters that appear before the numbers. The substring result forces the numbers to the beginning of the string. Next step is to determine where the numbers end. For that, we can use PatIndex again.
Select PatIndex(‘%[^0-9.-]%’, SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000))
PatIndex returns an integer for the first character that matches. We’ll want to use this in a LEFT function, there there is a potential problem. If there’s no match, PatIndex will return 0. So, we should make sure that PatIndex will always return at least one. We can trick the system by appending a character before running the PatIndex function. Like this:
Select PatIndex(‘%[^0-9.-]%’, SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000) + ‘X’)
By including something we know will match the pattern, we guarantee that we get a number greater than zero from the PatIndex function.
Next step is to get the left part of the substring, which will return just the numbers. Like this:
Select Left(SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000), PatIndex(‘%[^0-9.-]%’, SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000) + ‘X’)-1)
Now, I will admit that this is a very ugly formula to use. However, there is a high probability that it is re-usable. Because of this, it would make a nice little function to have in our SQL Arsenal.
Create Function dbo.GetNumbers(@Data VarChar(8000)) Returns VarChar(8000) AS Begin Return Left( SubString(@Data, PatIndex('%[0-9.-]%', @Data), 8000), PatIndex('%[^0-9.-]%', SubString(@Data, PatIndex('%[0-9.-]%', @Data), 8000) + 'X')-1) End
Now, we can use this function wherever we need it. As a final step, let’s test it on our sample data. Before doing this, let’s think about other data that could potentially cause us problems. We should test an empty string, NULL, a string without any numbers, and a string that contains multiple numbers separated by characters.
Declare @Temp Table(Data VarChar(60))
Insert Into @Temp Values('2.1 miles') Insert Into @Temp Values('4 miles') Insert Into @Temp Values('Approximately 6.5 miles') Insert Into @Temp Values('3.9') Insert Into @Temp Values('7.2miles') Insert Into @Temp Values('') Insert Into @Temp Values(NULL) Insert Into @Temp Values('No Numbers Here') Insert Into @Temp Values('approximately 2.5 miles, but less than 3') Select Data, dbo.GetNumbers(Data) From @Temp
The results are:
<span style="color:blue;"> Data Numbers ---------------------------------------- ------- 2.1 miles 2.1 4 miles 4 Approximately 6.5 miles 6.5 3.9 3.9 7.2miles 7.2 NULL NULL No Numbers Here approximately 2.5 miles, but less than 3 2.5 </span>
I wish we could count on our data being perfect, but we really can’t. The best we can do is to use the data we are given in a way that works.
51 Comments
This is excellent and I owe you thanks again 😉
Simple way to do this in Oracle is
SELECT TRANSLATE(‘blah123blah’, ‘0123456789abcdefghijklmnopqrstuvwxyz’,’0123456789′)
FROM dual;
The translate function looks for a character by character translation. If it only gets a match on numerics it will strip out the remaining characters. Might need to add uppercase characters to 2nd arg, not sure.
how do decimal values get handled in your example? It looks like a basic pattern match function to me and not really what the author was getting at. In your comment a basic LIKE regex pattern is the equal in T-SQL and not much to do with the actual blog
This worked great, it is going to be a real life saver.
This is an excellent function. I have included in mine as well as clients databases with aknowledgement for the author.
Massive thanks, I looked in numerous places for such an answer. Many thanks.
I needed just this today 🙂
George,
I was looking for something like this the other day. Found it via
tek-tips. Thanks again!
THANKS! Great function for deailing with very poor database design. You have saved me a lot of time. Thanks.
How can i extract percentages only?
Eg: 0.35 from “Up to 0.35% pa”
First off, great! We have people who submit data via cellphone and I needed a way to get just the numbers out of the messages that they send. I can’t rely only on this function to do all the cleaning that’s required, and I’ve written some of my own code to bulk it out, but it certainly helped!
I did some playing and I want to ask if this is not just as good/better than your final query above:
SELECT SUBSTRING(@NumTrim, PATINDEX(‘%[0-9.-]%’, @NumTrim), PATINDEX(‘%[^0-9.-]%’, SUBSTRING(@NumTrim, PATINDEX(‘%[0-9.-]%’, @NumTrim), 8000)+’X’)-1)
I’ve simply removed the LEFT() expression and used the same info to feed one substring. It works with the data I’ve fed it, but am I missing something?
The article is excellent. Liked a lot !
Maureen:
the code works fine for extracting 0.35 from “Up to 0.35% pa”
This article is good but one question, but for the string
“approximately 2.5 miles, but less than 3” function returned only 2.5.
Does it mean to find only the first occurence of the numbers?
Thanks — Vijaya Kadiyala
http://www.DotNetVJ.com
Vijaya,
Yes. This is only meant to find the first occurrence of a number within a string. If you have a different requirement than this, please post a question in the forum here, and I (and others) will be glad to help.
http://forum.lessthandot.com/viewforum.php?f=17
Thanks, great article!
Here a function to create series based on this one:
t-sql Increase series number
Excellent article and very helpful!
Thanks
CTE to extract all numbers from string:
Declare @txt varchar(4000) =’ab123sdf45.6dfgh. 6fgjh7df12345.67gh9a123,456.79aa8′
Declare @num varchar(10) = ‘%[0-9.,]%’
Declare @oth varchar(10) = ‘%[^0-9.,]%’
Set @txt+=’X’
;With a as (
Select 1 as i,
Substring(@txt,Patindex(@num,@txt),patindex(@oth,Substring(@txt,Patindex(@num,@txt),4000))-1) as num,
substring(@txt,Patindex(@num,@txt)+patindex(@oth,Substring(@txt,Patindex(@num,@txt),4000))-1,4000) as txt
Union All
Select i+1,
Substring(txt,Patindex(@num,txt),patindex(@oth,Substring(txt,Patindex(@num,txt),4000))-1),
substring(txt,Patindex(@num,txt)+patindex(@oth,Substring(txt,Patindex(@num,txt),4000))-1,4000)
from a
Where txt like ‘%[0-9]%’
)
Select i, num from a
Where num like ‘%[0-9]%’
Here is another interesting approach to keep only numbers (all numbers) and remove anything else from the string – Alex Kuznetsov idea
from this thread Extracting numbers from string
Hi,
Thanks, great article!
I want extract first numeric values of the string, like:
Ex.
Input output
—— ———
# 1 0
6 patches 6
– 0
1 15 gm tube 1
1 BOX-8 1
1/2pak 1
1`5 gm 1
16 packs/7e 16
one box(100) 0
1-28 day pak 1
input
———-
# 1
6 patches
–
1 15 gm tube
1 BOX-8
1/2pak
1`5 gm
16 packs/7e
one box(100)
1-28 day pak
output
———
0
6
0
1
1
1
1
16
0
1
Try
declare @t table(Col varchar(100))
insert into @t
select
‘# 1’
union all select
‘6 patches’
union all select
‘- ‘
union all select
‘1 15 gm tube’
union all select
‘1 BOX-8’
union all select
‘1/2pak’
union all select
‘1`5 gm’
union all select
’16 packs/7e’
union all select
‘one box(100)’
union all select
‘1-28 day pak’
select COL, F.Pos, F2.Number
from @t
cross apply (select PATINDEX (‘%[^0-9]%’,Col) as Pos) F
cross apply (select Case when F.Pos > 1 then cast(left(Col, Pos – 1) as int) else 0 end as [Number]) F2
Naomi,
Your query fails when the string is all numbers. For example, add “union all select ‘128’” to the end of your sample data.
Srr, you will probably have better luck if you try this:
Convert(Int, Left(YourColumnName, PatIndex(‘%[^0-9]%’, YourColumnName + ‘ ‘)-1))
Yes. This is also simpler than cross apply and checking Pos.
Thanks George Mastros,
But i want like 14.5 values also.
I want in numeric(12,2) not in int,i tried but getting error.
Convert(numeric(12,2), Left(col, PatIndex(‘%[^0-9.]%’, col + ‘ ‘)-1))
from table
Thanks in advance,
Srr
Srr,
You are getting an error because you CAN convert an empty string to an int, but not to a decimal.
Try this:
Coalesce(Convert(Decimal(12,2), NullIf(Left(Col, PatIndex(‘%[^0-9.]%’, Col + ‘ ‘)-1), ”)), 0)
Instead of directly converting to decimal, we use the NULLIF function to convert empty strings to nulls. Then we convert, and finally use coalesce to convert the nulls to 0.
Hi George Mastros,
It is very nice.U helped me alot,my problem solved.
In my query i used like this:
case charindex(‘.’,col)
when 1 then case isnumeric(col) when 1 then col else ‘0’ end
else Coalesce(Convert(numeric(12,2),NullIf(Left(ltrim(col),PatIndex(‘%[^0-9.]%’,ltrim(col) + ‘ ‘)-1), ”)),0)
end
Thanks,
Srr
Hi,
I have table like below,
first_nm last_nm Gender
Raj s F
Rajp e M
Kar Sh M
De re M
Sw ko F
Kal la F
Sa Sa M
now i need to update my table each row last_nm with another record last_nm within gender(means female-female,male-male)
example:
first_nm last_nm Gender
Rajp re M
Kar e M
De Sh M
Sw s F
Kal ko F
Raj la F
Thanks,
Srr
Awesome code. I was looking for the most efficient way to get the next increment of invoice numbers, where there can be an alphanumeric prepender and post, where the max number is not guranteed to be the last record. This code works flawlessly. Thank you.
I want to re-write the query given above:
Select Left(SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000), PatIndex(‘%[^0-9.-]%’, SubString(Data, PatIndex(‘%[0-9.-]%’, Data), 8000) + ‘X’)-1)
but in Teradata syntax. Some part the about query does not work the way it should in Teradata. For example, ‘%[0-9.-]%’ is not working in Teradata.
please help
declare @debid bigint
declare @address varchar(200) = null
set @debid=(select debtor_id from account where debtor_id=34691)
set @address = (Select Left(SubString(information, PatIndex(‘%[0-9.-]%’, information), 8000), PatIndex(‘%[^0-9.-]%’, SubString(information, PatIndex(‘%[0-9.-]%’, information), 8000) + ‘X’)-1)
from debtor_info where description = ‘Mailing Address’ and score > 0 and debtor_id=@debid)
print @address
I have this address–(30 JLN SAM HUP HUT 34900 PANTAIREMIS PERAK)
i need to pik pincode like -34900 from the abow data can u please send me query ,im using abow query but it reterns only first digit of the dada…
@kasinathpatil,
The code in this blog is meant to find any number. It appears as though you are looking for a 5 digit number, which does change this a bit. As such, I would encourage you to try this instead.
————————————————-
Declare @Information VarChar(200)
Set @Information = ’30 JLN SAM HUP HUT 34900 PANTAIREMIS PERAK’
Select Case When PatIndex(‘%[0-9.-][0-9.-][0-9.-][0-9.-][0-9.-]%’, @information) > 0
Then SubString(@Information, PatIndex(‘%[0-9.-][0-9.-][0-9.-][0-9.-][0-9.-]%’, @information), 5)
Else ” End
————————————————-
If a 5 digit number is found, it will be returned. If there is no 5 digit number, this code will return an empty string.
One word of caution, if you have a string of numbers longer than 5 digits, this code will return the first 5 digits of that number. For example, if you have a 5 (or more) digit house number, this code will return that instead of the 5 digit number.
Simple but brilliant. Thank you!
very helpful
thanks
Nice!! That looks really cool!. I could find another options to do that but all of them involve the creation of a function, and i need to do this inside a simple query. Here is the complication i have: How can i do the same using MySql??. i can’t find a replacement for the Patindex function…
Thank you so much for this function. It has been very helpful.
I am curious if it would be possible to modify the function to handle commas when they appear within a number -e.g. 1,340. 1,340 returns a value of 1 when I run it.
Any assistance would be greatly appreciated.
In the function I show in the code you will see 2 occurrences of ‘%[0-9.-]%’
Replace both of them with:
‘%[0-9,.-]%’
** Note that I added a comma in the string.
Hello George,
I really enjoyed your solution above, but I still have a question.
George what about if I’m to repeat the search of numbers within the string and each number must be assigned to a fixed column.
eg. “car256 bicycle1245 motor48 train453” and in this string which is already in column in my database, I’m to further select only the numbers aspect and place them in other table with different columns for car, bicycle, motor and train
hello
very nice
thanks
Maximillian,
I would suggest that you find a good split function and split on the spaces. Then you can use the extract numbers function to separate the “car” from the “256” and so on.
George,
I’m really grateful for the swift response you my post. actually I want the numerical section as my data to be inserted then,the alphabets thus car, motor and so to be my condition of insert, thus WHERE column name = car, or motor depending on where it was extracted form.
Please kindly gie a sample of the solution thank yoou
Maximillian
George,
how would you handle your own example if one of the data was like ‘Approximately 6.5 miles and 5.2 miles’ all on the same line.
Can you suggest a fix that allows the minus symbol to appear away from the number? You code fails with the text:
‘tip-toe 3 steps’
(The function returns just the minus sign, not the number 3)
The function also fails when a dot appears then text then the number:
‘I found a problem. This text generates a dot instead of a 3.’
I coded an ugly solution. Perhaps you have a clean fix?
[…] changed format of a string pattern our SQL was expecting. I came across a function written by George Mastros that addressed my SQL needs. Long story short we have a string with text then a number and some […]
I love what you’ve done here… but using sql server and RegExMatch is powerful and looks a lot neater. I found the following at https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/
The equivalent to your Patindex is:
Select dbo.RegExMatch(‘\d+’, YourColumnName,1) as StripNumber
If I can follow his instructions anyone can!!
thank you so much!
Hi, This works great in all cases except where I have a comma as a thousand separator. Where I have ‘1,000’ I get ‘1’ as the result – how can I fix this?
…Answered my own question…
alter Function dbo.GetNumbers(@Data VarChar(8000))
Returns VarChar(8000)
AS
Begin
declare @NoComma varchar(8000) = replace(@data,’,’,”)
Return
Left(
SubString(@NoComma, PatIndex(‘%[0-9.-]%’, @NoComma), 8000),
PatIndex(‘%[^0-9.-]%’, SubString(@NoComma, PatIndex(‘%[0-9.-]%’, @NoComma), 8000) + ‘X’)-1)
End
[…] Extracting numbers with SQL Server […]
“We all have perfectly normalized tables, with perfectly scrubbed data, right? I wish! Sometimes we are stuck with dirty data in legacy applications. What’s worse is that we are sometimes expected to do interesting things with dirty data.”
..That is excatly where i am. Thank you so much sir, this is what i’ve been looking for the days.