Geeks With Blogs
Rahul Anand's Blog If my mind can conceive it, and my heart can believe it, I know I can achieve it.

Regular expressions works really wonderful when you want to extract matches of a pattern within a string. You can also use regular expression to replace a pattern with some other text, or you can split the string on matching patterns as delimiter.

 

Is it confusing? Are you wondering now what pattern mean?

 

Ok, pattern is just a set of characters and meta-characters to describe all the text you are interested in. set of characters just represents the string (literal pattern) and the meta-characters give your pattern the power to match more than one string with a single pattern. The concept of meta-char is very much similar (but much more than that) to the wildcard characters you must have used to match more than one file in your dir (DOS) or ls (UNIX) command to list out files; for example match all files having extension “exe” you write “*.exe”. Or match all files having ‘a’ and ‘b’ separated by any one character, you write “*a?b*”.

 

List of Meta-characters used in patterns:

 

\        general escape character with several uses

^       assert start of subject (or line, in multi-line mode)

$        assert end of subject (or line, in multi-line mode)

.         match any character except newline (by default)

[        start of “character class” definition

]        end of “character class” definition

|        start of alternative branch

(        start sub-pattern

)        end sub-pattern

?        extends the meaning of (, also 0 or 1 quantifier, also quantifier minimizer

*        0 or more quantifier

+        1 or more quantifier

{        start min/max quantifier

}        end min/max quantifier

 

Note: The part of pattern enclosed by square brackets is called character class

 

List of meta-characters in a "character class":

 

\        general escape character

^       negate the class, but only if the first character

-        indicates character range

]        terminates the character class

 

Please note if you want to match any of this meta-character as literal then you must escape the meta-character to suppress its meaning as meta-character.

 

Backslash is further used to specify generic character families:

 

\d       any decimal digit

\D       any character not covered by \d

\s       any whitespace character

\S       any character not covered by \s

\w      any "word" character

\W      any character not covered by \w

\r       carriage return

\n       new line

\t       tab character

 

Backslash is also used to specify simple assertions:

 

\b       word boundary

\B       not a word boundary

 

Now we have learned enough concepts about the Regular Expressions so let us start writing our own patterns.

 

Suppose my string is:

 

This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.

 

Simplest example will be finding exact literal matches.

 

Example1:

 

Objective:    Find all ‘test’ from the given string.

Pattern:       test

 

The matches found are (underlined and bold faced):

 

This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.

 

Now we apply some meta-characters and extract only test not the testing.

 

Example2:

 

Objective:    Find only ‘test’ not testing from the given string.

Pattern:       \stest\s

 

The matches found are (underlined and bold faced):

 

This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.

 

Complicating further we will now write a pattern to extract all item codes from this string.

 

 

Example3:

 

Objective:    Find all item codes from the given string.

Pattern:       \sITM_\d*\s

 

The matches found are (underlined and bold faced):

 

This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.

 

Improving pattern further so that only valid item codes are extracted

 

Example4:

 

Objective:    Find all valid item codes from the given string.

Pattern:       \sITM_\d+\s

 

The matches found are (underlined and bold faced):

 

This is a test string in which we will find the matching patterns. This string also contains few numeric text like ITM_2345, and ITM_4321 which represents some useful code of very interesting items. These item codes are also just for testing and all such codes start with ITM_ and then followed by some numeric digits.

 

Ok. Enough understanding has been gained on regular expressions to solve real world situations. Now we try to write sample patterns to match some useful information like ZIP Code or Phone Number.

 

1

US ZIP Code

^\d{5}(-\d{4})?$

2

US Phone Number

\(?\d{3}\)?[-\s.]?\d{3}[-.]\d{4}

3

HTML Tag

]*>

 

4

Email Address

 

[A-Za-z0-9._%-]+@[A-Za-z0-9._%-]+\.[A-Za-z]{2,4}

 

Note: The patterns provided here are just for sample purpose and need not correctly match all possible information.

 

I will try to explain these sample patterns:

 

US ZIP Code:

 

^       Beginning of line.

\d       Any numeric character.

{5}     5 occurrences. Here only one quantifier is used so this will match the exact quantity of preceding character.

(        Start sub-pattern.

-        Match a ‘–‘ character.

\d       Any numeric character.

{4}     4 occurrence.

)        End sub-pattern

?        0 or 1 occurrences of sub-pattern

$        End of line

 

Example strings:

12345

12345-1234

 

US Phone Number:

 

\(?      Match ‘(‘ 0 or 1 time (making it optional)

\d{3}  3 Numeric Characters

\)?      Match ‘)‘ 0 or 1 time (making it optional)

[-\s.]? Match ‘-‘ or space or ‘.’ 0 or 1 time (making it optional), character class is used to provide a set of possible characters.

\d{3}  3 Numeric Characters

[-.]     Match 1 occurrence of ‘-‘ or ‘.’

\d{4}  4 Numeric Characters

 

Example strings:

(123)123-1234

123-456-7890

123.456.7890

123 456-7890

123456-7890

 

HTML Tag:

 

<        Match 1 occurrence of ‘<’

/?       Match 0 or 1 occurrence of ‘/’

\w+    Word characters 1 or more occurrences

\s*     Space characters 0 or more occurrences

[^>]* Matches any character other than ‘>’ 0 or more occurrences, ^ negates the character class.

>        Match 1 occurrence of ‘>’

 

Example strings:

<B>

</B>

<img src=”abc.jpg”>

<input type=’text’ value=’Test Value’ >

</script>

 

Email Address:

 

[A-Za-z0-9._%-]+    Match any characters from this character class, 1 or more occurrence. ‘-‘ is used to specify the range. ‘A-Z’ means any character from A to Z both including. No need to escape ‘.’ As it is specified under character class and hence has no meta-character significance.

@                          Match ‘@’ character

[A-Za-z0-9._%-]+    Match any characters from this character class, 1 or more occurrence.

\.                           Match ‘.’ Backslash is used to escape the meta character meaning.

[A-Za-z]{2,4}          Any alphabet upper case or lower case. Occurrence can be 2 to 4.

 

Example strings:

abc@abc.com

abc@yahoo.co.in

abc.def_007@test.ca

 

For more on Regular Expressions >> Regular Expression in C#

 

Posted on Thursday, August 4, 2005 5:27 AM | Back to top


Comments on this post: Basics of Regular Expression

# re: Basics of Regular Expression
Requesting Gravatar...
Thanks a ton - am a rookie, and this has helped me a lot.
Left by A. Muni on May 16, 2006 6:19 AM

# re: Basics of Regular Expression
Requesting Gravatar...
thanks alot ,i have question if i want the tested string NOT part of it to match exactly the regular expression one time
Left by doaa on Aug 19, 2006 6:02 AM

# re: Basics of Regular Expression
Requesting Gravatar...
helpful site :)
Thanks
Left by Yashvebdra Singh on May 11, 2008 4:58 AM

# re: Basics of Regular Expression
Requesting Gravatar...
Very Handy. Thanks a lot.
Left by Lakshmi on Dec 20, 2008 6:24 AM

# re: Basics of Regular Expression
Requesting Gravatar...
thanks!!! this turned out to be of great help.
Left by M.S.S on Aug 04, 2009 5:18 AM

# re: Basics of Regular Expression
Requesting Gravatar...
Good Read! Concisely and clearly explained!
Left by Titan on Jan 05, 2014 3:33 PM

Your comment:
 (will show your gravatar)


Copyright © Rahul Anand | Powered by: GeeksWithBlogs.net