Home > Linux, Webmaster > Regular expression

Regular expression

Regular expression( regex ) introduction.

Regular Expression Editor with links to to other editors

The pattern like [^abc] says, match any character not included in the set, abc, where abc is a set of characters and  ^  means not.

Obviously [abc]  match only a or b or c.

So [ ]  is a one character window.

That set can be explicitly defined, or defined by character classes.

[a-c] or [a-h]

Quantifier is necessary to increase occurences:  ?, *, +,  {n}      – ?,*,+ refer to  previous subpattern

?   zero or one character  ( and it applies to the preceding atom in the regex )

*   zero or any sequence of characters  ( of the prior subpattern)

+  one or  any sequence of characters

{n}   repeat n times

{n,m} repeat from n to m times

To check a postal code it’s possible to use something like  [0-9] {5}

There are some predefined character classes ( alias ) in the regular expression

So postal code can be represented  by \d{5}

    the dot matches any character except the newline by default (newline can be different in different platform).  Like [^\n]

\d   (d)igit which is equivalent to [0-9]
\D   which is equivalent to [^0-9]  ( that is  ^\d)

\w   represents “word characters” (digits, underscore and letters) [a-zA-Z0-9_]
\W   represents “non word characters”   ( that is  ^\w)

\s    contains whitespace characters like space, tab, newline, carriage return [\t\r\n]
\S    contains non-whitespace characters    ( that is  ^\s)

.*    match any sequence of characters ( null string included )

Escaping

\      escape to mean literal value for any non-alpha-numeric character  like .,  / or ?
so  \., \/,  \?  others  \+,  \*

Grouping, subpattern

To check a web domain that is something like   hpc2.eurotech.com

(” The characters allowed in a label are a subset of the ASCII character set, and includes the characters a through zA through Z, digits 0 through 9, and the hyphen. This rule is known as the LDH rule (letters, digits, hyphen). Domain names are interpreted in case-independent manner. Labels may not start or end with a hyphen”  from Wikipedia )

So it is possible to use

^([a-zA-Z0-9]([a-zA-Z0-9\-][a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}$

or

^([0-9a-zA-Z][-\w]*[0-9a-zA-Z]\.)+[a-zA-Z]{2,6}$

with less  characters
but  \w  include _ that is not permitted

or

^(\w[-.\w]*\w\.\w{2,9})$

with even less characters  (but  \w  include _ that is not permitted)

( )  to grouping and create subpattern

(.*)       match any sequence of characters ( null string included )

([^/]+)   any not null sequence of characters without /

^  means  NOT  but when at the beginning it is a boundary  and is used to start regex espression (usually ^ means NOT when  it is  in [] brakets ) .

$   means end of  regex string

so  ^m   string that begins with m    $m  text that ends with  m      m  any  m

\b is the word boudary   ex.   \b(max|min)\b

 ?!      when ?! appears as the first two characters within parens, that denotes what is known as a “non-capturing negative lookahead.”

(?i)    ignore case

or  operator 

^(\w[-.\w]*\w\.(com|info|biz))

So domains name are restricted to com or info or biz

|   like logic operator OR

attention to greediness

* is a greedy operator  so  to match only a tag we have to use

<.*?>   not   <.*>

\Q .. \E quoting substring

\Qabc$xyz\E abc$xyz
\Qabc\$xyz\E  abc\$xyz
\Qabc\E\$\Qxyz\E  abc$xyz

Useful summary and examples applied to urls

^   beginning-of-line (except when used within a range  = not)

$   end-of-line

(https?)     () capturing group.   The ?  matches: http or https.   ?  means zero or more ‘s

[^/]+     one or more characters, none of which are slash.

[^\.]+     one or more characters, none of which are dot

[^/\.]+\.jpg   matches one or more characters, none of which are dot or slash, followed by .jpg

^/products/?$      matches  /products or /products/  – with or without a trailing slash

\.html?    matches either .htm or .html

^/(?!index\.aspx)(.*)$    any URL that does not begin with index.aspx

^/([^/]+)/([^/]*)(?<!\.aspx)$       The (?<!\.aspx) that ends the pattern is a non-capturing negative look-behind subpattern, which says “the string must not end in .aspx“.

((?:en|fi)[0-9]{2})   The sequence ?: – when it appears as the first two characters within parens – makes the group a non-capturing group.

^(https?://www\.siagri\.net/)     This pattern can be used in a RewriteCond applied against HTTP_REFERER to prevent image leaching.

^(?!https?://www\.siagri\.net/)  This matches the opposite of the prior example: anything that is neither http://www.siagri.net nor https://www.siagri.net . ( ndr. IIRF This pattern can be used in a RewriteCond applied to the HTTP_HOST to rewrite if the hostname is NOT matched by the pattern.)

^(?!www.)([^\.]+)\.siagri\.net    matches any hostname in the siagri.net domain, except for ‘www.siagri.net’.

(https?)://([^/]+)(/([^\?]+(\?(.*))?)?)?      https://eurotech.com/      three groups    ->    1 https ;  2 eurotech.com; 3 /

(https?)://([^/]+)(/([^\?]+(\?(.*))?)?)?      http://eurotech.com/a/b/c.aspx?p1=foo      three groups    ->    1 https ;  2 eurotech.com;   3 /a/b/c.aspx?p1=foo;  4 a/b/c.aspx?p1=foo;    5 ?p1=foo;  6 p1=foo;

Examples of Regular Expressions from  Example of Regular Expression

IP Address Regexp

\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b

MAC Address Regexp

^([0-9a-fA-F][0-9a-fA-F]:){5}([0-9a-fA-F][0-9a-fA-F])$

Domain Name Regexp

^([a-zA-Z0-9]([a-zA-Z0-9\-]{0,61}[a-zA-Z0-9])?\.)+[a-zA-Z]{2,6}$

Windows File Name Regexp

(?i)^(?!^(PRN|AUX|CLOCK\$|NUL|CON|COM\d|LPT\d|\..*)(\..+)?$)[^\\\./:\*\?\"<>\|][^\\/:\*\?\"<>\|]{0,254}$

Float Number Regexp

[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+\b)(?:[eE][-+]?[0-9]+\b)?

Roman Number Regexp

^(?i:(?=[MDCLXVI])((M{0,3})((C[DM])|(D?C{0,3}))?((X[LC])|(L?XX{0,2})|L)?((I[VX])|(V?(II{0,2}))|V)?))$

Date in format yyyy-MM-dd

(19|20)\d\d([- /.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])

Resource:

Advertisements
  1. Non c'è ancora nessun commento.
  1. No trackbacks yet.

Rispondi

Inserisci i tuoi dati qui sotto o clicca su un'icona per effettuare l'accesso:

Logo WordPress.com

Stai commentando usando il tuo account WordPress.com. Chiudi sessione / Modifica )

Foto Twitter

Stai commentando usando il tuo account Twitter. Chiudi sessione / Modifica )

Foto di Facebook

Stai commentando usando il tuo account Facebook. Chiudi sessione / Modifica )

Google+ photo

Stai commentando usando il tuo account Google+. Chiudi sessione / Modifica )

Connessione a %s...

%d blogger hanno fatto clic su Mi Piace per questo: