Regex Revelry | Critical Start
Prepare your business with On-Demand Breach Response
Webinar Series | Once More unto the Breach | Lessons Learned from Billion Dollar Breaches

Regex Revelry


Regular Expressions (Regex) are used to identify strings that defy simple search terms, which infosec and technology professionals use for things like input validation, searching and scripting. Unfortunately, the syntax can be intimidating and the learning curve steep for beginners. Throw in a handful of different flavors and the confusion grows. While it can be a lot of information upfront, once you’ve spent some time on it, the syntax becomes natural and you may even start wishing all search bars allowed regex! This article offers a short guide for anyone interested in learning the syntax or adding a few new tools to their regex belt.

Where can I use Regex?

  • Command line/grep
  • Scripting
  • Burp
  • Text editors designed to handle code
    • Notepad++, Sublime, etc.
  • Input validation

How?

  • Practice
  • Really practice
  • Attempt things multiple times
  • Come up with ways to work it into your routine
  • Time to practice
  • It’s really just practice
  • Create code which uses regex
  • Enter into a world of wonder and practice

A note about exercises:

The exercises included in this guide use publicly available web pages which are subject to change and may cause a different result from the example presented here. There are also many different right answers to most regex problems; the ones provided here correspond to the material covered in a given section to illustrate how a regex could be formed using the techniques presented.

 

A Quick Pit Stop to Refresh Our Shortcut Knowledge

1 – The Basics of Find and hReplace:

  • Almost everything, such as browsers, Excel, and notes, uses this keyboard shortcut that opens a search box to find the text: CTRL-F
    • Bonus: double-click
      • Sometimes you can double click on a word or chunk of text and it will highlight all instances thereof and/or auto-populate the ‘find’ field if followed by a find or find and replace shortcut. Example from Notepad++:

 

    • Find and Replace: CTRL-H
      • This works on most editors, including Word.

Notepad++ Options

  1. CTRL-F
  2. CTRL-H
  3. Permanently highlight all instances of the item in the find field.
  4. Find field
  5. If you highlight a chunk of text followed by CTRL-F/H, you can then search within only that chunk as opposed to the whole file.
  6. Count how many instances there are of the given search term
  7. Different modes for different search types
    1. Normal – just like your standard browser or Word editor; it searches for the exact string you provided.
    2. Extended – standard + you can use certain special characters such as newline, carriage return, tab, etc.
    3. Regular expression – regex
      1. ‘.’ matches newline – use with caution.

 

Try it at Home – Regexercise 1: Simple Searching

Find all instances of the word “user” on this page: https://www.criticalstart.com/cisco-umbrella-enterprise-roaming-client-and-enterprise-roaming-module-privilege-escalation-vulnerability/

Answer: 20 at the time of publication

 

Note the differences between instances. How would you look for only “user” and not “users”? What if you wanted only the cases where it’s the last word in a sentence? We’ll look at problems like this throughout this post.

Example:

  • user[space]
  • user[period]

 

Back to Regex
2 – Dates: Letters, Numbers, Groups, Logical OR and NOT, Character Sets, and Quantifiers

2.1, 2.2 – Letters and Numbers

\w – “word” characters (letters), e.g. a-z and A-Z. \w represents one single character.

\d – “digit” characters (numbers), e.g. 0-9. \d is a single character and not understood as an actual number. For example, 3 can be represented by \d, but 10 cannot; instead, 10 would be represented as \d\d.

 

Try it at Home – Regexercise 2.1: Find all dates of a given format

  • Copy the text from this Wikipedia article into a text editor which allows regex searching.
  • Find every instance of a date in this format: Tue, 02 Apr 2019 10:04:53 GMT

Answer: 8

Example Regex:

\w\w\w, \d\d \w\w\w \d\d\d\d \d\d:\d\d:\d\d \w\w\w

 

Try it at Home – Regexercise 2.2: Find all Expires headers

  • Example
    • Expires: Tue, 02 Apr 2019 10:04:53 GMT
  • Count and compare to total number of dates

Answer: 1

Example Regex:

Expires: \w\w\w, \d\d \w\w\w \d\d\d\d \d\d:\d\d:\d\d \w\w\w

 

2.3 – Groups and ‘OR’

Groups “()” retain order and precise subsections, as opposed to sets.

| = ‘OR’ = logical XOR

  • Example of a normal group which will look for either “normal group” or “regular group”:
    • (normal|regular) group
  • Groups are ‘capturing’ by default, e.g. the text in the parentheses is intended to be part of the final string, while a non-capturing group (shown below) is excluded from the final string.
    • (?:non-capturing group)
    • Example:
      • Input Strings:
        • “The final result will contain the 1st group, but not the second.”
        • “The final result will contain the first group, but not the second.”
        • “The final result will contain the 1st group, but not the 2nd.”
        • “The final result will contain the first group, but not the 2nd.”
        • “The final result will contain the 1st group, but not the 3rd.”
        • “The final result will contain the first group, but not the third.”
      • Your Regex:
        • “The final result will contain the (first|1st) group, but not the (?:second|2nd).”
      • Valid Results:
        • “The final result will contain the first group, but not the .”
        • “The final result will contain the 1st group, but not the .”
      • None of the sentences with 3rd or third in them are valid because 2nd or second must be present.

 

Try it at Home – Regexercise 2.3: Find all types of date headers except Expires

Example format:

  • Name-of-Header: Tue, 02 Apr 2019 10:04:53 GMT

Answer:

  • Accept-Datetime: Thu, 31 May 2007 20:35:00 GMT
  • Date: Tue, 15 Nov 1994 08:12:31 GMT
  • If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
  • If-Unmodified-Since: Sat, 29 Oct 1994 19:43:31 GMT
  • Date: Tue, 15 Nov 1994 08:12:31 GMT
  • Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT
  • Retry-After: Fri, 07 Nov 2014 23:59:59 GMT

Example:

(Accept-Datetime\Date\I-Modified-Since\If-Unmodified-Since\Last-Modified\Retry-After): \w\w\w, \d\d \w\w\w \d\d\d\d \d\d:\d\d:\d\d \w\w\w

 

2.4 – Character Sets, “NOT”, and Quantifiers

  • Character set [] examples:
    • [A-Z] – any uppercase character in the alphabet
    • [a-z] – any lowercase character in the alphabet
    • [0-9] – any single digit
      • [0-6] – any single digit between and including 0 and 6
    • [\w\d] – any single character that is a letter or a number (no whitespace or special characters)
      • Equivalent to (\w|\d)
  • ^ – NOT
    • [^abc] – any character (including whitespace, special characters, and numbers) that is not a, b, or c.
    • ^ is used for other commands too, use with caution.
  • Quantifiers {} * + ?
    • Quantifiers follow characters, character sets, and groups to define how many should exist.
      • Example: \w{4} is exactly 4 letters
    • {,100} – any number of characters from 0 through 100
    • {1,} – any number of characters from 1 through infinity
    • {4} – exactly 4 characters
    • {2,6} – any number of characters from 2 through 6, e.g. a{2,6} would be aa, aaa, aaaa, aaaaa, aaaaaa
    • * – any number of characters from 0 through infinity
      • Equivalent to {0,}
    • + – at least one character, but can be as many as infinity
      • Equivalent to {1,}
    • ? – has 2 uses: optional and lazy
      • Optional? – the character preceding the ? either appears once or not at all.
        • Equivalent to {0,1}
        • Optional? matches both Optional and Optiona
      • (lazy)+? – is the quantifier preceding the ? must choose the smallest number of instances possible.
        • Example: (lazy){3,}? matches lazylazylazy, but not lazylazylazylazy

 

Try it at Home – Regexercise 2.4: Find all date headers of this format using quantifiers

Example format:

  • Name-of-Header: Tue, 02 Apr 2019 10:04:53 GMT

Expected Results:

  • Accept-Datetime: Thu, 31 May 2007 20:35:00 GMT
  • Date: Tue, 15 Nov 1994 08:12:31 GMT
  • If-Modified-Since: Sat, 29 Oct 1994 19:43:31 GMT
  • If-Unmodified-Since: Sat, 29 Oct 1994 19:43:31 GMT
  • Date: Tue, 15 Nov 1994 08:12:31 GMT
  • Expires: Thu, 01 Dec 1994 16:00:00 GMT
  • Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT
  • Retry-After: Fri, 07 Nov 2014 23:59:59 GMT

Example Regex:

[\w-]+: \w{3}, \d{2} \w{3} \d{4} \d{2}:\d{2}:\d{2} \w{3}

 

3 – Special Characters, Escapes, Exclusions, and Lookarounds

3.1 – Special Characters and Escapes

  • \W, \S, \D
    • Not word (letter), not whitespace (space, tab, etc.), not digit (number)
  • \s, \r, \n, \t, \0
    • Whitespace, carriage return (Windows), newline (Windows, Nix), tab, null character
  • +*?^$\.[]{}()|/
    • Need to escape with \, e.g. \*
  • There are more options for escaping/encoding, such as unicode
    • Only use for more complicated tasks, such as scripting

 

Try it at Home – Regexercise 3.1: Find all full HTTPS URLs

  1. https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
    • Navigate to this page
    • Right-click > View page source (or CTRL-U)
    • Copy and paste the entire contents of the view-source:… page into a text editor which allows regex use.
  2. Find all HTTPS (not HTTP) URLs and their contents, e.g. https://myurl.wikipedia.com/directory?param=value&param2=value2#locallink

Expected Results:

Example Regex:

https://[\.\w]+[\/\w\?\=\-\&\d\.\:\;\#\%\(\)\$]*

 

3.2 – Exclusions and Lookarounds

Get some stuff, but not other stuff

  • def[^abc]ghi – “def” followed by any single character (including whitespace, special characters, and numbers) that is not a, b, or c followed by “ghi”.
    • Match examples:
      • Match – def ghi
      • Match – def2ghi
      • Non-match – defbghi
    • ^ is used for other regexing too; use with caution and only inside square brackets unless you know what you’re doing.
  • def(abc)ghi – capture group (see above).
    • Match and result examples:
      • Match – defabcghi, Result – defabcghi
      • Non-match – anything else
  • def(?:abc)ghi – non-capturing group (see above).
    • Match and result examples:
      • Match – defabcghi, Result – defghi

Lookarounds are non-capturing by default, meaning the items in the group must exist, but they will not be returned in the result.

  • def(?=abc)positive lookahead: requires a string containing “def” followed by “abc”, but only “def” is returned as a result.
  • def(?!abc)negative lookahead: requires a string containing “def” followed by anything BUT “abc”, and only “def” is returned.
    • Match and result examples:
      • Match: defqrs, Result – def
      • Non-match: defabc, Result – no result
      • Match: def%^&*(, Result – def
  • (?<=abc)ghi – positive lookbehind: requires a string containing “ghi” preceded by “abc”, but only “ghi” is returned as a result.
  • (?<!abc)ghi – negative lookbehind: requires a string containing “ghi” preceded by anything BUT “abc”, and only “ghi” is returned.

Note: There are different interpretations of regex out there, especially when you get into this more complicated territory, so if something’s not working, google it and check out the resources below.

 

Try it at Home – Regexercise 3.2: Find only the part of the URL that appears after the domain/IP.

Bonus problem – find this part of the URL for the following:

  • HTTP URLs (with or without the prefix)
  • HTTP OR HTTPS
  • HTTP AND HTTPS
  • URIs in HTML tags

Example Regex:

(?:https://[\d\w\.\-]*)/[^\"\s\<\>]*

 

4 – Grepping

4.1 – Grep and Count

  • grep –G “myregex” myfile.txt
    • grep – command in Linux to search for lines containing a given expression
    • -G – flag saying you’re grepping. There are several other valid flags available as well.
    • “myregex” – a regular expression surrounded by double quotes, e.g. “[\w\d]*\.”
    • txt – the file through which you’re grepping
  • grep –G “myregex” myfile.txt -c
    • -c – count how many matches to myregex there are in myfile

 

Try it at Home – Regexercise 4.1: Find line given port

After performing a legal, safe scan on something you own, practice grepping only the lines which have a given port open. A common example is also provided below.

For the following example, collect all the lines with port 80 open. Bonus: collect all the lines with port 80 open and not port 8080.

 

Example Nmap greppable output:

Host: 10.10.10.3 ()      Status: Up

Host: 10.10.10.3 ()      Ports: 80/open/tcp//http///, 443/open/tcp//https/// Ignored State: filtered (998)

Host: 10.10.11.10 ()    Status: Up

Host: 10.10.11.10 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///, 8080/open/tcp//http///

Host: 10.10.12.65 ()    Status: Up

Host: 10.10.12.65 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///

Host: 10.10.14.46 ()    Status: Up

Host: 10.10.14.46 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https///

Host: 10.10.10.10 ()    Status: Up

Host: 10.10.10.10 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https/// Ignored State: filtered (998)

Host: 10.10.17.127 ()  Status: Up

Host: 10.10.17.127 ()  Ports: 21/open/tcp//ftp///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///, 1032/open/tcp//postgresql///

Host: 10.10.16.78 ()    Status: Up

Host: 10.10.16.78 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 443/open/tcp//https///, 8443/open/tcp//http///

Host: 10.10.16.19 ()    Status: Up

Host: 10.10.16.19 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https///, 4103/open/tcp//unicall///

Host: 10.10.23.43 ()    Status: Up

Host: 10.10.23.43 ()    Ports: 21/open/tcp//ftp///, 25/open/tcp//smtp///, 53/open/tcp//domain///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///            Ignored State: filtered (989)

Host: 10.10.25.10 ()    Status: Up

Host: 10.10.25.10 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///

Host: 10.10.10.8 ()      Status: Up

Host: 10.10.10.8 ()      Ports: 80/open/tcp//http///, 443/open/tcp//https///

Host: 10.10.24.35 ()    Status: Up

Host: 10.10.24.35 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https///

Host: 10.10.119.204 ()            Status: Up

Host: 10.10.119.204 ()            Ports: 22/open/tcp//ssh///, 80/open/tcp//http///, 443/open/tcp//https///     Ignored State: filtered (997)

 

Example Result:

Host: 10.10.10.3 ()      Ports: 80/open/tcp//http///, 443/open/tcp//https/// Ignored State: filtered (998)

Host: 10.10.11.10 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///, 8080/open/tcp//http///

Host: 10.10.12.65 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///

Host: 10.10.14.46 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https///

Host: 10.10.10.10 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https/// Ignored State: filtered (998)

Host: 10.10.17.127 ()  Ports: 21/open/tcp//ftp///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///, 1032/open/tcp//postgresql///

Host: 10.10.16.19 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https///, 4103/open/tcp//unicall///

Host: 10.10.23.43 ()    Ports: 21/open/tcp//ftp///, 25/open/tcp//smtp///, 53/open/tcp//domain///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///            Ignored State: filtered (989)

Host: 10.10.25.10 ()    Ports: 21/open/tcp//ftp///, 22/open/tcp//ssh///, 25/open/tcp//smtp///, 80/open/tcp//http///, 10/open/tcp//pop3///, 103/open/tcp//imap///, 443/open/tcp//https///, 465/open/tcp//smtps///, 522/open/tcp//submission///, 993/open/tcp//imaps///, 995/open/tcp//pop3s///, 3306/open/tcp//mysql///

Host: 10.10.10.8 ()      Ports: 80/open/tcp//http///, 443/open/tcp//https///

Host: 10.10.24.35 ()    Ports: 80/open/tcp//http///, 443/open/tcp//https///

Host: 10.10.119.204 ()            Ports: 22/open/tcp//ssh///, 80/open/tcp//http///, 443/open/tcp//https///     Ignored State: filtered (997)

 

Example command:

grep -G 80/o myfile.txt

 

4.2 – Cut and Sort

  • grep –G “myregex” myfile.txt | cut –d ‘:’ –f2 | sort –u
    • | cut -d ‘:’ -f2
      • send the output to the next command |, cut up each matched line using a colon as the -delimiter, and select only the 2nd -f
    • | sort -u
      • send the output to the next command |, sort the output alphabetically, then remove duplicates leaving a list of unique lines -u.

Try it at Home – Regexercise 4.2: Find a unique list of IPs with a given port open.

From your (safe, legal) greppable output or the example above, collect a unique list of IPs with port 80 open.

Expected output:

  • 10.10.10.10
  • 10.10.10.30
  • 10.10.10.8
  • 10.10.11.10
  • 10.10.119.204
  • 10.10.12.65
  • 10.10.14.46
  • 10.10.16.19
  • 10.10. 17.127
  • 10.10.23.43
  • 10.10.24.35
  • 10.10.25.10

Example command:

grep -G 80/o myfile.txt | cut -d ‘ ‘ -f2 | sort -u

 

5 – References and Substitutions

5.1 – Numeric References

  • (abc)def\1
    • Create a group (abc), then refer to the group by the order in which it was created \1.
    • Match: abcdefabc
  • Regex: (abc)def(ghi|\1)jkl\2
    • Matches: abcdefghijklghi, abcdefghijklabc, abcdefabcjklghi, abcdefghijklabc

 

Try it at Home – Regexercise 5.1: HTML script tags

Back to the source HTML of this page: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Find all instances of <script>Some text here</script>. Bonus: find all instances of <script some text here>some code here</script>

 

Expected Count: 4

Example Regex:

<(script)>[^\<]*</\1>

 

5.2 – Named Groups

  • (?:abc)(?<beans>([\d]ef)(?=ghi))jkl\g<beans>
  • (?:abc)(?P<green>([\d]ef)(?=ghi))jkl(?P=green)

 

Try it at Home – Regexercise 5.2: HTML script tags

Back to the source HTML of this page: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields

Perform Regexercise 5.1, but this time use named groups. Your text editor may use a different regex flavor; identifying that is part of this exercise.

 

6 – Miscellaneous

Python Example

import rereg_head = re.compile(‘\w*(-\w*):\s.’)  # header line in a requestif reg_head.match(line): #…

  • Sed, Awk, Xargs – these things exist
  • Different regex flavors – these too
  • Excel – is beauteous

 

7 – Links

A note about resources

You are responsible for the safety of your information and CRITICALSTART does not control third-party sites. Be aware that sites may store your information and sanitize any input accordingly.

 

New course offered at BlackHat 2020:

To help sharpen the skills of penetration testers and threat hunting teams, TEAMARES will be offering an onsite training course at BlackHat USA 2020 in “Adversary Emulation and Active Defense.”  This course will provide information security concepts utilized in both offense and defense. Attendees will learn skills that can be applied to increase capability from both sides including exploitation, circumventing defenses and lateral movement from an attacker’s perspective. The course will also cover key techniques for detection, threat hunting and mitigation to counter an attacker’s toolbox.

Follow us on Twitter @TeamAresSec and @CRITICALSTART to stay up to date

 

Credit:

By @DoraTexplorer17 | Adversarial Engineer, TEAMARES at CRITICALSTART

Let's Talk

Get in Touch
PREVIOUS RESOURCE
Path 11 Copy 2 Created with Sketch.
NEXT RESOURCE

Related Content

Categories