Day 2: regular & relational expressions, flow-control statments

AWK supports Extended Regular Expressions (ERE), similar to sed(1) & grep(1)

unlike sed & grep, AWK does *NOT* have back-reference support
 => see 'Challenge Exercise'

two AWK RegEx forms:  /regex/  &  "regex" (good for dynamic concatenation)

# AWK ERE meta-character basics:
#  "." => any char ; "^" => match start of string ; "$" => match end of string
#  "?" => match 0 or one of preceeding ; "+" => match 1 or more of preceeding
#  "*" => match 0 or more of preceeding ; "A-Z", "a-z", "0-9" => range set
#  "[...]" => match any char contained; can be single char or range sets
#  "[^..]" => match any char NOT contained (complement of above)
#  "{n,m}" => match n-m number of preceeding ; "{n,}" => matches n or more
# 
# some common POSIX character classes - see re_format(7) for details:
#   [:alpha:], [:alnum:], [:digit:], [:upper:], [:cntrl:], [:space:]
# 
# metachars (usually) lose specialness inside "[]" ; escape w/ "\" otherwise
#   ex.  "[.+?*]" => matches ".", "+", "?" or ","
#        "[\\]" => matches "\"  ; "[\]]" => matches "]"
# 
#  some special cases:
#     "[[]" => matches "[" *IF* first char in brackets
#     "[-]" => matches "-" *IF* first char in brackets
# 

complement ( "[^c..]" ) won't break 1st char special interpretation

# examples of typical ERE patterns:
#    "^[sS]erver"  =>  matches str starting w/ "Server" or "server"
#         "addr$"  =>  matches str ending w/ "addr"
#   "(http|goph)"  =>  matches str containing "http" or "goph"
#      "^[a-z]+$"  =>  matches str ONLY containing a-z
#        "[^sdf]"  =>  matches str NOT containing s, d, or f
#          "t{2}"  =>  matches str containing "tt"
#         "[qv]+"  =>  matches str with one or more "q" or "v"
# 

boolean "&&" and "||" along w/ "(|)" allows more complex matching

#  ex.  ^(GOPH|SERV)|[_]{2,3}|(ADDR|NAME)$
#  => matches strings w/ any of the following characteristics:
#       - begins w/ "GOPH" or "SERV"
#       - contains 2 or 3 "_" chars
#       - ends w/ "ADDR" or "NAME"
# 

... on to AWK relational expressions ...

AWK relations operators: >, <, >=, <=, !=, ==, !~, ~ 

#  some examples:
#   '$3 > $1' => lines w/ field 3 > field 1
#   'NF >= 5' => lines w/ 5+ fields ; 'NR != 1' => NOT line 1
#   'length < 72' => length w/o arg is quasi-var == length of $0
# 

AWK will attempt to mathematically compare strings
  eg. '42 <= "42abc"' => TRUE! try to avoid this sort of thing..


matching can be done as a range: '/pattern_start/ , /pattern_end/'
  eg. 'NR==10, NR==20' => lines 10-20

AWK lacks a way to directly reference the last line of data

work-around for last-line: eg. 'NR==42 , EOF'  => lines 42-last
  => works if EOF remains unassigned (0) ; or just use '0'


... on to flow-control statments ...

AWK statements can separated by newlines or ';', or continued by '\'
#   ex.  Foo = "fu" ; Bar = "bar"
#        Aws = 42
#        BigStr = "aaaaaaaaaaaaaaaaaaaaaaaaaaaa" \
#                 "bbbbbbbbbbbbbbbbbbbbbbbbbbbb" \
#                 "cccccccccccccccccccccccccccc"
# 

looping: if();else if();else, while(), do{}while(), various for()

details in sect. 9.7 of Classic Shell Scripting, Chap. 9 reference

for basic increment/decrement of vars one can use 'i++' or 'i--'

note '++i' increments *BEFORE* use, 'i++' increments *AFTER*

other compact variable assignments: +=, -=, *=, /=, %=, ^=, =

see POSIX AWK std sheet in reference materials for details

F-C statments used in BEGIN/END, body {action) blocks, & defined funcs
#   ex.
#      # processing cmd line args (ARGV is built-in arg array):
#      BEGIN {
#        if (ARGV[1])
# 	     for (i=1 ; i < ARGC ; i++)
# 	         if (ARGV[i] ~ /^-([hH]|-help)/
# 	             show_usage()
# 	         else
# 	             ...
#     }
# 

AWK not too picky regarding "{}", single blocks don't need
#   ex.
#       # all one block => no '{}'s required...
#       for (i=1 ; i <= NF ; i++)
#           if ( $i !~ /^[0-9]+$/ )
#               if ( length ($i) < 72 )
# 	            print $i
#               else
# 	            fold($i)  # usr-defined func.
# 
#       # same as above but w/ maximum '{}'s...
#       for (i=1 ; i <= NF ; i++) {
#           if ( $i !~ /^[0-9]+$/ ) {
#               if ( length ($i) < 72 ) {
# 	            print $i
#               } else {
# 	            fold($i)  # usr-defined func.
#               }
#           } 
#       } 
#

C-style ternary conditional operator:  expr1 ? expr2 : expr3
#   ex.  print even / odd:
#     $ seq 1 9 |awk '{print $1, ($1%2 == 0 ? "even":"odd")}'
#     1 odd
#     2 even
#     3 odd
#     ...
# 

ternary operator nestable => quickly become hard to read..

control-flow interuption: next, break, continue, exit

'break' & 'continue' used in for(), while(), do-while() loops

for nexted loops 'break' & 'continue' interupt the innermost loop

'next' ceases matching of current record & moves to next record

'next' only meaningful in main body action { expression } 

'exit' w/ opt. expression, ie. 'exit 1', quits script completely

--

# Challenge Exercise!
#  Both sed(1) and egrep(1) can use back reference to match
#  strings with multiple adjacent identical characters, i.e.
#  finding dictionary words w/ 3+ indentcal chars in a row:
# 
#     $ nice sed -n '/\(.\)\1\{2,\}/p' /usr/share/dict/words
#     bossship
#     demigoddessship
#     goddessship
#     headmistressship
#     patronessship
#     wallless
#     whenceeer
# 
#     $ nice egrep '(.)\1{2,}' /usr/share/dict/words
#     bossship
#     demigoddessship
#     goddessship
#     headmistressship
#     patronessship
#     wallless
#     whenceeer
# 
#  => how would you do this in AWK ?
#