Day 2: regular & relational expressions, flow-control statments AWK supports Extended Regular Expressions (ERE), similar to sed(1) & grep(1) unlike sed & grep, AWK does *NOT* have back-reference support => see 'Challenge Exercise' two AWK RegEx forms: /regex/ & "regex" (good for dynamic concatenation) # AWK ERE meta-character basics: # "." => any char ; "^" => match start of string ; "$" => match end of string # "?" => match 0 or one of preceeding ; "+" => match 1 or more of preceeding # "*" => match 0 or more of preceeding ; "A-Z", "a-z", "0-9" => range set # "[...]" => match any char contained; can be single char or range sets # "[^..]" => match any char NOT contained (complement of above) # "{n,m}" => match n-m number of preceeding ; "{n,}" => matches n or more # # some common POSIX character classes - see re_format(7) for details: # [:alpha:], [:alnum:], [:digit:], [:upper:], [:cntrl:], [:space:] # # metachars (usually) lose specialness inside "[]" ; escape w/ "\" otherwise # ex. "[.+?*]" => matches ".", "+", "?" or "," # "[\\]" => matches "\" ; "[\]]" => matches "]" # # some special cases: # "[[]" => matches "[" *IF* first char in brackets # "[-]" => matches "-" *IF* first char in brackets # complement ( "[^c..]" ) won't break 1st char special interpretation # examples of typical ERE patterns: # "^[sS]erver" => matches str starting w/ "Server" or "server" # "addr$" => matches str ending w/ "addr" # "(http|goph)" => matches str containing "http" or "goph" # "^[a-z]+$" => matches str ONLY containing a-z # "[^sdf]" => matches str NOT containing s, d, or f # "t{2}" => matches str containing "tt" # "[qv]+" => matches str with one or more "q" or "v" # boolean "&&" and "||" along w/ "(|)" allows more complex matching # ex. ^(GOPH|SERV)|[_]{2,3}|(ADDR|NAME)$ # => matches strings w/ any of the following characteristics: # - begins w/ "GOPH" or "SERV" # - contains 2 or 3 "_" chars # - ends w/ "ADDR" or "NAME" # ... on to AWK relational expressions ... AWK relations operators: >, <, >=, <=, !=, ==, !~, ~ # some examples: # '$3 > $1' => lines w/ field 3 > field 1 # 'NF >= 5' => lines w/ 5+ fields ; 'NR != 1' => NOT line 1 # 'length < 72' => length w/o arg is quasi-var == length of $0 # AWK will attempt to mathematically compare strings eg. '42 <= "42abc"' => TRUE! try to avoid this sort of thing.. matching can be done as a range: '/pattern_start/ , /pattern_end/' eg. 'NR==10, NR==20' => lines 10-20 AWK lacks a way to directly reference the last line of data work-around for last-line: eg. 'NR==42 , EOF' => lines 42-last => works if EOF remains unassigned (0) ; or just use '0' ... on to flow-control statments ... AWK statements can separated by newlines or ';', or continued by '\' # ex. Foo = "fu" ; Bar = "bar" # Aws = 42 # BigStr = "aaaaaaaaaaaaaaaaaaaaaaaaaaaa" \ # "bbbbbbbbbbbbbbbbbbbbbbbbbbbb" \ # "cccccccccccccccccccccccccccc" # looping: if();else if();else, while(), do{}while(), various for() details in sect. 9.7 of Classic Shell Scripting, Chap. 9 reference for basic increment/decrement of vars one can use 'i++' or 'i--' note '++i' increments *BEFORE* use, 'i++' increments *AFTER* other compact variable assignments: +=, -=, *=, /=, %=, ^=, = see POSIX AWK std sheet in reference materials for details F-C statments used in BEGIN/END, body {action) blocks, & defined funcs # ex. # # processing cmd line args (ARGV is built-in arg array): # BEGIN { # if (ARGV[1]) # for (i=1 ; i < ARGC ; i++) # if (ARGV[i] ~ /^-([hH]|-help)/ # show_usage() # else # ... # } # AWK not too picky regarding "{}", single blocks don't need # ex. # # all one block => no '{}'s required... # for (i=1 ; i <= NF ; i++) # if ( $i !~ /^[0-9]+$/ ) # if ( length ($i) < 72 ) # print $i # else # fold($i) # usr-defined func. # # # same as above but w/ maximum '{}'s... # for (i=1 ; i <= NF ; i++) { # if ( $i !~ /^[0-9]+$/ ) { # if ( length ($i) < 72 ) { # print $i # } else { # fold($i) # usr-defined func. # } # } # } # C-style ternary conditional operator: expr1 ? expr2 : expr3 # ex. print even / odd: # $ seq 1 9 |awk '{print $1, ($1%2 == 0 ? "even":"odd")}' # 1 odd # 2 even # 3 odd # ... # ternary operator nestable => quickly become hard to read.. control-flow interuption: next, break, continue, exit 'break' & 'continue' used in for(), while(), do-while() loops for nexted loops 'break' & 'continue' interupt the innermost loop 'next' ceases matching of current record & moves to next record 'next' only meaningful in main body action { expression } 'exit' w/ opt. expression, ie. 'exit 1', quits script completely -- # Challenge Exercise! # Both sed(1) and egrep(1) can use back reference to match # strings with multiple adjacent identical characters, i.e. # finding dictionary words w/ 3+ indentcal chars in a row: # # $ nice sed -n '/\(.\)\1\{2,\}/p' /usr/share/dict/words # bossship # demigoddessship # goddessship # headmistressship # patronessship # wallless # whenceeer # # $ nice egrep '(.)\1{2,}' /usr/share/dict/words # bossship # demigoddessship # goddessship # headmistressship # patronessship # wallless # whenceeer # # => how would you do this in AWK ? #