What a very nicely done demonsration to awaken (or in my case reawaken) a love for awk! A little taste of a variable, regex in a pattern, and a function. Sweet!!!
Now that that love has been reawakened, I'd like to drop by, at least once a week, to leave an example. This could easily carry on for the next 5 years - and I'd love it!
Anyway, I'll start by dropping off an example that piggybacks one of the examples that was given in the video. It simply demonstrates the logical NOT operator ( ! )
# awk '!/foo/ {count++} END {print "num_foos:", count}' foo.txt
Not that it's necessary, but if we were to add one more line to the file "foo.txt", for example: chan bar8
Executing the above command will produce the following output:
num_foos: 1
Thanks for the very succinct and powerful demonstration of the amazing filtering tool - awk
@Trevor good initiaive and nice addition too ! Inspired by you - I will add an even simpler awk usage :
The toupper() function capitalizes the first letter - substr() function in awk is used to extract a substring from a string - syntax is : substr(string, start, length) . The substr($0, 2) expression extracts the rest of the letters of the line. The tolower() function lowercases the rest of the letters.
Chetan, another great contribution by you that demonstrates your value to the learning community. I'd give you 2 thumbs up (kudos) on this one if I could! A (seemingly) simple use of the awk tool, but indeed a powerful demonstration of its functionality and capability.
As always, thank you for the kind comments, and for continuing to raise the bar!!!
Like any good tool that offers programmability, awk makes use variables. awk includes built-in variables, and also includes the ability to define (declare) variables.
In this post, I'll only mention the variables - no examples. In subsequent postings, I'll serve up an example (or examples) involving each variable. As I mentioned previously, this will be a marathon in terms of coverage, and not a sprint!
Below is a list of the built-in variables in awk, along with a brief explanation of each one. There are some others that are not in my list, only because they are variables of awk's cousin - GNU awk (aka gawk).
Variable Description $0 Whole line $1, $2...$NF First field, second field,… last field NR Number of Records NF Number of Fields OFS Output Field Separator (default " ") FS input Field Separator (default " ") ORS Output Record Separator (default "\n") RS input Record Separator (default "\n") FILENAME Name of the file ARGC Number or arguments ARGV Array of arguments FNR File Number of Records OFMT Format for numbers (default "%.6g") RSTART Location in the string RLENGTH Length of match SUBSEP Multi-dimensional array separator (default "\034")
Each awk statement consists of a pattern (or selection_criteria) with an associated action.
I feel that the {action} piece is what gives awk its real muscle. However, the selection_criteria definitely makes a significant contribution to the overall power provided by this tool.
Patterns (selection_criteria) in awk control the execution of rules - a rule is executed when its pattern matches the current input record. What is a record? Simply a line in the file that awk is running against.
Note: A rule contains
There are different ways/approaches to constructing patterns. I would like to expound on these.
A summary of the kinds of patterns supported by awk are:
- / regular expression /
- expression
- pattern1, pattern2
- BEGIN
- END
- empty
/ regular expression / - this will match content in a record when the text of the input record fits the regular express
expression - this will match content in a record when the expression is non-zero (a number) or non-null (a string)
parttern1, pattern2 - A pair of patterns, separated by a comma, specifying a range of records. The range includes both the initial record that matches pattern1, and the final record that matches pattern2
BEGIN END - The two keywords represent special patterns. - These are NOT used to match input records. Instead, they supply start-up or clean-up actions for the awk script.
empty - The empty pattern matches every input record.
Let's look at some simple examples of each:
/ regular expression /
Example 1:
/trevor/ - match every line (input record that contains "trevor"
Example 2:
/trevor|lee|chandler/ - match every line (input record) that contains either "trevor", "lee", or "chandler"
Example 3:
/^t/ - match every line (input record) that begins with the letter 't'
Example 4:
/t[aei]/ - match every line (input record) that has the pattern 'ta', 'te', or 'ti' somewhere on the line
Example 5:
/^t[ou]/ - match every line (input record) that begins with the pattern 'to' or 'tu'
I'll stop here with the examples for now. If you have some familiarity the use of regular expressions with other commands, you know that the patterns can get pretty exotic! Down the road, I do intend to get pretty exotic!
expression
Here, I'll simply refer you to Chetan's previous post, where he uses the pattern NR == 2. This is very common kind of expression, and is known as a comparison expression. Makes sense, huh? We're using comparison operator (that everyone is familar with), to compare the content of the variable NR being equal to 2.
pattern1, pattern2
Example:
$1 == '2022', $1 == 2023 - match the first line (record) that contains '2022', and continue matching all lines after that until a line containing '2023' is reached. So, if there a line with '2022' on it, and the line with '2023' is 15 lines below that, the lines that are matched are: 1) the line with '2022' on it 2) the 14 lines following the line with '2022' on it, and finally 3) the line with '2023' on it. All this verbage is why examples are so critical!!!
I"m going to save the BEGIN/END patterns for a later post. These two keywords alone can demand a chapter!!! Again, this is going to be marathon coverage!
empty
The empty pattern is essentially an awk statement without a pattern:
Example:
awk '{ print $6 }' data - this will match every line (record) in "data", printing the 6th field of each line (record)
I won't cover actions in this post. However, I'll leave you with this little tease:
* The purpose of the action is to tell awk what to do once a line matches a pattern
* There MUST BE curly braces { } for each action
* If there is no action specified in an awk statement, the curly braces { } can be omitted, and the equivalent action is ' {print $0}'
Okay, I'll stop here. There's such a long way to go before this journey is complete. I feel as though I've only taken 6 steps on this yellow-brick road! In the next episode, in all likelihood, I'll continue with more examples on patterns.
I'm running a little bit behind on my posts. No excuses, but I've been a little busy. That's a typical excuse. I may as well say that the dog ate my computer
Okay, this episode will be brief. I will confinue with the "pattern" component of the awk tool.
Here's the content of my file named /tmp/sample:
trevor lee chandler joseph jerome jessie wayne earl
Now for some examples:
$ awk '/[LiNuX]/ {print $0}' /tmp/sample jessie
You'll recall that a pattern of the form [abcde] instructs awk to match any line in the file that contains either the letter 'a', 'b', 'c', 'd', or 'e'. So, in my actual command, the latters that are to be matched are 'L', 'i', 'N', 'u', 'X'. So, why is there only one line appearing in the output?
There are definitely lines with the letters 'l'. Why don't they appear in the output? The pattern is specifying an uppercase 'L' letter. The letter 'l' in the names "lee" and "chandler" are lowercase. Okay, you already know the takeaway here - case is significant!
$ awk ' /e/ {print $0 }' /tmp/sample
trevor lee chandler joseph jerome jessie wayne earl
No surprises - hopefully - in the output that appears, based on this command. The lowercase letter 'e' appears somewhere on each line in the file.
Now, how do I go about displaying only the lines that end with the letter 'e'. Easy as jumping rope:
$ awk ' /e$/ {print $0} '
lee jerome jessie wayne
Okay, that's it for this episode. I did say it would be a brief, although I failed to mention that no heavy lifting would be required. All that was featured was:
1) case is significant 2) The '$' is the special character to anchor expressions to the end of a line. If you've completed the RH124 course, you've seen this in action before
1. Case sensitivity matters in patterns. 2. Use /[.......] to match any character within the brackets. 3. Use /e/ to match any occurrence of 'e' in a line. 4. Use /e$/ to match lines ending with 'e'.
Okay, we've looked at the all important 'pattern" component of the awk tool. Now, let's go in another direction, and have a look at how awk can make use of variables.
What we'll see over the next few postings is that awk can make use of 3 categories of variables: 1) built-in 2) user-defined 3) shell varibles
Note: I used the term "categories" in making reference to variables, and not the term "types", because awk doesn't have types of variables like some programming languagues and other applications have. With awk, a variable is either a string OR number.
Built-in variables, as the name suggest, are variables that are built-in, predefined, in the awk tool. They come ready to be used in a predefined way - a predefined utility if you will.
Built-in variables have values already defined in awk, but we can also alter those values.
Here's a list of the variables that are built into the awk tool, along with a brief explanation of what each built-in variable represents:
CONVFMT This string controls conversion of numbers to strings). It works by being passed, in effect, as the first argument to the sprintf function. Its default value is "%.6g".
FS FS is the input field separator. The value is a single-character string or a multi-character regular expression that matches the separations between fields in an input record. If the value is the null string (""), then each character in the record becomes a separate field. The default value is " ", a string consisting of a single space. As a special exception, this value means that any sequence of spaces and tabs is a single separator. It also causes spaces and tabs at the beginning and end of a record to be ignored. You can set the value of FS on the command line using the `-F' option:
awk -F, 'program' input-files
OFMT This string controls conversion of numbers to strings for printing with the print statement. It works by being passed, in effect, as the first argument to the sprintf function. Its default value is "%.6g".
OFS This is the output field separator. It is output between the fields output by a print statement. Its default value is " ", a string consisting of a single space.
ORS This is the output record separator. It is output at the end of every print statement. Its default value is "\n".
RS This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.
SUBSEP SUBSEP is the subscript separator. It has the default value of "\034", and is used to separate the parts of the indices of a multi-dimensional array. Thus, the expression foo["A", "B"] really accesses foo["A\034B"].
ARGC ARGV The command-line arguments available to awk programs are stored in an array called ARGV. ARGC is the number of command-line arguments present. Unlike most awk arrays, ARGV is indexed from zero to ARGC - 1.
FILENAME This is the name of the file that awk is currently reading. When no data files are listed on the command line, awk reads from the standard input, and FILENAME is set to "-". FILENAME is changed each time a new file is read
FNR FNR is the current record number in the current file. FNR is incremented each time a new record is read (see section Explicit Input with getline). It is reinitialized to zero each time a new input file is started.
NF NF is the number of fields in the current input record. NF is set each time a new record is read, when a new field is created, or when $0 changes (see section Examining Fields).
NR This is the number of input records awk has processed since the beginning of the program's execution. NR is set each time a new record is read.
RLENGTH RLENGTH is the length of the substring matched by the match function. RLENGTH is set by invoking the match function. Its value is the length of the matched string, or -1 if no match was found.
RSTART RSTART is the start-index in characters of the substring matched by the match function . RSTART is set by invoking the match function. Its value is the position of the string where the matched substring starts, or zero if no match was found.
Note: Regarding the NR and FNR variables, awk simply increments both of these variables each time it reads a record, instead of setting them to the absolute value of the number of records read. This means that your program can change these variables, and their new values will be incremented for each record. Okay, don't worry, I've got examples coming for this - and all of the other variables. Remember, this is a marathon, and not a sprint. Are you tired of seeing that commend in my posts
I'll conclude this post with just one example, and I'll pick on one of the more simple built-in variables: NR
I've got a file named "post2", whose contents is shown below:
trevor lee chandler joseph older chandler jessie youner chandler lonnie father chandler laura mother chandler
As you can see, there are 5 lines in the file, with each containing 3 fields (firstname, mid-name, lastname) of information.
$ awk '{ print NR }' post2
1 2 3 4 5
As you can see, the output is simply a number, that represents the record (i.e. line) that the awk was reading. The first line in the file is record 1, the second line in the file is record 2, etc. Each time a line is read by the awk tooll, the NR variable is incremented by 1. Complicated, huh
Something very noteworthy that I'll mention right now is that there is no dollar-sign ($) preceding/prepending the variable name NR (i.e. $NR). To simply reference the current record number, no prepending of a '$' is needed.
Now, if you ran that same command, only this time prepending the NR variable with a '$', notice what the output looks like:
$ awk '{ print $NR }' post2
trevor older chandler (blank) (blank)
On those last 2 lines, (blank) is not actually printed to the screen. I simply put that there for the last 2 lines because those lines are blank, and you wouldn't be able to those blank lines in my sample output.
I"m going to leave it to your investigation to discover why that output appears. If you can see why this is the output, you should stand-up and take a bow, pat yourself on the back, and just feel real good about your understanding of what's going on here. Oh yes, there's much, much more on this cross-country journey of our look at awk, but with there being nothing overly intuitive about why the output is what it is, you've earned a reason to celebrate your effort.
I'll close this episode by saying something about my mentioning of the sprintf function, when I provided the explanation about the CONVFMT variable. I simply wanted to say that there's no need to be concerned about this item at this point. My coverage for that is way down the road. Thisi is a cross-country journey, going from East coast to West coast, and right now we're only in North Carolina rigtht now Let's pace ourselves!!!
Okay, this episode on awk will be very abbreviated, with a focus on the sources of input for awk.
awk can take its input from 3 sources: 1) a file 2) a pipe 3) standard input
All of the examples that we've looked at up to this point have shown the input coming from a file: awk options 'pattern { action } ' input-file
If an input-file named "student" contains the line "Trevor Lee Chandler", the command
awk '{ print $2 }' student
will display Lee on the screen. Nothing new there. By now, we're well acquainted with awk enough to have expected that output on the screen.
Now, let's look at an example involving an old friend - the pipe mechanism.
cat student | awk '{ print $2 }'
What's the output? You guessed it - Lee The information that awk is processing is the same. The sole difference with this command is where the information is coming from: via a pipe vs a filename.
The last source of input for awk involves input from stdin, that will be indicated by use of the dash/hypen character. The dash/hyphen character ( - ) is used by awk to receive/expect its input from the keyboard (i.e. the input to awk will be input manually by the user, and is not read from a file or provided via a pipe). Let's look at an example:
awk '{ print $2 }' - Trevor Lee Chandler # Content that is manually input
Lee # This is the output that will appear on the screen Control-C # This is used to terminate the execution of the awk command
Note: When using the dash/hyphen character for input to awk, this will cause awk to continue to expect content after each line is input. So, in the example above, after "Trevor Lee Chandler" is input, the awk command processes that line, outputting "Lee". You will then be provided a blank line, indicating that another line of input is expected, or you can terminate the excution of the awk command by inputting Control-C - you know that old Ctrl-C key combination. Going through an example on your own will certainly demonstate what I'm attempting to convey here.
To recap, the awk command can receive its input from 3 sources: 1) filename 2) pipe 3) stdin (standard input)
As promised, this would be an abbreviated episode. That's all folks!!!
Chetan's last lesson was for the PhD's among you. I'm going to provide one for the undergraduates.
In an earlier session, we looked at the function/purpose of the $ (dollar sign) in awk. You will recall that it is used only when accessing the contents of a field in a line that was read as input by awk -- the $ is NOT used to access the value of variables, as is the case in the shell (e.g Bash).
Okay, now on to today's feature: the FS variable. Quite often, when a character is to be specified for this variable, it is done so on the command line, using the -F option. Again, from a previous lesson, we learned that the default value of the FS variable is the space character (" ").
Let's go ahead and dig into an example to see this variable in action. The file ("names") that I will be using for my examples contains the following content: First:Middle:Last Lonnie: :Chandler Laura: :Chandler Joseph:Jerome:Chandler Trevor: :Chandler Jessie:Wayne:Chandler
Because I'm the author of the file, I know that I'm intending to have the colon (:) serve as the character separating the fields of information. With that being the case, we can readily see that there are three fields of information:
Whoa! The command is specifying that ONLY field 1 is to be printed. What's going on here? The command is performing exactly as it should. Using the default value of a space character as the field separator, all of the content on each of those 5 lines in the output represent field 1.
If each field in this file is to be referenced, a colon must be used as the field separator.
Example 2: $ awk -F: '{ print $1 }' names First Lonnie Laura Joseph Trevor Jessie
Eureka! Just what the doctor ordered! My output is comprised only of the content in field 1.
Example3: $ awk -F : '{ print $2 }' names Sr Ethel Jerome Lee Wayne
Voila! Only the information in field 2 is output.
Note: Let me point out something that very subtle in this example. Notice the position of the colon - there is a space between the -F and the colon. This is nothing major. The only intent here is to let you see that the character to be specified as the field separator does not have to immediately follow the 'F'.
Okay, that will conclude this post. The coverage should be brief because there was only one item that was featured: Input Field Separator
@Trevor What a nice significant addition and a wonderful note about Input Field Separator.
here is continuing to this :
FPAT="[^,]+": This statement sets the field pattern (FPAT) to a regular expression that matches one or more characters that are not commas. This means that each field in the input will be separated by one or more commas.
The FPAT setting ensures that the field separator is recognized and respected.
In this post, I'd like to have a look at a couple of Regex operators used in awk, known as anchors:
- ^ - $
The ^ will match some pattern that is at the beginning of a line.
The $ will match some pattern that is at the end of a line.
In the examples that will follow, I'll use another regex operator, but this is not what is being featured in this post. One reason is because we've had a look at this in a previous post. The regex operator I make reference to is [...]. As was stated previously, this is called a bracket expression. It's purpose is to match any one of the characters within the square brackets. For example, [e i E I] will match either a lowercase e, lowercase i, uppercase E, or an uppercase I.
Okay, let's get to the examples.
The files I will be using for my examples is comprised of the following content:
file1: a begins this line e begins this line i begins this line o begins this line u begins this line A begins this line E begins this line I begins this line O begins this line U begins this line
file2: line that ends with a line that ends with e line that ends with i line that ends with o line that ends with u line that ends with A line that ends with E line that ends with I line that ends with O line that end with U
Example1: In this example, the only lines that will be displayed are the ones that begin with either a lowercase e, a lowercase i, an uppercase E, or an uppercase I.
$ awk '/^[eiEI]'/ { print }' file1 e begins this line i begins this line E begins this line I begins this line
Example2: In this example, the only lines that will be displayed are the ones that end with either a lowercase e, a lowercase i, an uppercase E, or an uppercase I.
$ awk '/[eiEI]$'/ { print }' file2
line that ends with e line that ends with i line that ends with E line that ends with I
Nothing overly challenghing, right? And for that reason, this will be another brief demonstration - a snack
Let me close by serviing up a command, to see where your thinking is on the two regex operators featured in this posting:
$ awk '/^[eiEI]$/ { print }' some-filename
Question: What lines do you think might be output when this awk command is executed? I'm not asking the question based on either of the files that I've used in this posting.
In this episode of "Parsing Output with AWK", I want to talk about functions that are built into awk. Those built-in function fall into three categories: - numeric - string - I/O
The built-in function that I want to demonstrate in this post is in the string category: toupper(string). The "toupper" function takes one argument - a string. The function returns a copy of the string, with all lower-case characters converted to upper case characters. Let's look at a short example to demonstrate the utility of this function.
The file that will be used in my example, "schools", has the following content: Booker T Washington Evan E Worthing John H Yates Phillis W Peters James D Ryan Carter G Woodson Bennie C Elmore
$ awk '{ print $3 }' schools Washington Worthing Yates Peters Ryan Woodson Elmore
No surprises in this output. The content in the 3rd colum is displayed.
Now, let's see what the output looks like when we deploy the "toupper" built-in function:
$ awk '{ print toupper($3) }' schools WASHINGTON WORTHING YATES PETERS RYAN WOODSON ELMORE
I'm sure you saw this coming - all of the lower-case characters in the 3rd column were converted to uppercase. The "toupper" function performed exactly as it was described above.
Okay, that's all for this post. There will be many more to follow, demonstrating some of the other built-in awk functions.
This command reads each line from the file "data.txt", splits the line based on the comma delimiter (",") into an array named "arr", and then prints the first element (arr[1]) of the array.
sum=0: Initializes a variable sum to store the running total. n = split($0, arr, ","): Splits the line and stores the number of elements in n.
for (i=1; i<=n; i++) { ... }:
This loop iterates through each element in the arr array: i: This is the loop counter, starting from 1 and iterating until it reaches the value of n. sum += arr[i]; : This statement adds the current element of the arr array (arr[i]) to the running total (sum). printf "%s%s", arr[i], (i< n ? " + " :""); :
This statement prints the current element (arr[i]) and a "+" symbol if it's not the last element:
%s: This format specifier tells printf to print a string. (i< n ? " + " :"") : This is a ternary operator that checks if i is less than n. If it is, it prints a "+" symbol; otherwise, it prints an empty string. printf " = %d\n", sum; :
This statement prints the final sum stored in the variable sum: %d: This format specifier tells printf to print an integer. \n: This prints a newline character.
I Agree, but nowadays seems a sort of "Forgotten Art/Craft". I started working in the early 90s. Back then AWK/SED/GREP was the usual tools for most automation tasks on shellscripts.