CS46B Lab 10

Copyright © Cay S. Horstmann 2012 Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Objectives

A. Writing a Script

When you submit your homework, some instructors will ask you to zip up all homework files. A beginner would use a program such as WinZip for this task. You know the drill. Click on the WinZip icon. Click, click, click, until you have added each file. Click, click, click, type the zip file name, click, click, and you are done. Did you need to make a last-minute change to a file? Do it all over again. And again.

You can do better than that. The command

jar -cf homework.zip *.java

makes a zip file called homework.zip that contains all Java files in the current directory. (The cf option denotes compression to a file.)

Maybe you also need to include some other files (such as Javadoc documentation and UML diagrams). Then the command gets longer.

jar -cvf homework.zip *.java *.html *.png

(The v option produces verbose output, listing the actions that the jar command takes.)

  1. Try it out. Open a bash shell. Using cd, change to a directory that contains a couple of Java files. Type
    jar -cvf homework.zip *.java

    What output do you get? (If you get a complaint that the jar command is not found, check your PATH—see Lab 2 Section I 6. You need to fix this before going on with the lab.)

  2. Using Emacs, make a file ~/zipup (that is, a file with name zipup in your home directory). Place the following inside:
    #!/bin/bash
    # Zip up homework source, HTML, images
    jar -cvf homework.zip *.java *.html *.png 

    Save the file and exit Emacs.

    The lines starting with # are comments. The first line indicates that this is a bash shell script.

    Back in the shell window, type

    chmod +x ~/zipup

    This makes the file executable.

    Now type

    ~/zipup

    What happens?

  3. Your instructor may ask you to run the Javadoc program to extract document comments before you submit your homework. The command is
    javadoc *.java

    Enhance the zipup file to automatically execute the javadoc program. What is the contents of the file now?

B. Shell Script Arguments

Suppose you sometimes want to zip up your homework to hw1.zip, then to hw2.zip, or even to hw1_1728.zip. This can be easily done by modifying zipup.

Change the filename homework.zip to $1.zip:

jar -cvf $1.zip *.java *.html *.png

When executing the shell script, you now need to supply the name of the zip file (without the .zip extension which is appended in the shell script). For example,

~/zipup hw1 

Then the $1 is replaced by hw1. $1 is the first argument of the shell script, $2 is the second, and so on.

  1. How do you run the shell script to save your homework to hw1_1728.zip?
  2. Run zipup without an argument. What happens?
  3. Let's modify the shell script to give an error message if no command line argument is specified. Add the following lines before the jar command.
    if [ -z $1 ]
    then
      echo "Usage: zipup nameOfFile"
      exit
    fi

    Now what happens when you run the script without an argument?

  4. What's fi?

C. Redirection

Imagine you work as a grader for your professor. You get a bunch of zip files, one from each student. For each of the zip files, you want to go through the same steps: Why capture the output? You can then look at it when it is convenient, or email it to the professor or student. Let's design such a shell script. We'll call it grade. It takes two parameters: the name of the zip file (without the extension), and the class with the main method. For example,
grade hw1_1728 BankAccountTest

  1. What are the commands to unzip the file hw1_1728.zip, compile the Java source, run the BankAccountTest program, and capture the output in the file hw1_1728.txt? Hint: Use redirection: java ... > ...
  2. Now write the grade file. Put it in your home directory again. Simply take the results from the preceding exercise and replace hw1_1728 with $1 and BankAccountTest with $2.  What is the contents of grade?
  3. Put this file in your current directory: bank.zip.

    Then run

    ~/grade bank BankAccountTest

    What output file was created? What is the contents of that file?

  4. Another useful redirection operator is >>. It appends the output of a program to a file.

    Modify your script so that it first writes the contents of $2.java to $2.txt, then appends the result of running the program. What is your script now?

  5. It would be nice to separate the file contents and the program run by a line ===Program Run===. How can you do that? (Hint: echo, >>)
  6. Finally, let us capture the compiler output in the homework report as well. Unfortunately, now we have a problem. Compiler errors are reported to the standard error stream, not the standard output stream. The > and >> commands only redirect  the standard output stream. The remedy is to use the 2> operator which redirects the standard error stream. (For historical reason, that stream is also known as stream #2.) 
    javac *.java 2> homework.txt

    Add this enhancement to the grade file. What is your file now?

  7. Introduce an error into BankAccountTest.java and run the grade script. How does the error show up in the report file?

D. Loops

You can do simple programming in shell scripts. You have already seen the if ... then ... fi construct. Now let's turn to loops.  The syntax is

for var in ...
do
  command $var
done

You probably expected rof, not done at the end, but bash isn't plagued by a foolish consistency.

Here is an example:

for f in *.java
do
   javac $f
done

Or, all on one line:

for f in *.java ; do javac $f ; done

Note the semicolons before do and done.

Now let us put this technique to work in our grading shell script. We want to feed all input files of the form input*.txt to the program to be graded. That way, the grader can prepare an arbitrary number of files input1.txt, input2.txt, etc.

for f in input*.txt ; do ( java $2 < $f >> $1.txt ) ; done

Note the () around the Java command.  This makes the java program run in a subshell, confining the redirection of input and output. Without the parentheses, the < and >> would apply to the whole loop.

  1. Add this enhancement to your grading shell script. What is your script now?
  2. Unzip this zip file. How did you do it, using just the command line?
  3. Produce two files input1.txt and input2.txt with different inputs for the SavingsAccountTest program. (Run it to see what inputs are expected.) What are your input files?
  4. What was the contents of interest.txt after you ran ~/grade interest SavingsAccountTest?
  5. How can you improve gradehelper so that the outputs that belong to different program runs are separated from each other? (Hint: echo)

E. Regular Expressions

  1. Can you think of an English word that contains the consecutive letters ea, followed by another letter, followed by the letters ou?

    There are actually several, such as zealous and whereabouts.

    How about a word that contains the consecutive letters ea, followed by another letter, followed by the letters io?

    (If you don't know one, just write “I don't know”. You'll learn in this lab how to find the answer.)

  2. A regular expression describes a set of strings that have a “regular” structure. For example, here is such a set

    {eaaou, eabou, eacou, eadou, eaeou, eafou, eagou, eahou, ..., eazou}

    This set is described by the regular expression ea[a-z]ou.

    Generally, a character matches itself (such as ea and ou in the example above). But the expression [a-z] matches any lowercase letter from a to z. [a-zA-Z] matches any upper- or lowercase letter, and [aeiou] matches any vowel.

    Using the same syntax, write an expression for “any letter or digit”.

  3. The egrep command prints out all lines in a file that contain a match for a regular expression. Type
    egrep 'ea[a-z]ou' /usr/share/dict/words

    What happens?

    When you browse the web, you may run into tutorials where the same command is written as grep ea[a-z]ou /usr/share/dict/words. It is a good idea to always use egrep instead of the older grep. Also, get into the habit of enclosing the regular expression inside single quotes ''. Then the command shell won't intercept characters such as $ and \ inside your regular expressions.

  4. You can form the complement of a letter set with the [^...] syntax. For example, [^aeiou] means “anything but a vowel”.

    Find all words in /usr/share/dict/words that have a q or Q followed by a letter other than u. What is your call to egrep?

  5. The ^ character matches the beginning of the line, and $ matches the end of the line. For example, ^oo matches all words that begin with oo, such as ooze or oodles.

    The | operator separates alternative matches. For example, a(vv|x)y matches savvy or waxy.

    How do you find all words that begin or end with oo, using a single call to egrep?

  6. To match a repetition, use one of the following operators:
    * 0 or more
    + 1 or more
    ? 0 or 1
    {n} n times
    {n,} at least n times
    {,n} at most n times
    {m, n} between m and n times

    Find all words in /usr/share/dict/words that have an a, b, or c at least five times in a row (such as cabbage). What is your call to egrep? What matches did you find?

  7. A period matches any character. This is most often used in a repetition. .* means zero or more characters, and .+ means at least one character. For example, o.+o.+o matches tomorrow but not zoology.

    How do you find all words that contain oo twice, such as foolproof or voodoo?

  8. /usr/share/dict/words is a bit special since it has one word per line. By default, egrep lists all lines that contain a match. What happens when you run
    egrep '[A-Za-z]+' BankAccount.java
  9. If you want to have only the matches and not the lines containing them, use the -o option:
    egrep -o '[A-Za-z]+' BankAccount.java

    What happens when you try that?

F. Pipes

How many words contain oo twice? You can count the output of the preceding call to egrep, but that's tedious. Instead, you can use the wc program that counts words.

  1. Run wc < BankAccount.java

    What output do you get? What do you think the numbers mean?

    If you can't figure it out, try running with this file instead:

    Hello World
    Goodbye
  2. Re-run the egrep command from step E7 and save the output to a file temp.txt. Then use wc to count the words in temp.txt. What were your commands?
  3. Because this combination is so frequent, there is a shortcut for it, called a pipe. The command
    egrep 'your pattern' /usr/share/dict/words | wc

    feeds the standard output of egrep into the standard input of wc, without the need to make a temporary file.

    What command do you call to find out how many words contain oo twice?

  4. It's a bit unpleasant that we get those three numbers when we just want the first one. Use another pipe and egrep -o '^[ ]*[0-9]*' to only grab the first number. What is your command?
  5. What is the output?
  6. Explain that egrep command. What does it match?
  7. There is a simpler way of getting just the word count. What is it? (Hint: wc --help)

G. Example: Timing Program Runs

  1. Download this class file, move it to the current directory, and run
    java Median1 10000

    What output do you get?

  2. Now we want to time it:
    time java Median1 10000

    What output do you get?

  3. We'd like that for more than one input size:
    for f in 10000 20000 30000 ; do ( time java Median1 $f ) ; done

    What output do you get?

  4. We only care about the user times. Use grep to only show those lines:
    ( for f in 10000 20000 30000 ; do ( time java Median1 $f ) ; done ) | grep real

    What happens? Why?

  5. Apparently, time sends its output to stderr, not stdout. The syntax to pipe stderr is a bit bizarre. It is
    command1 2>&1 | command2

    Fix up the command of the preceding step. What is your command? What is your output?

  6. We don't want just three timings. We want more. Run this command:
    seq 10000 5000 100000

    What is the output?

  7. What do the arguments of seq mean? (Try seq --help)
  8. We want the seq output as the parameters of the for loop. In bash, you can splice the output of one command into another with the backticks `...`

    for f in `seq 10000 5000 100000` ; do

    Try this. What command did you use? What is your output?

    Be sure to use the backtick (to the left of the 1 key) and not the single quote.

H. Example: Cleaning up Data (Optional)

  1. Go to this site and download the file called “A subset of about 1700 labeled email messages”. This is a different kind of archive, not a ZIP file. To see what's inside, use the command
    tar tvfz enron_with_categories.tar.gz

    How many files are inside? (Hint: Run it again and pipe into wc.)

  2. To extract the file, run
    tar xvfz enron_with_categories.tar.gz

    Afterwards, run

    find enron_with_categories | wc

    What number of lines do you get? Why?

  3. Run
    egrep -o -h -r '[^0-9][0-9]{3}[-) ]+[0-9]{3}[- ]+[0-9]{4}' enron_with_categories

    What output do you get?

  4. What are the meanings of the -o, -h, -r options? (Hint: egrep --help)
  5. Explain the regular expression.
  6. As you can see, people are not very consistent with formatting telephone numbers.
    -888-271-0949
     800-283-1805
     609 279 4094
    (415) 777 -0220
     415 781 0701

    Imagine you are an intern working for a congressional committee. Your task is to clean these up for your neat-freak boss: Put them all into the format (888) 271-0949. There are several hundred of them, so you don't want to do it by hand.

    The first step is easy: put them all in a single text file. How do you do that?

  7. Load that file into Emacs.

    A programmer's editor can match regular expressions. (Notepad—not so much.) In Emacs, the command is Edit -> Search -> Incremental search -> Forward regexp or C-M-s (i.e. Control+Alt+s). Enter that command. Type [0-9]+. Then type C-s repeatedly to go from one match to the next.

    What happened when you typed in [0-9]+?

  8. Now let's try a regex replace. In Emacs, the command is Edit -> Replace -> Regexp replace or C-M-% (i.e. Control+Alt+Shift+5, because % is a shifted 5, at least in the US keyboard layout). Replace all [0-9]+ with the # symbol. In Emacs, you type a ! to indicate global replacement. That is,

    What happens?

  9. Ok, you didn't want that. Undo it. (Ctrl+Z in Emacs provided you activated CUA mode)

    Let's try matching each of the numbers. A suitable regular expression is

    .[0-9]+[^0-9]+[0-9]+[^0-9]+[0-9]+

    Explain this expression in English.

  10. Now we need an added twist. You can mark matching subexpressions. In the Emacs syntax, you use \( and \) to mark the groups:
    .\([0-9]+\)[^0-9]+\([0-9]+\)[^0-9]+\([0-9]+\)

    Then you use \1 \2 \3 to refer to the match for the first, second, and third group.

    Try it out: Replace each phone number with

    (\1) \2-\3

    What output did you get?

Congratulations! If you were a congressional staffer, you could now take the afternoon off instead of laboriously formatting each phone number.

Hopefully, these exercises have given you a feel for the power of automation. While it is undeniably challenging to automate a task for the first time, the effort is  repaid handsomely. It is fun to watch the computer do the same boring tasks over and over, particularly if you consider how much time it would have taken you to do it by hand.

In your programming and testing process, you carry out lots of repetitive steps. Automate them, and you will become more productive. You will also find that you would never attempt certain tasks without automation. For example, consider the task of testing your programs. Whenever you change a program, you should really test it again with a bunch of inputs. Do you do that? Probably not. What could be more tedious than typing in the same inputs over and over again? You now know that you can automate that task. Put a bunch of test inputs into files and write a shell script that automatically feeds them into your program. Test automation leads to higher quality programs.

For those reasons, all professional programmers are serious about automating their build and test processes. You have just learned how to use the command shell for basic automation tasks.