CS46B Lab 3

Copyright © Cay S. Horstmann 2010-2015 Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.

Modified by:

Instructions

Working in Pairs

Objectives

Learning Outcomes

A. Command Line Arguments

  1. Get the lab2 from your buddy who worked on it last week.

    Remember our goal -- we don't want to click through a file chooser when we are testing. Fortunately, the program will switch to a console UI if we specify the name of the file from the command line:

    java -classpath /path/to/classes AddressBookDemo deptdir.txt

    When main starts, args[0] is the first argument after the program name, i.e. deptdir.txt.

    If you've ever wondered what the String[] args is good for in public static void main(String[] args), now you know. It is an array containing the command line arguments

    How does the AddressBookDemo program switch between GUI and console mode? Look in AddressBookDemo.java

  2. Now run
    java -classpath /path/to/classes AddressBookDemo

    Again, replace the /path/to with the actual path.

    What happens? Why? (Click Cancel to stop the program)

  3. Ok, we need to supply a file name if we want to specify the file on the command line.

    java -classpath /path/to/classes AddressBookDemo deptdir.txt

    What happens?

    For now, type option 5.

B. Input Redirection

Run the program from the console again (with two keystrokes ☺)

Select option 2, then enter the name Horstmann and the key Phone Then select option 5 to quit:

1: Add/Change Entry
2: Look Up Entry
3: Remove Entry
4: Save Directory
5: Exit
Enter command: 2
Enter name: Horstmann
Enter key: Phone
Value: (408) 924-5085
1: Add/Change Entry
2: Look Up Entry
3: Remove Entry
4: Save Directory
5: Exit
Enter command: 5

Now we don't even want to type the inputs! Using Emacs, make a file input.txt (Click File → Visit New File) containing the four lines

2
Horstmann
Phone
5

Put in the four lines, save the file in your home directory, and quit the text editor. Now run

java -classpath /path/to/classes AddressBookDemo deptdir.txt < input.txt

As always, remember to hit ↑ and just add the < input.txt to the end of the command line.

What happens? Can find the phone number?

The < symbol means “read keystrokes from a file”, or, more formally “redirect System.in to a file”.

Now edit input.txt to add a lookup for Diaz.

  1. What is the content of input.txt now?

  2. What happens when you run the program with this input.txt?

C. Output Redirection

Now we are ready to do some serious test automation. We want to test that adding an entry works correctly. Here is the plan:

  1. What input.txt file tests this scenario?

  2. Now let's capture the output. Run this command.

    java -classpath /path/to/classes AddressBookDemo deptdir.txt < input.txt > output.txt

    Remember to hit ↑ and just add the > output.txt to the end of the command line...

    What is the contents of output.txt? How do you know?

  3. Another way of checking contents of a file is the cat command. Type

    cat output.txt

    What happens?

  4. Here is a good reason for saving the output. It often happens that you make a change to a program, and you want to run a test case again to check that it still works. Save the output file

    cp -v output.txt expected.txt.

    What is the contents of expected.txt? How did you check it?

  5. Run the program again and capture its output in output.txt. Then compare the two:

    diff output.txt expected.txt

    The diff command compares two files and prints their differences. If there aren't any, it prints nothing.

    Run the AddressBookDemo program, and the diff command as described. What happens? Why?

  6. Change input.txt by adding a 4 before the last line, i.e.

    ...
    2
    Diaz
    Phone
    4
    5

    With the new input.txt, run the AddressBookDemo program and the diff command as described. Then do it again. What happens? Why?

    We'll do more test automation when we implement removal in the next lab.

  7. One last question:

    For each of these commands, give a one-sentence description what they do: ls, pwd, cp, cat, diff.

D. Tabs

  1. Look at this file (ExhibitA.txt) in Emacs. What do you notice about the way that the code lines up?
  2. That looks terrible, right? The culprit is the tab character. Many programming editors insert tab characters to line up code. For example,
    if (x > y)
    {
       y = x;
    }

    is actually

    if (x > y)
    {
    \ty=x;
    }

    where \t denotes a tab.

    When the file is displayed, the tab is shown as some number of spaces. How many spaces? That's the problem—nobody agrees. Eclipse thinks it should be 4. Windows Notepad thinks it should be 8.

    Here is how you can see the tabs. Load the file into Emacs and type

    Alt-X hexl-mode Enter

    That is, first type Alt-X. [You might need to hit the esc (escape) key, and then x, for Alt-X. You can hit-and-release the esc key, and then hit lower-case x.]

    The text will appear in the status line at the bottom of the window. Once you have gotten to the status line, type the 9 character string hexl-mode there, and then hit Enter.

    Never seen one of these? Congratulations, you've reached level 3.

    You are seeing the hexadecimal encoding of each byte in the file. To the right, you see the characters, and if you look carefully as you move the arrow keys, you can see how they correspond.

    For example, move the cursor on the lowercase b in public on the right hand side.  In the hex display, you will see 62. That's the code for b.

    What is the code for lowercase c? How do you know?

  3. Hexadecimal is like decimal, but for people with 16 fingers. It has extra digits A B C D E F with decimal values 10 11 12 13 14 15. And 62 isn't 6 x 10 + 2 but 6 x 16 + 2 or 98 in decimal. It's used for showing byte values because it's more compact than decimal. The range 0 - 255 turns into the range 00 - FF in hex: FF is 15 x 16 + 15 = 255. You don't have to worry about the details. What matters is that the hex dump shows truthfully what is in the file, not what the editor wants you to see.

    Look for spaces (with code 20) and tabs (with code 09). What is the first line of code in which you see each?

  4. You don't want tabs. In Eclipse, here is how you turn them off. Select Project -> Properties. On the left panel, choose Java Code Style -> Formatter

    In the right panel on the left side, check Enable project specific settings, then click the New... button to create a new profile. Name it SJSU.

    Choose Indentation tab, Set Tab policy to Whitespace only

    How do you turn off tabs in Emacs? Hint: Look into your ~/.emacs file.

    Don't use tabs. There is no advantage and only pain. I am not the only one who thinks this. Some people say that you should only use tabs and never spaces. In theory, that would work—it's the mixture of tabs and spaces that causes the problem. But how confident are you that you and your collaborators won't ever mix them? BTW, I didn't say “Don't use the Tab key”. The Tab key is fine. Just tell your editor to insert spaces when you press it.

    If you click the Braces tab in the Formatter, you can tell Eclipse to align the braces vertically. That is the style you see in the textbook. I suggest setting all except the last two to Next line.

E. Line Endings

  1. Now look at this file (ExhibitB.txt) in Notepad in Windows. If you don't have Windows, peek at the laptop of someone who does. This file shows a different problem: line endings. In most operating systems, the end of a line is denoted by a single character, the newline with code 0A. In Windows, however, two characters are expected: a "carriage return" with code 0D and then a newline OA.

    What's a carriage return?

    Remember these? Maybe not.

    In the olden days when dinosaurs roamed the earth, you had to move the "carriage" back to the left of the paper, and then advance the paper one line. Or, you could not advance the paper and print over the same line multiple times, for example to  style='text-decoration: line-through;' strike out characters.

    Just in case you ever need to run Windows on a typewriter, every line must end with 0D 0A. It is totally useless and a major pain, but it is the Windows way.

    Look at ExhibitB.txt in hexl-mode. What do you see at the end of each line?

  2. Now you know why the lines didn't move back to the left in Notepad. You'd think that Microsoft could figure out how to fix this, but apparently not.

    And they are not the only culprit. Download this file (ExhibitC.txt) and remember where you put it. Now look at it in hexl-mode. This file was created in Notepad. What do you see at the end of each line?

  3. Now try running this in at the command line:
    sh /path/to/ExhibitC.txt

    What happens?

  4. That's not what should be happening. The commands in the file should be executed. The shell gets confused by those extra 0D that it doesn't expect. You'd think that it could figure out how to ignore them, but it doesn't.

    To fix this, run

    dos2unix /path/to/ExhibitC.txt

    DOS is the precursor to Windows.

    Your system might have you install dos2unix before you can run it. If so, on the command line in the virtual machine, enter:

    sudo apt-get install dos2unix

    Now look at the file in hexl-mode again. (Close the old one and reload it.)

    What happens?

  5. Now run
    sh /path/to/ExhibitC.txt

    What happens? Why does it work now?

    Always use Unix-style line endings. The Emacs configuration that I gave you takes care of that. And don't use Notepad.

  6. How can you fix ExhibitB.txt so that Notepad won't choke?

F. Character Encodings

  1. One byte can encode 256 different values. There are tens of thousands of different characters in the different alphabets used on our fair planet, so one needs to use more than one byte to encode all of them. Unfortunately, there are different encoding schemes. Generally, the most useful one is the so-called UTF-8 encoding. Here is a file (ExhibitD.txt) that encodes San José in UTF-8.

    What is the UTF-8 encoding for é? (Hint: it is 2 bytes)

  2. Here is a file (ExhibitE.txt) that encodes San José in ISO 8859-1, another popular encoding that can only represent 256 characters (the 128 ASCII characters and a selection of accented characters that are useful for Western European languages).

    What is the ISO 8859-1 encoding for é? (Just one byte this time)

  3. Now type at the command line:
    cat ExhibitD.txt
    cat ExhibitE.txt

    Which one looks correct? (This depends entirely on how your system is configured.)

  4. Why can't your system pick the correct encoding for each file?

    Always use UTF-8 for your files unless you have an ironclad reason not to do so. (“I didn't know” isn't such a reason.) The Emacs configuration that I gave you makes UTF-8. When you read a file in a Java program, always open the scanner with UTF-8: new Scanner(file, "UTF-8"). Otherwise, your program will use the character encoding of the grader's operating system, and you don't know what that is.

G. The Byte Order Mark (Optional)

  1. Having sung the praises of UTF-8, it's not without pitfalls either. Open up the following file: test1.out. What is the first line?
  2. Now compile this program
    import java.util.*;
    
    public class Test1
    {
       public static void main(String[] args)
       {
          Scanner in = new Scanner(System.in);
          String line = in.nextLine();
          if (line.startsWith(args[0]))
             System.out.println("match");
          else
             System.out.println("no match");
       }
    }

    What do you expect to happen when you run

    java Test Bahrain < test1.out
  3. What actually happens? (Get the program to run first—when you compile and run it in the same directory as the one containing test1.out, the program will run and print something.)
  4. Open up the test1.out file in Emacs with hexl-mode. What are the first five bytes?
  5. What letters are denoted by the fourth and fifth byte?
  6. The first three bytes are the “byte order mark” U+FEFF in UTF-8. This requires some explanation. There are 16-bit encodings of Unicode, where each character is encoded as a sequence of one or more 16-bit quantities (values between 0 and 65536). Each of them is in turn represented by two 8-bit bytes.
    xxxx xxxx | xxxx xxxx
    <-byte1->   <-byte0->

    There are two possible ways of saving these two bytes in a file

    Which one do you think is more reasonable?

  7. You are right, of course, but unfortunately, both ways occur in practice. To distinguish the two, the following clever scheme is used by 16-bit encodings in Unicode. Make the file start with FEFF, the byte order mark, which is required to be ignored. The flipped FFFE

    Why does that help with reading a file with a 16-bit encoding?

  8. The first three bytes that you have seen are the UTF-8 encoding of the byte order mark. Why is a byte order mark not actually needed in a UTF-8 file?
  9. Nevertheless, there it is. Microsoft likes to put it into UTF-8 files as an indicator that they are, well, UTF-8.  It's perfectly legal, and it's not a bad idea. Explain how this might help distinguishing a Unicode file from an file encoded in, say, ISO 8859-1 or UTF-16.
  10. Why does the Java program fail? Modify Test.java so that the last line reads:
    System.out.printf("no match: %x\n" , Integer.valueOf(line.charAt(0)));

    Run

    java Test Bahrain < test1.out

    again. What happens?

  11. The Unicode standard requires that a program ignore the byte order mark at the beginning of a file. Java fails to ignore it. Check out this and this bug report. Vote for getting them fixed! What simple fix should Oracle make to the Scanner class?