Lesson 3.2 Characters
Java Characters and Strings
Numbers are one important kind of value that is manipulated by computers. Most of us, however, use characters much more often than numbers. Characters and groups of characters, [sentences, paragraphs, chapters, books, libraries] form the basis for human language and, arguably, civilization.
The first computer applications were, interestingly enough, all about words, specifically the secret codes used during World War II. The first research into "computing" as we know it was carried out by the mathematician, Dr. Alan Turing, and his collegues at Blechley Park in England where they built Collosus, a code-breaking computer.
Once the war was over, however, computers moved out of the academic and espionage communities and into the corporate arena where they were seen as glorified accounting machines. It wouldn't be until the 1970s with the Wang word processor that businesses realized that computers could deal with words as well as numbers.
Storing Characters
The problem with storing characters in a computer is that, to put it bluntly, you can't. Computers can only store and process binary numbers.
Converting between real-world integers and binary numbers is fairly natural, and converting between real-world floating-point numbers and binary numbers is not much more difficult. But how do you convert between binary numbers and characters?
The Codes
The answer, of course, is simple. You simply use an agreed-up correspondence between a number and a character, like Morse-Code for instance. [You do remember "SOS", don't you?]
EBCDIC
Rather than use Morse-Code, invented for the telegraph, computer makers invented new codes. One of these was EBCDIC, used by IBM on their mainframe computers.
An alternative was ASCII [The American Standard Code for Information Interchange], which has been, more or less, the standard in the PC and Unix worlds.
ASCII
ASCII uses 7 bits to store its characters. In ASCII for instance, the number 65 stands for the capital A. Here is a chart showing the standard ASCII characters and their numeric representations. [If your eyesight is getting a little blurry, as mine is, click the image and you can see the chart full-size].
Because ASCII only uses 7 bits, each character can be stored in a single byte, with room left over for some additional information. This additional bit has been put to a variety of uses. The early CPM/PC word processor WordStar used the extra bit to embed "soft" line breaks. With the introduction of the IBM PC, the extra bit was used to provide an additional 128 characters, including some widely used line-drawing characters.
Many, however found that they didn't need that extra bit after all, and that seven bits were more than enough. Do a search on ASCII using your favorite search engine, and you'll see what I mean. [The ASCII image appearing between the two horizontal rules following this is supposed to be a cow.]
( )
~(^^^^)~
) @@ \~_ |\
/ | \ \~ /
( 0 0 ) \ | | Hey
---___/~ \ | | Hiya
/'__/ | ~-_____/ | Doin?
o _ ~----~ ___---~
O // | |
((~\ _| -| Oops! I mean MOOOOOOO
o O //-_ \/ | ~ |
^ \_ / ~ |
| ~ |
| / ~ |
| ( |
\ \ /\
/ -_____-\ \ ~~-*
| / \ \ .==.
/ / / / | |
/~ | //~ | |__|
~~~~ ~~~~
Although ASCII met a real need for a quasi-universal character encoding during the first half-century of the computer age, today ASCII is really showing its age. The computer world is now much larger than it was in the past, especially with the Internet explosion. Not only are computer users using non-English languages, but their langauges are often written using non-Roman alphabets; alphabets that won't fit into ASCII's seven or eight bits, no matter how much they are shoe-horned.
Several alternative character encoding schemes have been suggested to replace ASCII, and as a programmer you might have to deal with some of them. By now, however, it looks like the winner in the race to succeed ASCII is the Unicode character set, which is, as you might suspect, used by Java. You can find more about Unicode at http://www.unicode.org.
The char Data Type
Java stores characters using the char data type. Unlike the integer and floating-point types, the char family has only one member. Each char variable [or literal] takes two bytes of storage and is interpreted by using the Unicode character set.
By now, the pattern you use to create a char variable should look familiar; it's the same one you used to create ints and floats. You create char variables like this:
| char |
middleInitial |
= |
<char literal here>; |
| char |
controlA |
= |
<char literal>; |
| char |
hTab |
= |
<char literal>; |
| char |
backSlash |
= |
<char literal>; |
| char |
copyRight |
= |
<char literal>; |
A Literal D
Let's start with the first and easiest of these examples. How do you write a literal D? [That's my middle initial.] If you try simply inserting the character--which is what you did with literal integers and floating point numbers--you end up with this:
At first blush, this looks OK, but it turns out to have a fatal flaw. How does the compiler know whether you want the character D or the variable named D? Since D is a valid name for a variable, the compiler goes searching for it, comes up empty and issues a stern rebuke.
The solution is easy. To placate javac, we simply have to add some punctuation to our character to let the compiler know that it is, in fact, a character literal. The punctuation you use is the single quote ['] character. With this new-found knowledge, it's obvious that our definition should be written as:
| char middleInitial = 'D'; |
When you write a character literal, you can only enclose a single character at a time, and make sure you use the single quotes, not the double quotes. These are both illegal:
char initials = 'SDG'; // Too many characters
char initials = "SDG"; // String, not char |
Other Literals
Writing a literal D proved to be quite simple, but how about the other characters? Let's start with controlA.
Control Characters
The first 128 characters in the Unicode character set are the same as those used by ASCII. In ASCII and in Unicode, the first 32 characters are called control characters. These are non-printing characters that were given the name control characters because they were used to control terminals and printers during the early days of computing.
To produce a control character, you hold down the Ctrl key on your keyboard [much like you would with the Shift key], and press one of the alphabetic characters A-Z. This yields the control characters whose ASCII/Unicode values are 1-26.
As you may have guessed, you can initialize controlA by just using its integer Unicode value like this:
"Special" Control Characters
The next character we need to initialize is called hTab. If you look back at the ASCII chart, or at the first page of the Unicode character charts, you'll see that many of the control characters have names, which allude to their functions under traditional computer systems. For instance, the character Unicode 7 [Ctrl-G] is also called the BEL character because it produced audible output.
Using the same scheme, Unicode character 8 was the BS [back-space] character and Unicode character 9 was the horizontal tab character, or HTAB.
Rather than remembering which code belonged to which key, however, programmers came up with a shorthand for these keys called an escape sequence. Here's a short explanation:
|
What is an Escape Sequence?
|
| An escape sequence is a method of writing "difficult" characters, that is, characters that are invisible or that are used, normally to delimit other characters.
The first part of an escape sequence is to designate one character as the "escape" character. When that character is encountered--in a character literal or in a group of characters, called a String--the compiler treats it as a flag that means "Look out! The next character is special."
In Java, the escape character is the backslash. Whenever a backslash appears in a character constant, the following character is set aside for special treatment. |
Java represents several of the common control characters as escape sequences. These include:
|
\n
|
The newline character [Unicode 10] |
|
\r
|
The carriage-return (ENTER) [Unicode 13] |
|
\t
|
The horizontal tab character [Unicode 9] |
|
\f
|
The form-feed character [Unicode 12] |
|
\b
|
The backspace character [Unicode 8] |
With this information, it's easy to see how to initialize the hTab variable, using one of the special escape characters:
Other Escapes
In addition to using one of the special escape values shown here, you can use an escape sequence to represent any character in the Unicode character set. A Unicode escape sequence starts with a backslash-u, like this:
\u
and is followed by the four-digit, hexadecimal Unicode value, which you can look up at the Unicode Consortium . [These are the PDF charts, I usually prefer the GIF charts at http://www.unicode.org/charts/web.html ] Once you have the completed value, you put the whole thing in single quotes, just like a regular character literal.
Here are some examples:
char tradeMark = '\u2122';
char copyRight = '\u00A9'; |
Escaping Punctuation
In addition to the characters you've already seen, sometimes you'll need to write the single quote ['], the backslash [\], or the double-quote ["] as a character constant.
Because the single quote is used to surround characters, you can't use it inside a character literal without taking special precautions. If you attempt to write:
the compiler will complain because the second quote acts as a delimiter instead of as the value for your character. You can easily fix this, however, by preceeding the second quote with the escape character, like this:
You can do the same thing for the double quote [the delimiter used with Strings] and the even the backslash itself like this:
Meet the Strings
Even Gary Cooper, a man of notoriously few words, was not reduced to speaking in single characters. Most communication requires groups of characters. In human speech we call these phrases and sentences. In Java, we store phrases and sentences in Strings.
What's A String?
In Java, a String is:
an immutable sequence of 0..n characters.
There are two things to notice about this definition that makes Java Strings different than characters and different than strings in other programming languages.
- First, Java Strings are immutable or constant. Just as the character literal 'A' cannot suddenly morph into 'B', so a String, once created, can never be changed. It will always be made up of exactly the same sequence of characters.
- Second, Java Strings can hold 0..n characters, while a char variable must always hold exactly one character.
Object or Primitive?
Because Java has a strong syntactical resemblence to the C programming language, programmers often expect Java Strings to act like null-terminated arrays of char, but they don't.
In fact, Java Strings act a little bit like objects and a little bit like primitive types. Strings, for instance, are the only object type that allows you to initialize them using literals, without using the new operator. Furthermore, unlike the other object types, Strings can be manipulated with operators.
Strings are not primitive types, however. Unlike primitive types, String variables hold referrences to Strings, not the actual character values. That means that two String variables can refer to the same, actual String object like this:
As you can imagine, if one of the String variables shown here were allowed to change the contents of the String "Hello Dolly", it would have a serious impact on the other. That's the reason that Strings are immutable in Java.
Which is not to say that you can't manipulate a String. Unlike the numeric and character primitives, Strings have a rich variety of methods that can be used to control them; they are not restricted to operator manipulation.
String Literals
To initialize a String variable using a literal, just form your String using regular characters, and enclose the whole thing in double quotes, like this:
| String greeting = "G'day Mate!"; |
(Note that you don't have to use the escape character with the single quote, when the single quote is used inside a String literal.)
An embedded escape sequence inside a String will be expanded in Java. In a Unicode enabled environment, for instance, you could write:
String message = "\u00BC is \u00BD of \u00BD";
// Says "1/4 is 1/2 of 1/2" |
The null String
A String variable that points to "nothing" is called the null String. You cannot send any messages to a nullString or apply any operations to a null String, except for assignment.
Here are two field definitions that illustrate the null String:
String s; // Not inititialized; value is null
String s2 = null; |
In the first case, the String s is not assigned a value, so it is given the value null by default. In the second example, the special constant value null is assigned to the String s2.
The Empty String
A null String does not refer to anything. The empty String, in contrast, refers to a String object that contains zero characters. Unlike the null String , the empty String responds to messages and may be used in String expressions, which you'll meet in the Lesson "Expressions."
To create an empty String you use a pair of adjacent double-quotes with no intervening spaces like this:
Something to Talk About
Here are some questions and exercises you can try to test your understanding of the char and String types:
- What will appear on the screen if the following character variable is displayed?
- What about this one? [This is very tricky. You'll have to read your book carefully to understand it.]
- Write a declaration for a String which contains each word in your name, separated by the tab character.
Please continue to the next section of this lesson.
|