Sei sulla pagina 1di 10

J D C

T E C H

T I P S
TIPS, TECHNIQUES, AND SAMPLE CODE

WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips,


April 23, 2002. This issue covers:
* Pattern Matching
* Creating a HelpSet with JavaHelp(tm) software
These tips were developed using Java 2 SDK, Standard Edition,
v 1.4.
This issue of the JDC Tech Tips is written by John Zukowski,
president of JZ Ventures, Inc. (http://www.jzventures.com).
You can view this issue of the Tech Tips on the Web at
http://java.sun.com/jdc/JDCTechTips/2002/tt0423.html
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - PATTERN MATCHING
The javax.util.regex package is a new package in Java 2 Platform,
Standard Edition version 1.4. The package provides a regular
expression library. A regular expression is a pattern of
characters that describes a set of strings, and is often used in
pattern matching. The classes in the javax.util.regex package
let you match sequences of characters against a regular
expression. These classes, which comprise the regular expression
library, use the Perl 5 regular expression pattern syntax, and
provide a much more powerful way of parsing text than was
previously available with the java.io.StreamTokenizer and the
java.util.StringTokenizer classes.
The regular expression library has three classes: Pattern,
Matcher, and PatternSyntaxException. Ignoring the exception class,
what you really have is one class to define the regular
expression you want to match (the Pattern), and another class
(the Matcher) for searching a pattern in a given string.
Most of the work of using the regular expression library is
understanding its pattern syntax. The actual parsing is the easy
part. So let's look at what makes up a regular expression.
The simplest kind of regular expression is a literal. A literal
is not simply a character within the regular expression, but a
character that is not part of some special grouping or expression
within the regular expression.
For instance, the literal "x" is a regular expression. Using the
literal, a matcher, and a string, you can ask "Does the regular
expression 'x' match the entire string?" Here's an expression that
asks the question:
boolean b = Pattern.matches("x", someString);
If the pattern "x" is the string referenced by someString,

then b is true. Otherwise, b is false. By itself, literals are


not that complicated to understand. Notice here that the matcher
is defined by the Pattern class, not the Matcher class. The
matches method is defined by the Pattern class as a convenience
for when a regular expression is used just once. Normally, you
would define a Pattern class, a Matcher class for the Pattern, and
then use the matches method defined by the Matcher class:
Pattern p = new Pattern("x");
Matcher m = p.matcher("sometext");
boolean b = m.matches();
The tip will cover those steps later.
Of course, regular expressions can be more complex than literals.
Adding to the complexity are wildcards and quantifiers. There is
only one wildcard used in regular expressions. It is the period
(.) character. A wildcard is used to match any single character,
possibly even a newline. The quantifier characters are the + and
*. (Technically, the question mark is also a quantifier
character.) The + character placed after a regular expression
allows for a regular expression to be matched one or more times.
The * is like the + character, but works zero or more times. For
instance, if you want to find a string with a j at the beginning,
a z at the end, and at least one character between the two, you
use the expression "j.+z". If there doesn't have to be any
characters between the j and the z, you use "j.*z" instead.
Note that pattern matching tries to find the largest possible
"hit" within a string. So if you request a match against the
pattern "j.*z", using the string "jazjazjazjaz", it returns the
entire string, not just a single "jaz". This is called "greedy
behavior." It is the default in a regular expression unless you
specify otherwise.
Now let's get a little more complex. By placing multiple
expressions in parentheses, you can request a match against
multi-character patterns. For instance, to match a j followed by
a z, you can use the "(jz)" pattern. By itself, that doesn't buy
you much. It is the same as "jz". But, by using parenthesis, you
can use the quantifiers and say match any number of "jz" patterns:
"(jz)+".
Another way of working with patterns is through character
classes. With character classes, you specify a range of possible
characters instead of specifying individual characters. For
instance, if you want to match against any letter from j to z,
you specify the range j-z in square brackets: "[j-z]". You could
also attach a quantifier to the expression, for example,
"[j-z]+", to get an expression matching at least one character
between j and z, inclusively.
Certain character classes are predefined. These represent classes
that are common, and so they have a common shorthand. Some of the
predefined character classes are:
\d
\D
\s
\S

A digit ([0-9])
A non-digit ([^0-9])
A whitespace character [ \t\n\x0B\f\r]
non-whitespace character: [^\s]

\w
\W

A word character: [a-zA-Z_0-9]


A non-word character: [^\w]

Notice that for character classes, ^ is used for negation of an


expression.
There is a second set of predefined character classes, called
POSIX character classes. These are taken from the POSIX
specification, and work with US-ASCII characters only:
\p{Lower}
\p{Upper}
\p{ASCII}
\p{Alpha}
\p{Digit}
\p{Alnum}
\p{Punct}
\p{Graph}
\p{Print}
\p{Blank}
\p{Cntrl}
\p{XDigit}
\p{Space}

A lower-case alphabetic character: [a-z]


An upper-case alphabetic character:[A-Z]
All ASCII:[\x00-\x7F]
An alphabetic character:[\p{Lower}\p{Upper}]
A decimal digit: [0-9]
An alphanumeric character:[\p{Alpha}\p{Digit}]
Punctuation: one of !"#$%&'()*,-./:;<=>?@[\]^_`{|}~
A visible character: [\p{Alnum}\p{Punct}]
A printable character: [\p{Graph}]
A space or a tab: [ \t]
A control character: [\x00-\x1F\x7F]
A hexadecimal digit: [0-9a-fA-F]
A whitespace character: [ \t\n\x0B\f\r]

The final set of character classes listed here are the boundary
matchers. These are meant to match the beginning or end of
a sequence of characters, specifically a line, word, or pattern.
^
$
\b
\B
\A
\G
\Z
\z

The beginning of a line


The end of a line
A word boundary
A non-word boundary
The beginning of the input
The end of the previous match
The end of the input but for the final terminator, if any
The end of the input

The key thing to understand about all the character class


expressions is the use of the \. When you compose a regular
expression as a Java string, you must escape the \ character.
Otherwise, the character following the \ will be treated as
special by the javac compiler. To escape the \ character, specify
a double \\. By placing a double \\ in the string, you are saying
you want the actual \ character there. For instance, if you want
to use a pattern for any string of alphanumeric characters,
simply having a string containing \p{Alnum}* is not sufficient.
You must escape the \ as follows:
boolean b = Pattern.matches("\\p{Alnum}*", someString);
As the name implies, the Pattern class is for defining patterns,
that is, it defines the regular expression you want to match.
Instead of using matches to see if a pattern matches the whole
string, what normally happens is you check to see if a pattern
matches the next part of the string.
To use a pattern you must compile it. You do this with the
compile method.
Pattern pattern = Pattern.compile(somePattern);

Pattern compilation can take some time, and doing it once is


wise. The matches method of the Pattern class compiles the
pattern with each call. If you want to use a pattern many
times, you can avoid multiple compilation by getting a Matcher
class for the Pattern class and then using the Matcher class.
After you compile the pattern, you can request to get a Matcher
for a specific string.
Matcher matcher = pattern.matcher(someString);
The Matcher provides a matches method that checks against the
entire string. The class also provides a find() method that tries
to find the next sequence, possibly not at the beginning of the
string, that matches the pattern.
After you know you have a match, you can get the match with the
group method:
if (matcher.find()) {
System.out.println(matcher.group());
}
You can also use the matcher as a search and replace mechanism.
For instance, to replace all occurrences of a pattern within
a string, you use the following expression:
String newString = matcher.replaceAll("replacement words");
Here, all occurrences of the pattern in question would be
replaced by the replacement words.
Here's a demonstration of pattern matching. The following program
takes three command line arguments. The first argument is a
string to search. The second is a pattern for the search. The
third is the replacement string. The replacement string replaces
each occurrence of the pattern found in the search string.
import java.util.regex.*;
public class MyMatch {
public static void main(String args[]) {
if (args.length != 3) {
System.out.println(
"Pass in source string, pattern, " +
"and replacement string");
System.exit(-1);
}
String sourceString = args[0];
String thePattern = args[1];
String replacementString = args[2];
Pattern pattern = Pattern.compile(thePattern);
Matcher match = pattern.matcher(sourceString);
if (match.find()) {
System.out.println(
match.replaceAll(replacementString));

}
}
}
For example, if you compile the program, and then run it like
this:
java MyMatch "I want to be in lectures" "lect" "pict"
It returns:
I want to be in pictures
Notice that when you run the program, it is unnecessary to
escape the \ character from the command line. That's because
the javac compiler does not process that information. For
example, if the search string is:
"I want to be in lectures\I want to be a star"
and you run the program with the same pattern ("lect") and
replacement string ("pict"), it returns:
I want to be in pictures\I want to be a star
For more information about pattern matching and regular
expressions, see the technical article Regular Expressions and
the Java Programming Language
(http://java.sun.com/jdc/technicalArticles/releases/1.4regex/).
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - CREATING A HELPSET WITH JAVAHELP SOFTWARE
JavaHelp software allows you to add online help to any system
that has a Java Runtime Environment (JRE). With JavaHelp
software, you can embed online documentation inside your
client-side programs. This includes the obvious applets and
applications, but you can also use JavaHelp software with
JavaBeans(tm) technology components or as standalone help
for third-party systems.
Getting started with the JavaHelp software is easy. Just go to
http://java.sun.com/products/javahelp/download_binary.html. You
can download either the user-version, with a JRE, or a
developer-centric version (as a Zip or self-extracting
executable). There is also a JavaHelp User Guide that comes with
the software downloads. If you view the JavaHelp User Guide,
you'll see the JavaHelp system in action.
Once started, the Swing-based help viewer for JavaHelp presents
information in a series of views. You'll find a Table of
Contents, index of topics, and search. These three features
combined are called the HelpSet and may include multiple help
topic files. Essentially, it is your job to create the help topic
files and the navigation files, mapping the help topic to the file
with the necessary information. The topic files are basic HTML,
and the navigation files are formatted in XML. You can however use
a third-party tool to automatically produce the necessary files.
For example, a tool such as RoboHELP generates the necessary
files in the JavaHelp format. See the list of tools supporting

the JavaHelp software format at


http://java.sun.com/products/javahelp/industry.html.
To demonstrate the JavaHelp system in action, let's create a
"Hello, JavaHelp" HelpSet. To do this, you'll need to configure
a special directory structure. It helps if you work in
a subdirectory to start, so that you don't mix up the HelpSet
files with any others. Navigation files go in the top-level
directory, and topic and image files in subdirectories.
To get started, create a directory named help. Under help,
create a directory named Hello.
In the Hello directory, you create subdirectories for subtopics
to hold the actual help files. For the "Hello, JavaHelp"
demonstration, create one directory named First and another Last.
Once the directory structure is created, you can start creating
the navigation and help files. The directory structure now looks
as follows:
+ help
+ Hello
+ First
+ Last
The DTD for the main HelpSet file is contained in
http://java.sun.com/products/javahelp/helpset_1_0.dtd. In it, you
create entries for the term map as well as table of contents and
index views. There is really no magic in the filenames. Just be
sure the HelpSet file ends with the extension .hs. Here's what
the HelpSet file, hello.hs, might look like, where the map is in
Map.jhm, table of contents is in toc.xml, and index is
in index.xml. Create this hello.hs file in the help directory.
<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE helpset
PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp HelpSet Version 1.0//EN"
"http://java.sun.com/products/javahelp/helpset_1_0.dtd">
<helpset version="1.0">
<title>Hello, JavaHelp</title>
<maps>
<mapref location="Map.jhm"/>
<homeID>overview</homeID>
</maps>
<view>
<name>TOC</name>
<label>TOC</label>
<type>javax.help.TOCView</type>
<data>toc.xml</data>
</view>
<view>
<name>Index</name>
<label>Index</label>
<type>javax.help.IndexView</type>
<data>index.xml</data>
</view>
</helpset>

For the map file, you need to create a mapping from map ID to
files, similar to the following:
<mapID target="one" url="Hello/First/one.htm" />
Be sure the help files are specified as relative locations from
the HelpSet. You could hard code complete paths, but then as soon
as you JAR up the HelpSet, all paths would be wrong. Of course,
these could be complete URLs to resources on the Web. If you want
to have one "overview" help file at the top, and two help files
in each of the First and Last directories, your XML mapping might
appear as follows. Create this Map.jhm file in the help directory.
<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE map
PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp Map Version 1.0//EN"
"http://java.sun.com/products/javahelp/map_1_0.dtd">
<map version="1.0">
<mapID target="overview" url="Hello/overview.htm" />
<mapID target="one" url="Hello/First/one.htm" />
<mapID target="two" url="Hello/First/two.htm" />
<mapID target="three" url="Hello/Last/three.htm" />
<mapID target="four" url="Hello/Last/four.htm" />
</map>
The table of contents and index files are next. These provide
alternate means of working through the various help files. Again,
these are described in XML files.
For the table of contents, each target from the map is mapped to
text to appear in the table of contents. Create this toc.xml file
in the help directory.
<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE toc
PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp TOC Version 1.0//EN"
"http://java.sun.com/products/javahelp/toc_1_0.dtd">
<toc version="1.0">
<tocitem image="toplevelfolder" target="overview" text="Hello, JavaHelp">
<tocitem text="First Stuff">
<tocitem target="one" text="The One"/>
<tocitem target="two" text="The Second"/>
</tocitem>
<tocitem text="Last Stuff">
<tocitem target="three" text="What's Third?"/>
<tocitem target="four" text="The End"/>
</tocitem>
</tocitem>
</toc>
The index is just another way of presenting the data. As you
create the index.xml file, you must alphabetize/list terms in the
order you want them presented. Simply create the XML file with
a set of hierarchical <indexitem> entries. In each <indexitem>
entry, provide a value for the text attribute and a value for the
target attribute. The value for the text attribute specifies what
to display to the user in the index. The value for the target
attribute specifies what help to display. Create this index.xml

file in the help directory.


<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE index
PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp Index Version 1.0//EN"
"http://java.sun.com/products/javahelp/index_1_0.dtd">
<index version="1.0">
<indexitem text="The First?">
<indexitem target="one" text="I'm One"/>
<indexitem target="two" text="I'm Second"/>
</indexitem>
<indexitem text="The Last?">
<indexitem target="three" text="We're Third!"/>
<indexitem target="four" text="We're Last"/>
</indexitem>
<indexitem target="overview" text="Overview!!!"/>
</index>
The map file mentions five HTML files:
Hello/overview.htm
Hello/First/one.htm
Hello/First/two.htm
Hello/Last/three.htm
Hello/Last/four.htm
So you must create them. Make sure to create the files in the
appropriate Hello directory or subdirectory. Try to create the
files with something interesting in them, for example, a few
sentences of overview information in the overview.htm file. The
whole directory structure now looks like this:
+ help
hello.hs
index.xml
Map.jhm
toc.xml
+ Hello
overview.htm
+ First
one.htm
two.htm
+ Last
three.htm
four.htm
To test if you have everything connected properly, run the
hsviewer utility that comes with the JavaHelp software, and have
it load the hello.hs file. You can find the utility in the
demos/bin (Unix) or demos\bin (Windows) subdirectory of your
JavaHelp installation directory. For example, in Unix
change to the demos/bin subdirectory, and enter:
hsviewer -helpset hello.hs -classpath path
Replace "path" with the path to the hello.hs HelpSet.
After starting up hsviewer, click on the Browse button to locate
the hello.hs file. Then click on the Display button to bring up

the help viewer.


find two tabs on
index. The right
item selected on

Because hello.hs has two <view> tags, you'll


the left side: one for the TOC and one for the
side will display the HTML associated with the
the left.

You can also add a search tab. To do this, run the jhindexer
program and add another <view> to the HelpSet. Enter the
jhindexer command as follows in the directory that contains the
hello.hs file.
jhindexer Hello
If the command isn't in your path, you'll need to prefix the
command with its full path. You can find the command in the
javahelp/bin (Unix) or javahelp\bin (Windows) subdirectory of
your JavaHelp installation directory.
Here's the <view> tag you need to add to hello.hs. JavaHelpSearch
is the name of the directory used for the help index support
files to be saved.
<view>
<name>Search</name>
<label>Word Search</label>
<type>javax.help.SearchView</type>
<data engine="com.sun.java.help.search.DefaultSearchEngine">
JavaHelpSearch
</data>
</view>
For more information about JavaHelp software, see the JavaHelp
software page (http://java.sun.com/products/javahelp/).
. . . . . . . . . . . . . . . . . . . . . . .
IMPORTANT: Please read our Terms of Use, Privacy, and Licensing
policies:
http://www.sun.com/share/text/termsofuse.html
http://www.sun.com/privacy/
http://developer.java.sun.com/berkeley_license.html
* FEEDBACK
Comments? Send your feedback on the JDC Tech Tips to:
jdc-webmaster@sun.com
* SUBSCRIBE/UNSUBSCRIBE
- To subscribe, go to the subscriptions page,
(http://developer.java.sun.com/subscription/), choose
the newsletters you want to subscribe to and click "Update".
- To unsubscribe, go to the subscriptions page,
(http://developer.java.sun.com/subscription/), uncheck the
appropriate checkbox, and click "Update".
- To use our one-click unsubscribe facility, see the link at
the end of this email:
- ARCHIVES
You'll find the JDC Tech Tips archives at:
http://java.sun.com/jdc/TechTips/index.html

- COPYRIGHT
Copyright 2002 Sun Microsystems, Inc. All rights reserved.
901 San Antonio Road, Palo Alto, California 94303 USA.
This document is protected by copyright. For more information, see:
http://java.sun.com/jdc/copyright.html
JDC Tech Tips
April 23, 2002
Sun, Sun Microsystems, Java, Java Developer Connection, JavaHelp,
and JavaBeans are trademarks or registered trademarks of
Sun Microsystems, Inc. in the United States and other countries.

Potrebbero piacerti anche