Sei sulla pagina 1di 6

REGULAR EXPRESSIONS AND ITS APPLICATIONS

Definition
Regular Expressions provide an efficient way for string pattern matching, which are widely
used in UNIX systems, and occasionally on personal computers as well. They provide a very
powerful, but also rather obtuse, set of tools for finding particular words or combinations of
characters in strings.
On first reading, this all seems particularly complicated and not of much use over and above
the standard string matching provided in the Edit Filters dialog (Word matching, for
example). In actual fact, in these cases NewsWatcher converts your string matching criteria
into a regular expression when applying filters to articles.
However, you can use some of the simpler matching criteria with ease (some examples are
suggested below), and gradually build up the complexity of the regular expressions that you
use.
One point to note is that regular expressions are not wildcards. The regular expression 'c*t'
does not mean 'match "cat", "cot"' etc. In this case, it means 'match zero or more 'c' characters
followed by a t', so it would match 't', 'ct', 'cccct' etc.
. A regular expression can be specified by the use of following two characters:

Metacharacters: These are operators that identify appropriate search algorithms.


Literals: Which are the characters the user is looking for in the text.

A regular expression can define complex patterns of character sequences. For e.g.: The
regular expression given below looks for the literals f or ht, the literal t, the literal p which
might or might not be followed by literal s, and the closing ( : ) literal:
(f|ht) tps?:
The parentheses here are the metacharacters that are used to group a number of pattern
elements into a single element; the ( | ) symbol provides the functionality of OR, allowing for
either of the characters in the group to be checked. The ( ? ) too is used here as a
metacharacter indicating that the s literal may be optional. Hence the above regular
expression can successfully find the http:, https: , ftp: , ftps: strings.

Some Chronology
Regular Expressions were introduced by S.C. Kleene to describe the McCulloch and Pitts
1943 finite automata model of neurons. ("Representation of Events in Nerve Nets", p3-40 in
Claude Shannon/John McCarthy "Automata Studies", 1956)
The first application of Regular Expressions to editor search/replace (in the QED editor) was
by Ken Thompson, who published a Regular Expression-to-NFA algorithm in 1968, "Regular
Expression Search Algorithm", CACM 11:6, 419-422
Ken Thompson went on to re-implement this in the Unix ed editor, which Bill Joy turned into
thevi editor. Ken Thompson adapted the ed code for grep and sed. (Some years after its
creation, Emacs eventually borrowed the idea of Regular Expressions, but not the code,
directly from these Unix editors -- RMS, private communication)
Steve Johnson (prior to, and building towards, his Unix yacc tool) and Mike Lesk (in the
Unixlex) did some of the earliest applications of Regular Expressions to compiler lexical
analyzers via automated DFA-building tools.
Awk is a scripting language/command line tool derived directly from this Unix Culture of
Regular Expressions; it is no coincidence that the language most famous for Regular
Expressions today, perl, was developed in a Unix environment, inspired by awk and other
Unix Regular Expression tools.
Regular Expressions were thus widespread in Unix tools of all sorts from the beginning,
years to decades before this technology was widespread elsewhere (although obviously there
were exceptions), and Regular Expressions have always been an extremely important (albeit
under-acknowledged) part of Unix Culture

How Are They Useful?


Regular expressions serve as a powerful text processing component of programming
languages such as PERL and Java. For example, a PERL script can process each HTML file
in a directory, read its contents into a scalar variable as a single string, and then use regular
expressions to search for URLs in the string. One reason that many developers write in PERL
is for its robust pattern matching functionality.
Oracle Database support of regular expressions enables developers to implement complex
match logic in the database. This technique is useful for the following reasons:

By centralizing match logic in Oracle Database, you avoid intensive string processing
of SQL results sets by middle-tier applications. For example, life science customers
often rely on PERL to do pattern analysis on bioinformatics data stored in huge
databases of DNAs and proteins. Previously, finding a match for a protein sequence
such as [AG].{4}GK[ST] was handled in the middle tier. The SQL regular expression
functions move the processing logic closer to the data, thereby providing a more
efficient solution.

Prior to Oracle Database 10g, developers often coded data validation logic on the
client, requiring the same validation logic to be duplicated for multiple clients. Using
server-side regular expressions to enforce constraints solves this problem.

The built-in SQL and PL/SQL regular expression functions and conditions make
string manipulations more powerful and less cumbersome than in previous releases of
Oracle Database.

Applications
1. Regular Expressions in Web Search Engines
One use of regular expressions that used to be very common was in web search engines.
Archie, one of the first search engines, used regular expressions exclusively to search through
a database of filenames on public FTP servers[1]. Once the World Wide Web started to take
form, the first search engines for it also used regular expressions to search through their
indexes. Regular expressions were chosen for these early search engines because of both their
power and easy implementation. It is a fairly trivial task to convert search strings into regular
expressions that accept only strings that have some relevance to the query. In the case of a
search engine, the strings input to the regular expression would be either whole web pages or
a pre-computed index of a web page that holds only the most important information from that
web page. A query such as regular expression could be translated into the following regular
expression.
(regular expression ) (expression regular )
, then, of course, would be the set of all characters in the character encoding used with this
search engine. The results returned to the user would be the set of web pages that were
accepted by this regular expression. Many other features commonly seen in search engines
are also easy to convert into regular expressions. One example of this is adding quotes around
a query to search for the whole string. The query "regular expression" could be converted into
the following regular expression: (regular expression )
Most of the other common features can also be easily converted into regular expressions.
Regular expressions are not used anymore in the large web search engines because with the
growth of the web it became impossibly slow to use regular expressions. They are however
still used in many smaller search engines such as a find/replace tool in a text editor or tools
such as grep.

2. Regular Expressions in Software Engineering


Software applications are nowadays component-based for reusability. A problem with
component based applications is how to precisely define the interplay on the component
interfaces. Since a programmer does not have complete control over the integrating
components, it can lead to unpredictable behavior. Solving this problem would make it easy
to understand the specifications of the interfaces, as well as the correctness in implementation
in terms of whether it adheres to the specifications. A precise and formal specification for
component behavior is a necessity in order to automate black-box testing. From textual
specifications, a finite state machine is produced, with its transitions labeled by messages,
their data, and the constraints on their data. The usefulness of regular languages in this comes
from the fact that regular languages deal in the finite which is true in reference to a computer.
Regular expressions are also used in test case characterizations. These test case
characterizations are related to the programs directly, or to the corresponding models for it. A
generic framework is developed where test cases are characterized, and coverage criteria are
defined for test sets. Coverage analysis can then be done, and test cases and test sets can be
generalized. Regular language theory is used to handle paths, and regular expressions are
used over the terminals and non-terminals of the paths, which are called regular path
expressions. Operations done on the paths are restricted to the level of regular expressions,
and the only paths of interest are the regular path expressions that are feasible. A regular

expression is sufficient enough to abstractly describe a test case or a class of test cases, but
sets of expressions require their own criteria. The use of regular language theory makes it
easy for coverage analysis and test set generation. The most expenditure for software stems
from maintaining it rather than its actual development, and much of that expenditure goes to
testing. Regression testing is an important part of the software development cycle. It is a
process that is used to determine if a modified program still meets its specifications or if new
errors have been found. There is research being done to improve regression testing to make it
more efficient and specifically more economical. Regular language theory does not play a
huge part in regression testing, but on an integration level, a relation can be established to
finite automaton and regular languages and their properties.

3. Regular Expressions in Lexical Analysis


Lexical analysis is the process of tokenizing a sequence of symbols to eventually parse a
string. To perform lexical analysis, two components are required: a scanner and a tokenizer. A
token is simply a block of symbols, also known as a lexeme. The purpose of tokenization is
to categorize the lexemes found in a string to sort them by meaning. For example, the C
programming language could contain tokens such as numbers, string constants, characters,
identifiers (variable names), keywords, or operators. The best way to define a token is by a
regular expression. We can simply define a set of regular expressions, each matching the
valid set of lexemes that belong to this token type. This is the process of scanning. Often, this
process can be quite complex and may require more than one pass to complete. Another
option is to use a process known as backtrackingthat is, rereading an earlier part of a string
to see if it matches a regular expression based on some information that could only be
obtained by analyzing a later part of the string. It is important to note, however, that the
process of scanning does not produce the set of tokens in the document; it simply produces a
set of lexemes. The tokenizer must assign these lexemes to tokens.
In tokenization, we generally use a finite state machine to define the lexical grammar of the
language we are analyzing. To generate this finite state machine, we again turn to regular
expressions to define which tokens may be composed of which lexemes. For example, to
determine if a lexeme is a valid identifier in C, we could use the following regular
expression:
[a-zA-Z ][a-zA-Z 0-9]
This regular expression says that identifiers must begin with a Roman letter or an underscore
and may be followed by any number of letters, underscores, or numbers. However, there is
one problem with the process of tokenization: we are unable to use regular expressions to
match complex recursive patterns, such as matching opening and closing parentheses, for
example. This is because these strings are not in a regular language and therefore cannot be
matched by a regular expression. To deal with this problem, we must invoke the use of a
parserthis is beyond the scope of this document. After we have our text broken up into a set
of tokens, we must pass the tokens on to the parser so that it can continue to analyze the text.
Numerous programs exist to automate this process. For example, we could use yacc to
convert BNF-like1 grammar specifications to a parser which can be used to deal with the
tokens produced through lexical analysis. Similarly, many lexical analyzers (often called
simply lexers) exist to automate the process of scanning and tokenizationone of the best
known is lex.

4. Regular Expression in Web Activity Analysis


Google Analytics is a freemium web analytics service offered by Google that tracks and
reports web application traffic its an amazing tool for keeping track of A websites
statistics. Today there are specialists that use this tool solely in their day-to-day work.
However, many do not leverage the power of regular expressions that lives inside Google
Analytics. Using regular expressions is quite easy in Analytics, by providing the user the
ability to create segments, goals, and filters. For e.g.: A website owner is viewing a report
within Google Analytics and find that all of the URLs are not in a friendly format. Lets say,
he may have this long URL <a href="http://example.com/some-info?
id=1234&account=4567">http://example.com/some-info?id=1234&account=4567</a>
but instead wishes for it to display as http://example.com/some-info/1234 . Using an
Advanced Filter with regular expression, one can easily find and replace the URLs that have
the undesired format.

Potrebbero piacerti anche