Sei sulla pagina 1di 61

Webserver Basics

Key Concepts

The web server that ships with Red Hat Enterprise Linux is the Apache
webserver.
In general terms, web servers map URL requests onto files within the local
directory, using the Document Root (/var/www/html/) as the base of the
translation.
The web server associates meta-data with requested files, such as content types.
When a client requests a directory instead of a file, Apache serves the file
index.html (if it exists), generates a dynamically generated directory listing (if
it's allowed to), or returns an access denied error.
Web servers and web clients communicate using the HTTP protocol.
Often, the information served from a web server is structured using the HTML
markup language.

Table 1. The Apache Web Server

httpd (with apr and httpd-suexec dependencies), plus other modules (usually
Packages
starting mod_...), and httpd-manual.
Service <service>httpd</service>
Daemon /usr/sbin/httpd
Config
/etc/httpd/conf/httpd.conf, /etc/httpd/conf.d/*
Files
Logging /var/log/httpd/{access,error}_log
Ports 80/tcp (http), 443/tcp (https)

Discussion

Web Servers

This lesson focuses on installing and starting the Apace web server, and publishing
information using the default configuration. We also introduce some of the basics of the
HTTP protocol and the HTML markup language, for those who are interested.

Installation the Apache Web Server

In Red Hat Enterprise Linux, the Apache web server is easy to install and start in its
default configuration, using the conventional trio of commands to install the httpd
package and start the <service>httpd</service> service: yum install ...; service
... start; chkconfig ... on.

[root@station ~]# yum install httpd


...
Dependencies Resolved
======================================================================
=======
Package Arch Version Repository
Size
======================================================================
=======
Installing:
httpd i386 2.2.3-6.el5 rha-rhel
1.1 M
...
Installed: httpd.i386 0:2.2.3-6.el5
Complete!

The <service>httpd</service> service can now be started and "chkconfiged on".

[root@station ~]$ service httpd start


Starting httpd: [ OK ]
[root@station ~]$ chkconfig httpd on

The availability of the Web Server can be confirmed by using any Web browser to
reference http://localhost. The following example uses elinks, but the firefox browser
could have been used just as easily.

[root@station ~]$ elinks -dump http://localhost

Red Hat Enterprise Linux Test Page

This page is used to test the proper operation of the Apache HTTP
server
after it has been installed. If you can read this page, it means
that the
Apache HTTP server installed at this site is working properly.

...

Web Server Layout

Once installed, a rpm query to list files (rpm -ql) always serves as a good introduction
to the layout of a new product.

[root@station ~]$ rpm -ql httpd


/etc/httpd
/etc/httpd/conf
/etc/httpd/conf.d
/etc/httpd/conf.d/README
...

Skimming the output, the following relevant files and directories could be seen.

Table 1. Web Server Filesystem Layout

Directory Purpose
/etc/httpd/
Configuration files, including
/etc/httpd/conf/httpd.conf.
/usr/lib/httpd/modules/ Dynamically loaded modules.
Directory Purpose
/var/log/httpd/ Log files, including access_log and error_log.
/var/www/html/
The Web Server Document Root (more on this in a
moment).

The Document Root: /var/www/html/

The purpose of the Web Server is to serve information. Usually, this involves reading a
file from the file system and transferring it to a web browser, which then displays or
renders the file.

As an arbitrary example, the file /etc/sysctl.conf can be copied to the document


root (/var/www/html) directory. Any web browser referencing
http://localhost/sysctl.conf should display the contents of the file just as could be done
with the cat command. (Some web browsers may mangle the whitespace within the file,
essentially placing the entire contents of the file on one line. This issue arises because of
misguided "Content Type" negotiations. More on this later.)

[root@station ~]$ cp /etc/sysctl.conf /var/www/html/


[root@station ~]$ elinks http://localhost/sysctl.conf
[root@station ~]$ elinks -source http://localhost/sysctl.conf
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
# sysctl.conf(5) for more details.

# Controls IP packet forwarding


net.ipv4.ip_forward = 0
...

Instead of a single file, entire directory trees can be copied into the /var/www/html
directory.

[root@station ~]$ cp -a /etc/sysconfig /var/www/html/

Now, by accessing http://localhost/sysconfig with a web browser, the contents of the


directory should be visible, with "clickable" file and subdirectory links.

Figure 1. Browsing the sysconfig Directory


Notice the shift in perspective. What we would call the directory
/var/www/html/sysconfig, the web server refers to as just /sysconfig. This
translation is the essence of the term "Document Root".

Web browsers request information using "Uniform Resource Locators", or more


commonly just "URL"s. Web related URL's are usually composed of a hostname and a
file path.

http://hostname/dir1/dir2/filename

The hostname is simply the hostname or IP address of the host running the server,
while the dir1/dir2/filename is thought of as being a path to a particular file on the
server. When locating the file, the web server assumes that the root of the "URL
Namespace" is the document root directory (/var/www/html).

The http portion of the URL is the protocol, which tells the web browser both which
port to connect to, and what "language" to expect to speak to whomever is listening on
that port. For web servers, the port is 80, and the language is known as the Hypertext
Transfer Protocol, or HTTP.

Of course, it's not a machine's configuration files that one usually chooses to publish to
the world. We'll move on to more interesting content.

Content Types

The purpose of the web server is to serve the content of files, but web clients seem to
learn not just the content of the file, but how to interpret the content, as well. As an
example, consider a text file such as /etc/hosts, an HTML file such as
/usr/share/doc/samba-version/htmldocs/manpages/net.8.html, and an image
file, such as /usr/share/backgrounds/tiles/neurons.png, each of which are
copied to a web server's document root.

[root@station ~]# mkdir /var/www/html/example


[root@station ~]# cd /var/www/html/example
[root@station example]# cp /etc/hosts .
[root@station example]# cp /usr/share/doc/samba-
*/htmldocs/manpages/net.8.html .
[root@station example]# cp /usr/share/backgrounds/tiles/neurons.png .
[root@station example]# ls
hosts net.8.html neurons.png

How does a web client handle each of these? If you're sitting at a student workstation,
try for yourself. (Of course, you will first need to perform the above commands to put
the files in place.)

http://localhost/example/hosts
http://localhost/example/net.8.html
http://localhost/example/neurons.png
Note
Make sure to create or copy files underneath the /var/www/html directory as the
root user. Do not move already existing files into the directory. If you're having
trouble, give it a pass for now, until you read the section "But What Could Go
Wrong?" below.

All of the files should have been treated reasonably by the client: the hosts file as a
simple text file, the net.8.html file as a marked up man page, complete with bolded
titles, italics, and hyperlinks, and neuron.png as a picture of blue blobs.

Now lets shake things up a bit.

[root@station example]# cp hosts hosts.html


[root@station example]# cp net.8.html net.8.txt
[root@station example]# cp hosts hosts.png
[root@station example]# cp neurons.png neurons.txt

Again, if at a student workstation, try the following.

http://localhost/example/hosts.html
http://localhost/example/net.8.txt
http://localhost/example/hosts.png
http://localhost/example/neurons.txt

For those not able to follow along, hosts.html lost all of it's formatting, net.8.txt
dumped what you would see if you catted the file directly, hosts.png caused the
browser to complain about a malformed image, and neurons.txt showed a bunch of
glyphs representing binary data.

There's obviously some expectations on the part of the browser about how to interpret
the data it receives: text to dump, marked up text (html) to format, or an image to
render. The expectation about what type of data the client is receiving is known as the
data's content type.
Apparently, the content type is determined by the file's filename extension. We still
don't know if the extension is being interpreted into a content type by the server (before
the file's content is transmitted) or by the client (after the content is received). The
answer is the server, and the server communicates that content type, as well as a lot of
other meta-data about the transfer, using the HTTP protocol.

Directories

We've seen how the web server responds when the web server requests a file: it returns
the contents of the file to the client. How does the web server handle directories? In
general, a webserver responds in one of three ways.

First, the web server checks to see if an index file (a file named index.html) exists in
the directory. If so, the webserver returns the contents of the file, as if the request for
http://localhost/example were for http://localhost/example/index.html.

Secondly, if no index file exists, the web server checks to see if the
<directive>Indexes</directive> option is enabled. If so, the web server returns a
dynamically generated directory listing. Otherwise, the webserver returns an error to the
client. (How the <directive>Indexes</directive> option is set or not set will be covered
in a following lesson. In Red Hat Enterprise Linux, the option is set by default.)

Table 1. Web Server Responses to Directory Requests

Configuration Response
index.html exists Return the contents of index.html
no index.html, <directive>Indexes</directive> Return a dynamically generated
enabled directory listing
no index.html, <directive>Indexes</directive>
Return error 403 ("Access Denied")
disabled

Assuming you followed along above, create the file


/var/www/html/example/index.html with the following content (you should be able
to cut and paste directly from the browser).

<h1>Examples</h1>
[<a href="hosts">hosts</a>]
[<a href="net.8.html">net man page</a>]
[<a href="neurons.png">picture of neurons</a>]

What happens when you now view http://localhost/example? You should see the
marked up contents of the index file. Is the effect any different if you view
http://localhost/example/index.html directly? (It shouldn't be.)

Figure 1. Contents of http://localhost/example


What about the file /var/www/html/hosts.html? Is it still available? You should be
able to access it by manually entering the URL http://localhost/example/hosts.html, but
there is no way to click to it directly (except from this page, of course). Content behind
an index file, which is not referenced directly, is obscured, but still available if someone
knows it's there.

Web Server Logging: /var/log/httpd/{access,error}_log

The Apache web server logs information about every request it handles to the file
/var/log/httpd/access_log. A sample of the log file's contents follows.

[root@station ~]# tail -3 /var/log/httpd/access_log


127.0.0.1 - - [13/Jul/2005:06:34:24 -0400] "GET /example/net.8.html
HTTP/1.1" 20
0 26196 "http://localhost/rhasb/curr/rha230/html-instructor-
classroom/rha230_htt
pd_http.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10)
Gecko/20050720
Fedora/1.0.6-1.1.fc4 Firefox/1.0.6"
127.0.0.1 - - [13/Jul/2005:06:34:24 -0400] "GET /example/samba.css
HTTP/1.1" 404
290 "http://localhost/example/net.8.html" "Mozilla/5.0 (X11; U; Linux
i686; en-
US; rv:1.7.10) Gecko/20050720 Fedora/1.0.6-1.1.fc4 Firefox/1.0.6"
127.0.0.1 - - [13/Jul/2005:06:34:25 -0400] "GET /favicon.ico
HTTP/1.1" 404 284 "
-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050720
Fedora/1.0
.6-1.1.fc4 Firefox/1.0.6"

Amongst any line, we find the following information.

The IP address of the client who made the request.


A timestamp of when the request occurred.
The response code associated with the request. A response of code of 200 implies
success, anything else is usually some type of failure.
The length of the content returned, not to be confused with the response code
which proceeds it.
Any request that does not complete successfully (i.e., whose response code is not 200)
also generates information in the error_log.

[root@station ~]# tail -3 /var/log/httpd/error_log


[Tue Jul 13 06:34:24 2005] [error] [client 127.0.0.1] File does not
exist: /var/
www/html/example/samba.css, referer:
http://localhost/example/net.8.html
[Tue Jul 13 06:34:25 2005] [error] [client 127.0.0.1] File does not
exist: /var/
www/html/favicon.ico

The access_log and the error_log are one of the first places an administrator should
look when trying to figure out why something doesn't seem to be working. The
following table itemizes some of the return codes associated with various errors (or
successes).

Table 1. HTTP return codes

Code Meaning
200 Success
301 Authorization Required
403 Access Denied
404 File Not Found
501 Internal Server Error

There are many others, but these tend to be the most common. (In general, the HTTP
protocol follows an response code convention used by many network services: partial
success are in the 100's, successes in the 200's, incomplete transactions in the 300's,
client errors in the 400's, and server errors in the 500's. Watch closely the output the
next time you use the simple ftp client, for example.)

But What Could Go Wrong?

In it's default configuration, there's really only two things that could cause problems:
permissions, and SELinux.

First, files must be readable by the system user <username>apache</username>. The


httpd process, like any other process, must have the right permissions to access a file.
For security reasons, the web server runs as the user <username>apache</username>.
Therefore, any file served by the web server must be readable by the user apache.

Secondly, the Apache web server is one of the services constrained by the Red Hat
Enterprise Linux SELinux targeted policy. Therefore, any file serviced by the Apache
web server must have an appropriate SELinux context. For now, the context of the
/var/www/html directory (httpd_sys_content_t) will suffice. Any file created in this
directory (including subdirectories) should inherit this context, and be fine. The problem
occurs when files are created somewhere else, and moved to this directory - they then
retain their original (inappropriate) SELinux context.
At any rate, whenever the web server complains in its log file that it cannot access a file
you think it should be able to, try the following commands to set appropriate
permissions and SELinux context.

[root@station ~]# chmod a+r filename


[root@station ~]# chcon --reference /var/www/html filename

or

[root@station ~]# restorecon /var/www/html/filename

The Anatomy of a Web Request: the HTTP Protocol (Optional, but Interesting)

This section introduces the HTTP protocol. The intent is not to be thorough, but instead
to give students an impression of what is meant when people use terms such as HTTP
headers, GET, and Response Code. For those who don't get enough, all of the details
can be found at the World Wide Web Consortium's website.

In order to introduce the HTTP protocol, it's easiest to start with an example. The entire
conversation between a web client and a web server can be captured using the
wireshark network analyzer. If not already installed, yum install wireshark-gnome
should do the trick. A capture is started by opening wireshark, choosing
Capture:Start... from the menu, specifying a capture filter of (in this case) port 80, and
"OK"ing. (Enabling "Update list of packets in real time" and "Automatic scroll in live
capture" tends to make things more interesting for small captures, as well.)

Figure 1. Specifying a Wireshark Capture filter


Once Wireshark is capturing packets, any conversations between a web client and a web
server which occur on the local machine should be captured. For example, the following
displays a conversation between a web client requesting
http://station53.rosemont.wlan/example/hosts and a web server providing the
answer. Once wireshark has been stopped, the individual IP packets can be browsed
from a list.

Figure 2. A Wireshark Capture Packet List


More interestingly for our purposes, wireshark can easily assemble the payload from
each of the individual packets which compose a TCP/IP conversation by right clicking
on any packet, and choosing Follow TCP Stream.

Figure 3. Viewing a TCP Conversation with Wireshark


The web client, in red, is making a request of the web server, in blue. The "language"
the client and server use is the HTTP protocol.

The HTTP Protocol: the Request (Client to Server)

A web request is composed of three parts: a request line, a series of HTTP headers, and
the "body" (or content).

Note
In the following, some portions of the text have been replaced with "..." for
readability. The same convention is used many places in the text.
GET /example/hosts HTTP/1.1
Host: station53.rosemont.wlan
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Geck...

Accept:
text/xml,...text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

The entire first line is known as the Request-Line, and contains exactly three pieces of
information in a specified order.
The request method, which for our purposes can be thought of either being a GET
or a POST. With a GET, the client is requesting information. With a POST, the
client is submitting information.
The URI, or "Universal Resource Identifier". Think of this as the path portion of a
URL. (The server portion has already been used to open the TCP/IP connection.)
The exact protocol that the client is speaking. Only two protocols are generally
considered, HTTP/1.0 and HTTP/1.1, and any modern client should be using the
latter.

The next series of lines, which all have the form header: data, are known as the
HTTP headers. These are used to associate any metadata with the request. Some HTTP
request headers relevant to our discussion are the following.

Host: The content of the host portion of the URL requested by the client.
User-Agent: The User Agent is the client software. In this case, the client is the
Firefox web browser, which identifies itself as a variant of Mozilla.
Accept: A list of the content types that the browser is willing to accept. This
browser prefers to receive text/xml or text/html, but will also handle
text/plain. For images, the browser prefers image/png, but in the end, the
browser will accept */*, or anything the server will throw at it.

After a blank line, indicating the end of the HTTP headers, the content of the request
would follow. For GET requests, such as this one, there is no content.

The HTTP Protocol: the Response (Server to Client)

The server responds with the following, which is again composed of three parts: a
response line, a set of response HTTP headers, and the response "body" (or content).

HTTP/1.1 200 OK
Date: Sat, 13 Aug 2005 11:09:51 GMT
Server: Apache/2.0.54 (Fedora)
Last-Modified: Sat, 13 Aug 2005 10:26:31 GMT
ETag: "406ee-104-105723c0"
Accept-Ranges: bytes
Content-Length: 260
Connection: close
Content-Type: text/plain; charset=UTF-8

# Do not remove the following line, or various programs


# that require network functionality will fail.
127.0.0.1.localhost.localdomain.localhost
192.168.218.254.rosemont.
#192.168.218.254.s.
#192.168.218.53.w.
192.168.0.5.s.
192.168.0.6.w.
192.168.201.254 rw

The Response-Line, like the Request-Line, is composed of three ordered parts. In the
case of the response, however, the latter two fields are redundant.
The exact protocol the server is using.
The response code of the transaction, which is used to imply success, or qualify a
type of failure. In this case, the response code 200 implies success. (More on these
later.)
A text representation of the response code. This is supplied only for diagnostic
(debugging) purposes, as the response code is what's really important. The text OK
is associated with the response code of 200.

Again, the next series of lines, which all have the form header: data, are known as the
HTTP headers. We will only focus on one of the HTTP response headers.

Content-Type: The server is providing the client with the type of the content, so
the browser can render the data appropriately. For this response, the content type is
text/plain, so the browser will display the content "as is", preserving whitespace.
Other content types could include image/png, text/html, or application/msword.

After a blank line, indicating the end of the HTTP headers, the content of the response
follows.

For this response, the content is a simple text file. (In the output above, tabs have
been replaced with periods, an artifact of how wireshark displays non-printing
characters.)

The Hyper Text Markup Language (HTML) (Optional)

This workbook is about managing the Apache webserver as a system administrator, not
about designing web content. However, during this workbook you will encounter files
which use HTML to markup their content, so a brief introduction will be useful. Again,
those who do not get enough can find more at the World Wide Web Consortium's
website.

Fundamentally, HTML provides three things.

1. Structure: HTML allows text to be identified as titles or inlined quotes, or


organized into lists and tables.
2. Embedded Media: HTML allows authors to embed media into their text,
usually in the form of images, but also as videos and sound.
3. Links: HTML allows authors to easily reference other information, so that
anyone reading the text can locate the other information with the click of a
mouse.

All three of the above capabilities rely on embedding HTML tags into the text, where a
tag is any text embedded between brackets, such as <table>, <img>, or <a>.

Because the brackets are now considered syntax, there needs to be some way to
represent the bracket. This is done using HTML entities, which begin with an
ampersand (&) and end with a semicolon. For example, the entity for a left bracket is
&lt; (for "less than"), and the entity for a right bracket is &gt; (for "greater than").
Entities are also used for glyphs not often found on keyboards, such as the copyright
symbol. Since the ampersand starts entities, there must also be some way of
representing it, and the answer is itself an entity: &amp;.

Rather than provide a full introduction to HTML in the text, a sample document is
provided at http://rha-server/pub/rha/rha230/sample.html. Students are encouraged to
examine this document, both as it is rendered by a web browser and the underlying text
(which can usually be viewed in a browser by right clicking and choosing view page
source).

Exercises

Lab Exercise
Objective: Install, start, and contribute content to an Apache web site.

Estimated Time: 45 mins.

This exercise has you download and install material for your web server, using the web
server's default configuration. The material consists of three texts which are not
optimally organized for the Apache web server. The lab has you perform some simple
renamings and repositioning of the material so that it is more naturally viewed using a
web browser.

Specification

1. If the httpd package is not already installed on your machine, install it now.
2. Start the <service>httpd</service> service (if it is not already started), and
configure the service to be started by default upon reboots.
3. Download a copy of the file http://rha-server/pub/rha/rha230/readings.tgz, and
extract its contents into your web server's document root directory
(/var/www/html/). Properly extracting the contents should result in a new
[1]
/var/www/html/readings directory.
4. Using a web browser, browse the http://localhost/readings directory. You should
be able to view the HTML files the_god_of_mars.html and
war_of_the_worlds appropriately.
5. Correct a misnamed index file.
a. Again using a web browser, examine the contents of the
http://localhost/readings/relat10h/ subdirectory. You should discover the
file index.htm. Try examining this file through the web browser:
http://localhost/readings/relat10h/index.htm.
b. Apparently, the intent of the author was that this page should serve as an
index page, but the file is named incorrectly for Apache's default
configuration. In the /var/www/html/readings/relat10h/ directory,
create a link of index.htm named index.html (either hard or soft).
c. Using a browser, again view the URL http://localhost/readings/relat10h/.
You should now see the contents of the index page.
d. To make life a little easier for anyone browsing your site, in the
/var/www/html/readings directory, create a symlink to the relat10h
directory called relativity.
e. Confirm that you may now access the content of the file index.htm
using http://localhost/readings/relativity/.
6. Correct a misnamed directory.
a. If you can stomach the physics (and, in fact, even if you cannot), skim
the first appendix to Einstein's theory of relativity, either by following
the link from the main page, or by referencing
http://localhost/readings/relativity/ap01.htm directly.
b. You might notice that many of the equations, such as equation 29,
equation 30, etc., are missing. Examine the end of
/var/log/httpd/access_log, and note the many requested images
files which received a 404 ("File Not Found") response code.
c. Examine the end of the file /var/log/httpd/error_log, and you will
discover more helpful messages.
d. [root@station ~]# tail /var/log/httpd/error_log
e. [Tue Jul 20 16:53:14 2005] [error] [client 127.0.0.1]
File does not exist: /var/
f. www/html/readings/relat10h/pics, referer:
http://localhost/readings/relat10h
g. /ap01.htm
h. ...
i. Examining the log messages closely, you may discover the problem. All
of the web pages are expecting images to be in a directory named pics,
but this directory does not exist.
j. Through a simple directory renaming, or perhaps another symlink, solve
the problem, so that all of the images of equations are properly
displayed.
7. Now that you have completed the hard work, relax a little, by deriving the
equation for the Lorentz transformation, following the steps in chapter 11. Place
your results in a file titled that_was_easy in your academy user's home
directory. (Just kidding.)

Deliverables

1. 1. An installed and running <service>httpd</service> service, configured to start


by default on bootup.
2. The text of three books, browsable from the URL http://localhost/readings.
3. The table of contents of Einstein's theory of relativity at
http://localhost/readings/relat10h.
4. The table of contents of Einstein's theory of relativity, also at
http://localhost/readings/relativity.
5. The images of equations in appendix 1 (found at
http://localhost/readings/relativity/ap01.htm) are displayed properly.

Clean Up

You will want to leave the <directive>/var/www/html/readings</directive> directory in


place, as you will need it in the next section.

Apache Configuration
Key Concepts

The Apache server is configured using the /etc/httpd/conf/httpd.conf and


/etc/httpd/conf.d/*.conf configuration files.
The configuration file is informally divided into the
<directive>Global</directive>, <directive>Main</directive>, and
<directive>Virtual Server</directive> sections.
The <directive>Global</directive> section defines aspects which pertain to the
server as a whole, including client connection dynamics, server pool parameters,
binding address, and which modules to load.
The <directive>Main</directive> section defines aspects which may be
redefined by any virtual server, such as the document root, logging behavior,
and URL namespace remappings.
Comprehensive documentation is provided by the <directive>httpd-
manual</directive> package, which, when installed, can be access at
http://localhost/manual.

Discussion

Apache Configuration: /etc/httpd/conf/httpd.conf

The Apache web server is configured with text configuration files which are read upon
startup. The primary configuration file is /etc/httpd/conf/httpd.conf, but the files
/etc/httpd/conf.d/*.conf are "slurped up" into the configuration, as well.

[root@station ~]# ls /etc/httpd/conf /etc/httpd/conf.d/


/etc/httpd/conf:
httpd.conf magic

/etc/httpd/conf.d/:
README welcome.conf

The apache configuration file syntax is straightforward, and tends to be well


documented (both as comments in the default configuration file, and in a separate
manual to be discussed later). A sample of the configuration file's syntax follows.

#
# DocumentRoot: The directory out of which you will serve your
# documents. By default, all requests are taken from this directory,
but
# symbolic links and aliases may be used to point to other locations.
#
DocumentRoot "/var/www/html"

#
# Each directory to which Apache has access can be configured with
respect
# to which services and features are allowed and/or disabled in that
# directory (and its subdirectories).
# ...
<Directory />
Options FollowSymLinks
AllowOverride None
</Directory>
Any empty line, or line which begins with a hash ("#"), is considered a comment.
Any line which is not a comment generally starts with a keyword referred to as a
directive. Directives are not case sensitive, but of course spelling is important. The
syntax of the remainder of the line depends on the directive, but all of a directive's
arguments must occur on a single line.
The only other way a line can begin is with a XML-like tag, which begins a
container. Containers end with a XML-like closing tag. Generally, all directives
found within a container only take effect within the scope of the container. We will
discuss the effects of different types of containers in a later lesson.

The file is thought of as occurring in three sections, although the syntax does not
formally enforce them.

1. The Global Section: This section contains configuration which applies to the
web server as a whole, including any virtual servers.
2. The Main Section: Configuration which applies to the main server (as opposed
to any virtual servers) belongs in this section. Any configuration in this section
can be overridden by a virtual server.
3. Virtual Servers: The Apache web server can take on the appearance of being
multiple distinct servers. Virtual servers will be discussed in more detail in the
next lesson.

We begin by examining configuration relevant to the server as a whole. You might want
to open the file /etc/httpd/conf/httpd.conf in a pager or text editor and follow
along as you read the following sections. (You should consider setting the editor into a
"read only" mode, or making a backup of the file and browsing it).

The Global Section

The Global section of the configuration file includes configuration that effects the server
as a whole.

Figure 1. /etc/httpd/conf/httpd.conf

### Section 1: Global Environment


#
35 # The directives in this section affect the overall operation of
Apache,
# such as the number of concurrent requests it can handle or
where it
# can find its configuration files.

Configuration Context: ServerRoot

The <directive>ServerRoot</directive> directive establishes a home base for all of the


remaining server context, while the second directive is a simple example of making use
of this home base.

Figure 1. /etc/httpd/conf/httpd.conf

46 #
# ServerRoot: The top of the directory tree under which the
server's
# configuration, error, and log files are kept.
...
#
55 # Do NOT add a slash at the end of the directory path.
#
ServerRoot "/etc/httpd"

#
60 # PidFile: The file in which the server should record its
process
# identification number when it starts.
#
PidFile run/httpd.pid
The <directive>ServerRoot</directive> directive establishes context for future file
references within the configuration file. Any relative file reference (one that does
not begin with a "/") will be relative to the <directive>ServerRoot</directive>,
which in Red Hat Enterprise Linux is /etc/httpd.
In Unix, daemons traditionally record the fact that they are running by creating a
file in the filesystem which contains their process id, called a pid file. The
<directive>PidFile</directive> directive specifies where this file should be located.

Examining the /etc/httpd directory, we find it's populated with several symbolic
links.

[root@station ~]$ ls -l /etc/httpd


total 28
drwxr-xr-x 4 root root 4096 Jul 25 06:33 conf
drwxr-xr-x 2 root root 4096 Jul 25 06:33 conf.d
lrwxrwxrwx 1 root root 19 Jul 25 06:33 logs -> ../../var/log/httpd
lrwxrwxrwx 1 root root 27 Jul 25 06:33 modules ->
../../usr/lib/httpd/modules
lrwxrwxrwx 1 root root 13 Jul 25 06:33 run -> ../../var/run

In the httpd.conf configuration file, file references that begin logs/, modules/, or
run/ are mapped to the relevant directories. Can you convince yourself that the
daemon's pid file would be found at /var/run/httpd.pid?

It's important to understand the role of the ServerRoot directive, and the use of the
symbolic links in the /etc/httpd directory, but there's seldom any reason to change
these values.

Client Connection Dynamics: Timeout and KeepAlive

The following directives control how long the server will wait on badly behaved clients.

Figure 1. /etc/httpd/conf/httpd.conf

65 #
# Timeout: The number of seconds before receives and sends time
out.
#
Timeout 120
70 #
# KeepAlive: Whether or not to allow persistent connections
(more than
# one request per connection). Set to "Off" to deactivate.
#
KeepAlive Off
75
#
# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited
amount.
# We recommend you leave this number high, for maximum
performance.
80 #
MaxKeepAliveRequests 100

#
# KeepAliveTimeout: Number of seconds to wait for the next
request from the
85 # same client on the same connection.
#
KeepAliveTimeout 15
A particular httpd process can only communicate with one client at a time. A badly
behaved client, which opens a TCP/IP connection but never uses it, could therefore
tie up a server indefinitely. The <directive>Timeout</directive> directive specifies
how long, in seconds, before a server terminates a connection with a badly behaved
client.
These directives decide if the server honors "Keep Alive" requests from a client,
how many request can be made over a "Keep Alive" connection, and how long
before an inactive connection should time out.

The HTTP protocol is termed a "stateless" protocol, meaning that the server doesn't
record any information about the client between one request and the next. In the original
HTTP/1.0 protocol, clients are required to open a new socket for every request.
Downloading a web page with 10 images, therefore, would require the client to open 11
sockets (one for the page, and one for each referenced image).

The HTTP/1.1 protocol tried to improve efficiency by allowing a client to leave a single
socket open for "follow up" requests. Such a persistent socket is called a "Keep Alive"
socket. Clients are more likely to abuse such persistent connections, however, by
leaving them open but not making any followup requests, so stricter timeout values are
usually assigned to such connections.

Managing the Server Pool: StartServers, {Min,Max}SpareServers, MaxClients,


and MaxRequestsPerChild

Recall that most Unix daemons use a forking model. Upon receiving a new client
connection, the server process forks (duplicates itself), dedicating the new child to the
newly connected client, while the parent returns to listening for new connections.

In order to gain efficiency, the Apache web server takes the uncommon approach of
"pre-forking" child daemons to handle client connections, before the clients ever arrive.
Even on an unused web server, several httpd processes exist. The parent daemon is
generally run as the user <username>root</username>, and the pre-forked child
daemons as the user <username>apache</username>. The collection of httpd process
are often referred to as the "server pool".

[root@station ~]# ps aux | grep httpd


root 2334 0.0 2.0 19504 10488 ? Ss 05:57 0:00
/usr/sbin/httpd
apache 2359 0.0 2.0 19504 10624 ? S 05:57 0:00
/usr/sbin/httpd
apache 2360 0.0 2.0 19504 10624 ? S 05:57 0:00
/usr/sbin/httpd
apache 5248 0.0 2.0 19504 10628 ? S 07:04 0:00
/usr/sbin/httpd
root 7636 0.0 0.1 3768 716 pts/5 S+ 08:56 0:00 grep
httpd

The following directive manage the dynamics of the server pool.

Figure 1. /etc/httpd/conf/httpd.conf

# prefork MPM
# StartServers: number of server processes to start
95 # MinSpareServers: minimum number of server processes which are
kept spare
# MaxSpareServers: maximum number of server processes which are
kept spare
# ServerLimit: maximum value for MaxClients for the lifetime of
the server
# MaxClients: maximum number of server processes allowed to
start
# MaxRequestsPerChild: maximum number of requests a server
process serves
100 <IfModule prefork.c>
StartServers 8
MinSpareServers 5
MaxSpareServers 20
ServerLimit 256
105 MaxClients 256
MaxRequestsPerChild 4000
</IfModule>
StartServers: The initial size of the server pool (in number of processes).
{Min,Max}SpareServers: The server pool scales dynamically. If a web server gets
blitzed with many requests, more child daemons will be started. If things go quiet,
unused child daemons will be killed. These directives place bounds on the server
pool size.
ServerLimit, MaxClients: The number of concurrent requests can be limited.
Connection request above this limit will be greeted with a quick "I'm busy... come
back later", rather than actually handled. The distinction between the
<directive>ServerLimit</directive> and <directive>MaxClients</directive>
directives is subtle, and in practice they are set together to the same value.
MaxRequestsPerChild: In order to improve stability, a given child daemon will
only serve so many requests until it kills itself, and a new daemon must be started.
(This suicide helps curtail memory leaks in poorly written libraries and CGI
executables.)
Controlling the Server Address: Listen

Figure 1. /etc/httpd/conf/httpd.conf

125 #
# Listen: Allows you to bind Apache to specific IP addresses
and/or
# ports, in addition to the default. See also the <VirtualHost>
# directive.
#
130 # Change this to Listen on specific IP addresses as shown below
to
# prevent Apache from glomming onto all bound IP addresses
(0.0.0.0)
#
#Listen 12.34.56.78:80
Listen 80

The <directive>Listen</directive> directive controls which address the server binds to.
In the default configuration (above), the server binds to internal IP address 0.0.0.0
(implying every active interface), port 80. Multiple <directive>Listen</directive> lines
can be used to specify that the daemon should bind to multiple ports and/or addresses.

Extending the Web Server: LoadModule

The Apache web server is modular by design. The core web server is actually fairly
minimal, with various modules providing much of the interesting behavior. Modules
may either be "static", meaning that they're part of the core executable and can never be
removed, or "dynamic", meaning that an administrator can control if the module is
loaded or not during startup.

Apache dynamic modules are located in the /usr/lib/httpd/modules, and are loaded
using the <directive>LoadModule</directive> directive.

Figure 1. /etc/httpd/conf/httpd.conf

136 #
# Dynamic Shared Object (DSO) Support
#
# To be able to use the functionality of a module which was
built as a DSO you
140 # have to place corresponding `LoadModule' lines at this
location so the
# directives contained in it are actually available _before_
they are used.
# Statically compiled modules (those listed by `httpd -l') do
not need
# to be loaded here.
#
145 # Example:
# LoadModule foo_module modules/mod_foo.so
#
LoadModule auth_basic_module modules/mod_auth_basic.so
LoadModule auth_digest_module modules/mod_auth_digest.so
150 LoadModule authn_file_module modules/mod_authn_file.so
LoadModule authn_alias_module modules/mod_authn_alias.so
LoadModule authn_anon_module modules/mod_authn_anon.so
...
LoadModule include_module modules/mod_include.so
LoadModule log_config_module modules/mod_log_config.so
165 LoadModule logio_module modules/mod_logio.so
LoadModule env_module modules/mod_env.so
LoadModule ext_filter_module modules/mod_ext_filter.so
LoadModule mime_magic_module modules/mod_mime_magic.so
...
206 #
# Load config files from the config directory
"/etc/httpd/conf.d".
#
Include conf.d/*.conf
210
The various modules tend to introduce new configuration directives to modify their
behavior. For example, the log_config_module provides the
<directive>LogFormat</directive> directive, which we will encounter later. In the
configuration file, the module must be loaded (with
<directive>LoadModule</directive>) before any directives it provides are
encountered.
In order to ease the distribution of modules using a package managed system (such
as RPM), the <directive>Include</directive> directive specifies external
configuration files to include, either directly or by using pathname expansion (file
globbing).

The Main Section

The Main section of the configuration file includes configuration that effects the
primary server, but directives in this section can be overridden by any virtual server.

Figure 1. /etc/httpd/conf/httpd.conf

### Section 2: 'Main' server configuration


#
235 # The directives in this section set up the values used by the
'main'
# server, which responds to any requests that aren't handled by
a
# <VirtualHost> definition. These values also provide defaults
for
# any <VirtualHost> containers you may define later in the file.
#
240 # All of these directives may appear inside <VirtualHost>
containers,
# in which case these default settings will be overridden for
the
# virtual host being defined.

Server Identity: ServerName and ServerAdmin

The first two directives in the main section help establish the identity of the server.

Figure 1. /etc/httpd/conf/httpd.conf
245 #
# ServerAdmin: Your address, where problems with the server
should be
# e-mailed. This address appears on some server-generated
pages, such
# as error documents. e.g. admin@your-domain.com
#
250 ServerAdmin root@localhost

#
# ServerName gives the name and port that the server uses to
identify itself.
# This can often be determined automatically, but we recommend
you specify
255 # it explicitly to prevent problems during startup.
...
264 #ServerName www.example.com:80
The <directive>ServerAdmin</directive> directive is mainly cosmetic. The email
address is listed in the footer of the default error pages.
For simple hosts, with a single external interface and therefore a clear concept of a
hostname, the <directive>ServerName</directive> can be automatically
determined. If in doubt, however, it should be specified manually. (For example, if
the server is bound to multiple interfaces, the preferred name should be configured
explicitly).

Server Content: the DocumentRoot

The <directive>DocumentRoot</directive> directive, one of the most fundamentally


important, identifies where in the filesystem the information to be be served is found.
Recall that when the file portion of a URL is translated to a file in the filesystem, the
document root provides the base of that translation. This directive is probably the most
often overridden by a Virtual Host.

The following default specifies the Red Hat Enterprise Linux document root as
/var/www/html.

Figure 1. /etc/httpd/conf/httpd.conf

# DocumentRoot: The directory out of which you will serve your


# documents. By default, all requests are taken from this
directory, but
# symbolic links and aliases may be used to point to other
locations.
#
280 DocumentRoot "/var/www/html"

Specifying the Directory Index File: DirectoryIndex

In a previous lesson, we discussed the role of an index file, called index.html. We now
see that the name of the file is configurable.

Figure 1. /etc/httpd/conf/httpd.conf

#
# DirectoryIndex: sets the file that Apache will serve if a
directory
# is requested.
385 #
# The index.html.var file (a type-map) is used to deliver
content-
# negotiated documents. The MultiViews Option can be used for
the
# same purpose, but it is much slower.
#
390 DirectoryIndex index.html index.html.var

Notice that if multiple file names are specified, each will be searched for in sequence.
Specifying too many alternatives, however, could lead to poor performance.

For example, if migrating content from a Microsoft based server, setting


<directive>DirectoryIndex</directive> to the following would be easier than renaming
every file named index.htm to index.html.

DirectoryIndex index.html index.htm


Tip
Index files can even be specified as an absolute reference. What do you think
would be the effect of a configuration such as the following?

DirectoryIndex index.html /cgi-bin/index.cgi

Collecting Client Identities: HostnameLookups

Buried deep withing the configuration file is an important directive called


<directive>HostnameLookups</directive>.

Figure 1. /etc/httpd/conf/httpd.conf

435 #
# HostnameLookups: Log the names of clients or just their IP
addresses
# e.g., www.apache.org (on) or 204.62.129.132 (off).
# The default is off because it'd be overall better for the net
if people
# had to knowingly turn this feature on, since enabling it means
that
440 # each client request will result in AT LEAST one lookup request
to the
# nameserver.
#
HostnameLookups Off

The web server can easily determine the IP address of any client which is making a web
request: it's part of the request's IP protocol header. In order to determine the hostname
of the client, however, the web server must work harder: it must perform a reverse DNS
lookup on the client's IP address. This reverse lookup increases both time and network
traffic on the part of the server, so by default, it's disabled. As a result, all logging and
access control list are implemented by IP address, not by hostname.
If you desire logs and access control lists to use client hostnames instead of IP
addresses, and are willing to pay the price in performance,
<directive>HostnameLookup</directive> can be set to <directive>on</directive>.

Logging: ErrorLog, LogLevel, LogFormat, and CustomLog

The apache web server maintains two types of logs: transaction logs, and error logs.
Transaction logging occurs with every web request ("hit"), and is highly configurable,
potentially logging to multiple files. In contrast, there is only one error log, and only
two questions associated with it: where, and how much. We start with the simpler of the
two.

Error Logging: ErrorLog and LogLevel

Figure 1. /etc/httpd/conf/httpd.conf

#
465 # ErrorLog: The location of the error log file.
# If you do not specify an ErrorLog directive within a
<VirtualHost>
# container, error messages relating to that virtual host will
be
# logged here. If you *do* define an error logfile for a
<VirtualHost>
# container, that host's errors will be logged there and not
here.
470 #
ErrorLog logs/error_log

#
# LogLevel: Control the number of messages logged to the
error_log.
475 # Possible values include: debug, info, notice, warn, error,
crit,
# alert, emerg.
#
LogLevel warn

By default, the web server logs to the file /var/log/httpd/error_log (recall the role
of the <directive>ServerRoot</directive> directive, and the /etc/httpd/logs
symlink). For the main server, it's hard to think of a reason to ever change it, though
virtual hosts often override it.

More interesting is the <directive>LogLevel</directive>, which determines how much


information is logged. The vocabulary draws directly from the syslog service. When
troubleshooting, an administrator often ratchets up the logging by setting the
<directive>LogLevel</directive> to <directive>debug</directive>, for example. Of
course, more copious logging slows down overall performance, so once a problem has
been resolved, logging is returned to a more suitable default.

Transaction Logging: LogFormat and CustomLog

For every web request, there is a large amount of information that an administrator can
choose to log (or not). Such transaction logs are often referred to as "access logs". The
<directive>LogFormat</directive> directive allows administrators to assign names to
collections of information, so that they are easy to refer to later. This is all
<directive>LogFormat</directive> does, however. In order to use one of the formats,
they must be associated with a <directive>CustomLog</directive>.

Figure 1. /etc/httpd/conf/httpd.conf

480 #
# The following directives define some format nicknames for use
with
# a CustomLog directive (see below).
#
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-
Agent}i\"" combined
485 LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat "%{Referer}i -> %U" referer
LogFormat "%{User-agent}i" agent

# "combinedio" includes actual counts of actual bytes received


(%I) and sent (%O); this
490 # requires the mod_logio module to be loaded.
#LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-
Agent}i\" %I %O" combinedio

The following table illustrates some of the parameters most commonly used in access
logs.

Table 1. Apache Log Parameters

Parameter References Example


%h Remote host (IP or hostname) 127.0.0.1
Remote user (for HTTP
%u elvis
authentication)
%t Timestamp [15/Jul/2005:06:55:44 -0400]
GET /icons/compressed.gif
%r Request line (from HTTP protocol)
HTTP/1.1
%s HTTP response status code 200
%b Response size (in bytes) 1079
%{name}i HTTP header name (depends on name)

Many more exist as well. As usual, with all of this flexibility comes the need for
convention. Two commonly used conventions are the common format and the combined
format, which are the first two formats defined above. The common format records IP
address, username (if any), timestamp, request line, response status, and number of
bytes transferred. [1]

The combined format adds the identity of the client application, and the referring page
(if any). While the combined format is used by default in Red Hat Enterprise Linux,
administrators could well choose to drop back to the common format to save space and
improve performance.
Many external log analysis utilities (such as webalizer) rely on logs being in a standard
format, so an administrator should consider the consequences before changing the log
format arbitrarily.

Finally, once a format has been decided, it can be associated with a log file using the
<directive>CustomLog</directive> directive.

Figure 2. /etc/httpd/conf/httpd.conf

# The location and format of the access logfile (Common Logfile


Format).
495 # If you do not define any access logfiles within a
<VirtualHost>
# container, they will be logged here. Contrariwise, if you
*do*
# define per-<VirtualHost> access logfiles, transactions will be
# logged therein and *not* in this file.
#
500 #CustomLog logs/access_log common

#
# If you would like to have separate agent and referer logfiles,
uncomment
# the following directives.
505 #
#CustomLog logs/referer_log referer
#CustomLog logs/agent_log agent

#
510 # For a single logfile with access, agent, and referer
information
# (Combined Logfile Format), use the following directive:
#
CustomLog logs/access_log combined

As the above configuration suggests, multiple log files, each containing different
information, could be updated with each hit, though of course performance is a
consideration. By default, Red Hat Enterprise Linux only updates the single file
/var/log/httpd/access_log, using the combined format.

Remapping the URL Namespace: Alias

Up until now, we have had a very clean concept of the URL namespace: the file portion
of a URL maps directly to a file which exists underneath the document root directory.
The <directive>Alias</directive> directive allows administrators to make arbitrary
mappings from a portion of the URL namespace to any directory in the filesystem.

Figure 1. /etc/httpd/conf/httpd.conf

# Aliases: Add here as many aliases as you need (with no limit).


The format is
# Alias fakename realname
#
# Note that if you include a trailing / on fakename then the
server will
530 # require it to be present in the URL. So "/icons" isn't
aliased in this
# example, only "/icons/". If the fakename is slash-terminated,
then the
# realname must also be slash terminated, and if the fakename
omits the
# trailing slash, the realname must also omit it.
#
535 # We include the /icons/ alias for FancyIndexed directory
listings. If you
# do not use FancyIndexing, you may comment this out.
#
Alias /icons/ "/var/www/icons/"

As an example, the default Red Hat Enterprise Linux configuration aliases


http://localhost/icons/ to the directory /var/www/icons/, which is not underneath the
document root, but a sibling of it. The remapping should be easy enough to confirm by
following the above link, and taking a ls of the icons directory.

For better or for worse, we now have a way to expose portions of our filesystem which
are not under the document root. Another option is the use of symbolic links, which will
be discussed in more detail shortly.

Also, notice the comments about trailing slashes, which have often been a source of
confusion. The Apache webserver automatically redirects clients which refer to
directories without the trailing slash to an equivalent URL which does (watch closely as
you access http://localhost/example, and note that the browser ends up showing the
omitted trailing slash). This causes some directory related configuration which doesn't
specify the omitted slash to be interpreted twice, which can cause confusion.

The Answer Book: http://localhost/manual

By now you could well be bewildered by the many different configuration directives,
and in many ways we've just touched the tip of the iceberg. This seems a good time to
introduce the manual, which in Red Hat Enterprise Linux ships as the separate http-
manual package. Once installed, the manual can be accessed at http://localhost/manual.

[root@station ~]# yum install httpd-manual


...
======================================================================
=======
Package Arch Version Repository
Size
======================================================================
=======
Installing:
httpd-manual i386 2.2.3-6.el5 rha-rhel
831 k
...
Installed: httpd-manual.i386 0:2.2.3-6.el5
Complete!
[root@station ~]# service httpd restart
Stopping httpd: [ OK ]
Starting httpd: [ OK ]
The manual provides comprehensive documentation, organized by directive name,
module name, or by topic (such as "Log Files" or "Virtual Hosts"). Anyone wishing to
quickly refresh memories, or learn more about Apache configuration, should definitely
load the manual as well.

Exercises

Lab Exercise
Objective: Configure the Apache web server.

Estimated Time: 45 mins.

Specification

You will probably want to make a backup of the main Apache configuration file
(/etc/httpd/conf/httpd.conf) before starting this exercise, so that you can later
restore the default configuration. If you have not already downloaded http://rha-
server/pub/rha/rha230/readings.tgz and extracted its contents into the /var/www/html
directory (as specified in the previous exercise), do so now.

Edit your Apache configuration so that the server meets the following specifications.
The suggested technique is to duplicate the relevant lines of your configuration file,
comment out the original configuration, and edit the new line to make your changes.
You will probably want to make incremental changes, checking your configuration as
you go.

1. Configure the Apache webserver so that it accepts HTTP/1.1 KeepAlive


requests, but will only wait 3 seconds for a followup request before closing the
connection.

Hint: you can confirm this configuration by capturing a transaction between the
Firefox browser and your webserver with ethereal, and examining the HTTP
headers of both the request and response.

2. Manage the bounds of the server pool, such that there are always between 2 and
4 (inclusive) child daemons present.
3. The Apache server should be bound to port 8888 (of at least the loopback
address), in addition to port 80 (on all interfaces). (Note: you will need to drop
SELinux into permissive mode in order to allow Apache to bind to a port other
than 80 and 443).
4. Configure the web server such that index.htm is recognized as an index file, as
well as index.html. Confirm your configuration by removing the file
/var/www/html/readings/relat10h/index.html that you created in the
previous exercise, if it exists, leaving the original
/var/www/html/readings/relat10h/index.htm, and referencing
http://localhost/readings/relat10h/.
5. Configure the server so that clients are logged by hostname (when available) as
opposed to IP address. (Hint: You are not expected to need to edit any
<directive>LogFormat</directive> directives).
6. Set the log level for the error log to <directive>debug</directive>.
7. In addition to the default logging, have every web request logged to the file
/var/log/httpd/common_log, using what is commonly referred to as the
common format.
8. In the separate configuration file /etc/httpd/conf.d/rha.conf, establish an
Alias, so that the URL http://localhost/images/ refers to the directory
/var/www/html/readings/relat10h/pics. (If the relevant directory is still
named picts, rename it or symlink it to pics).

Deliverables

1. 1. A running Apache webserver, that accepts Keep-Alive requests, but will close
connections after 3 seconds of inactivity.
2. The server should maintain a server pool of between 2 and 4 pre-forked child
daemons.
3. The server should be bound to the loopback address's port 8888, in addition to
the normal port 80.
4. The server should treat files named index.htm as index files, in addition to
the standard index.html.
5. Transaction logging should log clients by hostname, if available.
6. The error log should log all messages with <directive>debug</directive> and
higher priority.
7. In addition to the standard access_log, a transaction log named
/var/log/httpd/common_log should be kept, logging in the common
format.
8. The URL http://localhost/images/ should resolve to
/var/www/html/readings/relat10h/pics, due to an alias established in
the /etc/httpd/conf.d/rha.conf configuration file.

Apache Configuration: Containers

Key Concepts

The Apache web server allows context dependent configuration through the use
of <directive>Directory</directive>, <directive>Location</directive>,
<directive>Files</directive>, and <directive>VirtualHost</directive>
containers.
Often, the <directive>Options</directive> directive is used within containers to
allow or disallow symbolic link resolution (with
<directive>FollowSymLinks</directive>) and dynamic directory generation
(with <directive>Indexes)</directive>, among other parameters.
Often, the <directive>Order</directive>, <directive>allow from</directive>,
and <directive>deny from</directive> directives are used within containers to
implement access control based on the client's IP address or hostname.
The default Red Hat Enterprise Linux configuration allows the resolution of
symbolic links almost everywhere, but limits the generation of dynamic indexes
to the intended document root directory.
Dynamic information about the Apache webserver can be obtained using custom
handlers which are conventionally associated with the /server-status and
<directive>/server-info</directive> locations.

Discussion

Tailoring Customization to Particular Content: Containers

The Apache webserver allows configuration to be customized to particular files or


directories using containers. Containers start with an XMLish opening tag, such as
<Directory ...>, and end with an XMLish closing tag, such as </Directory>.
Directives found within the container only affect files which fall under the container's
scope.

There are essentially four types of scoping containers, which are exemplified below and
itemized in the following table.

Figure 1. Sample Apache Containers

<Directory "/var/www/icons">
Options Indexes MultiViews
AllowOverride None
Order allow,deny
Allow from all
</Directory>

<Location /server-status>
SetHandler server-status
Order deny,allow
Deny from all
Allow from .example.com
</Location>

<Files ~ "*.hide">
Order allow,deny
Deny from all
</Files>

<VirtualHost *:80>
ServerAdmin webmaster@dummy-host.example.com
DocumentRoot /www/docs/dummy-host.example.com
ServerName dummy-host.example.com
ErrorLog logs/dummy-host.example.com-error_log
CustomLog logs/dummy-host.example.com-access_log common
</VirtualHost>

Table 1. Apache Scoping Containers

Directive Scope
All files which exist in or underneath the specified
<directive>Directory</directive> directory in the filesystem, after URL to filename
translation occurs.
Directive Scope
All files which exist in or underneath the specified
<directive>Location</directive> location in the URL namespace, before URL to
filename translation occurs.
All files which match the specified pattern, no
<directive>Files</directive> matter where they exist in the filesystem or URL
namespace.
All files served by a particular virtual server.
<directive>VirtualHost</directive> Virtual hosts will be covered in detail in a later
lesson.

The argument to the opening tag specifies the relevant file or directory (or, in the case
of <directive>VirtualHost</directive>, IP address). The filename may either be explicit,
or shell-like pathname expansion (file globbing) can be used.

Common Container Configuration

Skimming the containers exemplified above, one finds that container configuration
often involves the following three concepts.

1. Options: Various capabilities of the web server are grouped under a general
<directive>Options</directive> directive.
2. ACLs: The web server allows access control lists (or ACLs, informally
pronounced "Ack-uls") to specify which clients are allowed to access
information, using the <directive>Order</directive>,
<directive>Allow</directive>, and <directive>Deny</directive> directives.
(Access control can also be based on authenticated users, unfortunately a topic
beyond the scope of the current course).
3. Overrides: If allowed with the <directive>AllowOverride</directive> directive,
local configuration files intermixed with webserver content can dynamically
override the startup configuration.

We look at each of these syntaxes in turn.

General Options: Options

The Apache server supports the following options, which are specified as arguments to
the <directive>Options</directive> directive, usually within a scoping container. Of
these, the first two are most commonly used.

Table 1. Apache Options

Option Effect
When a URL references a directory (as
opposed to a regular file), and no
<directive>Indexes</directive> index.html file is present (more on
this in a bit), and this option is enabled,
the web server will return an
Option Effect
automatically generated directory
listing. If
<directive>Indexes</directive> is
disabled, a 403 error page will be
returned to the client (Access
Forbidden).
This option must be enabled in order
for the webserver to resolve (follow) a
<directive>FollowSymLinks</directive>
symbolic link. The symlink is only
relevant to a directory container.
A qualification of the
<directive>FollowSymLinks</directive
<directive>SymLinksIfOwnerMatch</directive > option, where the symlink will only
> be followed if the file owner of the
resulting file is the same as the file
owner of the link itself.
Allow CGI executables to be executed
<directive>ExecCGI</directive> from withing this scope. (More on
these later).
Server side includes are allowed (or, in
<directive>Includes</directive>, the latter case, mostly allowed) from
<directive>IncludesNOEXEC</directive> within this scope. Server side includes
are beyond the scope of this course.
If enabled, content negotiation between
the client and the server is supported.
This allows a server to serve a
document in the most appropriate of
<directive>Multiviews</directive>
multiple languages, for example.
Further discussion of
<directive>Multiviews</directive> is
beyond the scope of this course.
This option refers to all of the previous
options collectively, with the exception
of <directive>Multiviews</directive>.
Unless otherwise specified, this is the
<directive>All</directive> default configuration. (Recall that in
Red Hat Enterprise Linux, however, a
different policy applies to the root
directory, effectively establishing a
different default.)

Why not Indexes?

The decision to allow the web server to automatically generate indexes or not is really a
matter of control. If indexes are automatically generated, then merely locating a file
underneath the document root allows anyone to view it or copy it (often with automated
command line clients such as wget), unless an index.html file is created to hide files
within a particular directory. In contrast, if indexes are not allowed, files must be
explicitly linked from other files (index.html or otherwise) to be easily discovered.

Many low maintenance, public web sites leave indexes on (such as the official Linux
kernel repository). Other web sites, hoping for a more professional look or more refined
control of information, do not.

Why not resolve Symbolic Links?

Again, the decision to allow symlink resolution is basically one of control. If symlinks
are not allowed, an administrator has a clear concept of what portions of the file system
are exposed through the web server (only files underneath the document root). If
symlinks are resolved, however, a symlink underneath the document root could expose
any other part of the filesystem.

More subtly, the decision to not resolve symlinks can degrade performance. When
resolving a path to reference a file, the kernel automatically resolves symlinks. (If you
were to cat the file /foo/biz/baz/buzz, you do not need to worry if the directory biz
or baz is actually a symlink). If symlinks are disabled, however, the web server must
make a system call on each of the nodes within a file path, asking "is it a symlink? is it a
symlink? is it a symlink?" This degradation is one of the reasons why the default Red
Hat Enterprise Linux configuration leaves <directive>FollowSymLinks</directive>
enabled.

Options Syntax

The <directive>Options</directive> directive takes effect for the scope specified by its
enclosing container. For example, the following container would enable indexes and
symlink resolution for all files underneath the directory /var/www/html.

<Directory /var/www/html>
Options FollowSymLinks Indexes
</Directory>

The following container, however, would enable indexes and server side includes
underneath /var/www/html/widgets.

<Directory /var/www/html/widgets>
Options Indexes Includes
</Directory>

The directory /var/www/html/widgets does not inherit its options from


/var/www/html, but instead gets its configuration entirely from the new
<directive>Options</directive> line. Because <directive>FollowSymLinks</directive>
is not mentioned, symlinks underneath /var/www/html/widgets will not be resolved.

In contrast, options can be preceded by a "+" or "-", implying that options should be
inherited from the enclosing scope, with the simple addition or stripping of a particular
option. Consider rewriting the above container as follows.
<Directory /var/www/html/widgets>
Options +Includes
</Directory>

In this case, the /var/www/html/widgets directory would have


<directive>Includes</directive>, <directive>Indexes</directive>, and
<directive>FollowSymLinks</directive> enabled (the latter two inherited from
/var/www/html).

Similarly, the following container would leave /var/www/html/widgets with only the
<directive>FollowSymLinks</directive> option enabled.

<Directory /var/www/html/widgets>
Options -Indexes
</Directory>

Client Access Control: Order, Allow, Deny

The Apache web server allows an administrator to impose access control restrictions on
a directory by directory (or even file by file) basis using access control lists. These
ACL's are composed of the following directives.

The Allow Directive

The <directive>Allow</directive> directive uses the following syntax to specify which


clients are allowed to connect to a given resource.

Allow from client_specification

The client_specification is composed of a whitespace separated list of any of the


following elements.

Table 1. Apache ACL client specification

Syntax Example Meaning


<directive>ALL</direc <directive>ALL</directi
All clients
tive> ve>
Full IP addresses 192.168.0.3 The specified client
All clients whose IP address begins
Partial IP addresses 172.63.
as specified
Network/Netmask 192.168.1.64/255.255.25 All clients who belong to the
notation 5.192 specified subnet
All clients who belong to the
specified subnet (this example is
CIDR notation 192.168.1.64/26
completely equivalent to the
preceding example).
All clients whose reverse lookup
A full or partial domain
.example.com domain name ends as specified
name
(reverse lookups must be enabled
Syntax Example Meaning
with
<directive>HostnameLookups</dire
ctive>)

The Deny Directive

The <directive>Deny</directive> directive uses an identical syntax to specify which


clients are not allowed to connect to a given resource.

Deny from client_specification

The client_specification is composed of the same elements as for the


<directive>Allow</directive> directive.

The Order directive

Here's where things get interesting. Whenever client ACLs are specified with the
<directive>Allow</directive> and <directive>Deny</directive> directives, the order of
precedence must be specified with the <directive>Order</directive> directive.

The <directive>Order</directive> directive usually comes in one of two forms.

Order Allow,Deny

In this case, any clients which are unspecified (not matching any rule) or over specified
(they match both an allow and deny rule) are denied.

Order Deny,Allow

In this case, any clients which are unspecified or over specified are allowed.
Surprisingly, no spaces are allowed around the comma in either case.

Some examples are in order.

Example 1
<Directory /some/sensitive/content>
Order Deny,Allow
Deny from All
Allow from 192.168.0.
</Directory>

In this case, only clients from within the 192.168.0.0/255.255.255.0 subnet are allowed
to access files underneath /some/sensitive/content.

Example 2
<Directory /keep/them/out>
Order Allow,Deny
Allow from 192.168.0.
Deny from 192.168.0.4
</Directory>
In this case, clients from within the 192.168.0.0/255.255.255.0 subnet are allowed to
access files underneath /keep/them/out, except for client 192.168.0.4. All clients
outside of the subnet are not allowed access.

Example 3
<Directory /only/for/example>
HostNameLookups on
Order Allow,Deny
Allow from .example.com
</Directory>

In this case, clients from within the example.com domain allowed to access files
underneath /only/for/example.

If you are having trouble figuring out how the term "order" applies to the effect of the
<directive>Order</directive> directive, your author sympathizes. However, with a little
experience, a certain sense of the syntax can be made. Until then, make sure that you
confirm any ACLs by actually trying to access the material from the appropriate clients.

Red Hat Enterprise Linux Default Configuration

Now that we know a little about containers, we're ready to examine some of the
containers that come in the default Red Hat Enterprise Linux Apache configuration. The
first container encountered establishes a fairly paranoid default policy.

Figure 1. /etc/httpd/conf/httpd.conf

#
# Each directory to which Apache has access can be configured
with respect
# to which services and features are allowed and/or disabled in
that
285 # directory (and its subdirectories).
#
# First, we configure the "default" to be a very restrictive set
of
# features.
#
290 <Directory />
Options FollowSymLinks
AllowOverride None
</Directory>

In this case, the "/" in the opening tag is not syntax, but a reference to the root directory.
So from the root directory on down (i.e., everywhere), the specified policies apply.
Specifically, the only allowed <directive>Option</directive> is
<directive>FollowSymLinks</directive>, and no overrides are allowed.

The next container loosens things up a bit for the directory /var/www/html. (Why was
this directory picked for special attention?)

Figure 2. /etc/httpd/conf/httpd.conf
290 <Directory "/var/www/html">

#
# Possible values for the Options directive are "None", "All",
# or any combination of:
310 # Indexes Includes FollowSymLinks SymLinksifOwnerMatch ExecCGI
MultiViews
#
# Note that "MultiViews" must be named *explicitly* - - -
"Options All"
# doesn't give it to you.
#
315 # The Options directive is both complicated and important.
Please see
# http://httpd.apache.org/docs-2.0/mod/core.html#options
# for more information.
#
Options Indexes FollowSymLinks
320
#
# AllowOverride controls what directives may be placed in
.htaccess files.
# It can be "All", "None", or any combination of the keywords:
# Options FileInfo AuthConfig Limit
325 #
AllowOverride None

#
# Controls who can get stuff from this server.
315 #
Order allow,deny
Allow from all

</Directory>

In answer to the above question, access to content beneath /var/www/html is loosened


a bit because that directory contains the expected content to be served from the
webserver. The container also contains some client access control configuration, but
only as an example, as the effect of the configuration is to allow everyone.

Location Containers: server-status and server-info

We find the following two examples of <directive>Location</directive> containers


within the default configuration file, both commented out.

Figure 1. /etc/httpd/conf/httpd.conf

#
# Allow server status reports generated by mod_status,
# with the URL of http://servername/server-status
900 # Change the ".example.com" to match your domain to enable.
#
#<Location /server-status>
# SetHandler server-status
# Order deny,allow
905 # Deny from all
# Allow from .example.com
#</Location>
#
910 # Allow remote server configuration reports, with the URL of
# http://servername/server-info (requires that mod_info.c be
loaded).
# Change the ".example.com" to match your domain to enable.
#
#<Location /server-info>
915 # SetHandler server-info
# Order deny,allow
# Deny from all
# Allow from .example.com
#</Location>

Both of these provide examples of virtual locations, in that, if enabled (and customized
a bit), the server would respond to requests for http://localhost/server-info and
http://localhost/server-status. The URL's do not map to any particular directory on the
filesystem, however, so a <directive>Directory</directive> container would have been
inappropriate.

Each of these containers implements a custom handler using the


<directive>SetHandler</directive> directive. A thorough discussion of the concept of a
handler is beyond the scope of the current class, but essentially a handler determines
how the server responds to a request. The default handler, which returns the contents of
the referenced file to the client, is the only handler we've encountered so far. Other
handlers allow the web server to respond differently to requests.

The server-status Handler

The server-status handler, when invoked, returns a page of status information


(formatted as HTML) back to the client. The following configuration would attach this
handler to the http://localhost/server-status url, but restrict access to 127.0.0.1.

<Location /server-status>
SetHandler server-status
Order deny,allow
Deny from all
Allow from 127.0.0.1
</Location>

The Apache web server responds to http://localhost/server-status with a page of status


information similar to the following.

Figure 1. Apache Web Server Status Page


The server-info Handler

Similarly, the server-info handler returns a dynamically generated page which reports
the web server's current configuration.

<Location /server-info>
SetHandler server-info
Order deny,allow
Deny from all
Allow from 127.0.0.1
</Location>

With this configuration active, the Apache web server responds to


http://localhost/server-info with a page of configuration information similar to the
following.

Figure 1. Apache Web Server Status Page


Virtual Hosts

One of the reasons for the popularity of the Apache web server is that it can easily take
on the personalty of any of multiple web servers, each of which is referred to as a
virtual host.

As a pre-requisite to virtual hosting, DNS (domain name service) must resolve multiple
hostnames to the single machine which is running the Apache web server. You will
discover in the workbook on DNS, this is not difficult to arrange. In our current
discussions, however, we will assume that DNS is appropriately configured.

There are two approaches to virtual hosting supported by the Apache web server: IP
based virtual hosting, and name based virtual hosting. We look at each of these in turn.

IP Based Virtual Hosting

For IP based virtual hosting, the machine running the Apache server must be assigned
multiple IP addresses. These addresses could either be a result of multiple Ethernet
cards (and thus multiple distinct network interfaces), or the result of a Linux trick called
IP aliasing, which assigns multiple IP addresses to a single Ethernet card.
For IP based virtual hosting, distinguishing the virtual hosts of the web server is trivial.
The web server merely needs to examine the server IP address which is part of the
incoming client request TCP/IP packet. Consider the machine which answers to the
hostname <hostname>www.republican.pol</hostname>, with an IP address of
192.168.0.1, and the hostname <hostname>www.democrat.pol</hostname>, with an IP
address of 192.168.0.2. (No, there is no top level domain <hostname>.pol</hostname> -
this is just an example).

<VirtualHost 192.168.0.1>
ServerAdmin webmaster@republican.pol
ServerName www.republican.pol
DocumentRoot /var/www/republican.pol
ErrorLog logs/republican.pol-error_log
CustomLog logs/republican.pol-access_log common
</VirtualHost>

<VirtualHost 192.168.0.2>
ServerAdmin webmaster@democrat.pol
ServerName www.democrat.pol
DocumentRoot /var/www/democrat.pol
ErrorLog logs/democrat.pol-error_log
CustomLog logs/democrat.pol-access_log common
</VirtualHost>

Now, requests for http://www.republican.pol/propaganda.html would be mapped


to the file /var/www/republican.pol/propaganda.html, and similarly, requests for
http://www.democrat.pol/propaganda.html would be mapped to the file
/var/www/democrat.pol/propaganda.html. The same web server would be serving
both web sites, but the client has no way of knowing. To the client, they seem to be
completely independent sites.

What configuration can be found within a <directive>VirtualHost</directive>


container? Anything found within the Main section of the configuration file. The
example above has the two hosts using distinct document roots and logs. Just as easily,
they could add distinct <directive>Alias</directive>es, <directive>Options</directive>,
and ACLs, and a host of other configuration.

Name Based Virtual Hosts

While IP based virtual hosting is simple, it suffers from the fact that each distinct virtual
host must be assigned a distinct IP address, while publicly routable IP addresses are
often a precious resource. For this reason, name based virtual hosting was developed.

With name based virtual hosting, multiple hostnames resolve to the same IP address.
For example, the hostnames <hostname>www.democrat.pol</hostname>,
<hostname>www.libertarian.pol</hostname>, and
<hostname>www.green.pol</hostname> could all resolve to the IP address 192.168.0.2.
In this case, however, the web server has a harder time distinguishing the various hosts,
because the IP address of the server in the TCP/IP request packet for each is the same.

The solution is that the web server needs to "dig deeper" into the request HTTP
protocol. Starting with HTTP/1.1, clients are required to supply a host HTTP header
with every web request, which identifies the hostname of the requested site. The server
can then attempt to match the supplied hostname with the
<directive>ServerName</directive> of the requested site.

In order to configure the Apache web server to "dig deeper" into the HTTP protocol in
this manner, the <directive>NameVirtualHost</directive> directive must be used to
identify a server IP address as one which is being used for name based virtual hosting.
Consider the following extension to the example above.

<VirtualHost 192.168.0.1>
ServerAdmin webmaster@republican.pol
ServerName www.republican.pol
DocumentRoot /var/www/republican.pol
ErrorLog logs/republican.pol-error_log
CustomLog logs/republican.pol-access_log common
</VirtualHost>

NameVirtualHost 192.168.0.2

<VirtualHost 192.168.0.2>
ServerAdmin webmaster@democrat.pol
ServerName www.democrat.pol
DocumentRoot /var/www/democrat.pol
ErrorLog logs/democrat.pol-error_log
CustomLog logs/democrat.pol-access_log common
</VirtualHost>

<VirtualHost 192.168.0.2>
ServerAdmin webmaster@libertarian.pol
ServerName www.libertarian.pol
DocumentRoot /var/www/libertarian.pol
ErrorLog logs/libertarian.pol-error_log
CustomLog logs/libertarian.pol-access_log common
</VirtualHost>

<VirtualHost 192.168.0.2>
ServerAdmin webmaster@green.pol
ServerName www.green.pol
DocumentRoot /var/www/green.pol
ErrorLog logs/green.pol-error_log
CustomLog logs/green.pol-access_log common
</VirtualHost>
NameVirtualHost: The IP address 192.168.0.2 has now been identified as an
address for which the server is implementing name based virtual hosting. Any
request received over this IP address will now have its HTTP headers examined for
the name of the server.
ServerName: The hostname supplied by the HTTP headers will be matched against
the <directive>ServerName</directive> directive of all virtual hosts which share
the relevant IP address. The <directive>ServerName</directive> directive now
takes on new importance.

What if the same virtual host should answer to more than one hostname (such as
<hostname>www.democrat.pol</hostname> and just
<hostname>democrat.pol</hostname>)? The <directive>ServerAlias</directive>
directive can be used to add multiple names to consider when attempting to find a
matching virtual host, as in the following example, where the relevant line has been
highlighted.

<VirtualHost 192.168.0.2>
ServerAdmin webmaster@democrat.pol
ServerName www.democrat.pol
ServerAlias democrat.pol democrat www.donkey.pol donkey.pol donkey
DocumentRoot /var/www/democrat.pol
ErrorLog logs/democrat.pol-error_log
CustomLog logs/democrat.pol-access_log common
</VirtualHost>

What if, probably due to a misconfiguration, a match in not found amongst the various
192.168.0.2 virtual hosts? The answer is that Apache defaults to the first defined server
on that IP address, in this case, <hostname>www.democrat.pol</hostname>. Once a
virtual host has been defined for a <directive>NameVirtualHost</directive> IP address,
requests over that IP address will never fall through to the main server.

Notice that, in the example above, the server is really simultaneously implementing IP
based virtual hosting (over IP address 192.168.0.1) and name based virtual hosting
(over IP address 192.168.0.2).

Exercises

Lab Exercise
Objective: Configure Apache virtual hosts

Estimated Time: 45 mins.

Specification

This lab will consist of setting up virtual hosts for four distinct trade organizations
which are all sharing a common web server. The various virtual hosts will be bound to
variants of the loopback address, so all configuration will be local to you machine. The
skills required to configure a "real world" external web server would be nearly identical,
however, only the IP addresses would need to change.

1. Create appropriate DNS entries.

As a prerequisite, DNS should be configured to resolve all relevant hostnames


appropriately. For our purposes, simply adding the following entries to your
local /etc/hosts file will suffice.

127.1.1.1 www.peanutbutterisgood.rha
127.1.1.2 www.jellyisgood.rha
127.1.1.2 www.jamisgood.rha
127.1.1.2 www.marmaladeisgood.rha

If you have configured the file correctly, you should be able to individually ping
each of the hostnames, and confirm that they resolve correctly. (Don't be
concerned that there's not really a top level domain called rha. We'll fix that in
an upcoming workbook.)
2. Four advocacy organizations, one each promoting peanut butter, jelly, jam, and
marmalade, want to use common infrastructure to support what looks like four
independent sites. You are to configure your web server so that it serves four
virtual hosts, with the following parameters. In the following table, all document
roots are relative to the directory /var/www/vhostlab, represented by '...'.
You will probably have to create this directory.

IP
Hostname Type Document Root
Address
www.peanutbutterisgood.rh 127.1.1. IP .../pb_root
a 1 based
Nam
127.1.1.
www.jellyisgood.rha e .../namevhost/jelly_root
2
based
Nam
127.1.1.
www.jamisgood.rha e .../namevhost/jam_root
2
based
Nam
127.1.1. .../namevhost/marmalade_roo
www.marmaladeisgood.rha e t
2
based

3. The content for the various websites can be found at http://rha-


server/pub/rha/rha230/pbandj_website.tgz. Each site consists of a single
index.html file found in the relevantly named directory. Each index.html file
also references a background image referenced as /images/some_name.jpg.
a. Extract the tar archive, and position the index.html files so that they are
located within the appropriate document roots.
b. Within the tar archive, all four images are found in a single images
directory. Install this directory on your web server as the directory
/var/www/vhostlab/images. Configure your web server so each virtual
host can reference images is this directory using a URL of the form
<hostname>http://vhostname/images/some_name.jpg</hostname>. You
may use whatever method you like, as long as the images are not moved
(or copied) from the images directory, and you do not modify the
index.html files.

If installed correctly, your site should have the following minimum structure.
(You may have added some additional links or whatnot to solve the image
directory problem).

/var/www/vhostlab/
|-- images
| |-- jam.jpg
| |-- jelly.jpg
| |-- marmalade.jpg
| `-- peanutbutter.jpg
|-- namevhost
| |-- jam_root
| | `-- index.html
| |-- jelly_root
| | `-- index.html
| `-- marmalade_root
| `-- index.html
`-- pb_root
`-- index.html

4. Set options such that clients accessing


http://www.peanutbutterisgood.rha/images receive a dynamically generated
index, but dynamically generated indexes for http://www.jamisgood.rha/images,
http://www.jellyisgood.rha/images, and
http://www.marmeladeisgood.rha/images are prohibited.
5. The site http://www.peanutbutterisgood.rha should log hits (client access) to the
file /var/log/httpd/pb_access_log, using the common format. The three
named based virtual hosts should all log hits to the file
/var/log/httpd/fruity_access_log, again using the common format.
6. Older web clients use the HTTP/1.0 protocol, instead of the HTTP/1.1 protocol,
and do not always provide the HTTP <directive>host:</directive> header
required to resolve name based virtual hosts. As a result, when accessing a site
which uses named based virtual hosting, they are always bound to the default
(first defined) virtual host.

In order to accommodate these older clients, create a new name based virtual
host, with a <directive>ServerName</directive> of
<hostname>DummyPlaceholder</hostname>, and assign it a document root of
/var/www/vhostlab/namevhost. Make sure that it's definition occurs before
any other virtual host definitions for IP address 127.1.1.2.

Create the file /var/www/namedlab/namevhost/index.html, with the


following content.

<p>Which of the following high quality sites are you trying to


access?</p>
<ul>
<li><a href="/jelly_root">www.jellyisgood.rha</a></li>
<li><a href="/jam_root">www.jamisgood.rha</a></li>
<li><a href="/marmalade_root">www.marmaladeisgood.rha</a></li>
</ul>

You may confirm your configuration by accessing the web server by IP address,
instead of hostname: http://127.1.1.2. Make sure that pages accessed through
this new (unnamed) virtual host resolve images correctly.

Deliverables

A title

A title

1. 1. A local DNS configuration which resolves


<hostname>www.peanutbutterisgood.rha</hostname> to 127.1.1.1, and each
of <hostname>www.jellyisgood.rha</hostname>,
<hostname>www.jamisgood.rha</hostname>, and
<hostname>www.marmaladeisgood.rha</hostname> to 127.1.1.2.
2. An IP based virtual host on 127.1.1.1, with a document root of
/var/www/vhostlab/pb_root, with the specified content, which logs hits to
/var/log/httpd/pb_access_log using the common format.
3. Three name based virtual hosts
(<hostname>www.jellyisgood.rha</hostname>,
<hostname>www.jamgood.rha</hostname>, and
<hostname>www.marmaladeisgood.rha</hostname>) which all share the IP
address 127.1.1.2, mapped to the document roots
/var/www/vhostlab/namevhost/jelly_root,
/var/www/vhostlab/namevhost/jam_root, and
/var/www/vhostlab/namevhost/marmalade_root, respectively, with the
specified content.
4. Each name based host logs hits to the shared log file
/var/log/httpd/fruity_access_log using the common format.
5. Requests for all four virtual hosts should resolve the URL /images to the
directory /var/www/namevhost/images.
6. For the IP based virtual hosts 127.1.1.1, requests to the URL /images should
result in a dynamically generated index. For all named based virtual hosts,
dynamic index generation of /images should be disabled.
7. In order to support legacy clients, all requests which resolve to the host
127.1.1.2 which do not directly reference one of the specified name virtual
hosts by name should resolve to the document root /var/www/namedlab,
which contains the file index.html with the specified content.

The Squid Proxy Server

Discussion

Proxy Servers

A proxy server acts as a middleman between a client and a server. The use of a proxy
server usually involves the following.

Figure 1. The Role of a Proxy Server

1. A client is configured to use the proxy server. This is a one time configuration,
which usually requires the IP address and port of the proxy server.
2. When asked to connect to a service, instead of connecting directly, the client
instead connects to the proxy server.
3. The proxy server accepts the request as if it were the server, but sends nothing
back to the client immediately. Instead, the proxy server initiates the request to
the real service, as if it were the client.
4. The true service receives the connection, and returns a response to the proxy
server.
5. The proxy server then resends the response it received from the server to the
client, as if it were the server.

Why would anyone want to use such a convoluted scheme? The answer usually
involves one of the following.

Access. The client may be on a machine that does not have a direct connection
to the Internet, so it needs the services of a proxy server which does. In the
scenario diagrammed above, the client is on a 192.168.0.0/24 private subnet,
which by convention should not be routed directly to the Internet.
Caching. The proxy server may store the response of the server, as well as
returning it to the client. If the client (or another client) asks for the same
information again, the proxy server merely needs to ask the real server "has your
information changed?" If not, the proxy server can return the local copy,
reducing traffic between the proxy server and the true service.
Filtering. The proxy server becomes a single control point for all clients which
it serves. Therefore, traffic can be filtered or logged for later auditing at the
proxy server.

Although our figure diagrams a web proxy server, our discussion has been intentionally
vague about what client and what service we're talking about, because the idea of a
proxy server is a general concept. The service in question could be a web server, an FTP
server, or even an LDAP server, and the same concepts would apply.

The squid Proxy Server

Most often, if people use the term proxy server without elaboration, they are referring to
a HTTP (web) proxy server. Red Hat Enterprise Linux ships with a full featured and
sophisticated proxy server, know as Squid. Squid supports FTP, gopher, and HTTP
requests, SSL encapsulation, robust caching, extensive access controls, and full
transaction logging. Much like the Apache web server, a whole course could be devoted
to deploying and maintaining the squid proxy server.

Like most Red Hat Enterprise Linux packaged products, however, the out-of-the-box
configuration makes it fairly easy to set up and use the proxy server in a basic
configuration. We will cover how to install the server, define which port it should bind
to, and specify which clients are able to connect to the service.

The proxy server is packaged in the squid package, and is managed as the
<service>squid</service> service. Therefore, the standard techniques can be used for
installing the software and starting the service in its default configuration.

[root@station ~]# yum install squid


...
======================================================================
=======
Package Arch Version Repository
Size
======================================================================
=======
Installing:
squid i386 7:2.6.STABLE6-3.el5 rha-rhel
1.2 M
...
Installed: squid.i386 7:2.6.STABLE6-3.el5
Complete!
[root@station1 ~]# service squid start
init_cache_dir /var/spool/squid... Starting squid: . [ OK ]
[root@station1 ~]# chkconfig squid on

The out-of-the-box configuration is not useful directly, however, as the default access
control lists do not let any useful clients connect.

Squid Configuration: /etc/squid/squid.conf

Upon startup, the squid daemon reads the /etc/squid/squid.conf configuration file
for its configuration. The configuration file follows a very traditional Linux (and Unix)
syntax.

All white lines (lines which are empty or contain only white space) are ignored,
as are all comment lines that begin with a "#".
All other lines begin with a keyword, referred to as a "TAG". The syntax for
arguments after the tag depend on the tag, but must all occur on the same line.

Like many Red Hat Enterprise Linux default configuration files, the file attempts to be
self documenting and provides copious comments with default configuration values
commented out. Usually, changing a value to something other than the default involves
uncommenting the default line, and changing its value (perhaps first duplicating the line
to preserve documentation of the default value).

While the default configuration file is intimidating, weighing in at over 4300 lines, the
relevant configuration is a mere 25 lines, as illustrated below.

[root@station ~]# wc /etc/squid/squid.conf


4325 24616 148129 /etc/squid/squid.conf
[root@station ~]# grep -v \# /etc/squid/squid.conf | sed "/^$/d" | wc
25 91 756

For our purposes, we are only going to examine three relevant tags:
<directive>http_port</directive>, <directive>acl</directive>, and
<directive>http_access</directive>.

The server's identity: http_port

Opening the /etc/squid/squid.conf configuration file with any text editor, you
should be able to quickly find the first configuration tag,
<directive>http_port</directive>.

Figure 1. /etc/squid/squid.conf: <directive>http_port</directive>


# NETWORK OPTIONS
20 # --------------------------------------------------------------
---------------

# TAG: http_port
# Usage: port
# hostname:port
25 # 1.2.3.4:port
#
# The socket addresses where Squid will listen for HTTP
client
# requests. You may specify multiple socket addresses.
# There are three forms: port alone, hostname with port,
and
30 # IP address with port. If you specify a hostname or IP
# address, Squid binds the socket to that specific
# address. This replaces the old 'tcp_incoming_address'
# option. Most likely, you do not need to bind to a
specific
# address, so you can use the port number alone.
...
72 # Squid normally listens to port 3128
http_port 3128

By default, squid binds to port 3128, although by convention, HTTP proxy servers
usually use the port 8000 or 8080. An administrator could well want to add a line akin
to the following.

http_port 8080

Note that, as the comment says, multiple <directive>http_port</directive> lines can be


added, causing squid to bind to more than one port or interface, if necessary.

Squid Access Control Lists: acl and http_access

More interestingly, we also explore manipulating the client access control


configuration. Finding the access control configuration can be difficult, as the relevant
configuration is found deep within the rather long file. Searching for the term acl,
however, and pounding on Find Next about 10 times, you should be able to discover the
following.

Figure 1. /etc/squid/squid.conf: <directive>acl</directive> Documentation

# ACCESS CONTROLS
# --------------------------------------------------------------
---------------

# TAG: acl
2230 # Defining an Access List
#
# acl aclname acltype string1 ...
# acl aclname acltype "file" ...
#
2235 # when using "file", the file should contain one item per line
#
# acltype is one of the types described below
#
# By default, regular expressions are CASE-SENSITIVE. To make
2240 # them case-insensitive, use the -i option.
#
# acl aclname src ip-address/netmask ... (clients IP
address)
# acl aclname src addr1-addr2/netmask ... (range of
addresses)
# acl aclname dst ip-address/netmask ... (URL host's IP
address)
2245 # acl aclname myip ip-address/netmask ... (local socket IP
address)
#
...
2255 #
# acl aclname srcdomain .foo.com ... # reverse lookup,
client IP
# acl aclname dstdomain .foo.com ... # Destination server
from URL

The <directive>acl</directive> tag assigns a name to a specification. The tag itself has
no observable effect, but may instead be referenced by other tags (such as
<directive>http_access</directive>, below). Skimming the comments here and in the
file itself, we find that <directive>acl</directive> specifications can involve a wide
range of parameters, including the following.

Table 1. Squid <directive>acl</directive> Specifications

Keyword Parameter
<directive>src</directive> Requesting client's IP address
<directive>dst</directive> Real server's IP address
<directive>port</directive> Real server's port
<directive>myip</directive> squid's IP address
<directive>srcdomain</directive> Requesting client's domain name
<directive>dstdomain</directive> Real server's domain name
<directive>time</directive> Time of day
Regular Expression matched against the Requested
<directive>url_regex</directive>
URL
<directive>proto</directive> Proxied protocol (HTTP, FTP, etc.)
Regular Expression matched against HTTP request
<directive>reqheader</directive>
headers
Regular Expression matched against HTTP response
<directive>repheader</directive>
headers

And this is only some of the parameters that can be specified. Obviously, squid is
highly configurable in terms of who it will let connect, and what content it is willing to
proxy. We now turn our attention to the default configuration, which are the
uncommented values found a few lines below.
Figure 2. /etc/squid/squid.conf: <directive>acl</directive>

#Recommended minimum configuration:


acl all src 0.0.0.0/0.0.0.0
2395 acl manager proto cache_object
acl localhost src 127.0.0.1/255.255.255.255
acl to_localhost dst 127.0.0.0/8
acl SSL_ports port 443
acl Safe_ports port 80 # http
2400 acl Safe_ports port 21 # ftp
acl Safe_ports port 443 # https
acl Safe_ports port 70 # gopher
acl Safe_ports port 210 # wais
acl Safe_ports port 1025-65535 # unregistered ports
2405 acl Safe_ports port 280 # http-mgmt
acl Safe_ports port 488 # gss-http
acl Safe_ports port 591 # filemaker
acl Safe_ports port 777 # multiling http
acl CONNECT method CONNECT

These lines define the following names, which can be referred to later.

Table 2. Default squid <directive>acl</directive> Definitions

Name Members
all All requests
manager squid internal cache management requests
localhost All requests originating from the loopback address
to_localhost All requests to the loopback address
Safe_ports All requests to the well known ports of services squid is willing to proxy
CONNECT All requests to initiate an SSL encapsulated connection

As the Safe_ports <directive>acl</directive> illustrates, a name may be assigned


multiple times, resulting in the values being "or"ed together (i.e., a match on any of the
individual values is considered a match on the <directive>acl</directive> as a whole).

Lastly, an access control policy is defined using multiple


<directive>http_access</directive> tags which reference the
<directive>acl</directive>'s defined above. On any client request, squid will use a "stop
on first match" policy while searching the following list of
<directive>http_access</directive> controls. Order is important. Once squid finds a
specification that matches the client request, it stops searching and immediately
implements the specified <directive>allow</directive> or <directive>deny</directive>
policy.

Figure 3. /etc/squid/squid.conf: <directive>http_access</directive>

# TAG: http_access
# Allowing or Denying access based on defined access
lists
#
# Access to the HTTP port:
2485 # http_access allow|deny [!]aclname ...
#
...
#Recommended minimum configuration:
#
# Only allow cachemgr access from localhost
2505 http_access allow manager localhost
http_access deny manager
# Deny requests to unknown ports
http_access deny !Safe_ports
...
# Example rule allowing access from your local networks. Adapt
2520 # to list your (internal) IP networks from where browsing should
# be allowed
#acl our_networks src 192.168.1.0/24 192.168.2.0/24
#http_access allow our_networks

2525 # And finally deny all other access to this proxy


http_access allow localhost
http_access deny all

To the experienced eye, the comments leave little more to add, but we'll walk through
these lines just in case.

The first argument to the <directive>http_access</directive> tag is either the


keyword <directive>allow</directive> or <directive>deny</directive>, followed
by one or more <directive>acl</directive> names, each possibly preceded by a "!".
The <directive>acl</directive> names are effectively "and"ed - all must apply to
the client request for the <directive>http_access</directive> policy to apply. The
presence of a "!" inverts the meaning of the <directive>acl</directive>.
The first line allows management requests, but only from the loopback address
(i.e., from processes running on the proxy server). Notice that both the manager
and localhost <directive>acl</directive>'s must apply for the policy to take effect.
The second line denies management requests from all other sources.
Any request for a port other than one for which squid is willing to proxy is denied.
(Notice the convenient use of "!" to invert the meaning of the safe_ports
<directive>acl</directive>.)
This is where the good guys are defined. More on this in a second.
Any requests from the loopback interface are considered good.
Any request not meeting the above policies is prohibited by <directive>deny
all</directive>.

Once we work our way through the default configuration, we realize that it only allows
connections from the loopback address! If the proxy server is to be useful, the identities
of the intended clients need to be specified. How should be evident from the comments.
First, define the our_networks <directive>acl</directive> to match requests from the
clients for whom squid should be willing to proxy. Second, add a
<directive>http_access</directive> rule allowing connections that match the
our_networks <directive>acl</directive>. (Of course, some name other than
our_networks could have been used).
Order is important. The matching rule should occur after requests for bad ports are
filtered out, but before the <directive>deny all</directive> sledge hammer. For
example, to allow clients to connect from the 192.168.0.0/24 subnet, we could add the
following lines just beneath the our_networks comments.

acl our_networks src 192.168.0.0/255.255.255.0


http_access allow our_networks

Or, the equivalent IP subnet CIDR notation 192.168.0.0/24 could have been used. Of
course, after modifying the configuration file, the <service>squid</service> service
should be restarted.

[root@station ~]# service squid restart


Stopping squid: . [ OK ]
Starting squid: . [ OK ]

It took a while to understand why, but in the end, configuring squid to allow clients
only involves a two line edit, both of which can be easily deduced from existing
comments: one to define who the good guys are, and another to modify the access
control list chain to let them in.

Configuring Proxies for Web Clients

Once a proxy server is up and running, clients must be configured to use it. The details
will vary from client to client, but the essence is the same. Somehow, the client needs to
be configured with the IP address and port number of the proxy server.

Configuring Firefox

The firefox web browser's proxy configuration is found by choosing the Connection
Settings... button from Preferences Dialog, which is opened by choosing the
Edit:Preferences menu item.

Figure 1. Firefox Proxy Configuration


Once open, the dialog allows you to specify an independent proxy server for each of
several protocols, or, conveniently, to set all protocols to use the same server. A list of
domains and IP address for which the client should not proxy can also be specified,
which is very useful for maintaining access to servers the proxy server might not be
aware of (such as <hostname>localhost</hostname> or <hostname>rha-
server</hostname>).

Configuring curl

Command line web clients are often configured to use proxy servers through command
line switches or environment variables. Opening the curl man page, for example, and
searching for proxy, one can (eventually) find the following.

-x/- -proxy <proxyhost[:port]>


Use specified HTTP proxy. If the port number is not
specified,
it is assumed at port 1080.

This option overrides existing environment variables


that sets
proxy to use. If there's an environment variable
setting a
proxy, you can set proxy to "" to override it.

And, a little further down, the following.


ENVIRONMENT
http_proxy [protocol://]<host>[:port]
Sets proxy server to use for HTTP.

HTTPS_PROXY [protocol://]<host>[:port]
Sets proxy server to use for HTTPS.
...

NO_PROXY <comma-separated list of hosts%gt;


list of host names that shouldn't go through any proxy.
If set
to a asterisk

For example, to download the Red Hat home page using a the proxy server defined
above, either of the following techniques could be used.

[root@station ~]# curl -x http://station:8080 http://www.redhat.com


[root@station ~]# export http_proxy=http://station:8080
[root@station ~]# curl http://www.redhat.com

Squid Logging: /var/log/squid/access.log

Like the Apache web server, squid maintains a transaction log, found at
/var/log/squid/access.log.

Squid uses its own log format, which displays details more pertinent to a proxy server
than the standard common format used by web servers. The
<directive>emulate_httpd_log</directive> directive can be set to use the traditional
common format instead, though information will be lost.

Table 1. Squid Log Format

Position Example Content


1 1124596159.068 A Unix standard timestamp. [a]
2 60355 Request duration, in milliseconds.
3 192.168.0.25 Client IP address
4 TCP_MISS/200 Squid result code
Number of bytes transferred to the
5 1381
client
6 GET The request method
7 http://www.redhat.com The requested URL
8-10 ... Parameters relevant to internal cache
[a]
The Unix world (including Linux) conventionally records timestamps internally using
"seconds since the epoch", with the epoch being January 1st, 1970. Using a signed 32bit
integer, this conveniently records times from around 1900 until around 2038. The Unix
world was not concerned about "Y2K" problems, but instead worries about "Y2038"
problems. Your author feels this would be the perfect time to come out of retirement
and consult for legacy Linux systems.
Three sample log messages are found below.

1124596032.120 2 192.168.1.1 TCP_DENIED/403 1355 GET


http://www.redhat.com/ - NONE/- text/html
1124596159.068 60355 192.168.0.25 TCP_MISS/200 1381 GET
http://www.redhat.com/ - DIRECT/209.132.177.50 text/html
1124596167.650 1 192.168.0.25 TCP_HIT/200 12115 GET
http://www.redhat.com/ - NONE/- text/html

The first is from a client which was not accepted by the client access control
configuration, and so received a TCP_DENIED. The second is a request from a client for
data not already in the cache, a TCP_MISS. The third is a followup request (perhaps from
a reload of the same page), whose data was already cached locally, generating a
TCP_HIT. Notice that the only request which took a significant amount of time to fulfill
was the cache miss, which consumed around 60000 milliseconds of cache time, as
opposed to 1 or 2.

Finding Out More

We have only touched upon a few of Squid's basics. Those interested in more, such as
using squid as a transparent proxy server (or "accelerator"), can consult the FAQs
(which reads almost like a manual) found at /usr/share/doc/squid-version/FAQ-
html, or consult the Squid home page.

Exercises

Lab Exercise
Objective: Configure the Squid Proxy Server

Estimated Time: 10 mins.

Specification

This lab will have you install, configure, and use the squid proxy server. A "real world"
use of squid would require 3 machines: One to host the web server, one to host the
proxy server, and of course the client machine running a web browser.

Figure 1. Standard Squid Proxy Server Configuration


The machine hosting the web server would need a publicly accessible IP address, as
would the proxy server. It could well be the case, however, that the client machine does
not, with squid running on a multi-homed host. The squid application would receive
requests from client over a private IP address, and forward them to the Internet through
its public IP address.

For our lab, we will instead run the web server, the client, and the squid proxy server on
the same machine. The concepts map directly to the real world scenario. In the
following diagram, 192.168.0.1 should be replaced with your eth0 IP address.

Figure 2. Lab Squid Proxy Server Configuration

1. Configure the squid proxy server.


a. Ensure that the squid package is installed.
b. As a precaution, make a backup of the file /etc/squid/squid.conf,
copying it to /etc/squid/squid.conf.orig, for example.
c. In the file /etc/squid/squid.conf, search for the
<directive>http_port</directive> option, around line 54. Set the
<directive>http_port</directive> to 8080.
d. In the file /etc/squid/squid.conf, search for term our_network,
around line 1860(!). Administrators are expected to set local access
control policies at this location. Following the commented out examples,
define an <directive>acl</directive> our_networks, which matches all
requests sourced from your eth0 interface. For example, if ifconfig eth0
reports your IP address as 192.168.0.5 and your network mask as
255.255.255.0, then the following line would be appropriate. (If in doubt,
you can specify your IP address directly, with a mask of
255.255.255.255). Once defined, add a
<directive>http_access</directive> directive which allows the acl.
e. acl our_networks src 192.168.0.0/255.255.255.0
f. http_access allow our_networks
g. Use the standard service and chkconfig commands to start the
<service>squid</service> service, and enable the service to start
automatically on reboots. You might want to use the netstat command to
confirm that squid is LISTENing for connections on port 8080.
2. Monitor squid and httpd requests. In two separate windows (or two separate
virtual consoles), use less to open the files /var/log/httpd/access_log and
/var/log/squid/access.log, respectively. Within less, hit SHIFT-F to enter
"follow" mode. As new requests are made for each service, you should see a log
line generated within the respective file. (Pressing CTRL-C will return less to
normal browsing mode.)
3. Configure firefox to use the proxy server. Using the firefox browser, open the
Edit: Preferences dialog, and follow the path to General and Connection
Settings.... In the resulting dialog, choose Manual proxy configuration, and set
the HTTP Proxy to be your eth0 IP address, port 8080. Also, remove any text
from the No Proxy For text entry. OK your way out of the various dialogs.
4. Browse your webserver. Now use firefox to browse the content of your
webserver. If some of your previous labs are still in place, you may try
http://localhost/relativity, http://www.peanutbutterisgood.rha, or
http://www.jamisgood.rha. Otherwise, simply create a file in your document root
directory, and reference it. With each request, you should see a line similar to
the following in your /var/log/squid/access.log file.
5. 1132416737.269 699 192.168.0.1 TCP_MISS/304 200 GET
http://localhost/reading
6. s/the_god_of_mars.html - DIRECT/127.0.0.1 -

If not, make sure you reload a page from within the browser. If the page is in the
browser's cache, then it will not actually generate a request.

Deliverables

A title

A title

1. 1. A running squid server, bound to port 8080, which allows requests over the
IP address assigned to the eth0 interface.
2. The <service>squid</service> service is configured to start automatically
upon reboot.

Challenge Exercises

1. Assuming your neighbors have set access control configuration appropriately,


you should be able to use your proxy server to browse a neighbor's website, or a
neighbor's proxy server to browse your website, or a neighbor's proxy server to
browse another neighbor's website. Explore.
2. Configure your access control specifications so that one particular neighbor may
access your squid proxy server, but another may not.
3. Notice the following line in the /etc/squid/squid.conf configuration file.
4. 2512 # We strongly recommend the following be uncommented to
protect innocent
5. # web applications running on the proxy server who think
the only
6. # one who can access services on "localhost" is a local
user
7. #http_access deny to_localhost

What concern is this addressing? In order to convince yourself that denying the
<directive>to_localhost</directive> acl is a good idea, enable the
<directive>/server-status</directive> location within your Apache web server,
but take the precaution of only allowing requests from the loopback address
127.0.0.1. Then have a neighbor use your proxy server to access
http://localhost/server-status from their machine.

(Realize, of course, that fixing this security hole by denying requests matching
the <directive>to_localhost</directive> acl would break the original lab.)

Potrebbero piacerti anche