Sei sulla pagina 1di 7

Text Streams and Filters

Objectives: Learn to manipulate text at the command line using GNU textutils Sending text files and output streams through text utility filters to modify the output. Introduce to the student the editor sed to script changes to text files.

Introduction The lessons will primary focus on filters that is used to build complex pipelines to manipulate text using the filters. We will discuss how to display text, sort it, count words and lines and translate characters. Sed editor will also be introduced.

Text Filtering Text filtering is the process of taking an input stream of text and performing some conversion on the text before sending it to an output stream. In the Linux or UNIX environments, filtering is most often done by constructing a pipeline of commands where the output from one command is piped or redirected to be used as input to the next.

Piping with | Input can come from parameters you supply to commands and output can be displayed on your terminal. Many text processing commands (filters) can take input either from the standard input stream or from a file. In order to use the output of a command, command1, as input to a filter, command2, you connect the commands using the pipe operator ( | ) Example 1: $ echo e apple\npear\banana | sort apple banana pear

You can also use | to redirect the output of the second command in the pipeline to a third command and so on. Constructing long pipelines of commands that each have limited capability is common Linux and UNIX way of accomplishing tasks. You will also sometimes see a hyphen ( - ) used in place of a filename as an argument to a command, meaning the input should come from stdin rather than a file.

Output Redirection With > While it is nice to be able to create a pipeline of several commands and see the output on your terminal, there are times when you want to save the output in a file. You do this with the output redirection operator ( > ) Example 2:

$ echo e 1 apple\n2 pear \n3 banana > text1.dat

cat and split Now that we have created a text1.dat file, the next thing to do is to check what is in it. Use the cat command (short for concatenate) to display the contents of a file on stdout. Example 3: $ cat text1.dat 1 apple 2 pear 3 banana

You can also use cat to concatenate several files together for display. Example 4: $ cat *.dat

Our example files are very small, but sometimes you will have large files that you need to split into smaller pieces. For example, you might want to break a large file into CD-sized chunks so you can write it to CD for sending through the mail someone who could create a DVD for you. The split command will do this in such a way that the cat command can be used to recreate the file easily. By default, the files resulting from the split command have a prefix in their name of x followed by a suffix pf aa, ab, ac.ba, bb, etc. Syntax: split [ options ] [ input ] [ output-prefix ] Use the -l n to split a file into n-line chunks $ split l 2 file.doc short_

Use b n to split into chunks of n bytes each $ split b 17 file.doc chunk_

wc,head and tail cat can displays the whole file. That is alright for small files, but suppose you have a large file. First you might want to use wc (Word Count) command to see how big the file is. The wc command displays the number of lines, words and bytes in a file. Syntax: wc [ options ] [ input-file ] Use the c to output the character count $ wc c file.doc

Use the l to output line count $ wc l poem.txt

Use the w to output word count. $ wc w myfile.txt

Two commands allow you to display either the first part (head) or last part (tail) of a file. These commands are the head and tail commands. They can be used as filters or they can take a filename as an argument. By default they display the first (or last) 10 lines of the file or stream. Example5: Print the first two lines of a text file (two alternatives) $ head n 2 notes.php $ head -2 notes.php

Another common use of tail is to follow a file using the f option, usually with a line count of 1. You might use this when you have a background process that is generating output in a file and you want to check in and see how it is doing. In this mode, tail will run until you cancel it (using Ctrl+c) displaying lines as they are written to the file. Example6: $ tail n 5 notes.txt

expand Sometimes you may want to swap tabs for spaces and vice versa. The t option for both commands allows you to set the tab stops. (The default tab size is 8) Example7: Change all tabs in document.txt to three spaces , display it on the screen $ expand t 3 document.txt $expand -3 document.txt

Translating Sets of Characters with tr tr translate one character set to another. Syntax: tr start-set end-set Options: -d to delete characters in start-set instead of translating them -s replaces sequences of identical characters with just one

The tr command replaces characters in start-set with the corresponding characters in end-set. It is also important to take note that tr cannot accept a file as an argument, but uses the standard input and output Example8: Replace all uppercase in input-file with lowercase characters (two alternatives) $ cat input-file | tr A-z a-z $ tr A-Z a-z < input file

Example9: Delete all occurrences of z in story.txt $ cat story.txt | tr d z

Pattern Matching and Wildcards Wildcards are pattern matching characters commonly used to find file names or text within a file. Common utilizations of a wildcard are: locating the file names that you dont fully remember, locating files that have something in common, or performing operations on multiple files rather than individual. The characters used for wildcard are: ?*[ ]

Special wildcard characters ? * [emacgEl] [e-x] [!x-z] Match one character Any string One character set One character in range Not in set

Numbering Lines Of A File with nl or cat The nl command number lines. There are options to finely control the formatting. By default, blank lines arent numbered. Options: -ba numbers every line cat n also numbers lines

Example10: $ nl text3.txt

Sorting Lines of Text with sort The sort filter read lines of text and prints them sorted into order. The sort command can sort numeric values or by characters values. You can specify this choice for the whole record or for each field. Unless you specify a different field separator, fields are delimited by blanks or tabs. Options: -f makes the sorting case-insensitive -n sorts numerically, rather than lexicographically -k sorts the record according to the key position -r sort in reverse

Removing Duplicate Lines with uniq The uniq command usually operates on the sorted files and removes consecutive identical lines from any file. The uniq command can also ignore some fields. Other uses of uniq: uniq c counts how many times each line appeared uniq u prints only uniq lines uniq d prints only duplicated lines

Example 11: find out how many unique words are in the dictionary $ sort /usr/dict/words | uniq | wc w

cut, paste and join This three commands with fields in textual data. These commands are particularly useful for dealing with tabular data. The first is the cut command, which extracts fields from text files. The default field delimiter is the tab character. To select the range of output: Use c for characters Use f for fields

Note that the range is written as start and end position, example 3-8. The first character or field is numbered 1 and not 0. The field separator is specified by d (defaults to tab). Example 12: Select usernames of logged in users $ who | cut d f1 | sort u The paste command pastes lines from two or more files side-by-side. Common Option: -d char to set the delimeter between fields in the output

Giving d more than one character sets different delimeters between each pair of columns Example 13: Assign passwords to users, separating them with a colon ( : ) $ paste d : usernames passwords > .syspasswd

The join command joins files based on a matching field. The files should be sorted on the join field. Common Option: -t option sets the field delimeter

By default, fields are separated by any number of spaces or tabs Example 14: Show details of suppliers and their products $ join suppliers.txt products.txt | less

pr and fmt The pr command is used to format files for printing. The default header includes the filename and the fie creation date and time, along with a page number and two lines of blank footer. When output is created from multiple files of the standard input stream, the current date and time are used instead of the filename and creation date. Common Option -d double space output -h header change from the default header to header -l lines change the default lines on a page from 66 to lines -o width set (offset) the left margin to width

Example 15: Format the file thesis.txt for printing with the header My Thesis $ pr h My Thesis thesis.txt > thesisforprinting.txt

Another useful command for formatting text is the fmt command which formats text so it fits within margins. Common option: -u convert to uniform spacing o One space between words, two between sentences -w width to set the maximum line width in characters

Example 16: Change the line length of notes.txt to a maximum of 70 characters and display it on the screen: $ fmt w 70 notes.txt | less

Activity Exercises: CREATE A SHELL SCRIPT FOR THE FOLLOWING REQUIREMENTS.


I. a. Display the currently logged in users. b. From the requirement as output, arrange the usernames in a sorted manner and remove any duplicates. c. Finally use nl to number lines in the output of the previous command. II. a. Save to a file named mydircontents.txt in the directory SysAdDirectory the output of ls l~ b. Use the split command to split mydircontents.txt into 50bytes chunk. c. Put your split mydircontents.txt back together again and save it into another file named putbackcontents.txt. d. Display the contents of putbackcontents.txt into the screen. III. a. For each option of the pr command, provide an output for each using the file sonnet.txt (downloadable)

Potrebbero piacerti anche