Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Objectives: Learn to manipulate text at the command line using GNU textutils Sending text files and output streams through text utility filters to modify the output. Introduce to the student the editor sed to script changes to text files.
Introduction The lessons will primary focus on filters that is used to build complex pipelines to manipulate text using the filters. We will discuss how to display text, sort it, count words and lines and translate characters. Sed editor will also be introduced.
Text Filtering Text filtering is the process of taking an input stream of text and performing some conversion on the text before sending it to an output stream. In the Linux or UNIX environments, filtering is most often done by constructing a pipeline of commands where the output from one command is piped or redirected to be used as input to the next.
Piping with | Input can come from parameters you supply to commands and output can be displayed on your terminal. Many text processing commands (filters) can take input either from the standard input stream or from a file. In order to use the output of a command, command1, as input to a filter, command2, you connect the commands using the pipe operator ( | ) Example 1: $ echo e apple\npear\banana | sort apple banana pear
You can also use | to redirect the output of the second command in the pipeline to a third command and so on. Constructing long pipelines of commands that each have limited capability is common Linux and UNIX way of accomplishing tasks. You will also sometimes see a hyphen ( - ) used in place of a filename as an argument to a command, meaning the input should come from stdin rather than a file.
Output Redirection With > While it is nice to be able to create a pipeline of several commands and see the output on your terminal, there are times when you want to save the output in a file. You do this with the output redirection operator ( > ) Example 2:
cat and split Now that we have created a text1.dat file, the next thing to do is to check what is in it. Use the cat command (short for concatenate) to display the contents of a file on stdout. Example 3: $ cat text1.dat 1 apple 2 pear 3 banana
You can also use cat to concatenate several files together for display. Example 4: $ cat *.dat
Our example files are very small, but sometimes you will have large files that you need to split into smaller pieces. For example, you might want to break a large file into CD-sized chunks so you can write it to CD for sending through the mail someone who could create a DVD for you. The split command will do this in such a way that the cat command can be used to recreate the file easily. By default, the files resulting from the split command have a prefix in their name of x followed by a suffix pf aa, ab, ac.ba, bb, etc. Syntax: split [ options ] [ input ] [ output-prefix ] Use the -l n to split a file into n-line chunks $ split l 2 file.doc short_
wc,head and tail cat can displays the whole file. That is alright for small files, but suppose you have a large file. First you might want to use wc (Word Count) command to see how big the file is. The wc command displays the number of lines, words and bytes in a file. Syntax: wc [ options ] [ input-file ] Use the c to output the character count $ wc c file.doc
Two commands allow you to display either the first part (head) or last part (tail) of a file. These commands are the head and tail commands. They can be used as filters or they can take a filename as an argument. By default they display the first (or last) 10 lines of the file or stream. Example5: Print the first two lines of a text file (two alternatives) $ head n 2 notes.php $ head -2 notes.php
Another common use of tail is to follow a file using the f option, usually with a line count of 1. You might use this when you have a background process that is generating output in a file and you want to check in and see how it is doing. In this mode, tail will run until you cancel it (using Ctrl+c) displaying lines as they are written to the file. Example6: $ tail n 5 notes.txt
expand Sometimes you may want to swap tabs for spaces and vice versa. The t option for both commands allows you to set the tab stops. (The default tab size is 8) Example7: Change all tabs in document.txt to three spaces , display it on the screen $ expand t 3 document.txt $expand -3 document.txt
Translating Sets of Characters with tr tr translate one character set to another. Syntax: tr start-set end-set Options: -d to delete characters in start-set instead of translating them -s replaces sequences of identical characters with just one
The tr command replaces characters in start-set with the corresponding characters in end-set. It is also important to take note that tr cannot accept a file as an argument, but uses the standard input and output Example8: Replace all uppercase in input-file with lowercase characters (two alternatives) $ cat input-file | tr A-z a-z $ tr A-Z a-z < input file
Pattern Matching and Wildcards Wildcards are pattern matching characters commonly used to find file names or text within a file. Common utilizations of a wildcard are: locating the file names that you dont fully remember, locating files that have something in common, or performing operations on multiple files rather than individual. The characters used for wildcard are: ?*[ ]
Special wildcard characters ? * [emacgEl] [e-x] [!x-z] Match one character Any string One character set One character in range Not in set
Numbering Lines Of A File with nl or cat The nl command number lines. There are options to finely control the formatting. By default, blank lines arent numbered. Options: -ba numbers every line cat n also numbers lines
Example10: $ nl text3.txt
Sorting Lines of Text with sort The sort filter read lines of text and prints them sorted into order. The sort command can sort numeric values or by characters values. You can specify this choice for the whole record or for each field. Unless you specify a different field separator, fields are delimited by blanks or tabs. Options: -f makes the sorting case-insensitive -n sorts numerically, rather than lexicographically -k sorts the record according to the key position -r sort in reverse
Removing Duplicate Lines with uniq The uniq command usually operates on the sorted files and removes consecutive identical lines from any file. The uniq command can also ignore some fields. Other uses of uniq: uniq c counts how many times each line appeared uniq u prints only uniq lines uniq d prints only duplicated lines
Example 11: find out how many unique words are in the dictionary $ sort /usr/dict/words | uniq | wc w
cut, paste and join This three commands with fields in textual data. These commands are particularly useful for dealing with tabular data. The first is the cut command, which extracts fields from text files. The default field delimiter is the tab character. To select the range of output: Use c for characters Use f for fields
Note that the range is written as start and end position, example 3-8. The first character or field is numbered 1 and not 0. The field separator is specified by d (defaults to tab). Example 12: Select usernames of logged in users $ who | cut d f1 | sort u The paste command pastes lines from two or more files side-by-side. Common Option: -d char to set the delimeter between fields in the output
Giving d more than one character sets different delimeters between each pair of columns Example 13: Assign passwords to users, separating them with a colon ( : ) $ paste d : usernames passwords > .syspasswd
The join command joins files based on a matching field. The files should be sorted on the join field. Common Option: -t option sets the field delimeter
By default, fields are separated by any number of spaces or tabs Example 14: Show details of suppliers and their products $ join suppliers.txt products.txt | less
pr and fmt The pr command is used to format files for printing. The default header includes the filename and the fie creation date and time, along with a page number and two lines of blank footer. When output is created from multiple files of the standard input stream, the current date and time are used instead of the filename and creation date. Common Option -d double space output -h header change from the default header to header -l lines change the default lines on a page from 66 to lines -o width set (offset) the left margin to width
Example 15: Format the file thesis.txt for printing with the header My Thesis $ pr h My Thesis thesis.txt > thesisforprinting.txt
Another useful command for formatting text is the fmt command which formats text so it fits within margins. Common option: -u convert to uniform spacing o One space between words, two between sentences -w width to set the maximum line width in characters
Example 16: Change the line length of notes.txt to a maximum of 70 characters and display it on the screen: $ fmt w 70 notes.txt | less