Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Users Guide
QUT000223DU
All rights reserved as an unpublished work, and the existence of this notice shall not be construed as an admission or presumption that publication has occurred. No part of the materials may be used, reproduced, disclosed or transmitted to others in any form or by any means except under license by SPSS Ltd. or its authorized distributors.
SPSS Ltd
England Please address any comments or queries about this manual to the Support Department at the above address, or via e-mail to: support-uk@spssmr.spss.com All trademarks acknowledged.
Contents
About this Guide ......................................................................................................... iii Typographical conventions ............................................................................................iii 1
1.1 1.2 1.3
2
2.1 2.2 2.3
2.4 2.5
Converting Quantum data to foreign formats ................................................. 9 Which program to use ......................................................................................................... 9 Using wcolbin.................................................................................................................... 10 Using mtw ......................................................................................................................... 10 Data format to write .......................................................................................................... 10 Input and output files......................................................................................................... 11 Record length and block size............................................................................................. 12 Writing only a given number of records ........................................................................... 13 Using mtwrite .................................................................................................................... 13 Restrictions........................................................................................................................ 13 Checking for corrupt Quantum data files ....................................................... 15 Editing Quantum data.............................................................................................. 17 Using ded........................................................................................................................... 17 Record-editing commands................................................................................................. 17 Card-editing commands .................................................................................................... 20 Restrictions........................................................................................................................ 21 Diagnostics ........................................................................................................................ 21 Replacing text with sequential numeric values ........................................... 23 Preparing the text file ........................................................................................................ 23 Using mc............................................................................................................................ 23 Printing selected fields from a file ..................................................................... 27
Which columns and fields to print .................................................................................... 27 Text and column separators in the output ......................................................................... 28 Dealing with blank or short records .................................................................................. 29
Contents / i
3 4
4.1 4.2 4.3 4.4 4.5
5
5.1 5.2
6
6.1 6.2 6.3
6.4 6.5
7 8
8.1 8.2
Sorting files ................................................................................................................. 31 ANSI carriage control sequences in files ....................................................... 33 Adding ANSI control sequences....................................................................................... 33 Removing ANSI control sequences .................................................................................. 34
ii / Contents
Typographical conventions
The following typographical conventions have been used in this manual: Bold text is used in syntax statements to show words that you must type exactly as they are shown. Italic text is used in syntax statements to show words where you must substitute information of your own. For example, the word filename indicates that the program requires a filename and that you should enter the name of a file in place of the filename parameter. Italic text is also used in the main body of the text to refer to variable parameters from the command line and also to show MS-DOS or Unix commands.
Fixed width type is used to show examples.
text
EBCDIC
360/370 column binary 1130 column binary Quantum internal (binary) format. (This uses the 12 lower-order bits (0FFF in hexadecimal) of a 16-bit word to represent the codes &0123456789 in that order; that is &=0800 and 9=0001.)
Normally, you will be reading foreign data files directly from a tape, but the instructions in this document apply to any input device as long as the records are of a fixed length.
SPSS MR has no utilities for reading variable length records. If you receive a tape or file in this
format, you should ask the person who created the tape or file to create another version in fixed-record format.
The next example reads a 360/370 column binary file called cbdata from the current directory and creates from it a Quantum data file called qtdata:
rcolbin cbdata qtdata
If you want to convert a 360/370 column binary file that has a different record length and/or block size, or you want to convert a certain number of records only, or you want to convert data from a different format, use mtr or mtread.
1130 qin
If you omit the format option from the command line, mtr prompts for each data type in turn, asking whether the tape is of the particular format in question. For example:
Is this an EBCDIC tape? Is this an ASCII tape?
Type y and press ENTER at the appropriate prompt. If you answer n to all prompts, mtr displays the message Dont know how to read this tape and stops.
The next command assumes that the data is being read from a -inch magnetic tape (/dev/rmt1):
mtr -360 -i/dev/rmt1
The next command names an input file rather than an input device, so mtr will search for a file called efile in the current directory:
mtr -1130 -iefile
If you forget to enter the name of the input file or device, mtr does not prompt you for it. Instead, it waits for you to type in the data; to cancel mtr and re-enter the command with a filename, press CTRL+D. If a tape contains more than one data file and you want to read any file after the first file, you need to tell mtr to skip over the preceding files on the tape. To do this, add the option fnumber to the command line, where number is the number of files to skip. For example, if you want to read the third file on the tape, your command could be:
mtr -c -f2 -i/dev/rst0
You enter the name of the Quantum data file you wish to create in a similar way using the option ofilename. For example:
mtr -360 -i/dev/rst0 -oqtdata
This example uses a simple filename as the name of the Quantum file so mtr will create the file in your current directory. If you want to create the file in a different directory, you may enter a full pathname here instead. In both cases, mtr overwrites the file if it already exists. If you omit the output filename from the command, mtr displays the data on the screen as it reads it.
To convert an EBCDIC data file that has a record length of 160 characters and a block size of 3,200 characters, you would type:
mtr -1130 -r160 -b3200 -iefile -oqtdata
reads a 360/370 column binary data file in which the record length and block size are both 160 characters.
or
mtr -360 -r160 -b160 -iefile -oqtdata
If you omit the record length from your command but have defined the block size, mtr asks whether the incoming data has variable length records. If records are written one per block, type y and the record length is set to the same as the block size. The same is true, in reverse, if you define a block size without a record length.
Using the v option or pressing y to the question about variable length records does not mean that the incoming data file contains records of different lengths. mtr expects all records in a file to be the same length.
to read and convert 50 records from SCSI tape drive 1. The command assumes that records have a record length of 160 characters, a block size of 1,600 characters and are in 360/370 column binary format. The Quantum data will be written to a file called qtdata in the current directory.
Byte swapping
Some computers hold the two bytes (characters) that make up a column in the opposite order to that used by the majority of computers. This does not affect text formats such as ASCII and EBCDIC, but with binary data the Quantum data will be incorrect if the bytes are not swapped as they are read in. To have mtr swap bytes before it converts to Quantum format, add the option s to the command line.
Do not use mtread if you want to convert a few records only or if the data needs to be byte swapped before conversion.
1.5 Restrictions
Invalid binary data is corrected and converted without warning. If mtr reaches the end of the file midway through a record, the partial record is ignored and cannot be converted. A message to this effect is issued:
Bad block size: Block 58 expected 160 got 100 assuming 80
It would also be useful to know how many respondents should be in the data file, and how many cards each respondent should have. Even if this information is approximate, it can give you a hint as to whether the data file is close to the correct size or not. When looking in hex at a file which may be column binary, you may find that every 160 characters or so you will see a repeated pattern, or a similar pattern. When you are dealing with an 80-column record, the serial number will be repeated in approximately the same place on each record, every 160 bytes. When looking at a file as a regular text file, you will see that most of the file consists of blanks, interspersed with blocks, triangles and other graphics-style characters.
Keep in mind that while much of column binary is record length 160 (two bytes per column), this is not a requirement and any even number may be used.
EBCDIC
is different. EBCDIC in regular text mode looks mostly like line-drawing characters. There are few spaces. However, the best way to tell is to look at the file in hex. The hex codes mostly are numbers of the form Fn, where n is the data code. For example, a code 1 would be a hex F1, a code 2 would be a hex F2, and so on.
Most market research data files you will be converting are in the 80-column format, no matter which type of record you are converting from. So, think 80, 160 and the like when trying to discover the record length. When you are converting, and are not sure of the record length, do not use the m option to read only a few records. With this option, mtr does not tell you if there is anything wrong with the conversion factors you used. It just reports how many records it found and returns you to the system prompt. If you let mtr convert the whole file, it will issue an error message if it found extra characters left over at the end of the operation. For example:
Bad block size: Block 69 expected 1600 got 1437 assuming 1280
This means that the last record was not completely filled before the end if the file was reached. This is the pointer that suggests that your mtr command was not correct for the data file you are converting.
Once you have tried to convert the file and have received a message similar to the one shown above, look at the output file using, say, type in MS-DOS or more in Unix. If you have the type of file correct, but the block size, record length, or both, is incorrect, you will see a file where you have numbers but they will seem to march diagonally across the screen, displaced out of their correct spots. For example:
00019595695969696797979879695949392692393459697695954493932932939339393933933 00 02409824098724t0989873245097309874098247124098712340987123409871240983408 00 03 244098723409874t3450987345098723098723409871234098712340987123409871234 0004 309873087959569596969679392692393459697695954493932932939339393933933 000594 0486467638362526474746253939393939 ....
Notice how the serial numbers 0001, 0002, 0003, 0004 and 0005 (shown in bold) seem to move diagonally across the page. The record length used to convert this file was 160 when it should have been 164. When you have converted a multicoded file into Quantum format, you may see a five-sided box symbol or, if you look at the file with vi, the characters ^?, followed by a list of letters, numbers and symbols. This special symbol is the decimal code 127, or hex 7E, that separates the end of the data from the multicodes in each record. In the body of the record, each multicode is shown as an asterisk. If a record contains no multicodes, you will not see this special symbol in that record.
For a complete description of Quantum data format, see Appendix C of the Quantum Users Guide Volume 4.
text
EBCDIC
360/370 column binary 1130 column binary Quantum internal (binary) format. (This uses the 12 lower-order bits (0FFF in hexadecimal) of a 16-bit word to represent the codes &0123456789 in that order; that is &=0800 and 9=0001.)
The next example creates a 360/370 column binary file called cbdata in the current directory:
wcolbin qtdata cbdata
If you want to create a 360/370 column binary file with a different record length and/or block size, or you want to create a certain number of records only, or you want to create data in a different format, use mtw or mtwrite.
1130 qin
In conversions to ASCII format, any multicodes in the Quantum data that do not correspond to standard ASCII characters are written out as asterisks. For example, a multicode of &1 corresponds to the letter A and will be written out as such, whereas a multicode of 123 has no ASCII equivalent, and will therefore be written out as an asterisk. In conversions to EBCDIC, the Quantum data is converted first to ASCII and then from ASCII into The notes for conversions to ASCII therefore apply.
EBCDIC.
If you omit the format option from the command line, mtw prompts for each data type in turn, asking whether the conversion is of the particular format in question. For example:
Is this an EBCDIC conversion? Is this an ASCII conversion?
Type y and press ENTER at the appropriate prompt. If you answer n to all prompts, mtw displays the message Dont know how to do this conversion and stops.
This example uses a simple filename as the name of the Quantum data file so mtw will look for the file in your current directory. If you want to convert a file in a different directory, you may enter a pathname here instead. If you forget to enter the name of the input file, mtw does not prompt you for it. Instead, it waits for you to type in the data; to cancel mtw and re-enter the command with a filename, press CTRL+D. You enter the name of the Quantum data file you wish to create in a similar way using the option ofilename. Most times, you will be writing data to a tape so the name of the output filename will be the name of the tape device you are using. For example, if you are using a SCSI tape drive called /dev/rst0 and are converting the file called qtdata, you would enter this on the command line as:
mtw -360 qtdata -o/dev/rst
The next command assumes that you are writing to a -inch magnetic tape (/dev/rmt1):
mtw -360 -iqtdata -o/dev/rmt1
The next command names an output file rather than an output device, so mtw will create a file called efile in the current directory:
mtw -1130 -iqtdata -oefile
If you omit the output filename or device name from the command, mtw displays the data on the screen as it converts it. You may find this useful if you want to use the converted data as the input to another program, since it means that you can pipe the converted data directly from mtw into the second program. There is no need to store an intermediate data file unless you wish to do so.
To create an EBCDIC data file that has a record length of 160 characters and a block size of 3,200 characters, you would type:
mtw -1130 -r160 -b3200 -iqtdata -oefile
If you omit either the record length or the block size from your command, mtw prompts you for them.
to convert and write out 50 records to SCSI tape drive 1. Records will have a record length of 160 characters, a block size of 1,600 characters and will be in 360 column binary format.
mtw does not skip to the end of the input file before stopping. If you are reading data from a tape, the tape stops in the middle of the Quantum data file and should be repositioned or rewound manually.
2.5 Restrictions
wcolbin, mtw and mtwrite have no facilities for positioning the tape before writing to it. If you wish to write more than one file to a tape, you must either use a non-rewinding tape drive or reposition the tape at the end of the last file before writing the second file to the tape.
For further information on Quantum data formats, see Appendix C of the Quantum Users Guide Volume 4.
Tab characters have no meaning in Quantum data files and will cause a run to fail. To check for tabs and corrupt multicodes, use the badata program. To run badata, type: badata [v][x][o output_file][input_files] input_files is a list of one or more filenames separated by spaces. A hyphen instead of a filename tells badata to read from the standard input (that is, data you type on your keyboard) rather than from a file. The v option displays the program version number, and x displays a summary of usage. Errors are normally displayed on the screen, but you can use the o option to redirect the output to a file. For example:
badata -o errors data1 data2
badata issues two types of error messages. If it finds a tab character, it reports:
Line number: Tab character in data
If it finds too few or too many characters after the multicode symbol, it reports:
Line number: corrupt record (x multi-punched columns, y character codes)
This example separates the start and end columns of each field with commas, but ded accepts any character except a space, a digit or a tab character as a separator. Editing commands are divided into record-editing commands and card-editing commands. The prompt for record-editing commands is a colon, and for card-editing commands, it is c:.
This is the default printing mode if you name the serial number field on the command line.
pc
Switches character-printing mode on. The contents of each record are displayed with multicodes shown as asterisks. For example:
0015616267*7575*16231204521321** 001562 9438 21232* &- *23
This is the default if you do not name the serial number field on the command line. pp Switches punch (code) printing mode on. The punches in a multicoded column are displayed vertically in that column. Ranges of consecutive codes are shown using the notation start/end (for example, 1/5 for codes 1 to 5 inclusive). For example:
00156162671757541623120452132113 3 / 25 7 9&
Prints the record in the current printing mode. In a multicard record, all cards are printed at once. Displays an 80-column ruler above each record printed in pp or pc mode. Entering this command when a ruler is already displayed removes the ruler. Prints cards in a multicard record double-spaced. To switch off double-spacing, re-enter the eol (end of line) command. Prints the nth record in the file (for example, 156). Prints records m to n in the file (for example, 156,160). Prints the next (+) or previous () record in the file. If you enter a number after the + or sign, ded skips forward or back that number of records and prints the record at that position. For example, typing -5 at record 156, prints record 151. The printed record becomes the current record to which any changes will apply Goes to the last record in the file and prints it. Locates the record with the given serial number. If the serial number is shorter than the serial number field width, ded pads it on the left with zeros. If you type (156), for example, and the serial number field is five columns wide, ded searches for a record whose serial number is 00156. This command is only valid if you defined the serial number field on the command line. Locates and displays all records with the given serial number. At the end of the search, the pointer is left at the end of the file. This is useful for looking at records with duplicate serial numbers with a view to deciding which one, if any, should be deleted. Lists the whole file. Reports the number of records in the file.
$ (sernum)
g(sernum)
l,$ =
/data/ s col=data
Searches for a record containing the given data and displays that record. Overwrites the contents of column/field col with the given data. As in Quantum, the column specification for multicard records must define both the card type and column numbers. Punches must be enclosed in single quotes and strings must be enclosed in dollar signs. Here are some examples:
s 15=& s 45,50=$123456$ s 252=156&
A space is an alternative to the = sign for separating the column and code specifications. a Appends a new record after the current record. Type the data on a new line. Each character you type goes into a new column unless you enclose a string of codes in single quotes. In this case, those codes are treated as a multicode in the current column. At the end of the data, press ENTER and then type a dot on a line by itself to terminate the record. For example:
0015718462137&123
generates a record with 14 columns of data; column 11 is multicoded. i d sort gsort merge Inserts a new record before the current record. Rules for data entry are as described for the append command. Deletes the current record. In a multicard record, all cards are deleted. Sorts the data by serial number (you must have defined the serial number field on the command line). Sorts the data by card type within serial number (you must have defined the serial number and card type fields on the command line). Merges adjacent records with identical serial numbers. Cards in the resultant records are not sorted, neither are duplicate card types merged into a single card. When used with the insert or append commands, this is a useful facility for dealing with missing cards. For example, if card 5 is missing from record 156, you could enter the data for this card by appending it after record 156. If you then run the merge command, the new card 5 will be merged with the rest of the data for respondent 156. cs sernum Changes the serial number to the given number. If the number you give is shorter than the serial number field, it will be padded on the left with zeros. To force blank padding, precede the number with a colon followed by the required number of blanks. For example:
cs : 156
e w
Switches into card-editing mode for the current record. Type q to revert to record-editing mode. Writes out the data file saving any changes. To write out a range of records, type the first and last record numbers in the range at the start of the command. To write data out to a different file, type the new filename at the end of the command. Here are some variations of the w command:
1,100 w w newdata 1,100 w newdata
q !command shell
Leave the data editor. Executes the given MS-DOS/Unix command without terminating the editing session. When the command finishes, you are returned to the editor. Starts a subshell in which you can run MS-DOS/Unix commands. When you close the subshell, you are returned to the editor.
For example, suppose you are working on card 3 of record 156 and you type cs 200. If there is already a record with serial number 200, the data on the current card is copied into record 200 as a card 3 and the current card is deleted from record 156. However, if there is not already a record 200 in the file, ded copies the data from the current card into a new record with serial number 200, and places the new record immediately after record 156. Card 3 is then deleted from record 156. q Returns to record-editing mode.
4.4 Restrictions
ded does not recognize the Quantum notation /& meaning all 12 punches. The dot notation used in the Unix ed, ex and vi editors for referring to the current line has not been implemented in ded. It is not possible to delete specific punches from a column, nor to emit new punches into a column. Use set commands to name the exact punches required in each column.
4.5 Diagnostics
Various error messages are displayed, mainly to do with buffer errors while reading or writing data. If the file is too large to handle, ded advises you to use the Unix split command to make it smaller.
5.2 Using mc
To use mc, type: mc start_value increment field_width[format] [text] [input_file] [output_file] In the mc command, start_value is the numeric value for the first replacement and increment is the incremental value for each subsequent replacement. The default for both values is 1. field_width is the width of the replacement field. This may be between 1 and 11 columns, the default is 5 columns.
format is a single character defining what to do when the replacement value is shorter than the field width. The default is to right-justify the replacement value in the field and pad the field on the left with zeroes. You may choose to use blanks instead of zeroes or to suppress them altogether. To do this, enter a format character immediately after the field width. Valid format characters are: B S Z Pad the field on the left with blanks. Suppress leading zeroes. The width of the replacement field then depends on the number of digits in the replacement value. Pad the field on the left with zeroes (the default).
text is the text to be replaced. The default is the @ symbol. You can also use mc from within the ex or vi editors. To do this, edit the text file containing the special replacement symbols using one or other of these editors. Then, type: :lines!mc start_value increment field_width[format] [text] lines is any ex or vi syntax that is valid for referring to lines in the file. Examples are 1,$ and % for all lines in the file, and 10,50 for lines 10 to 50 only. Here is an example. Suppose you have a large file containing a list of magazines. From time to time you want to extract various titles from the list and number them sequentially from 1. Here is part of the master magazine list:
Gardeners World Amateur Gardening Garden News Practical Gardening Amateur Photographer Photography Practical Photography Practical Woodworking Practical Householder Do It Yourself @ @ @ @ @ @ @ @ @ @
Suppose you want to extract a list of all photography magazines and number them sequentially from 1. Here are the steps you would take: 1. Edit the file with ex or vi and delete all magazines that are not to do with photography. It is helpful if all titles to do with a particular topic are grouped, as in the above example, but this is not necessary. 2. Type: :1,$!mc 1 1 3B
This tells mc to replace all @ symbols (the default text is used because no other text is defined with mc) with numbers starting at 1 and incremented by 1. The replacement field is 3 characters wide and is padded on the left with blanks. 3. Save your work in a new file (:w filename) and quit. Here is the result of running these commands on the example file:
Photography Practical Photography 1 2
The cut, paste and join utilities available with Unix provide some of this functionality, but are generally used in shell scripts rather than on the command line. An alternative is to use awk which is, again, standard on Unix systems. However, although it is extremely flexible, awk is more a programming language than a simple one-line utility. Another option is to use bycol. This easy-to-use program reads a file and prints selected columns or fields from each line. You may also define additional texts to be printed as part of the output. To use bycol, type: bycol [anpx] [sseparator] [what_to_print] [filename] To see a reminder of the command syntax, type: bycol x
prints columns 1 to 2, 15 to 10, 4, 9, 1 and 29 of myfile in that order. The contents of these columns are printed as a single string with no spaces in between them. The $ symbol represents the end of the line, so the notation 1,$ prints the whole line. However, $ often has a special meaning to the shell so it is advisable to enclose references of this kind in single quotes (Unix) or double quotes (MS-DOS). For example, under Unix:
bycol 1,$ myfile
prints the whole of myfile and is the same as typing cat myfile under Unix. Under MS-DOS, the command is:
bycol "1,$" myfile
bycol displays its output on the screen. To write the output to a file, end the line with >filename. For example:
bycol 1,2 15,10 4, 9, 1, 29 myfile > opfile
prints the word Before, then the contents of column 25, then the word After, and finally the contents of column 26. If you wish, you can define column separators as texts. Here is the very first example again, this time with spaces used as separators: Unix
MS-DOS bycol 1,2 15,10 4 9 1 29 myfile bycol 1,2 " " 15,10 " " 4 " " 9 " " 1 " " 29 myfile
If you want to use the same separator across the whole line, it is quicker to define it once using the s option. You could rewrite the previous example to produce the same output by typing: Unix MS-DOS
bycol -s 1,2 15,10 4 9 1 29 myfile bycol -s" " 1,2 15,10 4 9 1 29 myfile
The characters you use in text strings or as separators are not restricted to letters and numbers. You can use other characters from the list below, but be sure to enclose them in single quotes:
To print a Type Or the octal value
New line
\n
12 15 10 11
To print a
Type
Formfeed Backslash
\f \\
34 134
Here is an example that uses the tab character as the column separator: Unix MS-DOS
bycol -s\t 1,4 10,12 15 56,62 data bycol -s"\t" 1,4 10,12 15 56,62 data
6.5 Restrictions
bycol cannot output lines longer than 1,024 characters. The maximum number of column references, field references, and texts in a command is 1,024.
7 Sorting files
The standard ASCII file sorting programs provided with MS-DOS and Unix have shortcomings when used to sort files based on the contents of more than one field in the line. Under MS-DOS, lines are sorted based on the contents of column 1 or a single column that you choose. Sorting based on fields of columns is not possible. The Unix sorting program is much more sophisticated and allows sorting based on the contents of one or more fields, but the syntax for specifying the fields is not straightforward. asort is a utility that overcomes these limitations and makes it easy to specify sorts using any number of columns and fields. To use asort under Unix, type: asort [options] input_file output_file start1 end1 [ startn endn] To use asort under MS-DOS, type: asort input_file output_file start1 end1 Where input_file is the name of the unsorted input file, output_file is the name of the sorted output file, start1 and end1 are the start and end positions of the first field you want to sort on. To sort on a single column, enter the same value for the start and end columns. Under Unix, you can sort on more than one field by entering the pairs of start and end positions in order of importance, most important first. This is not possible under MS-DOS and asort will issue an error message if you specify more than one column field. The options under Unix are:
Option Explanation
Call sort using the old method of specifying the sort key. This has been provided for backwards compatibility. You may find this option useful in the unlikely event that asort now gives different results from previous versions. By using this option, you should get the same results as you did using the previous version of asort. The default is that this option is off. Call sort in verbose mode. The default is that this option is off.
This command produces a sorted version of unsort.txt in the file sort.txt. Lines are sorted first on the contents of columns 1 to 5 (the highest sort level) and within that on columns 10 to 12. The lowest level of sorting is on column 80 within columns 10 to 12.
Sorting files Chapter 7 / 31
Anything else means print this text at the beginning of the next line, which equates to the ASCII CTRL+J character. The accepted character to use in ANSI files is the space character. If you have a file with printer controls marked in ANSI format, you can convert it into ASCII format by running the program deftn. Similarly, if you have a file in ASCII format that needs to be converted into ANSI format, you can convert the file using ftnise.
to create a file called list1.ans by adding ANSI carriage control characters to the lines in the file called list1.
to remove the ANSI carriage control sequences from the file list1.ans and to save the results in a file called list1.unx.