20.2 - Issues with Reading Fixed-Field Data
20.2 - Issues with Reading Fixed-Field DataFor the most part, using column and formatted input to read in data values that are arranged in neatly defined columns is straightforward. There are though — as always — a few sticking points that we'll address in this section. First, we'll learn not to be surprised when the lengths of our numeric variables do not coincide with the widths of the numeric informats we use in our INPUT statement. Then, we'll investigate what to do when trying to read in fixed-field data from a raw data file that contains records of different lengths.
Lengths of Variables
You might recall that the descriptor portion of a data set contains information about the attributes of each variable in your data set. The length of each variable in your data sets is one such attribute. When you use informats to read in some of your data values, it pays to know how SAS defines the lengths of your variables when it processes your DATA step. Let's make our discussion of this concrete by taking a look at an example.
Example 20.4
The following program is identical to the program in Example 20.2, except the PRINT procedure has been replaced with the CONTENTS procedure so that we can explore how SAS defines the lengths of variables:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
+3 height 2.
+5 wt_date mmddyy8.
+1 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC CONTENTS data = temp;
title 'Contents of TEMP Data Set';
RUN;
Contents of TEMP Data Set
Data Set Name | WORK.TEMP | Observations | 5 |
---|---|---|---|
Member Type | DATA | Variables | 6 |
Engine | V9 | Indexes | 0 |
Created | 2023/09/12 16:10:42 | Observation Length | 56 |
Last Modified | 2023/09/12 16:10:42 | Deleted Observations | 0 |
Protection | Compressed | NO | |
Data Set Type | Sorted | NO | |
Label | |||
Data Representation | WINDOWS_64 | ||
Encoding | shift-jis Japanese (SJIS) |
Engine/Host Dependent Information | |
---|---|
Data Set Page Size | 65536 |
Number of Data Set Pages | 1 |
First Data Page | 1 |
Max Obs per Page | 1167 |
Obs in First Data Page | 5 |
Number of Data Set Repairs | 0 |
ExtendObsCounter | YES |
Filename | C:\Users\klc137\AppData\Local\Temp\SAS Temporary Files\_TD1668_E1-055551_\temp.sas7bdat |
Release Created | 9.0401M5 |
Host Created | X64_10PRO |
Owner Name | PSU\klc137 |
File Size | 128KB |
File Size (bytes) | 131072 |
Alphabetic List of Variables and Attributes | ||||
---|---|---|---|---|
# | Variable | Type | Len | Format |
6 | calorie | Num | 8 | COMMA5. |
2 | f_name | Char | 11 | |
4 | height | Num | 8 | |
3 | l_name | Char | 6 | |
1 | subj | Num | 8 | |
5 | wt_date | Num | 8 | MMDDYY8. |
Launch and run the SAS program, and when reviewing the output of the CONTENTS procedure, direct your attention to this portion:
# | Variable | Type | Len | Format |
---|---|---|---|---|
6 | calorie | num | 8 | COMMA5. |
2 | f_name | Char | 11 | |
4 | height | Num | 8 | |
3 | l_name | Char | 6 | |
1 | subj | Num | 8 | |
5 | wt_date | Nu | 8 | MMDDYY8. |
Let's see what we can learn from this ... SAS defined the length of the character variables — f_name and l_name — as 11 and 6, respectively. Hmm ... is it just a coincidence that the informats we used in the INPUT statement for these two variables were $11.? and $6., respectively? Of course not! In general, during the compile phase, SAS defines the lengths of the character variables in the program data vector to be the same as the field width w specified by the $w. character informat.
Can the same generalization be made for the numeric variables, such as subj, height, wt_date, and calorie? Apparently not! For example, we used the numeric informat of 4. to read in the subj variable, and yet SAS assigned the subj variable a length of 8 bytes. Similarly, we used the comma5. informat to read in the calorie variable, and yet SAS assigned the calorie variable a length of, not 5, but 8 bytes. In general, by default, SAS stores a numeric value— no matter how many digits the value contains — as a floating-point number in 8 bytes of storage. That is, the length of a stored numeric variable is not affected by an informat's width nor by any other column specifications in an INPUT statement. Unfortunately, that doesn't mean you can be lazy and use a default width of 8 for all of your numeric informats. If you do, you'll quickly see that your data values are not read properly.
Example 20.5
The following program is similar to the program in the previous example, except all of the input pointer controls have been changed to @n absolute pointer controls and the numeric informat for the height variable has been changed from 2. to 8.:
DATA temp;
input @1 subj 4.
@6 f_name $11.
@18 l_name $6.
@30 height 8.
@34 wt_date mmddyy8.
@43 calorie comma5.;
format wt_date mmddyy8. calorie comma5.;
DATALINES;
1024 Alice Smith 1 65 125 12/1/95 2,036
1167 Maryann White 1 68 140 12/01/95 1,800
1168 Thomas Jones 2 190 12/2/95 2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia Ho 1 63 115 1/1/96 1,972
;
RUN;
PROC PRINT data = temp;
title 'The TEMP Data Set';
RUN;
Obs | subj | f_name | l_name | height | wt_date | calorie |
---|---|---|---|---|---|---|
1 | 1024 | Alice | Smith | . | 12/01/95 | 2,036 |
2 | 1167 | Maryann | White | . | 12/01/95 | 1,800 |
3 | 1168 | Thomas | Jones | . | 12/02/95 | 2,302 |
4 | 1201 | Benedictine | Arnold | . | 11/30/95 | 2,432 |
5 | 1302 | Felicia | Ho | . | 01/01/96 | 1,972 |
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS had trouble reading the data values for the height variable. If you take a look at the input data, you can see why. If you start in column 30, where the height values begin, and count eight columns to the right, you should see that SAS encounters two blank spaces — the one between the height and weight values and the one between the weight and date values. Because a blank space is a character value, and the height variable is defined as numeric, SAS gets confused, and displays an Invalid data message in the log window:
NOTE: Invalid data for height in line 299 30-37.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
1024 Alice Smith 1 65 125 12/1/95 2,036
subj=1024 f_name=Alice l_name=Smith height=. wt_date=12/01/95 calorie=2,036 _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 300 30-37.
1167 Maryann White 1 68 140 12/01/95 1,800
subj=1167 f_name=Maryann l_name=White height=. wt_date=12/01/95 calorie=1,800 _ERROR_=1 _N_=2
NOTE: Invalid data for height in line 301 30-37.
1168 Thomas Jones 2 190 12/2/95 2,302
subj=1168 f_name=Thomas l_name=Jones height=. wt_date=12/02/95 calorie=2,302 _ERROR_=1 _N_=3
NOTE: Invalid data for height in line 302 30-37.
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
subj=1201 f_name=Benedictine l_name=Arnold height=. wt_date=11/30/95 calorie=2,432 _ERROR_=1 _N_=4
NOTE: Invalid data for height in line 303 30-37.
1302 Felicia Ho 1 63 115 1/1/96 1,972
subj=1302 f_name=Felicia l_name=Ho height=. wt_date=01/01/96 calorie=1,972 _ERROR_=1 _N_=5
NOTE: The data set WORK.TEMP has 5 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
and assigns missing values to the height variable. Because we used @n absolute pointer controls to tell SAS where to begin reading the wt_date and calorie values, SAS has no trouble reading in the data for those variables.
In a nutshell, the moral of the story of the two previous examples is that you need to make sure that you define the widths of your character and numeric informats to be the number of columns the data values occupy in the input data, not the length of their resulting variables.
Reading Variable-Length Records
When you are using column or formatted input, you might run into trouble when the records in your input data file are of different lengths. Data files that have an end-of-record marker after a fixed, predetermined number of columns are called fixed-length record data files. This data file:
is a fixed-length record data file, because each record ends in the same column. Reading this data file using column input is a straightforward exercise.
Data files that have an end-of-record marker after the last field in each record are called variable-length record data files. This data file:
is a variable-length record data file, because the length of the records varies. Reading this data file using column input requires using the INFILE statement's PAD option or MISSOVER option to tell SAS how to behave when it reaches an end-of-record marker before all of the variables in a record have been read. Let's investigate this more thoroughly.
First, an important administrative comment! So far, we've been able to get away with illustrating all of the above SAS features by reading the data instream using a DATALINES statement. Since the examples that follow concern options that are available for the INFILE statement, we can't be lazy anymore. For each example, you'll have to first right-click on the link to the raw ASCII data file to save it to a convenient location on your computer, and then edit the program's INFILE statement to reflect that location. When all is said and done, you'll appreciate even more the lazy learner's way!
Example 20.6
The following program attempts to read the addresses.dat raw data file:
using formatted input:
DATA temp;
infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat';
input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
title 'Temp data set';
RUN;
First, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. Then, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Finally, run the SAS program, and review the output. You should see that something has clearly gone awry:
Obs | subj | name | street |
---|---|---|---|
1 | 1024 | Alice Smith | 1167 Maryann White |
2 | 1168 | 1201 Benedict A | 1302 Felicia Ho |
If you take a look at the first record in the data file, you can see how things go downhill from the start. In trying to read the street address of subj 1024, SAS encounters an end-of-record marker before getting to the end of the field at column 45. Therefore, SAS proceeds to do what it has been told to do by default, namely to advance to the next line to look for the data field. That's why it reads "1167 Maryann White" as the street address for subj 1024. Having reached the end of the INPUT statement, SAS then proceeds to the next line and begins to read the data for subj 1168. This time, SAS can't complete reading the columns for name or street before encountering the end-of-record marker. Again, SAS advances to the next line and reads "1201 Benedictine A" as the name for subj 1168. SAS tries to read columns 27-45 as the street address for subj 1168, but again encounters the end-of-record marker before getting to column 45. So, SAS advances to the next line to look for the data field, and in so doing reads "1302 Felicia Ho" as the street address. Having reached the end of the INPUT statement, SAS proceeds to the next line, only to discover that there are no more records to read. The DATA step stops, and we are left with a messed up SAS data set! The following two examples help clean up our mess.
Example 20.7
As you now know, by default, SAS goes to the next data line to read more data if SAS has reached the end of a data line and there are still more variables in the INPUT statement that have not been assigned values. The following program uses the INFILE statement's MISSOVER option to tell SAS not to advance to the next data line to read more values to complete observation, but rather to assign missing values to the remaining variables:
DATA temp;
infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' MISSOVER;
input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
title 'Temp data set';
RUN;
If you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run the SAS program, and review the output. You should see that our resulting data set is much improved:
Obj | subj | name | street |
---|---|---|---|
1 | 1024 | Alice Smith | |
2 | 1167 | Maryann White | 756 Washington Ave. |
3 | 1168 | Thomas Jones | |
4 | 1201 | Benedictine Arnold | |
5 | 1302 | Felicia Ho | 110 Jefferson Blvd. |
At least this time, SAS associates the data values that it does read in with the correct variables. It's still not a perfect solution though, is it? In some cases, you might instead want SAS to read in the portions of the data fields that are available, such as 123 Elm St. for the street address for subj 1024, rather than automatically assigning a missing value when an end-of-record marker is encountered. That's where the INFILE statement's PAD option proves useful.
Example 20.8
The following program uses the INFILE statement's PAD option to tell SAS to pad each record in the data file with blanks so that all of the data lines have the same length:
DATA temp;
infile 'C:\yourdrivename\Stat481WC\01importingI\sasndata\addresses.dat' PAD;
input @1 subj 4. name $ 6-23 street $ 27-45;
RUN;
PROC PRINT;
title 'Temp data set';
RUN;
Again, if you haven't already done so, click on the link to the addresses.dat data file in order to save the file to a convenient location on your computer. And, upon launching the SAS program, edit the INFILE statement so it reflects the location in which you saved the file. Then, run the SAS program, and review the output. You should see that we have achieved the most complete and correct data set yet:
Obs | subj | name | street |
---|---|---|---|
1 | 1024 | Alice Smith | 123 Elm St. |
2 | 1167 | Maryann White | 756 Washington Ave. |
3 | 1168 | Thomas Jones | |
4 | 1201 | Benedictine Arnold | 245 Ferguson Ave. |
5 | 1302 | Felicia Ho | 110 Jefferson Blvd. |