Let's suppose we want to read a raw data file that, in each record, contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects:
Monthly Weight of Subjects | |||||
---|---|---|---|---|---|
1001 | 179 | 172 | 169 | ||
1002 | 250 | 249 | |||
1003 | 190 | 196 | 195 | 164 | 158 |
1004 | 232 | 224 | 219 | 212 | 208 |
1005 | 211 | 208 | 204 | 202 |
As you can see, because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record.
In some situations, it might work just fine to read in this data file using five different variables for the weights — wt1, wt2, wt3, wt4, and wt5, say — so that the resulting SAS data set looks like this:
id | wt1 | wt2 | wt3 | wt4 | wt5 |
---|---|---|---|---|---|
1001 | 179 | 172 | 169 | . | . |
1002 | 250 | 249 | . | . | . |
1003 | 190 | 196 | 195 | 164 | 158 |
1004 | 232 | 224 | 219 | 212 | 208 |
1005 | 211 | 208 | 204 | 202 | . |
In other cases, though, it makes more sense to read all of the weights into one weight variable — if you wanted to calculate an overall average across all subjects and all weigh-ins, say. In that case, we'd want our data set to have this structure instead:
id | weigh in | weight |
---|---|---|
1001 | 1 | 179 |
1001 | 2 | 172 |
1001 | 3 | 169 |
1002 | 1 | 250 |
1002 | 2 | 249 |
1003 | 1 | 190 |
1003 | 2 | 196 |
1003 | 3 | 195 |
1003 | 4 | 164 |
1003 | 5 | 158 |
1003 | 3 | 195 |
1004 | 1 | 232 |
1004 | 2 | 224 |
1004 | 3 | 219 |
1004 | 4 | 212 |
1004 | 5 | 208 |
1005 | 1 | 211 |
1005 | 2 | 208 |
1005 | 3 | 204 |
1005 | 4 | 202 |
in which each record in the input data file produces as many observations as it has weights, with each observation containing three variables — id, weighin, and weight.
Because the input data file contains a varying number of repeated fields in each record, our methods for the last section are going to have to be modified. We can still take advantage of the trailing @ to tell SAS to hold the records in the input buffer. We'll also have to take advantage, however, of the INFILE statement's MISSOVER option, as well as replace the iterative DO loop with a DO WHILE loop. Let's take a look!
Example 21.9 Section
The following program uses trailing at signs (@) in conjunction with the MISSOVER option and a conditional DO WHILE loop to read in a raw data file containing a varying number of weights in each record:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA dietdata;
infile DATALINES MISSOVER;
input id weight @;
weighin = 0;
DO WHILE (weight ne .);
weighin+1;
output;
input weight @;
END;
DATALINES;
1001 179 172 169
1002 250 249
1003 190 196 195 164 158
1004 232 224 219 212 208
1005 211 208 204 202
;
RUN;
PROC PRINT data = dietdata NOOBS;
title 'The dietdata data set';
var id weighin weight;
RUN;
id | weighin | weight |
---|---|---|
1001 | 1 | 179 |
1001 | 2 | 172 |
1001 | 3 | 169 |
1002 | 1 | 250 |
1002 | 2 | 249 |
1003 | 1 | 190 |
1003 | 2 | 196 |
1003 | 3 | 195 |
1003 | 4 | 164 |
1003 | 5 | 158 |
1004 | 1 | 232 |
1004 | 2 | 224 |
1004 | 3 | 219 |
1004 | 4 | 212 |
1004 | 5 | 208 |
1005 | 1 | 211 |
1005 | 2 | 208 |
1005 | 3 | 204 |
1005 | 4 | 202 |
Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — id, weighin, and weight. Then, SAS moves on to the execution phase.
Even though we are reading the data instream, we've included an INFILE statement this time just so we can take advantage of the statement's MISSOVER option. Like the previous example with the same number of repeating fields, we must use trailing at signs (@) to tell SAS to read the same record more than once. But, here, we also have to tell SAS not to move the input pointer to the next record when there are missing weights. That's why the MISSOVER option.
During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:
↓---+----10---+----20---+ |
---|
1001 179 172 169 |
and to read the values for the ID and weight variables into the program data vector. The trailing @ tells SAS to hold the record in the input buffer. The weighin variable is initialized to 0, so that the program data vector looks like this now:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 0 | 179 |
Then, SAS begins execution of the DO WHILE loop. The current value for weight is not missing — it's 179 — so SAS steps into the DO WHILE loop. The value of weighin is increased to 1 so that the program data vector looks like this:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 1 | 179 |
The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the dietdata data set. Then, the INPUT statement reads the repeating field (172) and assigns the value to the program data vector:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 1 | 172 |
The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 172. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 2 | 172 |
The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement reads the repeating field (169) and assigns the value to the program data vector:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 2 | 169 |
The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 169. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 3 | 169 |
The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement attempts to read another weight value and determines there to be none. The MISSOVER option prevents the input pointer from moving to the next record in search of another value for weight. SAS assigns a missing value to the weight variable in the program data vector:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
1 | 0 | 1001 | 3 | . |
Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is now missing, and so the statements in the loop are not executed. Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.
During the second iteration of the DATA step, the first INPUT statement tells SAS to place the second record in the input data file into the input buffer:
↓---+----10---+----20---+ |
---|
1002 250 249 |
The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:
_N_ | _ERROR_ | id | weighin | weight |
---|---|---|---|---|
2 | 0 | . | . | . |
Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the dietdata data set as described.