21.5 - Reading a Varying Number of Repeating Fields

21.5 - Reading a Varying Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects:

1001 179 172 169    
1002 250 249      
1003 190 196 195 164 158
1004 232 224 219 212 208
1005 211 208 204 202  

As you can see, because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record.

In some situations, it might work just fine to read in this data file using five different variables for the weights — wt1, wt2, wt3, wt4, and wt5, say — so that the resulting SAS data set looks like this:

id wt1 wt2 wt3 wt4 wt5
1001 179 172 169 . .
1002 250 249 . . .
1003 190 196 195 164 158
1004 232 224 219 212 208
1005 211 208 204 202 .

In other cases, though, it makes more sense to read all of the weights into one weight variable — if you wanted to calculate an overall average across all subjects and all weigh-ins, say. In that case, we'd want our data set to have this structure instead:

id weighin weight
1001 1 179
1001 2 172
1001 3 169
1002 1 250
1002 2 249
1003 1 190
1003 2 196
1003 3 195
1003 4 164
1003 5 158
1003 3 195
1004 1 232
1004 2 224
1004 3 219
1004 4 212
1004 5 208
1005 1 211
1005 2 208
1005 3 204
1005 4 202

in which each record in the input data file produces as many observations as it has weights, with each observation containing three variables — id, weighin, and weight.

Because the input data file contains a varying number of repeated fields in each record, our methods of the last section are going to have to be modified. We can still take advantage of the trailing @ to tell SAS to hold the records in the input buffer. We'll also have to take advantage, however, of the INFILE statement's MISSOVER option, as well as have to replace the iterative DO loop with a DO WHILE loop. Let's take a look!

Example 21.9

The following program uses trailing at signs (@) in conjunction with the MISSOVER option and a conditional DO WHILE loop to read in a raw data file containing a varying number of weights in each record:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

DATA dietdata;
    infile DATALINES MISSOVER;
    input id weight @;
    weighin = 0;
    DO WHILE (weight ne .);
        weighin+1;
        output;
        input weight @;
    END;
    DATALINES;
    1001 179 172 169
    1002 250 249
    1003 190 196 195 164 158
    1004 232 224 219 212 208
    1005 211 208 204 202
    ;
RUN;

PROC PRINT data = dietdata NOOBS;
    title 'The dietdata data set';
    var id weighin weight;
RUN;

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — id, weighin, and weight. Then, SAS moves on to the execution phase.

Even though we are reading the data instream, we've included an INFILE statement this time just so we can take advantage of the statement's MISSOVER option. Like the previous example with the same number of repeating fields, we must use trailing at signs (@) to tell SAS to read the same record more than once. But, here, we also have to tell SAS not to move the input pointer to the next record when there are missing weights. That's why the MISSOVER option.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

↓---+----10---+----20---+
1001 179 172 169

and to read the values for the ID and weight variables into the program data vector. The trailing @ tells SAS to hold the record in the input buffer. The weighin variable is initialized to 0, so that the program data vector looks like this now:

_N_ _ERROR_ id weighin weight
1 0 1001 0 179

Then, SAS begins execution of the DO WHILE loop. The current value for weight is not missing — it's 179 — so SAS steps into the DO WHILE loop. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_ _ERROR_ id weighin weight
1 0 1001 1 179

The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the dietdata data set. Then, the INPUT statement reads the repeating field (172) and assigns the value to the program data vector:

_N_ _ERROR_ id weighin weight
1 0 1001 1 172

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 172. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_ _ERROR_ id weighin weight
1 0 1001 2 172

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement reads the repeating field (169) and assigns the value to the program data vector:

_N_ _ERROR_ id weighin weight
1 0 1001 2 169

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 169. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_ _ERROR_ id weighin weight
1 0 1001 3 169

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement attempts to read another weight value and determines there to be none. The MISSOVER option prevents the input pointer from moving to the next record in search of another value for weight. SAS assigns a missing value to the weight variable in the program data vector:

_N_ _ERROR_ id weighin weight
1 0 1001 3 .

Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is now missing, and so the statements in the loop are not executed. Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.

During the second iteration of the DATA step, the first INPUT statement tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20---+
1002 250 249

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N_ _ERROR_ id weighin weight
2 0 . . .

Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the dietdata data set as described.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility