21.5 - Reading a Varying Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects:

Monthly Weight of Subjects

1001179172169  
1002250249   
1003190196195164158
1004232224219212208
1005211208204202 

As you can see, because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record.

In some situations, it might work just fine to read in this data file using five different variables for the weights — wt1, wt2, wt3, wt4, and wt5, say — so that the resulting SAS data set looks like this:

idwt1wt2wt3wt4wt5
1001179172169..
1002250249...
1003190196195164158
1004232224219212208
1005211208204202.

In other cases, though, it makes more sense to read all of the weights into one weight variable — if you wanted to calculate an overall average across all subjects and all weigh-ins, say. In that case, we'd want our data set to have this structure instead:

idweigh inweight
10011179
10012172
10013169
10021250
10022249
10031190
10032196
10033195
10034164
10035158
10033195
10041232
10042224
10043219
10044212
10045208
10051211
10052208
10053204
10054202

in which each record in the input data file produces as many observations as it has weights, with each observation containing three variables — id, weighin, and weight.

Because the input data file contains a varying number of repeated fields in each record, our methods for the last section are going to have to be modified. We can still take advantage of the trailing @ to tell SAS to hold the records in the input buffer. We'll also have to take advantage, however, of the INFILE statement's MISSOVER option, as well as replace the iterative DO loop with a DO WHILE loop. Let's take a look!

Example 21.9 Section

The following program uses trailing at signs (@) in conjunction with the MISSOVER option and a conditional DO WHILE loop to read in a raw data file containing a varying number of weights in each record:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA dietdata;
    infile DATALINES MISSOVER;
    input id weight @;
    weighin = 0;
    DO WHILE (weight ne .);
        weighin+1;
        output;
        input weight @;
    END;
    DATALINES;
    1001 179 172 169
    1002 250 249
    1003 190 196 195 164 158
    1004 232 224 219 212 208
    1005 211 208 204 202
    ;
RUN;
 
PROC PRINT data = dietdata NOOBS;
    title 'The dietdata data set';
    var id weighin weight;
RUN;

The dietdata data set
idweighinweight
10011179
10012172
10013169
10021250
10022249
10031190
10032196
10033195
10034164
10035158
10041232
10042224
10043219
10044212
10045208
10051211
10052208
10053204
10054202

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — id, weighin, and weight. Then, SAS moves on to the execution phase.

Even though we are reading the data instream, we've included an INFILE statement this time just so we can take advantage of the statement's MISSOVER option. Like the previous example with the same number of repeating fields, we must use trailing at signs (@) to tell SAS to read the same record more than once. But, here, we also have to tell SAS not to move the input pointer to the next record when there are missing weights. That's why the MISSOVER option.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

↓---+----10---+----20---+
1001 179 172 169

and to read the values for the ID and weight variables into the program data vector. The trailing @ tells SAS to hold the record in the input buffer. The weighin variable is initialized to 0, so that the program data vector looks like this now:

_N__ERROR_idweighinweight
1010010179

Then, SAS begins execution of the DO WHILE loop. The current value for weight is not missing — it's 179 — so SAS steps into the DO WHILE loop. The value of weighin is increased to 1 so that the program data vector looks like this:

_N__ERROR_idweighinweight
1010011179

The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the dietdata data set. Then, the INPUT statement reads the repeating field (172) and assigns the value to the program data vector:

_N__ERROR_idweighinweight
1010011172

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 172. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N__ERROR_idweighinweight
1010012172

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement reads the repeating field (169) and assigns the value to the program data vector:

_N__ERROR_idweighinweight
1010012169

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 169. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N__ERROR_idweighinweight
1010013169

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement attempts to read another weight value and determines there to be none. The MISSOVER option prevents the input pointer from moving to the next record in search of another value for weight. SAS assigns a missing value to the weight variable in the program data vector:

_N__ERROR_idweighinweight
1010013.

Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is now missing, and so the statements in the loop are not executed. Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.

During the second iteration of the DATA step, the first INPUT statement tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20---+
1002 250 249

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N__ERROR_idweighinweight
20...

Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run  the SAS program, and review the output to convince yourself that SAS did indeed create the dietdata data set as described.