21.5 - Reading a Varying Number of Repeating Fields

Let's suppose we want to read a raw data file that, in each record, contains a four-digit subject ID number followed by monthly weights (in pounds) of the subjects:

Monthly Weight of Subjects
1001	179	172	169
1002	250	249
1003	190	196	195	164	158
1004	232	224	219	212	208
1005	211	208	204	202

As you can see, because some of the subjects dropped out of the diet program in which they were participating, the data file does not contain an equal number of weights in each record.

In some situations, it might work just fine to read in this data file using five different variables for the weights — wt1, wt2, wt3, wt4, and wt5, say — so that the resulting SAS data set looks like this:

id	wt1	wt2	wt3	wt4	wt5
1001	179	172	169	.	.
1002	250	249	.	.	.
1003	190	196	195	164	158
1004	232	224	219	212	208
1005	211	208	204	202	.

In other cases, though, it makes more sense to read all of the weights into one weight variable — if you wanted to calculate an overall average across all subjects and all weigh-ins, say. In that case, we'd want our data set to have this structure instead:

id	weigh in	weight
1001	1	179
1001	2	172
1001	3	169
1002	1	250
1002	2	249
1003	1	190
1003	2	196
1003	3	195
1003	4	164
1003	5	158
1003	3	195
1004	1	232
1004	2	224
1004	3	219
1004	4	212
1004	5	208
1005	1	211
1005	2	208
1005	3	204
1005	4	202

in which each record in the input data file produces as many observations as it has weights, with each observation containing three variables — id, weighin, and weight.

Because the input data file contains a varying number of repeated fields in each record, our methods for the last section are going to have to be modified. We can still take advantage of the trailing @ to tell SAS to hold the records in the input buffer. We'll also have to take advantage, however, of the INFILE statement's MISSOVER option, as well as replace the iterative DO loop with a DO WHILE loop. Let's take a look!

Example 21.9 Section

The following program uses trailing at signs (@) in conjunction with the MISSOVER option and a conditional DO WHILE loop to read in a raw data file containing a varying number of weights in each record:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA dietdata;
    infile DATALINES MISSOVER;
    input id weight @;
    weighin = 0;
    DO WHILE (weight ne .);
        weighin+1;
        output;
        input weight @;
    END;
    DATALINES;
    1001 179 172 169
    1002 250 249
    1003 190 196 195 164 158
    1004 232 224 219 212 208
    1005 211 208 204 202
    ;
RUN;
 
PROC PRINT data = dietdata NOOBS;
    title 'The dietdata data set';
    var id weighin weight;
RUN;

The dietdata data set
id	weighin	weight
1001	1	179
1001	2	172
1001	3	169
1002	1	250
1002	2	249
1003	1	190
1003	2	196
1003	3	195
1003	4	164
1003	5	158
1004	1	232
1004	2	224
1004	3	219
1004	4	212
1004	5	208
1005	1	211
1005	2	208
1005	3	204
1005	4	202

Let's see how this program works! During the compile phase, SAS creates a program data vector containing the automatic variables — _N_ and _ERROR_ — and the three variables mentioned in the DATA step — id, weighin, and weight. Then, SAS moves on to the execution phase.

Even though we are reading the data instream, we've included an INFILE statement this time just so we can take advantage of the statement's MISSOVER option. Like the previous example with the same number of repeating fields, we must use trailing at signs (@) to tell SAS to read the same record more than once. But, here, we also have to tell SAS not to move the input pointer to the next record when there are missing weights. That's why the MISSOVER option.

During the first iteration, the first INPUT statement tells SAS to place the first record in the input data file into the input buffer:

↓---+----10---+----20---+
1001 179 172 169

and to read the values for the ID and weight variables into the program data vector. The trailing @ tells SAS to hold the record in the input buffer. The weighin variable is initialized to 0, so that the program data vector looks like this now:

_N_	_ERROR_	id	weighin	weight
1	0	1001	0	179

Then, SAS begins execution of the DO WHILE loop. The current value for weight is not missing — it's 179 — so SAS steps into the DO WHILE loop. The value of weighin is increased to 1 so that the program data vector looks like this:

_N_	_ERROR_	id	weighin	weight
1	0	1001	1	179

The OUTPUT statement tells SAS to write the contents of the program data vector to create the first observation of the dietdata data set. Then, the INPUT statement reads the repeating field (172) and assigns the value to the program data vector:

_N_	_ERROR_	id	weighin	weight
1	0	1001	1	172

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 172. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_	_ERROR_	id	weighin	weight
1	0	1001	2	172

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement reads the repeating field (169) and assigns the value to the program data vector:

_N_	_ERROR_	id	weighin	weight
1	0	1001	2	169

The trailing @ tells SAS to continue to hold the record in the input buffer. Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is again not missing — it's 169. So, SAS steps into the DO WHILE loop again. The value of weighin is increased to 1, so that the program data vector looks like this:

_N_	_ERROR_	id	weighin	weight
1	0	1001	3	169

The OUTPUT statement tells SAS to write the contents of the program data vector to create the second observation of the dietdata data set. Then, the INPUT statement attempts to read another weight value and determines there to be none. The MISSOVER option prevents the input pointer from moving to the next record in search of another value for weight. SAS assigns a missing value to the weight variable in the program data vector:

_N_	_ERROR_	id	weighin	weight
1	0	1001	3	.

Having reached the DO WHILE loop's END statement, SAS returns to the top of the DO loop and detemines that weight is now missing, and so the statements in the loop are not executed. Having reached the bottom of the DATA step, SAS returns to the top of the DATA step to begin processing the next iteration. In so doing, the first record that was being held in the input buffer is released.

During the second iteration of the DATA step, the first INPUT statement tells SAS to place the second record in the input data file into the input buffer:

↓---+----10---+----20---+
1002 250 249

The automatic variable _N_ is increased to 2, _ERROR_ is set to 0, and the other variables in the program data vector are reset to missing:

_N_	_ERROR_	id	weighin	weight
2	0	.	.	.

Okay, let's get off of this merry-go-round! SAS proceeds through the program and the input data file just as described above until SAS reaches the last field in the last record. Let's have you now launch and run the SAS program, and review the output to convince yourself that SAS did indeed create the dietdata data set as described.