When elements of an array are constants needed only for the duration of the DATA step, you can omit the variables associated with an array group and instead use temporary array elements. Although they behave like variables, temporary array elements:
- do not appear in the resulting data set;
- do not have names and can be only referenced by their array names and dimensions; and
- are automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step.
In this section, we'll look at three examples that involve checking a subset of Quality of Life data for errors. In the first example, we'll look to see if the data recorded in ten variables — qul3a, qul3b, ..., and qul3j —are within an expected range without using an array. In the second example, we'll look to see if the data recorded in the same ten variables are within an expected range using an array that corresponds to three new variables error1, error2, and error3. In the third and final example, we'll look to see if the data recorded in the same ten variables are within an expected range using an array containing only temporary elements.
Example 19.9 Section
The following program first reads a subset of Quality of Life data (variables qul3a, qul3b, ..., and qul3j) into a SAS data set called qul. Then, the program checks to make sure that the values for each variable have been recorded as either a 1, 2, or 3 as would be expected from the data form. If a value for one of the variables does not equal 1, 2, or 3, then that observation is output to a data set called errors. Otherwise, the observation is output to the qul data set. Because the error checking takes place without using arrays, the program contains a series of ten if/then statements, corresponding to each of the ten concerned variables:
DATA qul errors;
input subj qul3a qul3b qul3c qul3d qul3e
qul3f qul3g qul3h qul3i qul3j;
flag = 0;
if qul3a not in (1, 2, 3) then flag = 1;
if qul3b not in (1, 2, 3) then flag = 1;
if qul3c not in (1, 2, 3) then flag = 1;
if qul3d not in (1, 2, 3) then flag = 1;
if qul3e not in (1, 2, 3) then flag = 1;
if qul3f not in (1, 2, 3) then flag = 1;
if qul3g not in (1, 2, 3) then flag = 1;
if qul3h not in (1, 2, 3) then flag = 1;
if qul3i not in (1, 2, 3) then flag = 1;
if qul3j not in (1, 2, 3) then flag = 1;
if flag = 1 then output errors;
else output qul;
drop flag;
DATALINES;
110011 1 2 3 3 3 3 2 1 1 3
210012 2 3 4 1 2 2 3 3 1 1
211011 1 2 3 2 1 2 3 2 1 3
310017 1 2 3 3 3 3 3 2 2 1
411020 4 3 3 3 3 2 2 2 2 2
510001 1 1 1 1 1 1 2 1 2 2
;
RUN;
PROC PRINT data = qul;
TITLE 'Observations in Qul data set with no errors';
RUN;
PROC PRINT data = errors;
TITLE 'Observations in Qul data set with errors';
RUN;
Obs | subj | qul3a | qul3b | qul3c | qul3d | qul3e | qul3f | qul3g | qul3h | qul3i | qul3j |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 110011 | 1 | 2 | 3 | 3 | 3 | 3 | 2 | 1 | 1 | 3 |
2 | 211011 | 1 | 2 | 3 | 2 | 1 | 2 | 3 | 2 | 1 | 3 |
3 | 310017 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 1 |
4 | 510001 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 |
Obs | subj | qul3a | qul3b | qul3c | qul3d | qul3e | qul3f | qul3g | qul3h | qul3i | qul3j |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 210012 | 2 | 3 | 4 | 1 | 2 | 2 | 3 | 3 | 1 | 1 |
2 | 411020 | 4 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 |
The INPUT statement first reads an observation of data containing one subject's quality of life data. An observation is assumed to be error-free (flag is initially set to 0) until it is found to be in error (flag is set to 1 if any of the ten values are out of range). If an observation is deemed to contain an error (flag = 1) after looking at each of the ten values, it is output to the errors data set. Otherwise (flag = 0), it is output to the qul data set.
First, note that two of the observations in the input data set contain data recording errors. The qul3c value for subject 210012 was recorded as 4, as was the qul3a value for subject 411020. Then, launch and run the SAS program. Review the output to convince yourself that qul contains the four observations with clean data, and errors contain the two observations with bad data.
You should also appreciate that this is a classic situation that cries out for using arrays. If you aren't yet convinced, imagine how long the above program would be if you had to write similar if/then statements to check for errors in, say, a hundred such variables.
Example 19.10 Section
The following program performs the same error checking as the previous program except here the error checking is accomplished using two arrays, bounds, and quldata:
DATA qul errors;
DATA qul errors;
input subj qul3a qul3b qul3c qul3d qul3e
qul3f qul3g qul3h qul3i qul3j;
array bounds (3) error1 - error3 (1 2 3);
array quldata (10) qul3a -- qul3j;
flag = 0;
do i = 1 to 10;
if quldata(i) ne bounds(1) and
quldata(i) ne bounds(2) and
quldata(i) ne bounds(3)
then flag = 1;
end;
if flag = 1 then output errors;
else output qul;
drop i flag;
DATALINES;
110011 1 2 3 3 3 3 2 1 1 3
210012 2 3 4 1 2 2 3 3 1 1
211011 1 2 3 2 1 2 3 2 1 3
310017 1 2 3 3 3 3 3 2 2 1
411020 4 3 3 3 3 2 2 2 2 2
510001 1 1 1 1 1 1 2 1 2 2
;
RUN;
PROC PRINT data = qul;
TITLE 'Observations in Qul data set with no errors';
RUN;
PROC PRINT data = errors;
TITLE '
Observations in Qul data set with errors';
RUN;
Obs | subj | qul3a | qul3b | qul3c | qul3d | qul3e | qul3f | qul3g | qul3h | qul3i | qul3j | error1 | error2 | error3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 110011 | 1 | 2 | 3 | 3 | 3 | 3 | 2 | 1 | 1 | 3 | 1 | 2 | 3 |
2 | 211011 | 1 | 2 | 3 | 2 | 1 | 2 | 3 | 2 | 1 | 3 | 1 | 2 | 3 |
3 | 310017 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 1 | 1 | 2 | 3 |
4 | 510001 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | 1 | 2 | 3 |
Obs | subj | qul3a | qul3b | qul3c | qul3d | qul3e | qul3f | qul3g | qul3h | qul3i | qul3j | error1 | error2 | error3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 210012 | 2 | 3 | 4 | 1 | 2 | 2 | 3 | 3 | 1 | 1 | 1 | 2 | 3 |
2 | 411020 | 4 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 3 |
If you compare this program to the previous program, you'll see that the only differences here are the presence of two ARRAY definition statements and the IF/THEN statement within the iterative DO loop that does the error checking.
The first ARRAY statement uses a numbered range list to define an array called bounds that contains three new variables — error1, error2, and error3. The "(1 2 3)" that appears after the variable list error1-error3 tells SAS to set, or initialize, the elements of the array to equal 1, 2, and 3. In general, you initialize an array in this manner, namely listing as many values as there contain elements of the array and separating each pair of values with a space. If you intend for your array to contain character constants, you must put the values in single quotes. For example, the following ARRAY statement tells SAS to define a character array (hence the dollar sign $) called weekdays:
ARRAY weekdays(5) $ ('M' 'T' 'W' 'R' 'F');
and to initialize the elements of the array as M, T, W, R, and F.
The second ARRAY statement uses a name range list to define an array called quldata that contains the ten quality of life variables. The IF/THEN statement uses slightly different logic than the previous program to tell SAS to compare the elements of the quldata array to the elements of the bounds array to determine whether any of the values are out of range.
Now, launch and run the SAS program. Review the output to convince yourself that just as before qul contains the four observations with clean data, and errors contain the two observations with bad data. Also, note that the three new error variables error1, error2, and error3 remain present in the data set.
Example 19.11 Section
The valid values 1, 2, and 3 are needed only temporarily in the previous program. Therefore, we alternatively could have used temporary array elements in defining the bounds array. The following program does just that. It is identical to the previous program except here the bounds array is defined using temporary array elements rather than using three new variables error1, error2, and error3:
DATA qul errors;
input subj qul3a qul3b qul3c qul3d qul3e
qul3f qul3g qul3h qul3i qul3j;
array bounds (3) _TEMPORARY_ (1 2 3);
array quldata (10) qul3a -- qul3j;
flag = 0;
do i = 1 to 10;
if quldata(i) ne bounds(1) and
quldata(i) ne bounds(2) and
quldata(i) ne bounds(3)
then flag = 1;
end;
if flag = 1 then output errors;
else output qul;
drop i flag;
DATALINES;
110011 1 2 3 3 3 3 2 1 1 3
210012 2 3 4 1 2 2 3 3 1 1
211011 1 2 3 2 1 2 3 2 1 3
310017 1 2 3 3 3 3 3 2 2 1
411020 4 3 3 3 3 2 2 2 2 2
510001 1 1 1 1 1 1 2 1 2 2
;
RUN;
PROC PRINT data = qul;
TITLE 'Observations in Qul data set with no errors';
RUN;
PROC PRINT data = errors;
TITLE '
Observations in Qul data set with errors';
RUN;
Obs | subj | qul3a | qul3b | qul3c | qul3d | qul3e | qul3f | qul3g | qul3h | qul3i | qul3j |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 110011 | 1 | 2 | 3 | 3 | 3 | 3 | 2 | 1 | 1 | 3 |
2 | 211011 | 1 | 2 | 3 | 2 | 1 | 2 | 3 | 2 | 1 | 3 |
3 | 310017 | 1 | 2 | 3 | 3 | 3 | 3 | 3 | 2 | 2 | 1 |
4 | 510001 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 |
Obs | subj | qul3a | qul3b | qul3c | qul3d | qul3e | qul3f | qul3g | qul3h | qul3i | qul3j |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 210012 | 2 | 3 | 4 | 1 | 2 | 2 | 3 | 3 | 1 | 1 |
2 | 411020 | 4 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | 2 |
If you compare this program to the previous program, you'll see that the only difference here is the presence of the _TEMPORARY_ argument in the definition of the bounds array. The bounds array is again initialized to the three valid values "(1 2 3)".
Launch and run the SAS program. Review the output to convince yourself that just as before qul contains the four observations with clean data, and errors contain the two observations with bad data. Also, note that the temporary array elements do not appear in the data set.