17.3 - Automatic Retention

As mentioned earlier, values of the following variables are automatically retained from one iteration of the DATA step to the next:

variables read with a SET, MERGE, or UPDATE statement
a variable whose value is assigned in a SUM statement
variables created by the IN = option

It is not necessary, therefore, to specify their variable names in a RETAIN statement. In this section, we'll take a look at an example that illustrates this point.

Example 17.7 Section

The following SAS program calculates the students' final grades, and in so doing illustrates the automatic retention of two variables — a variable whose value is assigned in a SUM statement (total) and a variable created by the IN = option (lowest):

PROC SORT data=grades;
    by idno gtype;
RUN;
DATA final;
    merge grades lowest (in=lowest);
    by idno gtype;
    if lowest then delete;  
    if first.idno then total = 0;
    total + grade;
    if last.idno then fnl = (total/600)*100;
    format fnl 5.1;
    drop lowgrade gtype;
RUN;
PROC PRINT data=final;
    title 'Output Dataset: FINAL GRADES';
RUN;

Output Dataset: FINAL GRADES
Obs	idno	l_name	grade	total	fnl
1	10	Smith	78	78	.
2	10	Smith	82	160	.
3	10	Smith	86	246	.
4	10	Smith	160	406	.
5	10	Smith	97	503	83.8
6	11	Simon	88	88	.
7	11	Simon	86	174	.
8	11	Simon	99	273	.
9	11	Simon	170	443	.
10	11	Simon	100	543	90.5
11	12	Jones	98	98	.
12	12	Jones	92	190	.
13	12	Jones	99	289	.
14	12	Jones	185	474	.
15	12	Jones	99	573	95.5

Because the instructor wants to process the grades and lowest data sets by idno and gtype, the SORT procedure prepares the grades data set for doing so. The lowest data set is alread sorted by idno and gtype, and so no SORT procedure is necessary for it.

It's again the DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll again try to help you understand the code in three different ways. First, a written summary ... the DATA step merges the grades and lowest data sets back together and calculates each student's final grade. Recall that the grades data set contains multiple observations for each student, while the lowest data set contains only one observation per student. Also, recall that the lowest data set indicates which exam grade (gtype) is lowest for each student. When SAS merges the two data sets by idno and gtype, the resulting merged data set final contains multiple observations for each student. Because the "in = lowest" option was attached to the lowest data set, the variable lowest will take the value 1 for the one observation for each student that contains their lowest exam score. The "if lowest then delete" statement deletes the observation pertaining to each student's lowest exam score from the final data set. That is, the final data set contains only those observations needed to calculate each students' final grade. Note that there is no RETAIN statement for the variable lowest, and yet its value is automatically retained from one iteration of the data step to the next.

The variable total is used to keep track of each student's total point accumulation. Therefore, for each student (first.idno), the variable total is initially set to 0. As you know, the expression:

total + grade;

is called a SUM statement. It merely adds the grade appearing in the current observation to the student's total accumulator. You could think of the SUM statement as this kind of statement that you'd typically see instead in other programming languages:

total = total + grade;

Note that there is no RETAIN statement for the variable total, and yet its value is automatically retained from one iteration of the DATA step to the next. When SAS reaches each student's last observation, the final grade (fnl) is calculated.

Now, let's dive in a bit deeper by investigating how SAS would process the grades and lowest data sets. As you read through what follows, you'll want to refer to both the DATA step code and the grades data set (you'll have to envision the data set as it is sorted by idno and gtype). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), a variable in the lowest data set (lowgrade), a variable called lowest that is created as a result of the IN= option attached to the lowest data set, two variables defined within the DATA step (total and fnl), and as a result of the BY statement, a first.idno, a last.idno, a first.gtype and a last.gtype variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_	_ERR_	idno	l_name	gtype	grade	lowgrade	lowest	first.idno	last.idno	total	fnl
1	0	.			.	.	.	.	.	.	.

For the sake of room, note that the _ERROR_ variable has been abbreviated to _Err_. Also, the first.gtype and last.gtype variables are present, but not depicted, in the program data vector, because we never use them in the code. Now, SAS reads the first observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E1. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1, and last.idno is assigned a value of 0. Because first.idno equals 1, the total variable is initialized to 0, and then immediately updated to the sum of 0 and 78, or 78. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:

_N_	_ERR_	idno	l_name	gtype	grade	lowgrade	lowest	first.idno	last.idno	total	fnl
1	0	10	Smith	E1	78	.	0	1	0	78	.

Because we've reached the end of the first iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set. In so doing, the lowgrade and gtype variables are dropped, as are all of the internal variables (_N_, _ERROR_, lowest, first.idno, last.idno, first.gtype, and last.gtype). Here's what the final data set looks like after the first iteration of the DATA step:

idno	l_name	grade	total	fnl
10	Smith	78	78	.