As mentioned earlier, values of the following variables are automatically retained from one iteration of the DATA step to the next:
- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option
It is not necessary, therefore, to specify their variable names in a RETAIN statement. In this section, we'll take a look at an example that illustrates this point.
Example 17.7 Section
The following SAS program calculates the students' final grades, and in so doing illustrates the automatic retention of two variables — a variable whose value is assigned in a SUM statement (total) and a variable created by the IN = option (lowest):
PROC SORT data=grades;
by idno gtype;
RUN;
DATA final;
merge grades lowest (in=lowest);
by idno gtype;
if lowest then delete;
if first.idno then total = 0;
total + grade;
if last.idno then fnl = (total/600)*100;
format fnl 5.1;
drop lowgrade gtype;
RUN;
PROC PRINT data=final;
title 'Output Dataset: FINAL GRADES';
RUN;
Obs | idno | l_name | grade | total | fnl |
---|---|---|---|---|---|
1 | 10 | Smith | 78 | 78 | . |
2 | 10 | Smith | 82 | 160 | . |
3 | 10 | Smith | 86 | 246 | . |
4 | 10 | Smith | 160 | 406 | . |
5 | 10 | Smith | 97 | 503 | 83.8 |
6 | 11 | Simon | 88 | 88 | . |
7 | 11 | Simon | 86 | 174 | . |
8 | 11 | Simon | 99 | 273 | . |
9 | 11 | Simon | 170 | 443 | . |
10 | 11 | Simon | 100 | 543 | 90.5 |
11 | 12 | Jones | 98 | 98 | . |
12 | 12 | Jones | 92 | 190 | . |
13 | 12 | Jones | 99 | 289 | . |
14 | 12 | Jones | 185 | 474 | . |
15 | 12 | Jones | 99 | 573 | 95.5 |
Because the instructor wants to process the grades and lowest data sets by idno and gtype, the SORT procedure prepares the grades data set for doing so. The lowest data set is alread sorted by idno and gtype, and so no SORT procedure is necessary for it.
It's again the DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll again try to help you understand the code in three different ways. First, a written summary ... the DATA step merges the grades and lowest data sets back together and calculates each student's final grade. Recall that the grades data set contains multiple observations for each student, while the lowest data set contains only one observation per student. Also, recall that the lowest data set indicates which exam grade (gtype) is lowest for each student. When SAS merges the two data sets by idno and gtype, the resulting merged data set final contains multiple observations for each student. Because the "in = lowest" option was attached to the lowest data set, the variable lowest will take the value 1 for the one observation for each student that contains their lowest exam score. The "if lowest then delete" statement deletes the observation pertaining to each student's lowest exam score from the final data set. That is, the final data set contains only those observations needed to calculate each students' final grade. Note that there is no RETAIN statement for the variable lowest, and yet its value is automatically retained from one iteration of the data step to the next.
The variable total is used to keep track of each student's total point accumulation. Therefore, for each student (first.idno), the variable total is initially set to 0. As you know, the expression:
total + grade;
is called a SUM statement. It merely adds the grade appearing in the current observation to the student's total accumulator. You could think of the SUM statement as this kind of statement that you'd typically see instead in other programming languages:
total = total + grade;
Note that there is no RETAIN statement for the variable total, and yet its value is automatically retained from one iteration of the DATA step to the next. When SAS reaches each student's last observation, the final grade (fnl) is calculated.
Now, let's dive in a bit deeper by investigating how SAS would process the grades and lowest data sets. As you read through what follows, you'll want to refer to both the DATA step code and the grades data set (you'll have to envision the data set as it is sorted by idno and gtype). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), a variable in the lowest data set (lowgrade), a variable called lowest that is created as a result of the IN= option attached to the lowest data set, two variables defined within the DATA step (total and fnl), and as a result of the BY statement, a first.idno, a last.idno, a first.gtype and a last.gtype variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | . | . | . | . | . | . | . | . |
For the sake of room, note that the _ERROR_ variable has been abbreviated to _Err_. Also, the first.gtype and last.gtype variables are present, but not depicted, in the program data vector, because we never use them in the code. Now, SAS reads the first observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E1. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1, and last.idno is assigned a value of 0. Because first.idno equals 1, the total variable is initialized to 0, and then immediately updated to the sum of 0 and 78, or 78. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | E1 | 78 | . | 0 | 1 | 0 | 78 | . |
Because we've reached the end of the first iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set. In so doing, the lowgrade and gtype variables are dropped, as are all of the internal variables (_N_, _ERROR_, lowest, first.idno, last.idno, first.gtype, and last.gtype). Here's what the final data set looks like after the first iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | . | . | . | 0 | . | . | 78 | . |
Note that the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the second observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E2. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value and the lowest variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 78 and 82, or 160. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | 10 | Smith | E2 | 82 | . | 0 | 0 | 0 | 160 | . |
Because we've reached the end of the second iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the second iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | . | . | . | 0 | . | . | 160 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the third observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = E3. Therefore, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value and the lowest variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 160 and 86, or 246. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | 10 | Smith | E3 | 86 | . | 0 | 0 | 0 | 246 | . |
Because we've reached the end of the third iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the third iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | . | . | . | 0 | . | . | 246 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the fourth observation from the sorted grades data set. Now, there is an observation in the lowest data set for which idno = 10 and gtype = E4. Therefore, the lowest data set does contribute to the observation currently being built. The lowgrade variable is read from the lowest data set, getting a value of 69. The lowest variable is assigned a value of 1. Therefore, SAS does not write the observation to the final data set. The program data vector looks like this:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | 10 | Smith | E4 | 69 | 69 | 1 | 0 | 0 | 246 | . |
and the final data set still looks like this after the fourth iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 0 | . | . | . | 1 | . | . | 246 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the fifth observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = F1. (Don't forget that the grades data set was sorted by idno and gtype, so the observation for which gtype = F1 is processed before the observation for which gtype = P1.) Again, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 246 and 160, or 406. Because last.idno equals 0, the fnl variable is not calculated. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 0 | 10 | Smith | F1 | 160 | . | 0 | 0 | 0 | 406 | . |
Because we've reached the end of the fifth iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the fifth iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
10 | Smith | 160 | 406 | . |
Because SAS encounters a new idno and gtype, SAS resets the values in the program data vector to missing:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | . | . | . | 0 | . | . | 406 | . |
Again, the values of the lowest and total variables are automatically retained from the previous iteration. Now, SAS reads the sixth observation from the sorted grades data set. There is no observation in the lowest data set for which idno = 10 and gtype = P1. Again, the lowest data set does not contribute anything to the observation currently being built. So, the lowgrade variable retains its missing value, and the lowest variable is assigned a value of 0. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0. Because first.idno equals 0, the total variable is not re-initialized to 0 but is updated to the sum of 406 and 97, or 503. Because last.idno equals 1, the fnl variable is calculated to equal 503 divided by 600, and multiplied by 100, to get 83.8. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | 10 | Smith | P1 | 97 | . | 0 | 0 | 1 | 503 | 83.8 |
Because we've reached the end of the sixth iteration of the DATA step, SAS writes the contents of the program data vector to the output final data set, dropping appropriate variables along the way. Here's what the final data set looks like after the sixth iteration of the DATA step:
idno | l_name | grade | total | fnl |
---|---|---|---|---|
10 | Smith | 78 | 78 | . |
10 | Smith | 82 | 160 | . |
10 | Smith | 86 | 246 | . |
10 | Smith | 160 | 406 | . |
10 | Smith | 97 | 503 | 83.8 |
Whew!! This is a lot of work! You can take it from here... the process continues as described.
There's just one more thing you might want to do to help you understand the DATA step. That is, click on the Inspect! button below the code to see an explanation of each line in the DATA step. Then, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the final grades are appropriately calculated.