# 17.3 - Automatic Retention

17.3 - Automatic RetentionAs mentioned earlier, values of the following variables are automatically retained from one iteration of the DATA step to the next:

- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option

It is not necessary, therefore, to specify their variable names in a RETAIN statement. In this section, we'll take a look at an example that illustrates this point.

## Example 17.7

The following SAS program calculates the students' final grades, and in so doing illustrates the automatic retention of two variables — a variable whose value is assigned in a SUM statement (*total*) and a variable created by the IN = option (*lowest*):

```
PROC SORT data=grades;
by idno gtype;
RUN;
DATA final;
merge grades lowest (in=lowest);
by idno gtype;
if lowest then delete;
if first.idno then total = 0;
total + grade;
if last.idno then fnl = (total/600)*100;
format fnl 5.1;
drop lowgrade gtype;
RUN;
PROC PRINT data=final;
title 'Output Dataset: FINAL GRADES';
RUN;
```

Obs | idno | l_name | grade | total | fnl |
---|---|---|---|---|---|

1 | 10 | Smith | 78 | 78 | . |

2 | 10 | Smith | 82 | 160 | . |

3 | 10 | Smith | 86 | 246 | . |

4 | 10 | Smith | 160 | 406 | . |

5 | 10 | Smith | 97 | 503 | 83.8 |

6 | 11 | Simon | 88 | 88 | . |

7 | 11 | Simon | 86 | 174 | . |

8 | 11 | Simon | 99 | 273 | . |

9 | 11 | Simon | 170 | 443 | . |

10 | 11 | Simon | 100 | 543 | 90.5 |

11 | 12 | Jones | 98 | 98 | . |

12 | 12 | Jones | 92 | 190 | . |

13 | 12 | Jones | 99 | 289 | . |

14 | 12 | Jones | 185 | 474 | . |

15 | 12 | Jones | 99 | 573 | 95.5 |

Because the instructor wants to process the *grades* and *lowest *data sets by *idno* and *gtype*, the SORT procedure prepares the *grades* data set for doing so. The *lowest* data set is alread sorted by *idno* and *gtype*, and so no SORT procedure is necessary for it.

It's again the DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll again try to help you understand the code in three different ways. First, a written summary ... the DATA step merges the *grades* and *lowest* data sets back together and calculates each student's final grade. Recall that the *grades *data set contains multiple observations for each student, while the *lowest* data set contains only one observation per student. Also, recall that the *lowest *data set indicates which exam grade (*gtype*) is lowest for each student. When SAS merges the two data sets by *idno* and *gtype*, the resulting merged data set *final* contains multiple observations for each student. Because the "*in* = *lowest*" option was attached to the *lowest* data set, the variable *lowest *will take the value 1 for the one observation for each student that contains their *lowest *exam score. The "*if lowest then delete*" statement deletes the observation pertaining to each student's lowest exam score from the *final* data set. That is, the *final* data set contains only those observations needed to calculate each students' final grade. Note that there is no RETAIN statement for the variable *lowest*, and yet its value is automatically retained from one iteration of the data step to the next.

The variable *total* is used to keep track of each student's total point accumulation. Therefore, for each student (*first.idno*), the variable *total* is initially set to 0. As you know, the expression:

`total + grade;`

is called a SUM statement. It merely adds the *grade* appearing in the current observation to the student's *total* accumulator. You could think of the SUM statement as this kind of statement that you'd typically see instead in other programming languages:

`total = total + grade;`

Note that there is no RETAIN statement for the variable *total*, and yet its value is automatically retained from one iteration of the DATA step to the next. When SAS reaches each student's last observation, the final grade (*fnl*) is calculated.

Now, let's dive in a bit deeper by investigating how SAS would process the *grades and lowest *data sets. As you read through what follows, you'll want to refer to both the DATA step code and the *grades* data set (you'll have to envision the data set as it is sorted by *idno* and *gtype*). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (*_N_* and *_ERROR_*), the four variables in the *grades* data set (*idno*, *l_name*, *gtype*, and *grade*), a variable in the *lowest* data set (*lowgrade*), a variable called *lowest* that is created as a result of the IN= option attached to the *lowest* data set, two variables defined within the DATA step (*total* and *fnl*), and as a result of the BY statement, a *first.idno*, a *last.idno*, a *first.gtype* and a *last.gtype* variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | . | . | . | . | . | . | . | . |

For the sake of room, note that the *_ERROR_* variable has been abbreviated to *_Err_*. Also, the *first.gtype* and *last.gtype* variables are present, but not depicted, in the program data vector, because we never use them in the code. Now, SAS reads the first observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = E1. Therefore, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value, and the *lowest* variable is assigned a value of 0. The observation is the first in the group of id numbers that equal 10, therefore *first.idno* is assigned the value of 1, and *last.idno* is assigned a value of 0. Because *first.idno* equals 1, the *total* variable is initialized to 0, and then immediately updated to the sum of 0 and 78, or 78. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | E1 | 78 | . | 0 | 1 | 0 | 78 | . |

Because we've reached the end of the first iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set. In so doing, the *lowgrade* and *gtype* variables are dropped, as are all of the internal variables (*_N_*, *_ERROR_*, *lowest*, *first.idno*, *last.idno*, *first.gtype*, and *last.gtype*). Here's what the *final *data set looks like after the first iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

Because SAS encounters a new *idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

2 | 0 | . | . | . | 0 | . | . | 78 | . |

Note that the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the second observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = E2. Therefore, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value and the *lowest* variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 78 and 82, or 160. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

2 | 0 | 10 | Smith | E2 | 82 | . | 0 | 0 | 0 | 160 | . |

Because we've reached the end of the second iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the second iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

Because SAS encounters a new *idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

3 | 0 | . | . | . | 0 | . | . | 160 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the third observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = E3. Therefore, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value and the *lowest* variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 160 and 86, or 246. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

3 | 0 | 10 | Smith | E3 | 86 | . | 0 | 0 | 0 | 246 | . |

Because we've reached the end of the third iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the third iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

Because SAS encounters a new *idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

4 | 0 | . | . | . | 0 | . | . | 246 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the fourth observation from the sorted *grades* data set. Now, there is an observation in the *lowest* data set for which *idno* = 10 and *gtype* = E4. Therefore, the *lowest* data set does contribute to the observation currently being built. The *lowgrade* variable is read from the *lowest* data set, getting a value of 69. The *lowest* variable is assigned a value of 1. Therefore, SAS does not write the observation to the *final *data set. The program data vector looks like this:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

4 | 0 | 10 | Smith | E4 | 69 | 69 | 1 | 0 | 0 | 246 | . |

and the *final* data set still looks like this after the fourth iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

*idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

5 | 0 | . | . | . | 1 | . | . | 246 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the fifth observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = F1. (Don't forget that the *grades* data set was sorted by *idno* and *gtype*, so the observation for which *gtype* = F1 is processed before the observation for which *gtype* = P1.) Again, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value, and the *lowest* variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 246 and 160, or 406. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

5 | 0 | 10 | Smith | F1 | 160 | . | 0 | 0 | 0 | 406 | . |

Because we've reached the end of the fifth iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the fifth iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

10 | Smith | 160 | 406 | . |

*idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

6 | 0 | . | . | . | 0 | . | . | 406 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the sixth observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = P1. Again, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value, and the *lowest* variable is assigned a value of 0. The observation is the last in the group of id numbers that equal 10, therefore *last.idno* is assigned the value of 1 and *first.idno* is assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 406 and 97, or 503. Because *last.idno* equals 1, the *fnl* variable is calculated to equal 503 divided by 600, and multiplied by 100, to get 83.8. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

6 | 0 | 10 | Smith | P1 | 97 | . | 0 | 0 | 1 | 503 | 83.8 |

Because we've reached the end of the sixth iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the sixth iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

10 | Smith | 160 | 406 | . |

10 | Smith | 97 | 503 | 83.8 |

Whew!! This is a lot of work! You can take it from here... the process continues as described.

There's just one more thing you might want to do to help you understand the DATA step. That is, click on the **Inspect!** button below the code to see an explanation of each line in the DATA step. Then, launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the final grades are appropriately calculated.