# 17.2 - The RETAIN Statement

17.2 - The RETAIN StatementWhen SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:

`RETAIN `*variable1* *variable2* ... *variablen*;

You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:

`RETAIN var1 0 var2 3 a b c 'XYZ'`

the variable *var1* is assigned the value 0; the variable *var2* is assigned the value 3, and the variables *a*, *b*, and *c* are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.

Note that it is redundant to name any of the following items in a RETAIN statement, since their values are automatically retained from one iteration of the DATA step to the next:

- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option

Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.

## Example 17.5

Throughout the remainder of the lesson, we will work with the *grades* data set that is created in the following DATA step:

` ````
DATA grades;
input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
cards;
10 Smith E1 78
10 Smith E2 82
10 Smith E3 86
10 Smith E4 69
10 Smith P1 97
10 Smith F1 160
11 Simon E1 88
11 Simon E2 72
11 Simon E3 86
11 Simon E4 99
11 Simon P1 100
11 Simon F1 170
12 Jones E1 98
12 Jones E2 92
12 Jones E3 92
12 Jones E4 99
12 Jones P1 99
12 Jones F1 185
;
RUN;
PROC PRINT data = grades NOOBS;
title 'The grades data set';
RUN;
```

The *grades* data set is what we call a "subject- and grade-specific" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (*idno*) and last name (*l_name*). The data set contains six different types of grades: exam 1 (*E1*), exam 2 (*E2*), exam 3 (*E3*), exam 4 (*E4*), each worth 100 points; one project (*P1*) worth 100 points; and a final exam (*F1*) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (*E1, E2*, *E3*, *E4*) not including the final exam. Launch and run * * the SAS program so that we can work with the *grades* data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the *grades* data set.

## Example 17.6

One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade of the four semester exams:

` ````
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));
RUN;
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno;
retain lowgrade lowtype;
if first.idno then lowgrade = grade;
lowgrade = min(lowgrade, grade);
if grade = lowgrade then lowtype = gtype;
if last.idno then output;
drop gtype;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
```

Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called *exams* by selecting only the exam grades (E1, E2, E3, and E4) from the data set *grades*.

It's the second DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll try to help you understand the code in three different ways. First, a short written summary ... the DATA step searches through the *exams* data set for each subject ("*by idno*") and looks for the lowest grade ("*min*(*lowgrade*, *grade*)"). Because SAS would otherwise set the variables *lowgrade* and *lowtype* to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student ("*last.idno*") it outputs the data corresponding to the lowest exam type (*lowtype*) and grade (*lowgrade*) to the *lowest *data set. (Note that the statement "*if last.idno then output*;" effectively collapses multiple observations per student into one observation per student.) So that we can merge the *lowest *data set* * back into the *grades* data set, by *idno* and *gtype*, the variable *lowtype* is renamed back to *gtype*.

Now, let's dive in a bit deeper by investigating how SAS would process the *exams* data set. As you read through what follows, you'll want to refer to both the DATA step code and the *exams* data set (which is the same as the *grades* data set minus the P1 and F1 observations). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (*_N_* and *_ERROR_*), the four variables in the *exams* data set (*idno*, *l_name*, *gtype*, and *grade*), two variables defined within the DATA step (*lowgrade* and *lowtype*), and as a result of the BY statement, a *first.idno* and a *last.idno* variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | . | . | . | . | . |

SAS reads the first observation from the *exams* data set. The observation is the first in the group of id numbers that equal 10, therefore *first.idno* is assigned the value of 1 and *last.idno* is assigned a value of 0. Because *first.idno* equals 1, the *lowgrade* variable is assigned the same value as that of the *grade* variable, that is, 78. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since both values are 78, the value of the *lowgrade* variable remains unchanged. Because *grade* equals *lowgrade* (they are both 78), SAS assigns the *lowtype* variable the same value as that of the *gtype* variable, that is, E1. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | E1 | 78 | 78 | E1 | 1 | 0 |

Since *last.idno* does not equal 1, SAS does not write the contents of the program data vector to the *lowest* data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, *lowgrade* and *lowtype*, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the *exams* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since *grade* now equals 82 and *lowgrade* equals 78 from the previous iteration, the value of the *lowgrade* variable remains 78. Because *grade* does not equal *lowgrade* now, the value of the *lowtype* variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

2 | 0 | 10 | Smith | E2 | 82 | 78 | E1 | 0 | 0 |

Again, since *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *lowest* data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of *lowgrade* and *lowtype* are retained from the previous iteration. SAS reads the third observation from the *exams* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since *grade* now equals 86 and *lowgrade* still equals 78 from the previous iteration, the value of the *lowgrade* variable still remains 78. Because *grade* still does not equal *lowgrade* now, the value of the *lowtype* variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

3 | 0 | 10 | smith | E3 | 86 | 78 | E1 | 0 | 0 |

Again, since *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *lowest* data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of *lowgrade* and *lowtype* are retained from the previous iteration. SAS reads the fourth observation from the *exams* data set. The observation is the last in the group of id numbers that equal 10, therefore *last.idno* is assigned the value of 1 and *first.idno* is assigned a value of 0. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since *grade* now equals 69 and *lowgrade* still equals 78 from the previous iteration, the value of the *lowgrade* variable is updated to 69. Because *grade* equals *lowgrade* (they are both 69), SAS assigns the *lowtype* variable the same value as that of the *gtype* variable, that is, E4. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

4 | 0 | 10 | Smith | E4 | 69 | 69 | E4 | 0 | 1 |

Now, since *last.idno* equals 1, SAS writes the contents of the program data vector to the *lowest* data set. In doing so, SAS does not write the automatic variables *_N_* and *_ERROR*, nor the *first.idno* and *last.idno* variables, to the data set. As instructed by the code, SAS drops the *gtype* variable and renames the *lowtype* variable to *gtype*. So, here's what the *lowest* data set looks like after processing the first four observations:

idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|

10 | Smith | 69 | 69 | E4 |

Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken four observations and collapsed them into one observation.

There's just one more thing you might want to do to help you understand that second DATA step. That is, click on the **Inspect!** button below the code to see an explanation of each line in the DATA step. Then, launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the *lowest* data set contains the lowest exam grade for each student. Also, note that the *lowest* data set contains one observation per student rather than six observations per student as in the original *grades* data set.