# Lesson 17: Using the OUTPUT and RETAIN statements

Lesson 17: Using the OUTPUT and RETAIN statements## Overview

When processing any DATA step, SAS follows two default procedures:

- When SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. (SAS does not reset variables to missing if they were created by a SUM statement, or if the values came from a SAS data set via a SET or MERGE statement.)
- At the end of the DATA step after completing an iteration of the DATA step, SAS outputs the values of the variables in the program data vector to the SAS data set being created.

In this lesson, we'll learn how to modify these default processes by using the OUTPUT and RETAIN statements:

- The
**OUTPUT**statement allows you to control when and to which data set you want an observation written. - The
**RETAIN**statement causes a variable created in the DATA step to retain its value from the current observation into the next observation rather than being set to missing at the beginning of each iteration of the DATA step.

## Objectives

Upon completing this lesson, you should be able to do the following:

- write a RETAIN statement to tell SAS to retain the value of a variable from one iteration of the data step to the next
- state which kind of variables SAS automatically retains
- write a RETAIN statement to compare values across observations
- program successfully with the RETAIN statement
- write the "FIRST." and "LAST." variables in conjunction with an OUTPUT statement in order to collapse multiple observations in a data set into a single observation
- write a SUM statement to accumulate totals across a set of observations
- write a "LAST." variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement in order to transpose a data set
- write an OUTPUT statement to tell SAS to output the current observation when the output statement is processed
- write an OUTPUT statement to write observations to multiple data sets
- write an OUTPUT statement to control the output of observations to data sets based on certain conditions
- recall that if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step
- recall that assignment statements must precede OUTPUT statements
- write the
*today*( ) function to determine today's date

# 17.1 - The OUTPUT Statement

17.1 - The OUTPUT StatementAn OUTPUT statement overrides the default process by telling SAS to output the current observation when the OUTPUT statement is processed — not at the end of the DATA step. The OUTPUT statement takes the form:

` OUTPUT dataset1 dataset2 ... datasetn;`

where you may name as few or as many data sets as you like. If you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to each of the data sets named in the DATA step. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement.

The OUTPUT statement is pretty powerful in that, among other things, it gives us a way:

- to write observations to multiple data sets
- to control the output of observations to data sets based on certain conditions
- to transpose datasets using the OUTPUT statement in conjunction with the RETAIN statement, BY group processing, and the LAST.variable statement.

Throughout the rest of this section, we'll look at examples that illustrate how to use OUTPUT statements correctly. We'll work with the following subset of the ICDB Study's log data set:

## The icblog data set

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

210006 | 12 | 05/06/94 | cmed |

210006 | 12 | 05/06/94 | diet |

210006 | 12 | 05/06/94 | med |

210006 | 12 | 05/06/94 | phytrt |

210006 | 12 | 05/06/94 | purg |

210006 | 12 | 05/06/94 | qul |

210006 | 12 | 05/06/94 | sympts |

210006 | 12 | 05/06/94 | urn |

210006 | 12 | 05/06/94 | void |

310032 | 24 | 09/19/95 | backf |

310032 | 24 | 09/19/95 | cmed |

310032 | 24 | 09/19/95 | diet |

310032 | 24 | 09/19/95 | med |

310032 | 24 | 09/19/95 | medhxf |

310032 | 24 | 09/19/95 | phs |

310032 | 24 | 09/19/95 | phytrt |

310032 | 24 | 09/19/95 | preg |

310032 | 24 | 09/19/95 | purg |

310032 | 24 | 09/19/95 | qul |

310032 | 24 | 09/19/95 | sympts |

310032 | 24 | 09/19/95 | urn |

310032 | 24 | 09/19/95 | void |

410010 | 6 | 05/12/94 | cmed |

410010 | 6 | 05/12/94 | diet |

410010 | 6 | 05/12/94 | med |

410010 | 6 | 05/12/94 | phytrt |

410010 | 6 | 05/12/94 | purg |

410010 | 6 | 05/12/94 | qul |

410010 | 6 | 05/12/94 | sympts |

410010 | 6 | 05/12/94 | urn |

410010 | 6 | 05/12/94 | void |

As you can see, this log data set contains four variables:

`subj`: the subject's identification number`v_type`: the type of clinic visit, which means the number of months since the subject was first seen in the clinic`v_date`: the date of the clinic visit`form`: codes that indicate the data forms that were completed during the subject's clinic visit

The log data set is a rather typical data set that arises from large national clinical studies in which there are a number of sites around the country where data are collected. Typically, the clinical sites collect the data on data forms and then "ship" the data forms either electronically or by mail to a centralized location called a Data Coordinating Center (DCC). As you can well imagine, keeping track of the data forms at the DCC is a monumental task. For the ICDB Study, for example, the DCC received more than 68,000 data forms over the course of the study.

In order to keep track of the data forms that arrive at the DCC, they are "logged" into a database and subsequently tracked as they are processed at the DCC. In reality, a log database will contain many more variables than we have in our subset, such as the dates the data on the forms were entered into the database, who entered the data, the dates the entered data was verified, who verified the data, and so on. To keep our lives simple, we'll just use the four variables described above.

## Example 17.1

This example uses the OUTPUT statement to tell SAS to write observations to data sets based on certain conditions. Specifically, the following program uses the OUTPUT statement to create three SAS data sets — *s210006*, *s310032*, and *s410010* — based on whether the subject identification numbers in the *icdblog* data set meet a certain condition:

```
OPTIONS PS=58 LS=80 NODATE NONUMBER;
LIBNAME stat481 'C:\yourdrivename\Stat481WC\05retain\sasndata';
DATA s210006 s310032 s410010;
set stat481.icdblog;
if (subj = 210006) then output s210006;
else if (subj = 310032) then output s310032;
else if (subj = 410010) then output s410010;
RUN;
PROC PRINT data = s210006 NOOBS;
title 'The s210006 data set';
RUN;
PROC PRINT data = s310032 NOOBS;
title 'The s310032 data set';
RUN;
PROC PRINT NOOBS;
title 'The s410010 data set';
RUN;
```

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

210006 | 12 | 05/06/94 | cmed |

210006 | 12 | 05/06/94 | diet |

210006 | 12 | 05/06/94 | med |

210006 | 12 | 05/06/94 | phytrt |

210006 | 12 | 05/06/94 | purg |

210006 | 12 | 05/06/94 | qul |

210006 | 12 | 05/06/94 | sympts |

210006 | 12 | 05/06/94 | urn |

210006 | 12 | 05/06/94 | void |

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

310032 | 24 | 09/19/95 | backf |

310032 | 24 | 09/19/95 | cmed |

310032 | 24 | 09/19/95 | diet |

310032 | 24 | 09/19/95 | med |

310032 | 24 | 09/19/95 | medhxf |

310032 | 24 | 09/19/95 | phs |

310032 | 24 | 09/19/95 | phytrt |

310032 | 24 | 09/19/95 | preg |

310032 | 24 | 09/19/95 | purg |

310032 | 24 | 09/19/95 | qul |

310032 | 24 | 09/19/95 | sympts |

310032 | 24 | 09/19/95 | urn |

310032 | 24 | 09/19/95 | void |

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

410010 | 6 | 05/12/94 | cmed |

410010 | 6 | 05/12/94 | diet |

410010 | 6 | 05/12/94 | med |

410010 | 6 | 05/12/94 | phytrt |

410010 | 6 | 05/12/94 | purg |

410010 | 6 | 05/12/94 | qul |

410010 | 6 | 05/12/94 | sympts |

410010 | 6 | 05/12/94 | urn |

410010 | 6 | 05/12/94 | void |

As you can see, the DATA statement contains three data set names — *s210006*, *s310032*, and *s410010*. That tells SAS that we want to create three data sets with the given names. The SET statement, of course, tells SAS to read observations from the permanent data set called *stat481.icdblog*. Then comes the IF-THEN-ELSE and OUTPUT statements that make it all work. The first IF-THEN tells SAS to output any observations pertaining to subject 210006 to the *s210006* data set; the second IF-THEN tells SAS to output any observations pertaining to subject 310032 to the *s310032* data set; and, the third IF-THEN statement tells SAS to output any observations pertaining to subject 410010 to the *s410010* data set. SAS will hiccup if you have a data set name that appears in an OUTPUT statement without it also appearing in the DATA statement.

The PRINT procedures, of course, tell SAS to print the three newly created data sets. Note that the last PRINT procedure does not have a DATA= option. That's because when you name more than one data set in a single DATA statement, the last name on the DATA statement is the most recently created data set, and the one that subsequent procedures use by default. Therefore, the last PRINT procedure will print the *s410010* data set by default.

Now, before launching and running the SAS program, right-click to save the *icdblog* data set to a convenient location on your computer. Then, launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run * * the program and review the output from the PRINT procedures. You should see that, as expected, the data set

*s210006*contains data on subject 210006; the data set

*s310032*contains data on subject 310032; and

*s410010*contains data on subject 410010.

Incidentally, note that the IF-THEN-ELSE construct used here in conjunction with the OUTPUT statement is comparable to attaching the WHERE= option to each of the data sets appearing in the DATA statement.

## Example 17.2

Using an OUTPUT statement suppresses the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step. The following SAS program illustrates what happens if you fail to direct all of the observations to output:

```
DATA subj210006 subj310032;
set stat481.icdblog;
if (subj = 210006) then output subj210006;
RUN;
PROC PRINT data = subj210006 NOOBS;
title 'The subj210006 data set';
RUN;
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
```

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

210006 | 12 | 05/06/94 | cmed |

210006 | 12 | 05/06/94 | diet |

210006 | 12 | 05/06/94 | med |

210006 | 12 | 05/06/94 | phytrt |

210006 | 12 | 05/06/94 | purg |

210006 | 12 | 05/06/94 | qul |

210006 | 12 | 05/06/94 | sympts |

210006 | 12 | 05/06/94 | urn |

210006 | 12 | 05/06/94 | void |

The DATA statement contains two data set names, *subj210006* and *subj310032*, telling SAS that we intend to create two data sets. However, as you can see, the IF statement contains an OUTPUT statement that directs output to the *subj210006* data set, but no OUTPUT statement directs output to the *subj310032* data set. Launch and run * * the SAS program to convince yourself that the

*subj210006*data set contains data for subject 210006, while the

*subj310032*data set contains 0 observations. You should see a message like this in the log window:

```
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
NOTE: No observations in data set WORK.SUBJ310032.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
```

as well as see that no output for the *subj310032* data set appears in the output window.

## Example 17.3

If you use an assignment statement to create a new variable in a DATA step in the presence of OUTPUT statements, you have to make sure that you place the assignment statement **before** the OUTPUT statements. Otherwise, SAS will have already written the observation to the SAS data set, and the newly created variable will be set to missing. The following SAS program illustrates an example of how two variables, *current* and *days_vis*, get set to missing in the output data sets because their values get calculated after SAS has already written the observation to the SAS data set:

```
DATA subj210006 subj310032 subj410010;
set stat481.icdblog;
if (subj = 210006) then output subj210006;
else if (subj = 310032) then output subj310032;
else if (subj = 410010) then output subj410010;
current = today();
days_vis = current - v_date;
format current mmddyy8.;
RUN;
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
```

SUBJ | V_TYPE | V_DATE | FORM | current | days_vis |
---|---|---|---|---|---|

310032 | 24 | 09/19/95 | backf | . | . |

310032 | 24 | 09/19/95 | cmed | . | . |

310032 | 24 | 09/19/95 | diet | . | . |

310032 | 24 | 09/19/95 | med | . | . |

310032 | 24 | 09/19/95 | medhxf | . | . |

310032 | 24 | 09/19/95 | phs | . | . |

310032 | 24 | 09/19/95 | phytrt | . | . |

310032 | 24 | 09/19/95 | preg | . | . |

310032 | 24 | 09/19/95 | purg | . | . |

310032 | 24 | 09/19/95 | qul | . | . |

310032 | 24 | 09/19/95 | sympts | . | . |

310032 | 24 | 09/19/95 | urn | . | . |

310032 | 24 | 09/19/95 | void | . | . |

The main thing to note in this program is that the *current* and *days_vis* assignment statements appear after the IF-THEN-ELSE and OUTPUT statements. That means that each observation will be written to one of the three output data sets before the *current* and *days_vis* values are even calculated. Because SAS sets variables created in the DATA step as missing at the beginning of each iteration of the DATA step, the values of *current* and *days_vis* will remain missing for each observation.

By the way, the *today*( ) function, which is assigned to the variable *current*, creates a date variable containing today's date. Therefore, the variable *days_vis* is meant to contain the number of days since the subject's recorded visit *v_date*. However, as described above, the values of *current* and *days_vis* get set to missing. Launch and run * * the SAS program to convince yourself that the

*current*and

*days_vis*variables in the

*subj310032*data set contain only missing values. If we were to print the

*subj210006*and

*subj410020*data sets, we would see the same thing.

The following SAS program illustrates the corrected code for the previous DATA step, that is, for creating new variables with assignment statements in the presence of OUTPUT statements:

```
DATA subj210006 subj310032 subj410010;
set stat481.icdblog;
current = today();
days_vis = current - v_date;
format current mmddyy8.;
if (subj = 210006) then output subj210006;
else if (subj = 310032) then output subj310032;
else if (subj = 410010) then output subj410010;
RUN;
PROC PRINT data = subj310032 NOOBS;
title 'The subj310032 data set';
RUN;
```

SUBJ | V_TYPE | V_DATE | FORM | current | days_vis |
---|---|---|---|---|---|

310032 | 24 | 09/19/95 | backf | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | cmed | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | diet | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | med | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | medhxf | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | phs | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | phytrt | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | preg | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | purg | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | qul | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | sympts | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | urn | 09/06/23 | 10214 |

310032 | 24 | 09/19/95 | void | 09/06/23 | 10214 |

Now, since the assignment statements precede the OUTPUT statements, the variables are correctly written to the output data sets. That is, now the variable *current* contains the date in which the program was run and the variable *days_vis* contains the number of days since that date and the date of the subject's visit. Launch and run * * the SAS program to convince yourself that the

*current*and

*days_vis*variables are properly written to the

*subj310032*data set. If we were to print the

*subj210006*and

*subj410020*data sets, we would see similar results.

## Example 17.4

After SAS processes an OUTPUT statement within a DATA step, the observation remains in the program data vector and you can continue programming with it. You can even output the observation again to the same SAS data set or to a different one! The following SAS program illustrates how you can create different data sets with some of the same observations. That is, the data sets created in your DATA statement do not have to be mutually exclusive:

```
DATA symptoms visitsix;
set stat481.icdblog;
if form = 'sympts' then output symptoms;
if v_type = 6 then output visitsix;
RUN;
PROC PRINT data = symptoms NOOBS;
title 'The symptoms data set';
RUN;
PROC PRINT data = visitsix NOOBS;
title 'The visitsix data set';
RUN;
```

The symptoms data set

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

210006 | 12 | 05/06/94 | sympts |

310032 | 24 | 09/19/95 | sympts |

410010 | 6 | 05/12/94 | sympts |

SUBJ | V_TYPE | V_DATE | FORM |
---|---|---|---|

410010 | 6 | 05/12/94 | cmed |

410010 | 6 | 05/12/94 | diet |

410010 | 6 | 05/12/94 | med |

410010 | 6 | 05/12/94 | phytrt |

410010 | 6 | 05/12/94 | purg |

410010 | 6 | 05/12/94 | qul |

410010 | 6 | 05/12/94 | sympts |

410010 | 6 | 05/12/94 | urn |

410010 | 6 | 05/12/94 | void |

The DATA step creates two temporary data sets, *symptoms *and *visitsix*. The *symptoms *data set contains only those observations containing a form code of* sympts*. The *visitsix* data set, on the other hand, contains observations for which *v_type* equals 6. The observations in the two data sets are therefore not necessarily mutually exclusive. In fact, launch and run * * the SAS program and review the output from the PRINT procedures. Note that the observation for subject 410010 in which

*form*=

*sympts*is contained in both the

*symptoms*and

*visitsix*data sets.

# 17.2 - The RETAIN Statement

17.2 - The RETAIN StatementWhen SAS reads the DATA statement at the beginning of each iteration of the DATA step, SAS places missing values in the program data vector for variables that were assigned by either an INPUT statement or an assignment statement within the DATA step. A RETAIN statement effectively overrides this default. That is, a RETAIN statement tells SAS not to set variables whose values are assigned by an INPUT or assignment statement to missing when going from the current iteration of the DATA step to the next. Instead, SAS retains the values. The RETAIN statement takes the generic form:

`RETAIN `*variable1* *variable2* ... *variablen*;

You can specify as few or as many variables as you want. If you specify no variable names, then SAS retains the values of all of the variables created in an INPUT or assignment statement. You may initialize the values of variables within a RETAIN statement. For example, in the statement:

`RETAIN var1 0 var2 3 a b c 'XYZ'`

the variable *var1* is assigned the value 0; the variable *var2* is assigned the value 3, and the variables *a*, *b*, and *c* are all assigned the character value 'XYZ'. If you do not specify an initial value, SAS sets the initial value of a variable to be retained to missing.

Note that it is redundant to name any of the following items in a RETAIN statement, since their values are automatically retained from one iteration of the DATA step to the next:

- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option

Finally, since the RETAIN statement is not an executable statement, it can appear anywhere in the DATA step.

## Example 17.5

Throughout the remainder of the lesson, we will work with the *grades* data set that is created in the following DATA step:

```
DATA grades;
input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
cards;
10 Smith E1 78
10 Smith E2 82
10 Smith E3 86
10 Smith E4 69
10 Smith P1 97
10 Smith F1 160
11 Simon E1 88
11 Simon E2 72
11 Simon E3 86
11 Simon E4 99
11 Simon P1 100
11 Simon F1 170
12 Jones E1 98
12 Jones E2 92
12 Jones E3 92
12 Jones E4 99
12 Jones P1 99
12 Jones F1 185
;
RUN;
PROC PRINT data = grades NOOBS;
title 'The grades data set';
RUN;
```

idno | l_name | gtype | grade |
---|---|---|---|

10 | Smith | E1 | 78 |

10 | Smith | E2 | 82 |

10 | Smith | E3 | 86 |

10 | Smith | E4 | 69 |

10 | Smith | P1 | 97 |

10 | Smith | F1 | 160 |

11 | Simon | E1 | 88 |

11 | Simon | E2 | 72 |

11 | Simon | E3 | 86 |

11 | Simon | E4 | 99 |

11 | Simon | P1 | 100 |

11 | Simon | F1 | 170 |

12 | Jones | E1 | 98 |

12 | Jones | E2 | 92 |

12 | Jones | E3 | 92 |

12 | Jones | E4 | 99 |

12 | Jones | P1 | 99 |

12 | Jones | F1 | 185 |

The *grades* data set is what we call a "subject- and grade-specific" data set. That is, there is one observation for each grade for each student. Students are identified by their id number (*idno*) and last name (*l_name*). The data set contains six different types of grades: exam 1 (*E1*), exam 2 (*E2*), exam 3 (*E3*), exam 4 (*E4*), each worth 100 points; one project (*P1*) worth 100 points; and a final exam (*F1*) worth 200 points. We'll suppose that the instructor agreed to drop the students' lowest exam grades (*E1, E2*, *E3*, *E4*) not including the final exam. Launch and run * * the SAS program so that we can work with the

*grades*data set in the following examples. Review the output from the PRINT procedure to convince yourself that the data were properly read into the

*grades*data set.

## Example 17.6

One of the most powerful uses of a RETAIN statement is to compare values across observations. The following program uses the RETAIN statement to compare values across observations, and in doing so determines each student's lowest grade for the four semester exams:

```
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4')));
RUN;
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno;
retain lowgrade lowtype;
if first.idno then lowgrade = grade;
lowgrade = min(lowgrade, grade);
if grade = lowgrade then lowtype = gtype;
if last.idno then output;
drop gtype;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
```

**Note**: In the upper right-hand corner of the code block you will have the option of copying (* *) the code to your clipboard or downloading (

*) the file to your computer.*

```
DATA exams;
set grades (where = (gtype in ('E1', 'E2', 'E3', 'E4'))); *Because we’re not interested in grade types P1 and F1, don’t read them in;
RUN;
*The data step will find the lowest grade for each student and store those observations in the dataset 'grades';
DATA lowest (rename = (lowtype = gtype));
set exams;
by idno; *Process the exams dataset by idno;
retain lowgrade lowtype; *Retain lowgrade and lowtype as each student is processed;
if first.idno then lowgrade = grade; *The first grade for each student is the starts as the lowest;
lowgrade = min(lowgrade, grade); *Assign lowgrade the lower of the current grade and the lowest grade thus far;
if grade = lowgrade then lowtype = gtype; *If the current grade is the lowest, update lowtype;
if last.idno then output; *For the last observation for a student, output one observation containing the lowest grade;
drop gtype; *Drop gtype here and rename lowtype as gtype in the data statement;
RUN;
PROC PRINT data=lowest;
title 'Output Dataset: LOWEST';
RUN;
```

Obs | idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|---|

1 | 10 | Smith | 69 | 69 | E4 |

2 | 11 | Simon | 99 | 72 | E2 |

3 | 12 | Jones | 99 | 92 | E3 |

Because the instructor only wants to drop the lowest exam grade, the first DATA step tells SAS to create a data set called *exams* by selecting only the exam grades (E1, E2, E3, and E4) from the data set *grades*.

It's the second DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll try to help you understand the code in three different ways. First, let's summarize the procedure. The DATA step searches through the *exams* data set for each subject ("*by idno*") and looks for the lowest grade ("*min*(*lowgrade*, *grade*)"). Because SAS would otherwise set the variables *lowgrade* and *lowtype* to missing for each new iteration, the RETAIN statement is used to keep track of the observation that contains the lowest grade. When SAS reads the last observation of the student ("*last.idno*") it outputs the data corresponding to the lowest exam type (*lowtype*) and grade (*lowgrade*) to the *lowest *data set. (Note that the statement "*if last.idno then output*;" effectively collapses multiple observations per student into one observation per student.) So that we can merge the *lowest *data set* *back into the *grades* data set, by *idno* and *gtype*, the variable *lowtype* is renamed back to *gtype*.

Now, let's dive in a bit deeper by investigating how SAS would process the *exams* data set. As you read through what follows, you'll want to refer to both the DATA step code and the *exams* data set (which is the same as the *grades* data set minus the P1 and F1 observations). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (*_N_* and *_ERROR_*), the four variables in the *exams* data set (*idno*, *l_name*, *gtype*, and *grade*), two variables defined within the DATA step (*lowgrade* and *lowtype*), and as a result of the BY statement, a *first.idno* and a *last.idno* variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | . | . | . | . | . |

SAS reads the first observation from the *exams* data set. The observation is the first in the group of id numbers that equal 10, therefore *first.idno* is assigned the value of 1 and *last.idno* is assigned a value of 0. Because *first.idno* equals 1, the *lowgrade* variable is assigned the same value as that of the *grade* variable, that is, 78. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since both values are 78, the value of the *lowgrade* variable remains unchanged. Because *grade* equals *lowgrade* (they are both 78), SAS assigns the *lowtype* variable the same value as that of the *gtype* variable, that is, E1. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | E1 | 78 | 78 | E1 | 1 | 0 |

Since *last.idno* does not equal 1, SAS does not write the contents of the program data vector to the *lowest* data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, *lowgrade* and *lowtype*, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the *exams* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since *grade* now equals 82 and *lowgrade* equals 78 from the previous iteration, the value of the *lowgrade* variable remains 78. Because *grade* does not equal *lowgrade* now, the value of the *lowtype* variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

2 | 0 | 10 | Smith | E2 | 82 | 78 | E1 | 0 | 0 |

Again, since *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *lowest* data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of *lowgrade* and *lowtype* are retained from the previous iteration. SAS reads the third observation from the *exams* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since *grade* now equals 86 and *lowgrade* still equals 78 from the previous iteration, the value of the *lowgrade* variable still remains 78. Because *grade* still does not equal *lowgrade* now, the value of the *lowtype* variable is not changed. Instead, it remains E1. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

3 | 0 | 10 | smith | E3 | 86 | 78 | E1 | 0 | 0 |

Again, since *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *lowest* data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of *lowgrade* and *lowtype* are retained from the previous iteration. SAS reads the fourth observation from the *exams* data set. The observation is the last in the group of id numbers that equal 10, therefore *last.idno* is assigned the value of 1 and *first.idno* is assigned a value of 0. The *lowgrade* variable is then assigned the smallest value of the *lowgrade* and *grade* variables. Since *grade* now equals 69 and *lowgrade* still equals 78 from the previous iteration, the value of the *lowgrade* variable is updated to 69. Because *grade* equals *lowgrade* (they are both 69), SAS assigns the *lowtype* variable the same value as that of the *gtype* variable, that is, E4. Here's what the program data vector looks like now:

_N_ | _ERROR_ | idno | l_name | gtype | grade | lowgrade | lowtype | first.idno | last.idno |
---|---|---|---|---|---|---|---|---|---|

4 | 0 | 10 | Smith | E4 | 69 | 69 | E4 | 0 | 1 |

Now, since *last.idno* equals 1, SAS writes the contents of the program data vector to the *lowest* data set. In doing so, SAS does not write the automatic variables *_N_* and *_ERROR*, nor the *first.idno* and *last.idno* variables, to the data set. As instructed by the code, SAS drops the *gtype* variable and renames the *lowtype* variable to *gtype*. So, here's what the *lowest* data set looks like after processing the first four observations:

idno | l_name | grade | lowgrade | gtype |
---|---|---|---|---|

10 | Smith | 69 | 69 | E4 |

Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken four observations and collapsed them into one observation.

There's just one more thing you might want to do to help you understand that second DATA step. That is, click on the **Inspect!** button below the code to see an explanation of each line in the DATA step. Then, launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the

*lowest*data set contains the lowest exam grade for each student. Also, note that the

*lowest*data set contains one observation per student rather than six observations per student as in the original

*grades*data set.

# 17.3 - Automatic Retention

17.3 - Automatic RetentionAs mentioned earlier, values of the following variables are automatically retained from one iteration of the DATA step to the next:

- variables read with a SET, MERGE, or UPDATE statement
- a variable whose value is assigned in a SUM statement
- variables created by the IN = option

It is not necessary, therefore, to specify their variable names in a RETAIN statement. In this section, we'll take a look at an example that illustrates this point.

## Example 17.7

The following SAS program calculates the students' final grades, and in so doing illustrates the automatic retention of two variables — a variable whose value is assigned in a SUM statement (*total*) and a variable created by the IN = option (*lowest*):

```
PROC SORT data=grades;
by idno gtype;
RUN;
DATA final;
merge grades lowest (in=lowest);
by idno gtype;
if lowest then delete;
if first.idno then total = 0;
total + grade;
if last.idno then fnl = (total/600)*100;
format fnl 5.1;
drop lowgrade gtype;
RUN;
PROC PRINT data=final;
title 'Output Dataset: FINAL GRADES';
RUN;
```

Obs | idno | l_name | grade | total | fnl |
---|---|---|---|---|---|

1 | 10 | Smith | 78 | 78 | . |

2 | 10 | Smith | 82 | 160 | . |

3 | 10 | Smith | 86 | 246 | . |

4 | 10 | Smith | 160 | 406 | . |

5 | 10 | Smith | 97 | 503 | 83.8 |

6 | 11 | Simon | 88 | 88 | . |

7 | 11 | Simon | 86 | 174 | . |

8 | 11 | Simon | 99 | 273 | . |

9 | 11 | Simon | 170 | 443 | . |

10 | 11 | Simon | 100 | 543 | 90.5 |

11 | 12 | Jones | 98 | 98 | . |

12 | 12 | Jones | 92 | 190 | . |

13 | 12 | Jones | 99 | 289 | . |

14 | 12 | Jones | 185 | 474 | . |

15 | 12 | Jones | 99 | 573 | 95.5 |

Because the instructor wants to process the *grades* and *lowest *data sets by *idno* and *gtype*, the SORT procedure prepares the *grades* data set for doing so. The *lowest* data set is alread sorted by *idno* and *gtype*, and so no SORT procedure is necessary for it.

It's again the DATA step that is the meat of the program and the challenging one to understand. Because of that, we'll again try to help you understand the code in three different ways. First, a written summary ... the DATA step merges the *grades* and *lowest* data sets back together and calculates each student's final grade. Recall that the *grades *data set contains multiple observations for each student, while the *lowest* data set contains only one observation per student. Also, recall that the *lowest *data set indicates which exam grade (*gtype*) is lowest for each student. When SAS merges the two data sets by *idno* and *gtype*, the resulting merged data set *final* contains multiple observations for each student. Because the "*in* = *lowest*" option was attached to the *lowest* data set, the variable *lowest *will take the value 1 for the one observation for each student that contains their *lowest *exam score. The "*if lowest then delete*" statement deletes the observation pertaining to each student's lowest exam score from the *final* data set. That is, the *final* data set contains only those observations needed to calculate each students' final grade. Note that there is no RETAIN statement for the variable *lowest*, and yet its value is automatically retained from one iteration of the data step to the next.

The variable *total* is used to keep track of each student's total point accumulation. Therefore, for each student (*first.idno*), the variable *total* is initially set to 0. As you know, the expression:

`total + grade;`

is called a SUM statement. It merely adds the *grade* appearing in the current observation to the student's *total* accumulator. You could think of the SUM statement as this kind of statement that you'd typically see instead in other programming languages:

`total = total + grade;`

Note that there is no RETAIN statement for the variable *total*, and yet its value is automatically retained from one iteration of the DATA step to the next. When SAS reaches each student's last observation, the final grade (*fnl*) is calculated.

Now, let's dive in a bit deeper by investigating how SAS would process the *grades and lowest *data sets. As you read through what follows, you'll want to refer to both the DATA step code and the *grades* data set (you'll have to envision the data set as it is sorted by *idno* and *gtype*). As always, at the conclusion of the compile phase, SAS makes the program data vector. In this case, it contains the automatic variables (*_N_* and *_ERROR_*), the four variables in the *grades* data set (*idno*, *l_name*, *gtype*, and *grade*), a variable in the *lowest* data set (*lowgrade*), a variable called *lowest* that is created as a result of the IN= option attached to the *lowest* data set, two variables defined within the DATA step (*total* and *fnl*), and as a result of the BY statement, a *first.idno*, a *last.idno*, a *first.gtype* and a *last.gtype* variable. Here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | . | . | . | . | . | . | . | . |

For the sake of room, note that the *_ERROR_* variable has been abbreviated to *_Err_*. Also, the *first.gtype* and *last.gtype* variables are present, but not depicted, in the program data vector, because we never use them in the code. Now, SAS reads the first observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = E1. Therefore, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value, and the *lowest* variable is assigned a value of 0. The observation is the first in the group of id numbers that equal 10, therefore *first.idno* is assigned the value of 1, and *last.idno* is assigned a value of 0. Because *first.idno* equals 1, the *total* variable is initialized to 0, and then immediately updated to the sum of 0 and 78, or 78. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | E1 | 78 | . | 0 | 1 | 0 | 78 | . |

Because we've reached the end of the first iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set. In so doing, the *lowgrade* and *gtype* variables are dropped, as are all of the internal variables (*_N_*, *_ERROR_*, *lowest*, *first.idno*, *last.idno*, *first.gtype*, and *last.gtype*). Here's what the *final *data set looks like after the first iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

Because SAS encounters a new *idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

2 | 0 | . | . | . | 0 | . | . | 78 | . |

Note that the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the second observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = E2. Therefore, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value and the *lowest* variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 78 and 82, or 160. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

2 | 0 | 10 | Smith | E2 | 82 | . | 0 | 0 | 0 | 160 | . |

Because we've reached the end of the second iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the second iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

Because SAS encounters a new *idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

3 | 0 | . | . | . | 0 | . | . | 160 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the third observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = E3. Therefore, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value and the *lowest* variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 160 and 86, or 246. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

3 | 0 | 10 | Smith | E3 | 86 | . | 0 | 0 | 0 | 246 | . |

Because we've reached the end of the third iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the third iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

Because SAS encounters a new *idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

4 | 0 | . | . | . | 0 | . | . | 246 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the fourth observation from the sorted *grades* data set. Now, there is an observation in the *lowest* data set for which *idno* = 10 and *gtype* = E4. Therefore, the *lowest* data set does contribute to the observation currently being built. The *lowgrade* variable is read from the *lowest* data set, getting a value of 69. The *lowest* variable is assigned a value of 1. Therefore, SAS does not write the observation to the *final *data set. The program data vector looks like this:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

4 | 0 | 10 | Smith | E4 | 69 | 69 | 1 | 0 | 0 | 246 | . |

and the *final* data set still looks like this after the fourth iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

*idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

5 | 0 | . | . | . | 1 | . | . | 246 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the fifth observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = F1. (Don't forget that the *grades* data set was sorted by *idno* and *gtype*, so the observation for which *gtype* = F1 is processed before the observation for which *gtype* = P1.) Again, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value, and the *lowest* variable is assigned a value of 0. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 246 and 160, or 406. Because *last.idno* equals 0, the *fnl* variable is not calculated. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

5 | 0 | 10 | Smith | F1 | 160 | . | 0 | 0 | 0 | 406 | . |

Because we've reached the end of the fifth iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the fifth iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

10 | Smith | 160 | 406 | . |

*idno *and* gtype*, SAS resets the values in the program data vector to missing:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

6 | 0 | . | . | . | 0 | . | . | 406 | . |

Again, the values of the *lowest* and *total* variables are automatically retained from the previous iteration. Now, SAS reads the sixth observation from the sorted *grades* data set. There is no observation in the *lowest* data set for which *idno* = 10 and *gtype* = P1. Again, the *lowest* data set does not contribute anything to the observation currently being built. So, the *lowgrade* variable retains its missing value, and the *lowest* variable is assigned a value of 0. The observation is the last in the group of id numbers that equal 10, therefore *last.idno* is assigned the value of 1 and *first.idno* is assigned a value of 0. Because *first.idno* equals 0, the *total* variable is not re-initialized to 0 but is updated to the sum of 406 and 97, or 503. Because *last.idno* equals 1, the *fnl* variable is calculated to equal 503 divided by 600, and multiplied by 100, to get 83.8. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | lowgrade | lowest | first.idno | last.idno | total | fnl |
---|---|---|---|---|---|---|---|---|---|---|---|

6 | 0 | 10 | Smith | P1 | 97 | . | 0 | 0 | 1 | 503 | 83.8 |

Because we've reached the end of the sixth iteration of the DATA step, SAS writes the contents of the program data vector to the output *final* data set, dropping appropriate variables along the way. Here's what the *final* data set looks like after the sixth iteration of the DATA step:

idno | l_name | grade | total | fnl |
---|---|---|---|---|

10 | Smith | 78 | 78 | . |

10 | Smith | 82 | 160 | . |

10 | Smith | 86 | 246 | . |

10 | Smith | 160 | 406 | . |

10 | Smith | 97 | 503 | 83.8 |

Whew!! This is a lot of work! You can take it from here... the process continues as described.

There's just one more thing you might want to do to help you understand the DATA step. That is, click on the **Inspect!** button below the code to see an explanation of each line in the DATA step. Then, launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the final grades are appropriately calculated.

# 17.4 - Transposing a Data Set

17.4 - Transposing a Data SetIt is very likely that some day you will encounter the need to turn what I call a "tall" data set, such as *grades*:

idno | l_name | gtype | grade |
---|---|---|---|

10 | Smith | E1 | 78 |

10 | Smith | E2 | 82 |

10 | Smith | E3 | 86 |

10 | Smith | E4 | 69 |

10 | Smith | P1 | 97 |

10 | Smith | F1 | 160 |

11 | Simon | E1 | 88 |

11 | Simon | E2 | 72 |

11 | Simon | E3 | 86 |

11 | Simon | E4 | 99 |

11 | Simon | P1 | 100 |

11 | Simon | F1 | 170 |

12 | Jones | E1 | 98 |

12 | Jones | E2 | 92 |

12 | Jones | E3 | 92 |

12 | Jones | E4 | 99 |

12 | Jones | P1 | 99 |

12 | Jones | F1 | 185 |

into a fat data set, say *gradestoo*, containing the same information but in a different structure:

idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|

10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |

11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |

12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |

We can use, as you'll soon see, a "LAST*.*" variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement to transpose the *grades* data set.

## Example 17.8

The following program illustrates a failed attempt at transposing the original "tall" *grades* data set (one observation per grade) to a "fat" data set (one observation per student):

```
DATA grades2;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
drop gtype grade;
RUN;
PROC PRINT data=grades2;
title 'Output Dataset: FAULTY TRANSPOSED GRADES';
RUN;
```

Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|

1 | 10 | Smith | . | . | . | . | 97 | . |

2 | 11 | Simon | . | . | . | . | 100 | . |

3 | 12 | Jones | . | . | . | . | 99 | . |

In summary, the program is trying to take the grades of each student ("*by idno*") that appear in the variable *grade* and depending on what type of grade they are ("*if gtype =*") assign them to the new variables *E1*, *E2*, ..., *F1*. Only when the last observation is encountered for each student ("*if last.idno*") are the data output to the *grades2* data set. The only problem is that because there is no RETAIN statement to tell SAS to retain the values of *E1*, *E2*, ..., *F1* from one iteration of the DATA step to the next. Therefore, by the time SAS goes to output the program data vector, the values of *E1*, *E2*, ..., *F1* have been set to missing.

Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the grades

*E1*,

*E2*, ...,

*F1*are not retained from iteration to iteration and therefore are all missing. The reason the grade for

*P1*is not missing is that its value is encountered last in the sorted

*grades*data set. Therefore, there are no more iterations of the DATA step after its value has been assigned, and SAS has no opportunity to set it to missing.

The following program correctly makes the transposition using the RETAIN statement:

```
DATA grades3;
DATA grades3;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
retain E1 E2 E3 E4 P1 F1;
drop gtype grade;
RUN;
PROC PRINT data=grades3;
title 'Output Dataset: TRANSPOSED GRADES';
RUN;
```

Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|

1 | 10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |

2 | 11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |

3 | 12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |

Note that the DATA step in this program is identical to the DATA step in the previous program except this one has a RETAIN statement while the previous one did not. First, inspect the program so you understand what each line of the program is doing. Then, let's work our way through how SAS processes the DATA step.

The program data vector contains the automatic variables (*_N_* and *_ERROR_*), the four variables in the *grades* data set (*idno*, *l_name*, *gtype*, and *grade*), six variables created within the DATA step (*E1*, *E2*, *E3*, *E4*, *P1*, and *F1*), and as a result of the BY statement, a *first.idno* and a *last.idno* variable. Noting that the *_ERROR_* variable has been abbreviated to *_Err_*, here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 0 | . | . | . | . | . | . | . | . | . |

Now, SAS reads the first observation from the *grades* data set. The observation is the first in the group of id numbers that equal 10, therefore *first.idno* is assigned the value of 1 and *last.idno* is assigned a value of 0. Because *gtype* equals E1, the *E1* variable is assigned the value contained in the *grade* variable, that is, 78. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | E1 | 78 | 1 | 0 | 78 | . | . | . | . | . |

Because *last.idno* does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, *E1*,* E2*, ..., and* F1*, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals E2, the *E2* variable is assigned the value contained in the *grade* variable, that is, 82. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 0 | 10 | Smith | E2 | 82 | 0 | 0 | 78 | 82 | . | . | . | . |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of *E1*,* E2*, ..., and* F1 *are retained from the previous iteration. SAS reads the third observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals E3, the *E3* variable is assigned the value contained in the *grade* variable, that is, 86. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

3 | 0 | 10 | Smith | E3 | 86 | 0 | 0 | 78 | 82 | 86 | . | . | . |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of *E1*,* E2*, ..., and* F1 *are retained from the previous iteration. SAS reads the fourth observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals E4, the *E4* variable is assigned the value contained in the *grade* variable, that is, 69. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

4 | 0 | 10 | Smith | E4 | 69 | 0 | 0 | 78 | 82 | 86 | 69 | . | . |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the fifth observation. Again, the values of *E1*,* E2*, ..., and* F1 *are retained from the previous iteration. SAS reads the fifth observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals F1 (don't forget that we sorted the *grades* data set by *idno* and *gtype*), the *F1* variable is assigned the value contained in the *grade* variable, that is, 160. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | F1 | 160 | 0 | 0 | 78 | 82 | 86 | 69 | . | 160 |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the sixth observation. Again, the values of *E1*,* E2*, ..., and* F1 *are retained from the previous iteration. SAS reads the sixth observation from the *grades* data set. The observation is the last in the group of id numbers that equal 10, therefore *last.idno* is assigned the value of 1 and *first.idno* is assigned a value of 0.. Because *gtype* equals P1 (don't forget that we sorted the *grades* data set by *idno* and *gtype*), the *P1* variable is assigned the value contained in the *grade* variable, that is, 97. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

6 | 0 | 10 | Smith | P1 | 97 | 0 | 1 | 78 | 82 | 86 | 69 | 97 | 160 |

Ahhhh... finally.... now, because *last.idno* equals 1, SAS writes the contents of the program data vector to the *grades3* data set. In doing so, SAS does not write the automatic variables *_N_* and *_ERROR*, nor the *first.idno* and *last.idno* variables, to the data set. As instructed by the code, SAS drops the *gtype* and *grade* variables. So, here's what the *grades3* data set looks like after processing the first six observations:

idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|

0 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |

Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken six observations and collapsed them into one fat observation. Now, SAS moves back to the top of the DATA step and begins processing the seventh observation. Because every combination of student and grade type is represented in the data set, it is not necessary to reset the retained values to missing, however, you may wish to do so, just in case. You can take it from here.

Now, launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that this time the grades

*E1*,

*E2*, ...,

*F1*are appropriately assigned and retained. Also, note that we have successfully transposed the

*grades*data set from a "tall" data set to a "fat" data set.

Just to close the loop, the following program calculates the final grades of the students using the newly transposed data set:

```
DATA grades4;
set grades3;
if E1 = min(E1, E2, E3, E4) then E1 = .;
else if E2 = min(E1, E2, E3, E4) then E2 = .;
else if E3 = min(E1, E2, E3, E4) then E3 = .;
else if E4 = min(E1, E2, E3, E4) then E4 = .;
fnl = sum(E1, E2, E3, E4, P1, F1)/6;
format fnl 5.1;
RUN;
PROC PRINT data=grades4;
title 'Output Dataset: FINAL GRADES calculated from TRANSPOSED GRADES';
RUN;
```

Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 | fnl |
---|---|---|---|---|---|---|---|---|---|

1 | 10 | Smith | 78 | 82 | 86 | . | 97 | 160 | 83.8 |

2 | 11 | Simon | 88 | . | 86 | 99 | 100 | 170 | 90.5 |

3 | 12 | Jones | 98 | . | 92 | 99 | 99 | 185 | 95.5 |

The MIN function:

`min(var1, var2, var3, .., varn)`

returns the smallest number of the given arguments. Therefore, the IF-THEN-ELSE statement in the DATA step replaces the lowest exam grade encountered for each student with a missing value. For example, if SAS deems that a student's *E1* value is the smallest of her four exam grades, then SAS will set *E1* to missing. Then, SAS calculates the student's final (*fnl*) grade by merely summing the student's available grades and dividing by 6.

Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the student's final grades have been again appropriately calculated.

**NOTE!**Illustrating the calculation of the final grades twice — once with the "tall" data set and once with the "fat" data set — is intended only to show you that the grades can be calculated either way. Although the code may be simpler to understand with the "fat" data set, it is not to suggest that you should necessarily lean towards "fat" data sets. At least some statistical analysis procedures require that your data sets be "tall."

# 17.5 - Summary

17.5 - SummaryIn this lesson, we learned how the OUTPUT and RETAIN statements allow us to control the output of data sets.

The homework for this lesson will give you practice with these techniques.