It is very likely that some day you will encounter the need to turn what I call a "tall" data set, such as *grades*:

idno | l_name | gtype | grade |
---|---|---|---|

10 | Smith | E1 | 78 |

10 | Smith | E2 | 82 |

10 | Smith | E3 | 86 |

10 | Smith | E4 | 69 |

10 | Smith | P1 | 97 |

10 | Smith | F1 | 160 |

11 | Simon | E1 | 88 |

11 | Simon | E2 | 72 |

11 | Simon | E3 | 86 |

11 | Simon | E4 | 99 |

11 | Simon | P1 | 100 |

11 | Simon | F1 | 170 |

12 | Jones | E1 | 98 |

12 | Jones | E2 | 92 |

12 | Jones | E3 | 92 |

12 | Jones | E4 | 99 |

12 | Jones | P1 | 99 |

12 | Jones | F1 | 185 |

into a fat data set, say *gradestoo*, containing the same information but in a different structure:

idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|

10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |

11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |

12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |

We can use, as you'll soon see, a "LAST*.*" variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement to transpose the *grades* data set.

##
Example 17.8
Section* *

The following program illustrates a failed attempt at transposing the original "tall" *grades* data set (one observation per grade) to a "fat" data set (one observation per student):

` ````
DATA grades2;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
drop gtype grade;
RUN;
PROC PRINT data=grades2;
title 'Output Dataset: FAULTY TRANSPOSED GRADES';
RUN;
```

In summary, the program is trying to take the grades of each student ("*by idno*") that appear in the variable *grade* and depending on what type of grade they are ("*if gtype =*") assign them to the new variables *E1*, *E2*, ..., *F1*. Only when the last observation is encountered for each student ("*if last.idno*") are the data output to the *grades2* data set. The only problem is that because there is no RETAIN statement to tell SAS to retain the values of *E1*, *E2*, ..., *F1* from one iteration of the DATA step to the next. Therefore, by the time SAS goes to output the program data vector, the values of *E1*, *E2*, ..., *F1* have been set to missing.

Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the grades *E1*, *E2*, ..., *F1* are not retained from iteration to iteration and therefore are all missing. The reason the grade for *P1* is not missing is that its value is encountered last in the sorted *grades* data set. Therefore, there are no more iterations of the DATA step after its value has been assigned, and SAS has no opportunity to set it to missing.

The following program correctly makes the transposition using the RETAIN statement:

```
DATA grades3;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
retain E1 E2 E3 E4 P1 F1;
drop gtype grade;
RUN;
PROC PRINT data=grades3;
title 'Output Dataset: TRANSPOSED GRADES';
RUN;
```

Note that the DATA step in this program is identical to the DATA step in the previous program except this one has a RETAIN statement while the previous one did not. First, inspect the program so you understand what each line of the program is doing. Then, let's work our way through how SAS processes the DATA step.

The program data vector contains the automatic variables (*_N_* and *_ERROR_*), the four variables in the *grades* data set (*idno*, *l_name*, *gtype*, and *grade*), six variables created within the DATA step (*E1*, *E2*, *E3*, *E4*, *P1*, and *F1*), and as a result of the BY statement, a *first.idno* and a *last.idno* variable. Noting that the *_ERROR_* variable has been abbreviated to *_Err_*, here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 0 | . | . | . | . | . | . | . | . | . |

Now, SAS reads the first observation from the *grades* data set. The observation is the first in the group of id numbers that equal 10, therefore *first.idno* is assigned the value of 1 and *last.idno* is assigned a value of 0. Because *gtype* equals E1, the *E1* variable is assigned the value contained in the *grade* variable, that is, 78. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | E1 | 78 | 1 | 0 | 78 | . | . | . | . | . |

Because *last.idno* does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, *E1*,* E2*, ..., and* F1*, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals E2, the *E2* variable is assigned the value contained in the *grade* variable, that is, 82. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 0 | 10 | Smith | E2 | 82 | 0 | 0 | 78 | 82 | . | . | . | . |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of *E1*,* E2*, ..., and* F1 * are retained from the previous iteration. SAS reads the third observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals E3, the *E3* variable is assigned the value contained in the *grade* variable, that is, 86. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

3 | 0 | 10 | Smith | E3 | 86 | 0 | 0 | 78 | 82 | 86 | . | . | . |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of *E1*,* E2*, ..., and* F1 * are retained from the previous iteration. SAS reads the fourth observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals E4, the *E4* variable is assigned the value contained in the *grade* variable, that is, 69. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

4 | 0 | 10 | Smith | E4 | 69 | 0 | 0 | 78 | 82 | 86 | 69 | . | . |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the fifth observation. Again, the values of *E1*,* E2*, ..., and* F1 * are retained from the previous iteration. SAS reads the fifth observation from the *grades* data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore *first.idno* and *last.idno* are both assigned a value of 0. Because *gtype* equals F1 (don't forget that we sorted the *grades* data set by *idno* and *gtype*), the *F1* variable is assigned the value contained in the *grade* variable, that is, 160. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 10 | Smith | F1 | 160 | 0 | 0 | 78 | 82 | 86 | 69 | . | 160 |

Again, because *last.idno* still does not equal 1, SAS does not write the contents of the program data vector to the *grades3* data set. Instead, SAS returns to the top of the DATA step to begin processing the sixth observation. Again, the values of *E1*,* E2*, ..., and* F1 * are retained from the previous iteration. SAS reads the sixth observation from the *grades* data set. The observation is the last in the group of id numbers that equal 10, therefore *last.idno* is assigned the value of 1 and *first.idno* is assigned a value of 0.. Because *gtype* equals P1 (don't forget that we sorted the *grades* data set by *idno* and *gtype*), the *P1* variable is assigned the value contained in the *grade* variable, that is, 97. Here's what the program data vector looks like now:

_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

6 | 0 | 10 | Smith | P1 | 97 | 0 | 1 | 78 | 82 | 86 | 69 | 97 | 160 |

Ahhhh... finally.... now, because *last.idno* equals 1, SAS writes the contents of the program data vector to the *grades3* data set. In doing so, SAS does not write the automatic variables *_N_* and *_ERROR*, nor the *first.idno* and *last.idno* variables, to the data set. As instructed by the code, SAS drops the *gtype* and *grade* variables. So, here's what the *grades3* data set looks like after processing the first six observations:

idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|

0 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |

Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken six observations and collapsed them into one fat observation. Now, SAS moves back to the top of the DATA step and begins processing the seventh observation. Because every combination of student and grade type is represented in the data set, it is not necessary to reset the retained values to missing, however, you may wish to do so, just in case. You can take it from here.

Now, launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that this time the grades *E1*, *E2*, ..., *F1* are appropriately assigned and retained. Also, note that we have successfully transposed the *grades* data set from a "tall" data set to a "fat" data set.

Just to close the loop, the following program calculates the final grades of the students using the newly transposed data set:

` ````
DATA grades4;
set grades3;
if E1 = min(E1, E2, E3, E4) then E1 = .;
else if E2 = min(E1, E2, E3, E4) then E2 = .;
else if E3 = min(E1, E2, E3, E4) then E3 = .;
else if E4 = min(E1, E2, E3, E4) then E4 = .;
fnl = sum(E1, E2, E3, E4, P1, F1)/6;
format fnl 5.1;
RUN;
PROC PRINT data=grades4;
title 'Output Dataset: FINAL GRADES calculated from TRANSPOSED GRADES';
RUN;
```

The MIN function:

`min(`*var1*, *var2*, *var3*, .., *varn*)

returns the smallest number of the given arguments. Therefore, the IF-THEN-ELSE statement in the DATA step replaces the lowest exam grade encountered for each student with a missing value. For example, if SAS deems that a student's *E1* value is the smallest of her four exam grades, then SAS will set *E1* to missing. Then, SAS calculates the student's final (*fnl*) grade by merely summing the student's available grades and dividing by 6.

Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the student's final grades have been again appropriately calculated.

**NOTE!**Illustrating the calculation of the final grades twice — once with the "tall" data set and once with the "fat" data set — is intended only to show you that the grades can be calculated either way. Although the code may be simpler to understand with the "fat" data set, it is not to suggest that you should necessarily lean towards "fat" data sets. At least some statistical analysis procedures require that your data sets be "tall."