It is very likely that some day you will encounter the need to turn what I call a "tall" data set, such as grades:
idno | l_name | gtype | grade |
---|---|---|---|
10 | Smith | E1 | 78 |
10 | Smith | E2 | 82 |
10 | Smith | E3 | 86 |
10 | Smith | E4 | 69 |
10 | Smith | P1 | 97 |
10 | Smith | F1 | 160 |
11 | Simon | E1 | 88 |
11 | Simon | E2 | 72 |
11 | Simon | E3 | 86 |
11 | Simon | E4 | 99 |
11 | Simon | P1 | 100 |
11 | Simon | F1 | 170 |
12 | Jones | E1 | 98 |
12 | Jones | E2 | 92 |
12 | Jones | E3 | 92 |
12 | Jones | E4 | 99 |
12 | Jones | P1 | 99 |
12 | Jones | F1 | 185 |
into a fat data set, say gradestoo, containing the same information but in a different structure:
idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|
10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |
12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |
We can use, as you'll soon see, a "LAST." variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement to transpose the grades data set.
Example 17.8 Section
The following program illustrates a failed attempt at transposing the original "tall" grades data set (one observation per grade) to a "fat" data set (one observation per student):
DATA grades2;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
drop gtype grade;
RUN;
PROC PRINT data=grades2;
title 'Output Dataset: FAULTY TRANSPOSED GRADES';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | . | . | . | . | 97 | . |
2 | 11 | Simon | . | . | . | . | 100 | . |
3 | 12 | Jones | . | . | . | . | 99 | . |
In summary, the program is trying to take the grades of each student ("by idno") that appear in the variable grade and depending on what type of grade they are ("if gtype =") assign them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") are the data output to the grades2 data set. The only problem is that because there is no RETAIN statement to tell SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. Therefore, by the time SAS goes to output the program data vector, the values of E1, E2, ..., F1 have been set to missing.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are not retained from iteration to iteration and therefore are all missing. The reason the grade for P1 is not missing is that its value is encountered last in the sorted grades data set. Therefore, there are no more iterations of the DATA step after its value has been assigned, and SAS has no opportunity to set it to missing.
The following program correctly makes the transposition using the RETAIN statement:
DATA grades3;
DATA grades3;
set grades;
by idno;
if gtype = 'E1' then E1 = grade;
else if gtype = 'E2' then E2 = grade;
else if gtype = 'E3' then E3 = grade;
else if gtype = 'E4' then E4 = grade;
else if gtype = 'P1' then P1 = grade;
else if gtype = 'F1' then F1 = grade;
if last.idno then output;
retain E1 E2 E3 E4 P1 F1;
drop gtype grade;
RUN;
PROC PRINT data=grades3;
title 'Output Dataset: TRANSPOSED GRADES';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
2 | 11 | Simon | 88 | 72 | 86 | 99 | 100 | 170 |
3 | 12 | Jones | 98 | 92 | 92 | 99 | 99 | 185 |
Note that the DATA step in this program is identical to the DATA step in the previous program except this one has a RETAIN statement while the previous one did not. First, inspect the program so you understand what each line of the program is doing. Then, let's work our way through how SAS processes the DATA step.
The program data vector contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), six variables created within the DATA step (E1, E2, E3, E4, P1, and F1), and as a result of the BY statement, a first.idno and a last.idno variable. Noting that the _ERROR_ variable has been abbreviated to _Err_, here's what the program data vector looks like at the beginning of the first iteration of the DATA step:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | . | . | . | . | . | . | . | . | . |
Now, SAS reads the first observation from the grades data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because gtype equals E1, the E1 variable is assigned the value contained in the grade variable, that is, 78. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | E1 | 78 | 1 | 0 | 78 | . | . | . | . | . |
Because last.idno does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, E1, E2, ..., and F1, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E2, the E2 variable is assigned the value contained in the grade variable, that is, 82. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 0 | 10 | Smith | E2 | 82 | 0 | 0 | 78 | 82 | . | . | . | . |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the third observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E3, the E3 variable is assigned the value contained in the grade variable, that is, 86. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 0 | 10 | Smith | E3 | 86 | 0 | 0 | 78 | 82 | 86 | . | . | . |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the fourth observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E4, the E4 variable is assigned the value contained in the grade variable, that is, 69. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 0 | 10 | Smith | E4 | 69 | 0 | 0 | 78 | 82 | 86 | 69 | . | . |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the fifth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the fifth observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals F1 (don't forget that we sorted the grades data set by idno and gtype), the F1 variable is assigned the value contained in the grade variable, that is, 160. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 10 | Smith | F1 | 160 | 0 | 0 | 78 | 82 | 86 | 69 | . | 160 |
Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the sixth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the sixth observation from the grades data set. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0.. Because gtype equals P1 (don't forget that we sorted the grades data set by idno and gtype), the P1 variable is assigned the value contained in the grade variable, that is, 97. Here's what the program data vector looks like now:
_N_ | _ERR_ | idno | l_name | gtype | grade | first.idno | last.idno | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 0 | 10 | Smith | P1 | 97 | 0 | 1 | 78 | 82 | 86 | 69 | 97 | 160 |
Ahhhh... finally.... now, because last.idno equals 1, SAS writes the contents of the program data vector to the grades3 data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype and grade variables. So, here's what the grades3 data set looks like after processing the first six observations:
idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 |
---|---|---|---|---|---|---|---|
0 | Smith | 78 | 82 | 86 | 69 | 97 | 160 |
Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken six observations and collapsed them into one fat observation. Now, SAS moves back to the top of the DATA step and begins processing the seventh observation. Because every combination of student and grade type is represented in the data set, it is not necessary to reset the retained values to missing, however, you may wish to do so, just in case. You can take it from here.
Now, launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that this time the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the grades data set from a "tall" data set to a "fat" data set.
Just to close the loop, the following program calculates the final grades of the students using the newly transposed data set:
DATA grades4;
set grades3;
if E1 = min(E1, E2, E3, E4) then E1 = .;
else if E2 = min(E1, E2, E3, E4) then E2 = .;
else if E3 = min(E1, E2, E3, E4) then E3 = .;
else if E4 = min(E1, E2, E3, E4) then E4 = .;
fnl = sum(E1, E2, E3, E4, P1, F1)/6;
format fnl 5.1;
RUN;
PROC PRINT data=grades4;
title 'Output Dataset: FINAL GRADES calculated from TRANSPOSED GRADES';
RUN;
Obs | idno | l_name | E1 | E2 | E3 | E4 | P1 | F1 | fnl |
---|---|---|---|---|---|---|---|---|---|
1 | 10 | Smith | 78 | 82 | 86 | . | 97 | 160 | 83.8 |
2 | 11 | Simon | 88 | . | 86 | 99 | 100 | 170 | 90.5 |
3 | 12 | Jones | 98 | . | 92 | 99 | 99 | 185 | 95.5 |
The MIN function:
min(var1, var2, var3, .., varn)
returns the smallest number of the given arguments. Therefore, the IF-THEN-ELSE statement in the DATA step replaces the lowest exam grade encountered for each student with a missing value. For example, if SAS deems that a student's E1 value is the smallest of her four exam grades, then SAS will set E1 to missing. Then, SAS calculates the student's final (fnl) grade by merely summing the student's available grades and dividing by 6.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the student's final grades have been again appropriately calculated.