17.4 - Transposing a Data Set

It is very likely that some day you will encounter the need to turn what I call a "tall" data set, such as grades:

idno l_name gtype grade
10 Smith E1 78
10 Smith E2 82
10 Smith E3 86
10 Smith E4 69
10 Smith P1 97
10 Smith F1 160
11 Simon E1 88
11 Simon E2 72
11 Simon E3 86
11 Simon E4 99
11 Simon P1 100
11 Simon F1 170
12 Jones E1 98
12 Jones E2 92
12 Jones E3 92
12 Jones E4 99
12 Jones P1 99
12 Jones F1 185

into a fat data set, say gradestoo, containing the same information but in a different structure:

idno l_name E1 E2 E3 E4 P1 F1
10 Smith 78 82 86 69 97 160
11 Simon 88 72 86 99 100 170
12 Jones 98 92 92 99 99 185

We can use, as you'll soon see, a "LAST." variable in conjunction with BY-group processing, a RETAIN statement, and an OUTPUT statement to transpose the grades data set.

Example 17.8 Section

The following program illustrates a failed attempt at transposing the original "tall" grades data set (one observation per grade) to a "fat" data set (one observation per student):

DATA grades2;
    set grades;
    by idno;
            if gtype = 'E1' then E1 = grade;
       else if gtype = 'E2' then E2 = grade;
       else if gtype = 'E3' then E3 = grade;
       else if gtype = 'E4' then E4 = grade;
       else if gtype = 'P1' then P1 = grade;
       else if gtype = 'F1' then F1 = grade;
    if last.idno then output;
    drop gtype grade;
RUN;
PROC PRINT data=grades2;
    title 'Output Dataset: FAULTY TRANSPOSED GRADES';
RUN;

Output Dataset: FAULTY TRANSPOSED GRADES

Obs

idno

l_name

E1

E2

E3

E4

P1

F1

1

10

Smith

.

.

.

.

97

.

2

11

Simon

.

.

.

.

100

.

3

12

Jones

.

.

.

.

99

.

In summary, the program is trying to take the grades of each student ("by idno") that appear in the variable grade and depending on what type of grade they are ("if gtype =") assign them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") are the data output to the grades2 data set. The only problem is that because there is no RETAIN statement to tell SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. Therefore, by the time SAS goes to output the program data vector, the values of E1, E2, ..., F1 have been set to missing.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are not retained from iteration to iteration and therefore are all missing. The reason the grade for P1 is not missing is that its value is encountered last in the sorted grades data set. Therefore, there are no more iterations of the DATA step after its value has been assigned, and SAS has no opportunity to set it to missing.

The following program correctly makes the transposition using the RETAIN statement:

DATA grades3;
DATA grades3;
set grades;
by idno;
            if gtype = 'E1' then E1 = grade;
       else if gtype = 'E2' then E2 = grade;
       else if gtype = 'E3' then E3 = grade;
       else if gtype = 'E4' then E4 = grade;
       else if gtype = 'P1' then P1 = grade;
       else if gtype = 'F1' then F1 = grade;
    if last.idno then output;
    retain E1 E2 E3 E4 P1 F1;
    drop gtype grade;
RUN;
PROC PRINT data=grades3;
    title 'Output Dataset: TRANSPOSED GRADES';
RUN;

Output Dataset: TRANSPOSED GRADES

Obs

idno

l_name

E1

E2

E3

E4

P1

F1

1

10

Smith

78

82

86

69

97

160

2

11

Simon

88

72

86

99

100

170

3

12

Jones

98

92

92

99

99

185

Note that the DATA step in this program is identical to the DATA step in the previous program except this one has a RETAIN statement while the previous one did not. First, inspect the program so you understand what each line of the program is doing. Then, let's work our way through how SAS processes the DATA step.

The program data vector contains the automatic variables (_N_ and _ERROR_), the four variables in the grades data set (idno, l_name, gtype, and grade), six variables created within the DATA step (E1, E2, E3, E4, P1, and F1), and as a result of the BY statement, a first.idno and a last.idno variable. Noting that the _ERROR_ variable has been abbreviated to _Err_, here's what the program data vector looks like at the beginning of the first iteration of the DATA step:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

1

0

0

  

.

.

.

.

.

.

.

.

.

Now, SAS reads the first observation from the grades data set. The observation is the first in the group of id numbers that equal 10, therefore first.idno is assigned the value of 1 and last.idno is assigned a value of 0. Because gtype equals E1, the E1 variable is assigned the value contained in the grade variable, that is, 78. Here's what the program data vector looks like now:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

1

0

10

Smith

E1

78

1

0

78

.

.

.

.

.

Because last.idno does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the second observation. Typically, SAS would reset the values of the variables created within the DATA step, that is, E1, E2, ..., and F1, to missing. The RETAIN statement overrides that default and the values of those variables from the previous iteration are retained. SAS reads the second observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E2, the E2 variable is assigned the value contained in the grade variable, that is, 82. Here's what the program data vector looks like now:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

2

0

10

Smith

E2

82

0

0

78

82

.

.

.

.

Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the third observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the third observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E3, the E3 variable is assigned the value contained in the grade variable, that is, 86. Here's what the program data vector looks like now:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

3

0

10

Smith

E3

86

0

0

78

82

86

.

.

.

Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the fourth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the fourth observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals E4, the E4 variable is assigned the value contained in the grade variable, that is, 69. Here's what the program data vector looks like now:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

4

0

10

Smith

E4

69

0

0

78

82

86

69

.

.

Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the fifth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the fifth observation from the grades data set. The observation is neither the first nor the last in the group of id numbers that equal 10, therefore first.idno and last.idno are both assigned a value of 0. Because gtype equals F1 (don't forget that we sorted the grades data set by idno and gtype), the F1 variable is assigned the value contained in the grade variable, that is, 160. Here's what the program data vector looks like now:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

1

0

10

Smith

F1

160

0

0

78

82

86

69

.

160

Again, because last.idno still does not equal 1, SAS does not write the contents of the program data vector to the grades3 data set. Instead, SAS returns to the top of the DATA step to begin processing the sixth observation. Again, the values of E1, E2, ..., and F1 are retained from the previous iteration. SAS reads the sixth observation from the grades data set. The observation is the last in the group of id numbers that equal 10, therefore last.idno is assigned the value of 1 and first.idno is assigned a value of 0.. Because gtype equals P1 (don't forget that we sorted the grades data set by idno and gtype), the P1 variable is assigned the value contained in the grade variable, that is, 97. Here's what the program data vector looks like now:

_N_

_ERR_

idno

l_name

gtype

grade

first.idno

last.idno

E1

E2

E3

E4

P1

F1

6

0

10

Smith

P1

97

0

1

78

82

86

69

97

160

Ahhhh... finally.... now, because last.idno equals 1, SAS writes the contents of the program data vector to the grades3 data set. In doing so, SAS does not write the automatic variables _N_ and _ERROR, nor the first.idno and last.idno variables, to the data set. As instructed by the code, SAS drops the gtype and grade variables. So, here's what the grades3 data set looks like after processing the first six observations:

idno

l_name

E1

E2

E3

E4

P1

F1

0

Smith

78

82

86

69

97

160

        
        

Here's where it should be clear that SAS has, by virtue of the code we've written, effectively taken six observations and collapsed them into one fat observation. Now, SAS moves back to the top of the DATA step and begins processing the seventh observation. Because every combination of student and grade type is represented in the data set, it is not necessary to reset the retained values to missing, however, you may wish to do so, just in case. You can take it from here.

Now, launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that this time the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the grades data set from a "tall" data set to a "fat" data set.

Just to close the loop, the following program calculates the final grades of the students using the newly transposed data set:

DATA grades4;
    set grades3;
         if E1 = min(E1, E2, E3, E4) then E1 = .;
    else if E2 = min(E1, E2, E3, E4) then E2 = .;
    else if E3 = min(E1, E2, E3, E4) then E3 = .;
    else if E4 = min(E1, E2, E3, E4) then E4 = .;
    fnl = sum(E1, E2, E3, E4, P1, F1)/6;
    format fnl 5.1;
RUN;
 
PROC PRINT data=grades4;
    title 'Output Dataset: FINAL GRADES calculated from TRANSPOSED GRADES';
RUN;

Output Dataset: FINAL GRADES calculated from TRANSPOSED GRADES

Obs

idno

l_name

E1

E2

E3

E4

P1

F1

fnl

1

10

Smith

78

82

86

.

97

160

83.8

2

11

Simon

88

.

86

99

100

170

90.5

3

12

Jones

98

.

92

99

99

185

95.5

The MIN function:

min(var1, var2, var3, .., varn)

returns the smallest number of the given arguments. Therefore, the IF-THEN-ELSE statement in the DATA step replaces the lowest exam grade encountered for each student with a missing value. For example, if SAS deems that a student's E1 value is the smallest of her four exam grades, then SAS will set E1 to missing. Then, SAS calculates the student's final (fnl) grade by merely summing the student's available grades and dividing by 6.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the student's final grades have been again appropriately calculated.

NOTE! Illustrating the calculation of the final grades twice — once with the "tall" data set and once with the "fat" data set — is intended only to show you that the grades can be calculated either way. Although the code may be simpler to understand with the "fat" data set, it is not to suggest that you should necessarily lean towards "fat" data sets. At least some statistical analysis procedures require that your data sets be "tall."