19.6 - Using Arrays to Transpose a Data Set

A few lessons back, we learned how to transpose a data set by taking advantage of the last. variable and RETAIN and OUTPUT statements. In this section, we'll learn how to use an array to transpose a "tall" data set into a "fat" data set.

Example 19.15 Section

Throughout this section, we will work with the tallgrades data set that is created in the following DATA step:

DATA tallgrades;
    input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
    cards;
10  Smith  E1  78
10  Smith  E2  82
10  Smith  E3  86
10  Smith  E4  69
10  Smith  P1  97
10  Smith  F1 160
11  Simon  E1  88
11  Simon  E2  72
11  Simon  E3  86
11  Simon  E4  99
11  Simon  P1 100
11  Simon  F1 170
12  Jones  E1  98
12  Jones  E2  92
12  Jones  E3  92
12  Jones  E4  99
12  Jones  P1  99
12  Jones  F1 185
;
RUN;

PROC PRINT data = tallgrades NOOBS;
   TITLE 'The tall grades data set';
RUN;

The tall grades data set

idno

l_name

gtype

grade

10

Smith

E1

78

10

Smith

E2

82

10

Smith

E3

86

10

Smith

E4

69

10

Smith

P1

97

10

Smith

F1

160

11

Simon

E1

88

11

Simon

E2

72

11

Simon

E3

86

11

Simon

E4

99

11

Simon

P1

100

11

Simon

F1

170

12

Jones

E1

98

12

Jones

E2

92

12

Jones

E3

92

12

Jones

E4

99

12

Jones

P1

99

12

Jones

F1

185

The tallgrades data set contains one observation for each grade for each student. Students are identified by their ID number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. Launch and run  the SAS program so that we can work with the tallgrades data set in the next two examples.

Example 19.16 Section

(You might recall seeing this program a few lessons ago.) Using RETAIN and OUTPUT statements, the following program takes advantage of BY-group processing, as well as RETAIN and OUTPUT statements, to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):

DATA fatgrades;
	set tallgrades;
	by idno;
			 if gtype = 'E1' then E1 = grade;
		else if gtype = 'E2' then E2 = grade;
		else if gtype = 'E3' then E3 = grade;
		else if gtype = 'E4' then E4 = grade;
		else if gtype = 'P1' then P1 = grade;
		else if gtype = 'F1' then F1 = grade;
	if last.idno then output;
	retain E1 E2 E3 E4 P1 F1;
	drop gtype grade;
RUN;
 
PROC PRINT data=fatgrades;
	title 'The fat grades data set';
RUN;

The fat grades data set

Obs

idno

l_name

E1

E2

E3

E4

P1

F1

1

10

Smith

78

82

86

69

97

160

2

11

Simon

88

72

86

99

100

170

3

12

Jones

98

92

92

99

99

185

First, note that the program takes the grades of each student ("by idno") that appear in the variable grade and, depending on what type of grade they are ("if gtype ="), assigns them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") is the data output to the fatgrades data set. The RETAIN statement tells SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. If the RETAIN statement was not present, by the time SAS went to output the program data vector, the values of E1, E2, ..., F1 would be set to missing.

Now, launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.

Example 19.17 Section

The following program uses an array to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):

DATA fatgrades;
	set tallgrades;
   by idno;
   array allgrades (6) G1 - G6;
   if first.idno then i = 1;
   allgrades(i) = grade;
   if last.idno then output;
   i + 1;
   retain G1 - G6;
   drop i gtype grade;
RUN;
 
PROC PRINT data=fatgrades;
  title 'The fat grades data set';
RUN;

The fat grades data set

Obs

idno

l_name

G1

G2

G3

G4

G5

G6

1

10

Smith

78

82

86

69

97

160

2

11

Simon

88

72

86

99

100

170

3

12

Jones

98

92

92

99

99

185

Yikes! This code looks even scarier! Let's dissect it a bit. First, just as in the previous program, the tallgrades data set is processed BY idno. Doing so makes the first.idno and last.idno variables available for us to use. The ARRAY statement defines an array called allgrades and, using a numbered range list, associates the array with six (uninitialized) variables G1, G2, ..., G6. The allgrades array is used to hold the six grades for each student before they are output in their transposed direction to the fatgrades data set. Because the elements of any array, and therefore allgrades, must be assigned using an index variable, this is how the transposition takes place:

  • ("if first.idno then i = 1;") If the input observation contains a student idno that hasn't yet been encountered in the data set, then the index variable i is initialized to 1. If the input observation doesn't contain a new student idno, then do nothing other than advance to the next step.
  • ("allgrades(i) = grade;") The grade from the current observation is assigned to the array allgrades. (For example, if the input observation contains Smith's first grade, then allgrades(1) is assigned the value 78. If the input observation contains Smith's second grade, then allgrades(2) are assigned the value 82. And so on.)
  • ("if last.idno then output;") If the input observation is the last observation in the data set that contains the student idno, then dump the program data vector (which contains allgrades) to the output data set. (For example, if the input observation is Smith's final exam grade, then output the now fat observation containing his six grades). If the input observation is not the last observation in the data set that contains the student idno, do nothing other than advance to the next step.
  • ("i + 1;") Then, increase the index variable i by 1. (For example, if i is 1, change i to 2.)
  • ("retain G1-G6;") Rather than setting G1, G2, ..., G6 to missing at the beginning of the next iteration of the data step, retain their current values. (So, for example, for Smith, allgrades(1) would retain its value of 78, allgrades (2) would retain its value of 82, and so on.)

The program would keep cycling through the above five steps until it encountered the last observation in the data set. Then, the variables i, gtype, and grade would be dropped from the output fatgrades data set.

Now, launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that we have now successfully used arrays to transpose the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.