Lesson 19: Processing Variables with Arrays

Lesson 19: Processing Variables with Arrays

Overview

In this lesson, we'll learn about basic array processing in SAS. In DATA step programming, you often need to perform the same action on more than one variable at a time. Although you can process the variables individually, it is typically easier to handle the variables as a group. Arrays offer you that option. For example, until now, if you wanted to take the square root of the 50 numeric variables in your SAS data set, you'd have to write 50 SAS assignment statements to accomplish the task. Instead, you can use an array to simplify your task.

Arrays can be used to simplify your code when you need to:

  • perform repetitive calculations
  • create many variables that have the same attributes
  • read data
  • transpose "fat" data sets to "tall" data sets, that is, change the variables in a data set to observations
  • transpose "tall" data sets to "fat" data sets, that is, change the observations in a data set to variables
  • compare variables

In this lesson, we'll learn how to accomplish such tasks using arrays. Using arrays in appropriate situations can seriously simplify and shorten your SAS programs!

Objectives

Upon completion of this lesson, you should be able to:

Learning Objectives & Outcomes

Upon completing this lesson, you should be able to do the following:

  • use an ARRAY statement to define a one-dimensional array
  • use an iterative DO loop to process through a one-dimensional array
  • determine the dimension of a one-dimensional array
  • recall that an array exists only for the duration of the DATA step
  • use a numbered range list as shorthand for a list of variables ending in consecutive numbers
  • use a named range list as shorthand for a list of variables that appear in consecutive order in your data set
  • use the special name lists _ALL_, _NUMERIC_, and _CHARACTER_ as shorthand for a list of variables
  • identify the inner workings of the compile and execution phases of a DATA step that involves an array
  • write an ARRAY statement so that SAS creates new variables rather than use already existing variables
  • use the _TEMPORARY_ array option to tell SAS to create an array with only temporary elements
  • identify the advantages and limitations of using temporary array elements
  • initialize a one-dimensional array
  • use array and BY-group processing to transpose a tall data set into a fat data set, and vice versa
  • use an ARRAY statement to define a two-dimensional array
  • understand how SAS assigns the elements to a two-dimensional array
  • use and reference a two-dimensional array
  • use the DIM function to determine the number of elements in a one-dimensional array dynamically
  • modify the lower and upper bounds of an array dimension
  • use the LBOUND and HBOUND functions to determine the lower and upper bounds of an array dimension dynamically
  • use an * in an ARRAY statement to tell SAS to determine the dimension of a one-dimensional array dynamically

19.1 - One-Dimensional Arrays

19.1 - One-Dimensional Arrays

A SAS array is a temporary grouping of SAS variables under a single name. For example, suppose you have four variables named winter, spring, summer, and, fall. Rather than referring to the variables by their four different names, you could associate the variables with an array name, say seasons, and refer to the variables as seasons(1), seasons(2), seasons(3), and seasons(4). When you pair an array up with an iterative DO loop, you create a powerful and efficient way of writing your computer programs. Let's take a look at an example!

Example 19.1

The following program simply reads in the average montly temperatures (in Celsius) for ten different cities in the United States into a temporary SAS data set called avgcelsius:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA avgcelsius;
    input City $ 1-18 jan feb mar apr may jun
                        jul aug sep oct nov dec;
    DATALINES;
State College, PA  -2 -2  2  8 14 19 21 20 16 10  4 -1
Miami, FL          20 20 22 23 26 27 28 28 27 26 23 20
St. Louis, MO      -1  1  6 13 18 23 26 25 21 15  7  1
New Orleans, LA    11 13 16 20 23 27 27 27 26 21 16 12
Madison, WI        -8 -5  0  7 14 19 22 20 16 10  2 -5
Houston, TX        10 12 16 20 23 27 28 28 26 21 16 12
Phoenix, AZ        12 14 16 21 26 31 33 32 30 23 16 12
Seattle, WA         5  6  7 10 13 16 18 18 16 12  8  6
San Francisco, CA  10 12 12 13 14 15 15 16 17 16 14 11
San Diego, CA      13 14 15 16 17 19 21 22 21 19 16 14
;
RUN;
PROC PRINT data = avgcelsius;
    title 'Average Monthly Temperatures in Celsius';
    id City;
    var jan feb mar apr may jun 
        jul aug sep oct nov dec;
RUN;

Average Monthly Temperatures in Celsius

City

jan

feb

mar

apr

may

jun

jul

aug

sep

oct

nov

dec

State College, PA

-2

-2

2

8

14

19

21

20

16

10

4

-1

Miami, FL

20

20

22

23

26

27

28

28

27

26

23

20

St. Louis, MO

-1

1

6

13

18

23

26

25

21

15

7

1

New Orleans, LA

11

13

16

20

23

27

27

27

26

21

16

12

Madison, WI

-8

-5

0

7

14

19

22

20

16

10

2

-5

Houston, TX

10

12

16

20

23

27

28

28

26

21

16

12

Phoenix, AZ

12

14

16

21

26

31

33

32

30

23

16

12

Seattle, WA

5

6

7

10

13

16

18

18

16

12

8

6

San Francisco, CA

10

12

12

13

14

15

15

16

17

16

14

11

San Diego, CA

13

14

15

16

17

19

21

22

21

19

16

14

Launch and run  the SAS program so that the data set becomes available to you. Also, review the output from the PRINT procedure to convince yourself that the data were read in properly.

Now, suppose that we don't feel particularly comfortable with understanding Celsius temperatures, and therefore, we want to convert the Celsius temperatures into Fahrenheit temperatures for which we have a better feel. The following SAS program uses the standard conversion formula:

Fahrenheit temperature = 1.8*Celsius temperature + 32

to convert the Celsius temperatures in the avgcelsius data set to Fahrenheit temperatures stored in a new data set called avgfahrenheit:

DATA avgfahrenheit;
    set avgcelsius;
    janf = 1.8*jan + 32;
    febf = 1.8*feb + 32;
    marf = 1.8*mar + 32;
    aprf = 1.8*apr + 32;
    mayf = 1.8*may + 32;
    junf = 1.8*jun + 32;
    julf = 1.8*jul + 32;
    augf = 1.8*aug + 32;
    sepf = 1.8*sep + 32;
    octf = 1.8*oct + 32;
    novf = 1.8*nov + 32;
    decf = 1.8*dec + 32;
    drop jan feb mar apr may jun
            jul aug sep oct nov dec;
RUN;
PROC PRINT data = avgfahrenheit;
    title 'Average Monthly Temperatures in Fahrenheit';
    id City;
    var janf febf marf aprf mayf junf 
        julf augf sepf octf novf decf;
RUN;

Average Monthly Temperatures in Fahrenheit

City

janf

febf

marf

aprf

mayf

junf

julf

augf

sepf

octf

novf

decf

State College, PA

28.4

28.4

35.6

46.4

57.2

66.2

69.8

68.0

60.8

50.0

39.2

30.2

Miami, FL

68.0

68.0

71.6

73.4

78.8

80.6

82.4

82.4

80.6

78.8

73.4

68.0

St. Louis, MO

30.2

33.8

42.8

55.4

64.4

73.4

78.8

77.0

69.8

59.0

44.6

33.8

New Orleans, LA

51.8

55.4

60.8

68.0

73.4

80.6

80.6

80.6

78.8

69.8

60.8

53.6

Madison, WI

17.6

23.0

32.0

44.6

57.2

66.2

71.6

68.0

60.8

50.0

35.6

23.0

Houston, TX

50.0

53.6

60.8

68.0

73.4

80.6

82.4

82.4

78.8

69.8

60.8

53.6

Phoenix, AZ

53.6

57.2

60.8

69.8

78.8

87.8

91.4

89.6

86.0

73.4

60.8

53.6

Seattle, WA

41.0

42.8

44.6

50.0

55.4

60.8

64.4

64.4

60.8

53.6

46.4

42.8

San Francisco, CA

50.0

53.6

53.6

55.4

57.2

59.0

59.0

60.8

62.6

60.8

57.2

51.8

San Diego, CA

55.4

57.2

59.0

60.8

62.6

66.2

69.8

71.6

69.8

66.2

60.8

57.2

As you can see by the number of assignment statements necessary to make the conversions, the exercise becomes one of patience. Because there are twelve average monthly temperatures, we must write twelve assignment statements. Each assignment statement performs the same calculation. Only the name of the variable changes in each statement. Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were properly converted to Fahrenheit temperatures.

The above program is crying out for the use of an array. One of the primary arguments for using an array is to reduce the number of statements that are required for processing variables. Let's take a look at an example.

Example 19.2

The following program uses a one-dimensional array called fahr to convert the average Celsius temperatures in the avgcelsius data set to average Fahrenheit temperatures stored in a new data set called avgfahrenheit:

DATA avgfahrenheit;
    set avgcelsius;
    array fahr(12) jan feb mar apr may jun
                   jul aug sep oct nov dec;
    do i = 1 to 12;
            fahr(i) = 1.8*fahr(i) + 32;
    end;
RUN;
PROC PRINT data = avgfahrenheit;
    title 'Average Monthly Temperatures in Fahrenheit';
    id City;
    var jan feb mar apr may jun 
        jul aug sep oct nov dec;
RUN;

Average Monthly Temperatures in Fahrenheit

City

jan

feb

mar

apr

may

jun

jul

aug

sep

oct

nov

dec

State College, PA

28.4

28.4

35.6

46.4

57.2

66.2

69.8

68.0

60.8

50.0

39.2

30.2

Miami, FL

68.0

68.0

71.6

73.4

78.8

80.6

82.4

82.4

80.6

78.8

73.4

68.0

St. Louis, MO

30.2

33.8

42.8

55.4

64.4

73.4

78.8

77.0

69.8

59.0

44.6

33.8

New Orleans, LA

51.8

55.4

60.8

68.0

73.4

80.6

80.6

80.6

78.8

69.8

60.8

53.6

Madison, WI

17.6

23.0

32.0

44.6

57.2

66.2

71.6

68.0

60.8

50.0

35.6

23.0

Houston, TX

50.0

53.6

60.8

68.0

73.4

80.6

82.4

82.4

78.8

69.8

60.8

53.6

Phoenix, AZ

53.6

57.2

60.8

69.8

78.8

87.8

91.4

89.6

86.0

73.4

60.8

53.6

Seattle, WA

41.0

42.8

44.6

50.0

55.4

60.8

64.4

64.4

60.8

53.6

46.4

42.8

San Francisco, CA

50.0

53.6

53.6

55.4

57.2

59.0

59.0

60.8

62.6

60.8

57.2

51.8

San Diego, CA

55.4

57.2

59.0

60.8

62.6

66.2

69.8

71.6

69.8

66.2

60.8

57.2

If you compare this program with the previous program, you can see the statements that replaced the twelve assignment statements. The ARRAY statement defines an array called fahr. It tells SAS that you want to group the twelve month variables, jan , feb, ... dec, into an array called fahr. The (12) that appears in parentheses is a required part of the array declaration. Called the dimension of the array, it tells SAS how many elements, that is, variables, you want to group together. When specifying the variable names to be grouped in the array, we simply list the elements, separating each element with a space. As with all SAS statements, the ARRAY statement is closed with a semicolon (;).

Once we've defined the array fahr, we can use it in our code instead of the individual variable names. We refer to the individual elements of the array using its name and an index, such as, fahr(i). The order in which the variables appear in the ARRAY statement determines the variable's position in the array. For example, fahr(1) corresponds to the jan variable, fahr(2) corresponds to the feb variable, and fahr(12) corresponds to the dec variable. It's when you use an array like fahr, in conjunction with an iterative DO loop, that you can really simplify your code, as we did in this program.

The DO loop tells SAS to process through the elements of the fahr array, each time converting the Celsius temperature to a Fahrenheit temperature. For example, when the index variable i is 1, the assignment statement becomes:

fahr(1) = 1.8*fahr(1) + 32;

which you could think of as saying:

jan = 1.8*jan + 32;

The value of jan on the right side of the equal sign is the Celsius temperature. After the assignment statement is executed, the value of jan on the left side of the equal sign is updated to reflect the Fahrenheit temperature.

Now, launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures. Oh, one more thing to point out! Note that the variables listed in the PRINT procedure's VAR statement are the original variable names jan, feb, ..., dec, not the variables as they were grouped into an array, fahr(1), fahr(2), ..., fahr(12). That's because an array exists only for the duration of the DATA step. If in the PRINT procedure, you instead tell SAS to print fahr(1), fahr(2), ... you'll see that SAS will hiccup. Let's summarize!

Defining an Array

You must use an ARRAY statement having the following general form in order to group previously defined data set variables into an array:

ARRAY array-name(dimension) <elements>;

where:

  • array-name must be a valid SAS name that specifies the name of the array
  • dimension describes the number and arrangement of array elements. The default dimension is one.
  • elements list the variables to be grouped together to form the array. The array elements must be either all numeric or all characters. Using standard SAS Help notation, the term elements appear in <> brackets to indicate that they are optional. That is, you do not have to specify elements in the ARRAY statement. If no elements are listed, new variables are created with default names.

A few more points must be made about the array-name. Unless you are interested in confusing SAS, you should not give an array the same name as a variable that appears in the same DATA step. You should also avoid giving an array the same name as a valid SAS function. SAS allows you to do so, but then you won't be able to use the function in the same DATA step. For example, if you named an array mean in a DATA step, you would not be able to use the mean function in the DATA step. SAS will print a warning message in your log window to let you know such. Finally, array names cannot be used in LABEL, FORMAT, DROP, KEEP, or LENGTH statements.

The three examples that remain in this section pertain to alternative ways of defining the array. The first pertains to an alternative way of defining the dimension of the array. The second and third pertain to alternative ways of defining the variables to be grouped in the array.

Example 19.3

The following program is identical to the program in the previous example, except the 12 in the ARRAY statement has been changed to an asterisk (*):

DATA avgfahrenheittwo;
    set avgcelsius;
    array fahr(*) jan feb mar apr may jun
                  jul aug sep oct nov dec;
    do i = 1 to 12;
            fahr(i) = 1.8*fahr(i) + 32;
    end;
RUN;

PROC PRINT data = avgfahrenheittwo;
    title 'Average Monthly Temperatures in Fahrenheit';
    id City;
    var jan feb mar apr may jun 
        jul aug sep oct nov dec;
RUN;

Average Monthly Temperatures in Fahrenheit

City

jan

feb

mar

apr

may

jun

jul

aug

sep

oct

nov

dec

State College, PA

28.4

28.4

35.6

46.4

57.2

66.2

69.8

68.0

60.8

50.0

39.2

30.2

Miami, FL

68.0

68.0

71.6

73.4

78.8

80.6

82.4

82.4

80.6

78.8

73.4

68.0

St. Louis, MO

30.2

33.8

42.8

55.4

64.4

73.4

78.8

77.0

69.8

59.0

44.6

33.8

New Orleans, LA

51.8

55.4

60.8

68.0

73.4

80.6

80.6

80.6

78.8

69.8

60.8

53.6

Madison, WI

17.6

23.0

32.0

44.6

57.2

66.2

71.6

68.0

60.8

50.0

35.6

23.0

Houston, TX

50.0

53.6

60.8

68.0

73.4

80.6

82.4

82.4

78.8

69.8

60.8

53.6

Phoenix, AZ

53.6

57.2

60.8

69.8

78.8

87.8

91.4

89.6

86.0

73.4

60.8

53.6

Seattle, WA

41.0

42.8

44.6

50.0

55.4

60.8

64.4

64.4

60.8

53.6

46.4

42.8

San Francisco, CA

50.0

53.6

53.6

55.4

57.2

59.0

59.0

60.8

62.6

60.8

57.2

51.8

San Diego, CA

55.4

57.2

59.0

60.8

62.6

66.2

69.8

71.6

69.8

66.2

60.8

57.2

Simple enough! Rather than having to tell SAS how many variables you are grouping in an array, you can let SAS do the dirty work of counting the number of elements you include in your variable list. To do so, you simply define the dimension using an asterisk (*). You might find this strategy particularly helpful if you are grouping so many variables together into an array that you don't want to spend the time counting them. Incidentally, throughout this lesson, we enclose the array's dimension (or index variable) in parentheses ( ). We could alternatively use braces { } or brackets [ ].

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures.

Example 19.4

The following program re-reads the average monthly temperatures of the ten cities into numbered variables m1, m2, ..., m12, and then uses a numbered range list m1-m12 as a shortcut in specifying the elements of the fahr array in the ARRAY statement:

DATA avgfahrenheittwo;
DATA avgtempsF;
    input City $ 1-18 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12;
    array fahr(*) m1-m12;
    do i = 1 to 12;
            fahr(i) = 1.8*fahr(i) + 32;
    end;
    DATALINES;
State College, PA  -2 -2  2  8 14 19 21 20 16 10  4 -1
Miami, FL          20 20 22 23 26 27 28 28 27 26 23 20
St. Louis, MO      -1  1  6 13 18 23 26 25 21 15  7  1
New Orleans, LA    11 13 16 20 23 27 27 27 26 21 16 12
Madison, WI        -8 -5  0  7 14 19 22 20 16 10  2 -5
Houston, TX        10 12 16 20 23 27 28 28 26 21 16 12
Phoenix, AZ        12 14 16 21 26 31 33 32 30 23 16 12
Seattle, WA         5  6  7 10 13 16 18 18 16 12  8  6
San Francisco, CA  10 12 12 13 14 15 15 16 17 16 14 11
San Diego, CA      13 14 15 16 17 19 21 22 21 19 16 14;
RUN;
PROC PRINT data = avgtempsF;
    title 'Average Monthly Temperatures in Fahrenheit';
    id City;
    var m1-m12;
RUN;

Average Monthly Temperatures in Fahrenheit

City

m1

m2

m3

m4

m5

m6

m7

m8

m9

m10

m11

m12

State College, PA

28.4

28.4

35.6

46.4

57.2

66.2

69.8

68.0

60.8

50.0

39.2

30.2

Miami, FL

68.0

68.0

71.6

73.4

78.8

80.6

82.4

82.4

80.6

78.8

73.4

68.0

St. Louis, MO

30.2

33.8

42.8

55.4

64.4

73.4

78.8

77.0

69.8

59.0

44.6

33.8

New Orleans, LA

51.8

55.4

60.8

68.0

73.4

80.6

80.6

80.6

78.8

69.8

60.8

53.6

Madison, WI

17.6

23.0

32.0

44.6

57.2

66.2

71.6

68.0

60.8

50.0

35.6

23.0

Houston, TX

50.0

53.6

60.8

68.0

73.4

80.6

82.4

82.4

78.8

69.8

60.8

53.6

Phoenix, AZ

53.6

57.2

60.8

69.8

78.8

87.8

91.4

89.6

86.0

73.4

60.8

53.6

Seattle, WA

41.0

42.8

44.6

50.0

55.4

60.8

64.4

64.4

60.8

53.6

46.4

42.8

San Francisco, CA

50.0

53.6

53.6

55.4

57.2

59.0

59.0

60.8

62.6

60.8

57.2

51.8

San Diego, CA

55.4

57.2

59.0

60.8

62.6

66.2

69.8

71.6

69.8

66.2

60.8

57.2

When specifying a numbered range of variables:

  • the variables must have the same name except for the last character or characters
  • the last character of each variable must be numeric
  • the variables must be numbered consecutively

As you can see, the variables m1, m2, ..., m12 in our program meet each of these conditions. That's why we can use the shortcut m1-m12 when we define our array fahr in the ARRAY statement.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures.

The above program used a numbered range list to shorten the list of variable names grouped into the fahr array. In some cases, you could also consider using the special name lists _ALL_, _CHARACTER_ and _NUMERIC_:

  • Use _ALL_ when you want SAS to use all of the same types of variables (all numeric or all characters) in your SAS data set.
  • Use _CHARACTER_ when you want SAS to use all of the character variables in your data set.
  • Use _NUMERIC_ when you want SAS to use all of the numeric variables in your data set.

The following program illustrates the use of the _NUMERIC_ special list.

Example 19.5

The following program re-reads the average monthly temperatures of the ten cities into month variables jan, feb, ..., dec, and then uses the special _NUMERIC_ list as a shortcut in specifying the elements of the fahr array in the ARRAY statement:

DATA avgtempsFtwo;
    input City $ 1-18 jan feb mar apr may jun 
                      jul aug sep oct nov dec;
	array fahr(*) _NUMERIC_;
	do i = 1 to 12;
	      fahr(i) = 1.8*fahr(i) + 32;
    end;
    DATALINES;
State College, PA  -2 -2  2  8 14 19 21 20 16 10  4 -1
Miami, FL          20 20 22 23 26 27 28 28 27 26 23 20
St. Louis, MO      -1  1  6 13 18 23 26 25 21 15  7  1
New Orleans, LA    11 13 16 20 23 27 27 27 26 21 16 12
Madison, WI        -8 -5  0  7 14 19 22 20 16 10  2 -5
Houston, TX        10 12 16 20 23 27 28 28 26 21 16 12
Phoenix, AZ        12 14 16 21 26 31 33 32 30 23 16 12
Seattle, WA         5  6  7 10 13 16 18 18 16 12  8  6
San Francisco, CA  10 12 12 13 14 15 15 16 17 16 14 11
San Diego, CA      13 14 15 16 17 19 21 22 21 19 16 14;
RUN;
PROC PRINT data = avgtempsFtwo;
    title 'Average Monthly Temperatures in Fahrenheit';
	id City;
	var jan--dec;
RUN;

Average Monthly Temperatures in Fahrenheit

City

jan

feb

mar

apr

may

jun

jul

aug

sep

oct

nov

dec

State College, PA

28.4

28.4

35.6

46.4

57.2

66.2

69.8

68.0

60.8

50.0

39.2

30.2

Miami, FL

68.0

68.0

71.6

73.4

78.8

80.6

82.4

82.4

80.6

78.8

73.4

68.0

St. Louis, MO

30.2

33.8

42.8

55.4

64.4

73.4

78.8

77.0

69.8

59.0

44.6

33.8

New Orleans, LA

51.8

55.4

60.8

68.0

73.4

80.6

80.6

80.6

78.8

69.8

60.8

53.6

Madison, WI

17.6

23.0

32.0

44.6

57.2

66.2

71.6

68.0

60.8

50.0

35.6

23.0

Houston, TX

50.0

53.6

60.8

68.0

73.4

80.6

82.4

82.4

78.8

69.8

60.8

53.6

Phoenix, AZ

53.6

57.2

60.8

69.8

78.8

87.8

91.4

89.6

86.0

73.4

60.8

53.6

Seattle, WA

41.0

42.8

44.6

50.0

55.4

60.8

64.4

64.4

60.8

53.6

46.4

42.8

San Francisco, CA

50.0

53.6

53.6

55.4

57.2

59.0

59.0

60.8

62.6

60.8

57.2

51.8

San Diego, CA

55.4

57.2

59.0

60.8

62.6

66.2

69.8

71.6

69.8

66.2

60.8

57.2

First, note that the only numeric variables in the data set are twelve average monthly temperatures. For that reason, we can — and therefore do — define the array fahr using the special list _NUMERIC_. The remainder of the program is identical in functionality to the previous programs in this section.

Oh, you might want to note one more shortcut that was taken in the PRINT procedure, that is, the name range list, jan--dec, used in the VAR statement. This tells SAS to print all of the variables that appear in the avgtempsFtwo data set between the jan variable and the dec variable — by their position in the data set. This shortcut can also be used when defining an ARRAY. In order to specify a name range list, though, you have to know the internal order, or position, of the variables in the SAS data set. If you are not sure of the internal order of your data set, you can find out using the CONTENTS procedure with the POSITION option.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures.


19.2 - Processing of Arrays

19.2 - Processing of Arrays

It is helpful, as is often the case when learning a new SAS tool, to take a look at the inner workings of the compile and execution phases of a DATA step that involves an array. That's what we'll do in this section. Specifically, we'll revisit Example 19.2, but this time our focus will be on how SAS processes the DATA step.

Example 19.6

The following program is identical to the program in Example 19.2. That is, the program uses a one-dimensional array called fahr to convert the average Celsius temperatures in the avgcelsius data set to average Fahrenheit temperatures stored in a new data set called avgfahrenheit:

DATA avgfahrenheit;
        	set avgcelsius;
            array fahr(12) jan feb mar apr may jun
                            jul aug sep oct nov dec;
            do i = 1 to 12;
                    fahr(i) = 1.8*fahr(i) + 32;  	
            end;
        RUN;
        PROC PRINT data = avgfahrenheit;
            title 'Average Monthly Temperatures in Fahrenheit';
            id City;
            var jan feb mar apr may jun 
                jul aug sep oct nov dec;
        RUN;

Average Monthly Temperatures in Fahrenheit
Cityjanfebmaraprmayjunjulaugsepoctnovdec
State College, PA28.428.435.646.457.266.269.868.060.850.039.230.2
Miami, FL68.068.071.673.478.880.682.482.480.678.873.468.0
St. Louis, MO30.233.842.855.464.473.478.877.069.859.044.633.8
New Orleans, LA51.855.460.868.073.480.680.680.678.869.860.853.6
Madison, WI17.623.032.044.657.266.271.668.060.850.035.623.0
Houston, TX50.053.660.868.073.480.682.482.478.869.860.853.6
Phoenix, AZ53.657.260.869.878.887.891.489.686.073.460.853.6
Seattle, WA41.042.844.650.055.460.864.464.460.853.646.442.8
San Francisco, CA50.053.653.655.457.259.059.060.862.660.857.251.8
San Diego, CA55.457.259.060.862.666.269.871.669.866.260.857.2

As always, at the end of the compile phase, SAS will have created a program data vector containing the automatic variables ( _N_ and _ERROR_), the variables from the input data set avgcelsius (that is, City, jan, feb, ..., dec), and any newly created variables in the DATA step (the DO loop's index variable i). At the end of the compile phase, this is what (an abbreviated version of ) the program data vector looks like:

_N__ERROR_Cityjanfebmar...novdeci
10 .........

Note that the array name and array references are not included in the program data vector, as they exist only for the duration of the DATA step. During the first iteration of the DATA step, the first observation in the avgcelsius data set is read into the program data vector:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA-2-22...104.

Because the ARRAY statement is a compile-time-only statement, it is ignored during execution. The DO loop is executed next. During the first iteration of the DO loop, the index variable i is set to 1. As a result, the array reference fahr(i) becomes fahr(1). Because fahr(1) refers to the first array element, jan, the value of jan in the program data vector is converted from Celsius to Fahrenheit:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA28.4-22...1041

During the second iteration of the DO loop, the index variable i is set to 2. As a result, the array reference fahr(i) becomes fahr(2). Because fahr(2) refers to the second array element, feb, the value of feb in the program data vector is converted from Celsius to Fahrenheit:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA28.428.42...1042

During the third iteration of the DO loop, the index variable i is set to 3. As a result, the array reference fahr(i) becomes fahr(3). Because fahr(3) refers to the third array element, mar, the value of mar in the program data vector is converted from Celsius to Fahrenheit:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA28.428.435.6...1043

SAS continues to process through the DO loop. During the eleventh iteration of the DO loop, the index variable i is set to 11. As a result, the array reference fahr(i) becomes fahr(11). Because fahr(11) refers to the eleventh array element, nov, the value of nov in the program data vector is converted from Celsius to Fahrenheit:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA28.428.435.6...39.2411

And, during the twelfth iteration of the DO loop, the index variable i is set to 12. As a result, the array reference fahr(i) becomes fahr(12). Because fahr(12) refers to the twelfth array element, dec, the value of dec in the program data vector is converted from Celsius to Fahrenheit:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA28.428.435.6...39.230.212

SAS then increases the value of the index variable i to 13:

_N__ERROR_Cityjanfebmar...novdeci
10State College, PA28.428.435.6...39.230.213

and steps out of the DO loop because its stop value is 12. Having arrived at the end of the DATA step, SAS writes the contents of the program data vector as the first observation in the output data set avgfahrenheit. SAS returns to the top of the DATA step and begins the process all over again for the second observation in the avgcelsius data set. The process proceeds as described until SAS runs out of observations to process in the avgcelsius data set.


19.3 - Creating Variables in an Array Statement

19.3 - Creating Variables in an Array Statement

So far, we have learned several ways to group existing variables into an array. We can also create new variables in an ARRAY statement by omitting the array elements from the statement. When our ARRAY statement fails to reference existing variables, SAS automatically creates new variables for us and assigns default names to them.

Example 19.7

The following program again converts the average monthly Celsius temperatures in ten cities to average monthly Fahrenheit temperatures. To do so, the already existing Celsius temperatures, jan, feb, ..., and dec, are grouped into an array called celsius, and the resulting Fahrenheit temperatures are stored in new variables janf, febf, ..., decf, which are grouped into an array called fahr:

DATA avgtemps;
    set avgcelsius;
    array celsius(12) jan feb mar apr may jun 
                        jul aug sep oct nov dec;
    array fahr(12) janf febf marf aprf mayf junf
                    julf augf sepf octf novf decf;
    do i = 1 to 12;
            fahr(i) = 1.8*celsius(i) + 32;
    end;
RUN;
PROC PRINT data = avgtemps;
    title 'Average Monthly Temperatures';
    id City;
    var jan janf feb febf mar marf;
    var apr aprf may mayf jun junf;
    var jul julf aug augf sep sepf;
    var oct octf nov novf dec decf;
RUN;

Average Monthly Temperatures

City

jan

janf

feb

febf

mar

marf

apr

aprf

may

mayf

jun

junf

jul

julf

aug

augf

sep

sepf

oct

octf

nov

novf

dec

decf

State College, PA

-2

28.4

-2

28.4

2

35.6

8

46.4

14

57.2

19

66.2

21

69.8

20

68.0

16

60.8

10

50.0

4

39.2

-1

30.2

Miami, FL

20

68.0

20

68.0

22

71.6

23

73.4

26

78.8

27

80.6

28

82.4

28

82.4

27

80.6

26

78.8

23

73.4

20

68.0

St. Louis, MO

-1

30.2

1

33.8

6

42.8

13

55.4

18

64.4

23

73.4

26

78.8

25

77.0

21

69.8

15

59.0

7

44.6

1

33.8

New Orleans, LA

11

51.8

13

55.4

16

60.8

20

68.0

23

73.4

27

80.6

27

80.6

27

80.6

26

78.8

21

69.8

16

60.8

12

53.6

Madison, WI

-8

17.6

-5

23.0

0

32.0

7

44.6

14

57.2

19

66.2

22

71.6

20

68.0

16

60.8

10

50.0

2

35.6

-5

23.0

Houston, TX

10

50.0

12

53.6

16

60.8

20

68.0

23

73.4

27

80.6

28

82.4

28

82.4

26

78.8

21

69.8

16

60.8

12

53.6

Phoenix, AZ

12

53.6

14

57.2

16

60.8

21

69.8

26

78.8

31

87.8

33

91.4

32

89.6

30

86.0

23

73.4

16

60.8

12

53.6

Seattle, WA

5

41.0

6

42.8

7

44.6

10

50.0

13

55.4

16

60.8

18

64.4

18

64.4

16

60.8

12

53.6

8

46.4

6

42.8

San Francisco, CA

10

50.0

12

53.6

12

53.6

13

55.4

14

57.2

15

59.0

15

59.0

16

60.8

17

62.6

16

60.8

14

57.2

11

51.8

San Diego, CA

13

55.4

14

57.2

15

59.0

16

60.8

17

62.6

19

66.2

21

69.8

22

71.6

21

69.8

19

66.2

16

60.8

14

57.2

The DATA step should look eerily similar to that of Example 7.6. The only thing that differs here is rather than writing over the Celsius temperatures, they are preserved by storing the calculated Fahrenheit temperatures in new variables called janf, febf, ..., and decf. The first ARRAY statement tells SAS to group the jan, feb, ..., dec variables in the avgcelsius data set into a one-dimensional array called celsius. The second ARRAY statement tells SAS to create twelve new variables called janf, febf, ..., and decf and to group them into an array called fahr. The DO loop processes through the twelve elements of the celsius array, converts the Celsius temperatures to Fahrenheit temperatures, and stores the results in the fahr array. The PRINT procedure then tells SAS to print the contents of the twelve Celsius temperatures and twelve Fahrenheit temperatures side-by-side. Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were properly converted to Fahrenheit temperatures.

Example 19.8

The following program is identical to the previous program, except this time rather than naming the new variables grouped into the fahr array, we let SAS do the naming for us:

DATA avgtempsinF;
    set avgcelsius;
    array celsius(12) jan feb mar apr may jun 
                      jul aug sep oct nov dec;
    array fahr(12);
    do i = 1 to 12;
            fahr(i) = 1.8*celsius(i) + 32;
    end;
RUN;
PROC PRINT data = avgtempsinF;
    title 'Average Monthly Temperatures in Fahrenheit';
    id City;
    var fahr1-fahr12;
RUN;

Average Monthly Temperatures in Fahrenheit

City

fahr1

fahr2

fahr3

fahr4

fahr5

fahr6

fahr7

fahr8

fahr9

fahr10

fahr11

fahr12

State College, PA

28.4

28.4

35.6

46.4

57.2

66.2

69.8

68.0

60.8

50.0

39.2

30.2

Miami, FL

68.0

68.0

71.6

73.4

78.8

80.6

82.4

82.4

80.6

78.8

73.4

68.0

St. Louis, MO

30.2

33.8

42.8

55.4

64.4

73.4

78.8

77.0

69.8

59.0

44.6

33.8

New Orleans, LA

51.8

55.4

60.8

68.0

73.4

80.6

80.6

80.6

78.8

69.8

60.8

53.6

Madison, WI

17.6

23.0

32.0

44.6

57.2

66.2

71.6

68.0

60.8

50.0

35.6

23.0

Houston, TX

50.0

53.6

60.8

68.0

73.4

80.6

82.4

82.4

78.8

69.8

60.8

53.6

Phoenix, AZ

53.6

57.2

60.8

69.8

78.8

87.8

91.4

89.6

86.0

73.4

60.8

53.6

Seattle, WA

41.0

42.8

44.6

50.0

55.4

60.8

64.4

64.4

60.8

53.6

46.4

42.8

San Francisco, CA

50.0

53.6

53.6

55.4

57.2

59.0

59.0

60.8

62.6

60.8

57.2

51.8

San Diego, CA

55.4

57.2

59.0

60.8

62.6

66.2

69.8

71.6

69.8

66.2

60.8

57.2

Note that when we define the fahr array in the second ARRAY statement, we specify how many elements the fahr array should contain (12), but we do not specify any variables to group into the array. That tells SAS two things: i) we want to create twelve new variables, and ii) we want to leave the naming of the variables to SAS. In this situation, SAS creates default names by concatenating the array name and the numbers 1, 2, 3, and so on, up to the array dimension. Here, for example, SAS creates the names fahr1, fahr2, fahr3, ..., up to fahr12. That's why we refer to the Fahrenheit temperatures as fahr1 to fahr12 in the PRINT procedure's VAR statement. Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the Celsius temperatures were again properly converted to Fahrenheit temperatures.


19.4 - Temporary Array Elements

19.4 - Temporary Array Elements

When elements of an array are constants needed only for the duration of the DATA step, you can omit the variables associated with an array group and instead use temporary array elements. Although they behave like variables, temporary array elements:

  • do not appear in the resulting data set;
  • do not have names and can be only referenced by their array names and dimensions; and
  • are automatically retained, rather than being reset to missing at the beginning of the next iteration of the DATA step.

In this section, we'll look at three examples that involve checking a subset of Quality of Life data for errors. In the first example, we'll look to see if the data recorded in ten variables — qul3a, qul3b, ..., and qul3j —are within an expected range without using an array. In the second example, we'll look to see if the data recorded in the same ten variables are within an expected range using an array that corresponds to three new variables error1, error2, and error3. In the third and final example, we'll look to see if the data recorded in the same ten variables are within an expected range using an array containing only temporary elements.

Example 19.9

The following program first reads a subset of Quality of Life data (variables qul3a, qul3b, ..., and qul3j) into a SAS data set called qul. Then, the program checks to make sure that the values for each variable have been recorded as either a 1, 2, or 3 as would be expected from the data form. If a value for one of the variables does not equal 1, 2, or 3, then that observation is output to a data set called errors. Otherwise, the observation is output to the qul data set. Because the error checking takes place without using arrays, the program contains a series of ten if/then statements, corresponding to each of the ten concerned variables:

DATA qul errors;
    input subj qul3a qul3b qul3c qul3d qul3e 
               qul3f qul3g qul3h qul3i qul3j;
    flag = 0;
    if qul3a not in (1, 2, 3) then flag = 1;
    if qul3b not in (1, 2, 3) then flag = 1;
    if qul3c not in (1, 2, 3) then flag = 1;
    if qul3d not in (1, 2, 3) then flag = 1;
    if qul3e not in (1, 2, 3) then flag = 1;
    if qul3f not in (1, 2, 3) then flag = 1;
    if qul3g not in (1, 2, 3) then flag = 1;
    if qul3h not in (1, 2, 3) then flag = 1;
    if qul3i not in (1, 2, 3) then flag = 1;
    if qul3j not in (1, 2, 3) then flag = 1;
    if flag = 1 then output errors;
                else output qul;
    drop flag;
    DATALINES;
    110011 1 2 3 3 3 3 2 1 1 3
    210012 2 3 4 1 2 2 3 3 1 1
    211011 1 2 3 2 1 2 3 2 1 3
    310017 1 2 3 3 3 3 3 2 2 1
    411020 4 3 3 3 3 2 2 2 2 2
    510001 1 1 1 1 1 1 2 1 2 2
    ;
RUN;
PROC PRINT data = qul;
    TITLE 'Observations in Qul data set with no errors';
RUN;
PROC PRINT data = errors;
    TITLE 'Observations in Qul data set with errors';
RUN;

Observations in Qul data set with no errors
Obssubjqul3aqul3bqul3cqul3dqul3equl3fqul3gqul3hqul3iqul3j
11100111233332113
22110111232123213
33100171233333221
45100011111112122

Observations in Qul data set with errors
Obssubjqul3aqul3bqul3cqul3dqul3equl3fqul3gqul3hqul3iqul3j
12100122341223311
24110204333322222

The INPUT statement first reads an observation of data containing one subject's quality of life data. An observation is assumed to be error-free (flag is initially set to 0) until it is found to be in error (flag is set to 1 if any of the ten values are out of range). If an observation is deemed to contain an error (flag = 1) after looking at each of the ten values, it is output to the errors data set. Otherwise (flag = 0), it is output to the qul data set.

First, note that two of the observations in the input data set contain data recording errors. The qul3c value for subject 210012 was recorded as 4, as was the qul3a value for subject 411020. Then, launch and run  the SAS program. Review the output to convince yourself that qul contains the four observations with clean data, and errors contain the two observations with bad data.

You should also appreciate that this is a classic situation that cries out for using arrays. If you aren't yet convinced, imagine how long the above program would be if you had to write similar if/then statements to check for errors in, say, a hundred such variables.

Example 19.10

The following program performs the same error checking as the previous program except here the error checking is accomplished using two arrays, bounds, and quldata:

DATA qul errors;
DATA qul errors;
    input subj qul3a qul3b qul3c qul3d qul3e 
            qul3f qul3g qul3h qul3i qul3j;
    array bounds (3) error1 - error3 (1 2 3);
    array quldata (10) qul3a -- qul3j;
    flag = 0;
    do i = 1 to 10;
        if quldata(i) ne bounds(1) and
            quldata(i) ne bounds(2) and
            quldata(i) ne bounds(3)
        then flag = 1;
    end;
    if flag = 1 then output errors;
                else output qul;
    drop i flag;
    DATALINES;
    110011 1 2 3 3 3 3 2 1 1 3
    210012 2 3 4 1 2 2 3 3 1 1
    211011 1 2 3 2 1 2 3 2 1 3
    310017 1 2 3 3 3 3 3 2 2 1
    411020 4 3 3 3 3 2 2 2 2 2
    510001 1 1 1 1 1 1 2 1 2 2
    ;
RUN;
PROC PRINT data = qul;
    TITLE 'Observations in Qul data set with no errors';
RUN;
PROC PRINT data = errors;
    TITLE '
Observations in Qul data set with errors';
RUN;

Observations in Qul data set with no errors

Obs

subj

qul3a

qul3b

qul3c

qul3d

qul3e

qul3f

qul3g

qul3h

qul3i

qul3j

error1

error2

error3

1

110011

1

2

3

3

3

3

2

1

1

3

1

2

3

2

211011

1

2

3

2

1

2

3

2

1

3

1

2

3

3

310017

1

2

3

3

3

3

3

2

2

1

1

2

3

4

510001

1

1

1

1

1

1

2

1

2

2

1

2

3


Observations in Qul data set with errors

Obs

subj

qul3a

qul3b

qul3c

qul3d

qul3e

qul3f

qul3g

qul3h

qul3i

qul3j

error1

error2

error3

1

210012

2

3

4

1

2

2

3

3

1

1

1

2

3

2

411020

4

3

3

3

3

2

2

2

2

2

1

2

3

If you compare this program to the previous program, you'll see that the only differences here are the presence of two ARRAY definition statements and the IF/THEN statement within the iterative DO loop that does the error checking.

The first ARRAY statement uses a numbered range list to define an array called bounds that contains three new variables — error1, error2, and error3. The "(1 2 3)" that appears after the variable list error1-error3 tells SAS to set, or initialize, the elements of the array to equal 1, 2, and 3. In general, you initialize an array in this manner, namely listing as many values as there contain elements of the array and separating each pair of values with a space. If you intend for your array to contain character constants, you must put the values in single quotes. For example, the following ARRAY statement tells SAS to define a character array (hence the dollar sign $) called weekdays:

ARRAY weekdays(5) $ ('M' 'T' 'W' 'R' 'F');

and to initialize the elements of the array as M, T, W, R, and F.

The second ARRAY statement uses a name range list to define an array called quldata that contains the ten quality of life variables. The IF/THEN statement uses slightly different logic than the previous program to tell SAS to compare the elements of the quldata array to the elements of the bounds array to determine whether any of the values are out of range.

Now, launch and run  the SAS program. Review the output to convince yourself that just as before qul contains the four observations with clean data, and errors contain the two observations with bad data. Also, note that the three new error variables error1, error2, and error3 remain present in the data set.

Example 19.11

The valid values 1, 2, and 3 are needed only temporarily in the previous program. Therefore, we alternatively could have used temporary array elements in defining the bounds array. The following program does just that. It is identical to the previous program except here the bounds array is defined using temporary array elements rather than using three new variables error1, error2, and error3:

DATA qul errors;
    input subj qul3a qul3b qul3c qul3d qul3e 
            qul3f qul3g qul3h qul3i qul3j;
    array bounds (3) _TEMPORARY_ (1 2 3);
    array quldata (10) qul3a -- qul3j;
    flag = 0;
    do i = 1 to 10;
        if quldata(i) ne bounds(1) and
            quldata(i) ne bounds(2) and
            quldata(i) ne bounds(3)
        then flag = 1;
    end;
    if flag = 1 then output errors;
                else output qul;
    drop i flag;
    DATALINES;
    110011 1 2 3 3 3 3 2 1 1 3
    210012 2 3 4 1 2 2 3 3 1 1
    211011 1 2 3 2 1 2 3 2 1 3
    310017 1 2 3 3 3 3 3 2 2 1
    411020 4 3 3 3 3 2 2 2 2 2
    510001 1 1 1 1 1 1 2 1 2 2
    ;
RUN; 
PROC PRINT data = qul;
    TITLE 'Observations in Qul data set with no errors';
RUN;
PROC PRINT data = errors;
    TITLE '
Observations in Qul data set with errors';
RUN;

Observations in Qul data set with no errors

Obs

subj

qul3a

qul3b

qul3c

qul3d

qul3e

qul3f

qul3g

qul3h

qul3i

qul3j

1

110011

1

2

3

3

3

3

2

1

1

3

2

211011

1

2

3

2

1

2

3

2

1

3

3

310017

1

2

3

3

3

3

3

2

2

1

4

510001

1

1

1

1

1

1

2

1

2

2


Observations in Qul data set with errors

Obs

subj

qul3a

qul3b

qul3c

qul3d

qul3e

qul3f

qul3g

qul3h

qul3i

qul3j

1

210012

2

3

4

1

2

2

3

3

1

1

2

411020

4

3

3

3

3

2

2

2

2

2

If you compare this program to the previous program, you'll see that the only difference here is the presence of the _TEMPORARY_ argument in the definition of the bounds array. The bounds array is again initialized to the three valid values "(1 2 3)".

Launch and run  the SAS program. Review the output to convince yourself that just as before qul contains the four observations with clean data, and errors contain the two observations with bad data. Also, note that the temporary array elements do not appear in the data set.


19.5 - Array Bounds

19.5 - Array Bounds

Each of the arrays we've considered thus far have been defined, by default, to have a lower bound of 1 and an upper bound which equals the number of elements in the array's dimension. For example, the array pennstate:

ARRAY pennstate(4) nittany lions happy valley;

has a lower bound of 1 and an upper bound of 4. In this section, we'll look at three examples that concern the bounds of an array. In the first example, we'll use the DIM function to change the upper bound of a DO loop's index variable dynamically (rather than stating it in advance). In the second example, we'll define the lower and upper bounds of a one-dimensional array to create a bounded array. In the third example, we'll use the LBOUND and HBOUND functions to change the lower and upper bounds of a DO loop's index variable dynamically.

Example 19.12

The following program reads the yes/no responses of five subjects to six survey questions (q1, q2, ..., q6) into a temporary SAS data set called survey. A yes response is coded and entered as a 2, while a no response is coded and entered as a 1. Just four of the variables (q3, q4, q5, and q6) are stored in a one-dimensional array called qxs. Then, a DO LOOP, in conjunction with the DIM function, is used to recode the responses to the four variables so that a 2 is changed to a 1, and a 1 is changed to a 0:

DATA survey (DROP = i);
	INPUT subj q1 q2 q3 q4 q5 q6;
	ARRAY qxs(4) q3-q6;
	DO i = 1 to dim(qxs);
		qxs(i) = qxs(i) - 1;
	END;
	DATALINES;
	1001 1 2 1 2 1 1
	1002 2 1 2 2 2 1
	1003 2 2 2 1 . 2
	1004 1 . 1 1 1 2
	1005 2 1 2 2 2 1
	;
RUN;
 
PROC PRINT data = survey;
	TITLE 'The survey data using dim function';
RUN;

The survey data using dim function

Obs

subj

q1

q2

q3

q4

q5

q6

1

1001

1

2

0

1

0

0

2

1002

2

1

1

1

1

0

3

1003

2

2

1

0

.

1

4

1004

1

.

0

0

0

1

5

1005

2

1

1

1

1

0

First, note that although all of the survey variables (q1, ..., q6) are read into the survey data set, the ARRAY statement groups only 4 of the variables (q3, q4, q5, q6) into the one-dimensional array qxs. For example, qxs(1) corresponds to the q3 variable, qxs(2) corresponds to the q4 variable, and so on. Then, rather than telling SAS to process the array from element 1 to element 4, the DO loop tells SAS to process the array from element 1 to the more general DIM(qxs). In general, the DIM function returns the number of the elements in the array, which in this case is 4. The DO loop tells SAS to recode the values by simply subtracting 1 from each value. And, the index variable i is output to the survey data set by default and is therefore dropped.

Now, launch and run  the SAS program. Then, review the output from the PRINT procedure to convince yourself that the program does indeed recode the four variables q3, q4, q5, and q6 as described.

Example 19.13

As previously discussed and illustrated, if you do not specifically tell SAS the lower bound of an array, SAS assumes that the lower bound is 1. For most arrays, 1 is a convenient lower bound and the number of elements is a convenient upper bound, so you usually don't need to specify both the lower and upper bounds. However, in cases where it is more convenient, you can modify both bounds for any array dimension.

In the previous example, perhaps you find it a little awkward that the array element qxs(1) corresponds to the q3 variable, the array element qxs(2) corresponds to the q4 variable, and so on. Perhaps you would find it more clear for the array element qxs(3) to correspond to the q3 variable, the array element qxs(4) to correspond to the q4 variable, ..., and the array element qxs(6) to correspond to the q6 variable. The following program is similar in function to the previous program, except here the task of recoding is accomplished by defining the lower bound of the qxs array to be 3 and the upper bound to be 6:

DATA survey (DROP = i);
DATA survey2 (DROP = i);
	INPUT subj q1 q2 q3 q4 q5 q6;
	ARRAY qxs(3:6) q3-q6;
	DO i = 3 to 6;
		qxs(i) = qxs(i) - 1;
	END;
	DATALINES;
	1001 1 2 1 2 1 1
	1002 2 1 2 2 2 1
	1003 2 2 2 1 . 2
	1004 1 . 1 1 1 2
	1005 2 1 2 2 2 1
	;
RUN;
 
PROC PRINT data = survey2;
	TITLE 'The survey data using bounded arrays';
RUN;

The survey data using bounded arrays

Obs

subj

q1

q2

q3

q4

q5

q6

1

1001

1

2

0

1

0

0

2

1002

2

1

1

1

1

0

3

1003

2

2

1

0

.

1

4

1004

1

.

0

0

0

1

5

1005

2

1

1

1

1

0

If you compare this program with the previous program, you'll see that only two things differ. The first difference is that the ARRAY statement here defines the lower bound of the qxs array to be 3 and the upper bound to be 6. In general, you can always define the lower and upper bounds of any array dimension in this way, namely by specifying the lower bound, then a colon (:), and then the upper bound. The second difference is that, for the DO loop, the bounds on the index variable i are specifically defined here to be between 3 and 6 rather than 1 to DIM(qxs) (which in this case is 4).

Now, launch and run  the SAS program. Then, review the output from the PRINT procedure to convince yourself that the program does indeed re-code the four variables q3, q4, q5, and q6 just as in the previous program.

Example 19.14

Now, there's still a little bit more that we can do to automate the handling of the bounds of an array dimension. The following program again uses a one-dimensional array qxs to recode four survey variables as did the previous two programs. Here, though, an asterisk (*) is used to tell SAS to determine the dimension of the qxs array, and the LBOUND and HBOUND functions are used to tell SAS to determine, respectively, the lower and upper bounds of the DO loop's index variable dynamically:

DATA survey3 (DROP = i);
	INPUT subj q1 q2 q3 q4 q5 q6;
	ARRAY qxs(*) q3-q6;
	DO i = lbound(qxs) to hbound(qxs);
		qxs(i) = qxs(i) - 1;
	END;
	DATALINES;
	1001 1 2 1 2 1 1
	1002 2 1 2 2 2 1
	1003 2 2 2 1 . 2
	1004 1 . 1 1 1 2
	1005 2 1 2 2 2 1
	;
RUN;
 
PROC PRINT data = survey3;
	TITLE 'The survey data by changing upper and lower bounds automatically';
RUN;

The survey data by changing upper and lower bounds automatically

Obs

subj

q1

q2

q3

q4

q5

q6

1

1001

1

2

0

1

0

0

2

1002

2

1

1

1

1

0

3

1003

2

2

1

0

.

1

4

1004

1

.

0

0

0

1

5

1005

2

1

1

1

1

0

If you compare this program with the previous program, you'll see that only two things differ. The first difference is that the asterisk (*) that appears in the ARRAY statement tells SAS to determine the bounds on the dimensions of the array during the declaration of qxs. SAS counts the number of elements in the array and determines that the dimension of qxs is 4. The second difference is that, for the DO loop, the bounds on the index variable i are determined dynamically to be between LBOUND(qxs) and HBOUND(qxs).

Now, launch and run  the SAS program. Then, review the output from the PRINT procedure to convince yourself that the program does indeed recode the four variables q3, q4, q5, and q6 just as in the previous two programs.


19.6 - Using Arrays to Transpose a Data Set

19.6 - Using Arrays to Transpose a Data Set

A few lessons back, we learned how to transpose a data set by taking advantage of the last. variable and RETAIN and OUTPUT statements. In this section, we'll learn how to use an array to transpose a "tall" data set into a "fat" data set.

Example 19.15

Throughout this section, we will work with the tallgrades data set that is created in the following DATA step:

DATA tallgrades;
    input idno 1-2 l_name $ 5-9 gtype $ 12-13 grade 15-17;
    cards;
10  Smith  E1  78
10  Smith  E2  82
10  Smith  E3  86
10  Smith  E4  69
10  Smith  P1  97
10  Smith  F1 160
11  Simon  E1  88
11  Simon  E2  72
11  Simon  E3  86
11  Simon  E4  99
11  Simon  P1 100
11  Simon  F1 170
12  Jones  E1  98
12  Jones  E2  92
12  Jones  E3  92
12  Jones  E4  99
12  Jones  P1  99
12  Jones  F1 185
;
RUN;

PROC PRINT data = tallgrades NOOBS;
   TITLE 'The tall grades data set';
RUN;

The tall grades data set

idno

l_name

gtype

grade

10

Smith

E1

78

10

Smith

E2

82

10

Smith

E3

86

10

Smith

E4

69

10

Smith

P1

97

10

Smith

F1

160

11

Simon

E1

88

11

Simon

E2

72

11

Simon

E3

86

11

Simon

E4

99

11

Simon

P1

100

11

Simon

F1

170

12

Jones

E1

98

12

Jones

E2

92

12

Jones

E3

92

12

Jones

E4

99

12

Jones

P1

99

12

Jones

F1

185

The tallgrades data set contains one observation for each grade for each student. Students are identified by their ID number (idno) and last name (l_name). The data set contains six different types of grades: exam 1 (E1), exam 2 (E2), exam 3 (E3), exam 4 (E4), each worth 100 points; one project (P1) worth 100 points; and a final exam (F1) worth 200 points. Launch and run  the SAS program so that we can work with the tallgrades data set in the next two examples.

Example 19.16

(You might recall seeing this program a few lessons ago.) Using RETAIN and OUTPUT statements, the following program takes advantage of BY-group processing, as well as RETAIN and OUTPUT statements, to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):

DATA fatgrades;
	set tallgrades;
	by idno;
			 if gtype = 'E1' then E1 = grade;
		else if gtype = 'E2' then E2 = grade;
		else if gtype = 'E3' then E3 = grade;
		else if gtype = 'E4' then E4 = grade;
		else if gtype = 'P1' then P1 = grade;
		else if gtype = 'F1' then F1 = grade;
	if last.idno then output;
	retain E1 E2 E3 E4 P1 F1;
	drop gtype grade;
RUN;
 
PROC PRINT data=fatgrades;
	title 'The fat grades data set';
RUN;

The fat grades data set

Obs

idno

l_name

E1

E2

E3

E4

P1

F1

1

10

Smith

78

82

86

69

97

160

2

11

Simon

88

72

86

99

100

170

3

12

Jones

98

92

92

99

99

185

First, note that the program takes the grades of each student ("by idno") that appear in the variable grade and, depending on what type of grade they are ("if gtype ="), assigns them to the new variables E1, E2, ..., F1. Only when the last observation is encountered for each student ("if last.idno") is the data output to the fatgrades data set. The RETAIN statement tells SAS to retain the values of E1, E2, ..., F1 from one iteration of the DATA step to the next. If the RETAIN statement was not present, by the time SAS went to output the program data vector, the values of E1, E2, ..., F1 would be set to missing.

Now, launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the grades E1, E2, ..., F1 are appropriately assigned and retained. Also, note that we have successfully transposed the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.

Example 19.17

The following program uses an array to transpose the tallgrades data set (one observation per grade) into the fatgrades data set (one observation per student):

DATA fatgrades;
	set tallgrades;
   by idno;
   array allgrades (6) G1 - G6;
   if first.idno then i = 1;
   allgrades(i) = grade;
   if last.idno then output;
   i + 1;
   retain G1 - G6;
   drop i gtype grade;
RUN;
 
PROC PRINT data=fatgrades;
  title 'The fat grades data set';
RUN;

The fat grades data set

Obs

idno

l_name

G1

G2

G3

G4

G5

G6

1

10

Smith

78

82

86

69

97

160

2

11

Simon

88

72

86

99

100

170

3

12

Jones

98

92

92

99

99

185

Yikes! This code looks even scarier! Let's dissect it a bit. First, just as in the previous program, the tallgrades data set is processed BY idno. Doing so makes the first.idno and last.idno variables available for us to use. The ARRAY statement defines an array called allgrades and, using a numbered range list, associates the array with six (uninitialized) variables G1, G2, ..., G6. The allgrades array is used to hold the six grades for each student before they are output in their transposed direction to the fatgrades data set. Because the elements of any array, and therefore allgrades, must be assigned using an index variable, this is how the transposition takes place:

  • ("if first.idno then i = 1;") If the input observation contains a student idno that hasn't yet been encountered in the data set, then the index variable i is initialized to 1. If the input observation doesn't contain a new student idno, then do nothing other than advance to the next step.
  • ("allgrades(i) = grade;") The grade from the current observation is assigned to the array allgrades. (For example, if the input observation contains Smith's first grade, then allgrades(1) is assigned the value 78. If the input observation contains Smith's second grade, then allgrades(2) are assigned the value 82. And so on.)
  • ("if last.idno then output;") If the input observation is the last observation in the data set that contains the student idno, then dump the program data vector (which contains allgrades) to the output data set. (For example, if the input observation is Smith's final exam grade, then output the now fat observation containing his six grades). If the input observation is not the last observation in the data set that contains the student idno, do nothing other than advance to the next step.
  • ("i + 1;") Then, increase the index variable i by 1. (For example, if i is 1, change i to 2.)
  • ("retain G1-G6;") Rather than setting G1, G2, ..., G6 to missing at the beginning of the next iteration of the data step, retain their current values. (So, for example, for Smith, allgrades(1) would retain its value of 78, allgrades (2) would retain its value of 82, and so on.)

The program would keep cycling through the above five steps until it encountered the last observation in the data set. Then, the variables i, gtype, and grade would be dropped from the output fatgrades data set.

Now, launch and run  the SAS program. Review the output from the PRINT procedure to convince yourself that we have now successfully used arrays to transpose the tallgrades data set from a "tall" data set to a "fat" fatgrades data set.


19.7 - Two-Dimensional Arrays

19.7 - Two-Dimensional Arrays

Two-dimensional arrays are straightforward extensions of one-dimensional arrays. You can think of one-dimensional arrays such as the array barkers:

ARRAY barkers(4) dog1-dog4;

as a single row of variables:

dog1 dog2 dog3 dog4

And two-dimensional arrays such as the array pets:

ARRAY pets(2,4) dog1-dog4 cat1-cat4;

as multiple rows of variables:

dog1  dog2  dog3  dog4
cat1  cat2  cat3  cat4

As the previous ARRAY statement suggests, to define a two-dimensional array, you specify the number of elements in each dimension, separated by a comma. In general, the first dimension number tells SAS how many rows your array needs, while the second dimension number tells SAS how many columns your array needs.

When you define a two-dimensional array, the array elements are grouped in the order in which they appear in the ARRAY statement. For example, SAS assigns the elements of the array horse:

ARRAY horse(3,5) x1-x15;

as follows:

Array Example

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15

In this section, we'll look at two examples that involve checking a subset of Family History data for missing values. In the first example, we'll use two one-dimensional arrays — the first array to store the actual data and the second array to store binary status variables that indicate whether a particular data value is missing or not. In the second example, we'll use one two-dimensional array — the first dimension to store the actual data and the second dimension to store binary status variables that indicate whether a particular data value is missing or not.

Example 19.18

The following program reads 14 family history variables (fhx1, ..., fhx14) arising from five subjects and stores the data in a one-dimensional array called edit. Then, SAS searches the edit array to determine whether or not any data values are missing. Fourteen (14) status variables (stat1, ..., stat14) are created to correspond to each of the 14 data variables. The status (1 if missing and 0 if nonmissing) is stored in another one-dimensional array called status:

DATA fhx;
	input subj v_date mmddyy8. fhx1-fhx14;
	array edit(14) fhx1-fhx14;
	array status(14) stat1-stat14;
	do i = 1 to 14;
		status(i) = 0;
		if edit(i) = . then status(i) = 1;
	end;
	DATALINES;
220004  07/27/93  0  0  0  .  8  0  0  1  1  1  .  1  0  1
410020  11/11/93  0  0  0  .  0  0  0  0  0  0  .  0  0  0
520013  10/29/93  0  0  0  .  0  0  0  0  0  0  .  0  0  1
520068  08/10/95  0  0  0  0  0  1  1  0  0  1  1  0  1  0
520076  08/25/95  0  0  0  0  1  8  0  0  0  1  1  0  0  1
;
RUN;
 
PROC PRINT data = fhx;
	var fhx1-fhx14;
	TITLE 'The FHX data itself';
RUN;
 
PROC PRINT data = fhx;
	var stat1-stat14;
	TITLE 'The presence of missing values in FHX data';
RUN;

The FHX data itself

Obs

fhx1

fhx2

fhx3

fhx4

fhx5

fhx6

fhx7

fhx8

fhx9

fhx10

fhx11

fhx12

fhx13

fhx14

1

3

0

0

0

.

8

0

0

1

1

1

.

1

0

2

3

0

0

0

.

0

0

0

0

0

0

.

0

0

3

3

0

0

0

.

0

0

0

0

0

0

.

0

0

4

5

0

0

0

0

0

1

1

0

0

1

1

0

1

5

5

0

0

0

0

1

8

0

0

0

1

1

0

0


The presence of missing values in FHX data

Obs

stat1

stat2

stat3

stat4

stat5

stat6

stat7

stat8

stat9

stat10

stat11

stat12

stat13

stat14

1

0

0

0

0

1

0

0

0

0

0

0

1

0

0

2

0

0

0

0

1

0

0

0

0

0

0

1

0

0

3

0

0

0

0

1

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

The first ARRAY statement tells SAS to group the fourteen family history variables, fhx1, ..., fhx14, into a one-dimensional array called edit. The second ARRAY statement tells SAS to group the fourteen status variables, stat1, ..., stat14, into a one-dimensional array called status. The DO loop tells SAS to review the contents of the 14 variables and to assign each element of the status array a value of 0 ("status(i) = 0"). If the element of the edit array is missing, however, then SAS is told to change the element of the status array from a 0 to a 1 ("if edit(i) = . then status(i) = 1").

Launch and run  the SAS program. Then, review the output from the PRINT procedures to convince yourself that the program does indeed do as described.

Example 19.19

The previous program was written merely as a prelude to this example, in which the code is modified to illustrate the use of two-dimensional arrays. This program performs exactly the same task as the previous program, namely searching a subset of family history data for missing values. Here, though, we use one two-dimensional array called edit instead of two one-dimensional arrays:

DATA fhx2;
	input subj v_date mmddyy8. fhx1-fhx14;
	array edit(2,14) fhx1-fhx14 stat1-stat14;
	do i = 1 to 14;
		edit(2,i) = 0;
		if edit(1,i) = . then edit(2,i) = 1;
	end;
	DATALINES;
220004  07/27/93  0  0  0  .  8  0  0  1  1  1  .  1  0  1
410020  11/11/93  0  0  0  .  0  0  0  0  0  0  .  0  0  0
520013  10/29/93  0  0  0  .  0  0  0  0  0  0  .  0  0  1
520068  08/10/95  0  0  0  0  0  1  1  0  0  1  1  0  1  0
520076  08/25/95  0  0  0  0  1  8  0  0  0  1  1  0  0  1
;
RUN;
 
PROC PRINT data = fhx2;
	var fhx1-fhx14;
	TITLE 'The FHX2 data itself';
RUN;
 
PROC PRINT data = fhx2;
	var stat1-stat14;
	TITLE 'The presence of missing values in FHX2 data';
RUN;

The FHX2 data itself

Obs

fhx1

fhx2

fhx3

fhx4

fhx5

fhx6

fhx7

fhx8

fhx9

fhx10

fhx11

fhx12

fhx13

fhx14

1

3

0

0

0

.

8

0

0

1

1

1

.

1

0

2

3

0

0

0

.

0

0

0

0

0

0

.

0

0

3

3

0

0

0

.

0

0

0

0

0

0

.

0

0

4

5

0

0

0

0

0

1

1

0

0

1

1

0

1

5

5

0

0

0

0

1

8

0

0

0

1

1

0

0


The presence of missing values in FHX2 data

Obs

stat1

stat2

stat3

stat4

stat5

stat6

stat7

stat8

stat9

stat10

stat11

stat12

stat13

stat14

1

0

0

0

0

1

0

0

0

0

0

0

1

0

0

2

0

0

0

0

1

0

0

0

0

0

0

1

0

0

3

0

0

0

0

1

0

0

0

0

0

0

1

0

0

4

0

0

0

0

0

0

0

0

0

0

0

0

0

0

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

First, if you compare this program with the previous program, you should notice that the two programs have more similarities than differences. Here, we have just one ARRAY statement that defines the two-dimensional array edit containing 2 rows and 14 columns. The ARRAY statement tells SAS to group the family history variables (fhx1, ..., fhx14) into the first dimension and to group the status variables (stat1, ..., stat14) into the second dimension. Then, the DO loop tells SAS to review the contents of the 14 variables and to assign each element of the status dimension a value of 0 ("edit(2,i) = 0;"). If the element of the edit dimension is missing, however, then SAS is told to change the element of the status dimension from a 0 to a 1 ("if edit(1,i) = . then edit(2,i) = 1").

Launch and run  the SAS program. Then, review the output from the PRINT procedures to convince yourself that the program accomplishes using one two-dimensional array just as the previous program accomplished using two one-dimensional arrays.


19.8 - Summary

19.8 - Summary

In this lesson, we've learned how to process arrays in SAS.

The homework for this lesson will give you more practice with array processing so that you become even more familiar with how arrays work and can use them in your own SAS programming.


Legend
[1]Link
Has Tooltip/Popover
 Toggleable Visibility