13.5 - Understanding How Data Sets are Read

In Stat 480, we spent a lesson investigating how SAS processes a raw data file that is read into a SAS data set. Do you remember? The compile phase creates the program data vector, and hence the descriptor portion of the SAS data set. The execution phase does the actual data processing, in which the values in the program data vector are set to missing, the current record is placed in an input buffer, the values are read from the input buffer into the appropriate position in the program data vector, and then an observation is written from the program data vector to the output data set. Hopefully, this sounds at least a little bit familiar, because as we'll soon see, the process by which SAS reads an existing SAS data set into another SAS data set is very similar. The main difference is that while reading an existing SAS data set with the SET statement, SAS retains the values of the variables from one observation to the next. Let's work through an example.

Example 13.17 Section

The following program uses a SET statement to read the permanent SAS data set stat481.sales, and then creates a variable called SalesTax, as well as a new temporary SAS data set called tax:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';

DATA tax;
	set stat481.sales;
	SalesTax = Sales * 0.06; 
RUN;

PROC PRINT data = tax;
	title 'The tax data set';
RUN;

The tax data set

Obs

Store

Dept

Quarter

Sales

SalesTax

1

101

10

1

110001.50

6600.09

2

101

10

2

113101.20

6786.07

3

101

10

3

111932.15

6715.93

4

101

10

4

99901.10

5994.07

5

101

20

1

110002.36

6600.14

6

101

20

2

99922.39

5995.34

7

101

20

3

98832.98

5929.98

8

101

20

4

110101.70

6606.10

9

121

20

1

121947.10

7316.83

10

121

20

2

119964.69

7197.88

11

121

20

3

122136.28

7328.18

12

121

20

4

120111.11

7206.67

13

121

10

1

127192.92

7631.58

14

121

10

2

125280.13

7516.81

15

121

10

3

128203.56

7692.21

16

121

10

4

123632.29

7417.94

17

109

10

1

120422.77

7225.37

18

109

10

2

123984.32

7439.06

19

109

10

3

121801.29

7308.08

20

109

10

4

122125.66

7327.54

21

109

30

1

98310.13

5898.61

22

109

30

2

97331.25

5839.88

23

109

30

3

96386.28

5783.18

24

109

30

4

98511.90

5910.71

25

109

20

1

115239.09

6914.35

26

109

20

2

113001.98

6780.12

27

109

20

3

114234.32

6854.06

28

109

20

4

114122.65

6847.36

A pretty straightforward program! If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement so it reflects the location in which you saved the data set. Then, run  the program, and review the output to familiarize yourself with the content of the output data set tax. Now, let's walk our way through how SAS processes the DATA step in this program.

Compile phase Section

During the compile phase, SAS takes the following steps:

  1. SAS creates a program data vector containing the automatic variables _N_ and _ERROR_:

    _N_

    _ERROR_

     

    1

    0

     
  2. SAS scans each statement in the DATA step looking for syntax errors, such as missing semicolons and invalid statements
  3. When SAS compiles the SET statement, SAS adds a position to the program data vector for each variable in the input data set. SAS gets the variable names and attributes, such as type and length, from the input data set. Our input data set here (stat481.sales) tells SAS to add four positions to the program data vector — one for Store, one for Dept, one for Quarter, and one for Sales:

    _N_

    _ERROR_

    Store

    Dept

    Quarter

    Sales

     

    1

    0

         
  4. SAS also adds a position to the program data vector for any variables that are created in the DATA step. The attributes of each of these variables are determined by the expression in the statement. The one assignment statement in our DATA step tells SAS to add one position to the program data vector — for the new variable SalesTax:

    _N_

    _ERROR_

    Store

    Dept

    Quarter

    Sales

    Sales Tax

    1

    0

         
  5. SAS completes the compile phase at the bottom of the DATA step, and it is then that SAS makes the descriptor portion of the SAS data set

The output data set does not yet contain any observations, because SAS has not yet begun executing the program. When the compile phase is complete, that's when SAS starts the execution phase

Execution phase Section

During the execution phase, SAS takes the following steps:

Is this starting to sound like an endless loop? You should be getting the idea now ... the process continues as described until all of the observations are read

  1. The DATA step executes once for each observation in the input data set. In our case, SAS will execute 28 times because there are 28 observations in the input data set
  2. At the beginning of the execution phase, SAS sets all of the data set variables in the program data vector to missing:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    1 0 . . . . .

    Because it is the first iteration of the DATA step, the automatic variable _N_ is set to 1. And, because SAS has not yet encountered any errors, the automatic variable _ERROR_ is set to 0

  3. The SET statement reads the first observation from the input data set and writes the values to the program data vector:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    1 0 101 10 1 110001.50 .
  4. The assignment statement executes to compute the first value of SalesTax:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    1 0 101 10 1 110001.50 6600.09
  5. At the end of the first iteration of the DATA step, the values in the program data vector are written to the output data set tax as the first observation
  6. The value of the automatic variable _N_ is increased to 2, and control returns to the top of the DATA step. The automatic variable _ERROR_ retains its value of 0, since SAS has still not encountered an error. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a Sum statement. All other variable values, such as the values of the variable SalesTax, are reset to missing. Taking all of this into account, our program data vector looks like this at the beginning of the second iteration of the DATA step:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    2 0 101 10 1 110001.50 .

    As stated earlier, this is the step that differs the greatest from when SAS instead reads from a raw data file. Recall that when SAS reads from a raw data file, SAS sets the value of each variable to missing (with a few special exceptions) at the beginning of each iteration

  7. As the SET statement executes, the values from the second observation are written to the program data vector:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    2 0 101 10 2 113101.20 .
  8. The assignment statement executes again to compute the value for SalesTax for the second observation:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    2 0 101 10 2 113101.20 6786.07
  9. At the bottom of the DATA step, the values in the program data vector are written to the output data set tax as the second observation
  10. The value of the automatic variable _N_ is increased to 3, and control returns to the top of the DATA step. The automatic variable _ERROR_ retains its value of 0, since SAS has still not encountered an error. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a Sum statement. All other variable values, such as the values of the variable SalesTax, are reset to missing. Taking all of this into account, our program data vector looks like this at the beginning of the third iteration of the DATA step:
    _N_ _ERROR_ Store Dept Quarter Sales Sales Tax
    2 0 101 10 2 113101.20 .

    Is this starting to sound like an endless loop? You should be getting the idea now ... the process continues as described until all of the observations are read.