7.2 - Compilation Phase

Thus far, the focus of our attention has been on how SAS processes entire programs step by step. In this section and the next, we'll focus our attention only on how SAS processes a DATA step when the DATA step involves reading in raw data. That phrase is bolded, because it makes a very important clarification. DATA steps involve either reading data from another SAS data set or reading in raw data (instream using DATALINES or from an external file using INFILE). The discussions on this page, and the next two, do not pertain to DATA steps that read data from another SAS data set. Instead, the upcoming discussion pertains only to DATA steps that involve reading in raw data. Now that we've got that clarified, let's move on!

As is the case for many programming languages, SAS processes a DATA step in two phases — a compilation phase and an execution phase. During the compilation phase, SAS checks to make sure that your program meets all of the required syntax rules — that is, SAS checks the gory details, such as making sure that proper naming conventions are followed and that every statement ends in a semi-colon. When the compilation phase is complete, the descriptor portion of the SAS data set, containing the variable names and their attributes, is created. If the DATA step successfully compiles, then the execution phase begins. During the execution phase, the DATA step reads and processes the input data line by line. When the execution phase is complete, the data portion of the SAS data set is created.

That's the overview in a nutshell. Let's now direct our attention solely to the compilation phase, so that we can learn exactly how the descriptor portion of a new data set is created.

The Input Buffer

After SAS checks for syntax errors, SAS creates what is called an input buffer. (Again, such a buffer is created only when the DATA step reads raw data, not when SAS reads data from another SAS data set.) You can think of an input buffer as a temporary holding place for a record of raw data as it is being read by SAS via the INPUT statement. It is, however, just a logical concept; it is not an actual physical storage area. You might visualize it looking something like this:

New Record of Data

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

...

Then, you can imagine that as a record of raw data is read, it is placed in the blank spaces of the input buffer — column by column — until there is no more data in the record.

The Program Data Vector

After the input buffer is created, SAS creates what is called a program data vector. You can think of the program data vector as the temporary area of memory where SAS builds a data set, one observation at a time. In addition to the program data vector containing a holding place for every variable mentioned in the DATA step, it also contains two automatic variables:

  • _N_ counts the number of times that the DATA step begins to execute
  • _ERROR_ is a binary variable that takes on the value of either 0 or 1 to denote whether the data caused an error during execution. The default value of _ERROR_ is 0, which means that no error occurred. When one or more errors occur, the value is set to 1.

SAS uses the two automatic variables for processing, but they are not written to the data set as part of an observation. If your DATA step has

  • an INPUT statement mentioning three variables, such as store, month, and sales, and
  • an assignment statement that creates one variable, say taxdue

then you can imagine the program data vector looking something like this:

New Record of Data

_N_

_ERROR_

store

month

sales

taxdue

Descriptor Portion of the SAS Data Set

At the bottom of the DATA step — in most cases, when a RUN statement is encountered — the compilation phase is complete, and the descriptor portion of the new SAS data set is created. As you may recall, in addition to variable names, the descriptor portion of a data set includes the name of the data set, the date and time that the data set was created, the number of observations and variables it contains, and the attributes — such as type, length, format, and label — of the variables.