7.3 - Execution Phase
7.3 - Execution PhaseOnce your SAS program has successfully passed the compile phase, SAS then moves into the execution phase. The primary goal of the execution phase is to create the portion of the data set that contains the data values.
During the execution phase, each record, i.e., row, in the input raw data file:
- is read into the input buffer,
- stored in the program data vector, along with other newly created variables,
- and then written to the new data set as one observation.
The DATA step executes once for each record in the input file unless otherwise directed by additional statements that we'll learn about in our future study of SAS.
Example 7.3
The following SAS program reads in various measurements of six different trees into a data set called trees, and while doing so calculates the volume of each tree:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA trees;
input type $ 1-16 circ_in hght_ft crown_ft;
volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
DATALINES;
oak, black 222 105 112
hemlock, eastern 149 138 52
ash, white 258 80 70
cherry, black 187 91 75
maple, red 210 99 74
elm, american 229 127 104
;
RUN;
PROC PRINT data = trees;
RUN;
Typically, we'd be interested in launching and running the program to learn something about what it does. In this case, though, we are going to just take a walk behind the scenes to see how SAS builds the trees data set using the input buffer and program data vector. Here goes!
Since we are focusing on the execution phase, we assume that SAS has successfully completed the compilation phase. In that case, SAS has created an input buffer:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | ... |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
and a program data vector:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
that contains a position for the two automatic variables _N_ and _ERROR_ and a position for each of the five variables mentioned in the DATA step.
Now, at the beginning of the execution phase, when SAS reads the DATA statement:
DATA trees;
SAS:
- sets the value of the automatic variable _N_ to 1
- sets the value of the automatic variable _ERROR_ to 0 (because there are as of yet no execution errors detected)
- sets the remaining user-defined variables to missing
At this point, then, our program data vector looks like this:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
1 | 0 | . | . | . | . |
As is always the case, missing numeric values are represented by periods, and missing character values are represented by blanks.
Reading down the program, SAS then reads the INFILE statement or, as in this case, the DATALINES statement to find the location of the raw data. Upon finding the raw data, SAS places the first record in the input buffer:
1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | 16 | ... | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
o | a | k | , | b | l | ... | 2 | 2 | 2 | 1 | 0 | 5 | 1 | 1 | 5 |
When the INPUT statement begins to read data values from a record that is held in the input buffer, it uses an input pointer to keep track of its position. Unless told otherwise, the input pointer always starts at column 1. As the INPUT statement executes, the raw data values in columns 1-16 are read and assigned to type in the program data vector:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
1 | 0 | oak, black | . | . | . | . |
Then, the INPUT statement reads the next three fields in the input buffer and stores, them, respectively, in the circ_in, hght_ft, and crown_ft positions in the program data vector:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
1 | 0 | oak, black | 222 | 105 | 112 | . |
Continuing to read down the program, SAS then executes the assignment statement to calculate the volume of the tree:
volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
In case you are curious, that formula comes from the volume of a cone, the assumed (general) shape of a tree. (Those odd-looking factors come from converting our feet and inches measurements into meters.) When SAS completes the calculation, it again stores the value in the correct position in the program data vector:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
1 | 0 | oak, black | 222 | 105 | 112 | 26.9075 |
Now, when SAS encounters the RUN statement at the end of the DATA step:
RUN;
SAS takes several actions. First, the values in the program data vector are written to the output data set as the first observation:
type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|
oak, black | 222 | 105 | 112 | 26.9075 |
Note that only the user-defined variables are written to the output data set. The automatic variables _N_ and _ERROR_ are not written to the output data set. After writing the first observation to the output data set, SAS returns to the top of the DATA step to begin processing the second observation. At that point, SAS increases the value of _N_ to 2. The automatic variable _ERROR_ retains its value of 0, because SAS did not encounter an error while reading in the first record. Finally, the values of the variables in the program data vector are reset to missing, so that our program data vector now looks like this:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
2 | 0 | . | . | . | . |
This is an important step to note. When reading in raw data, SAS sets the value of each variable to missing at the beginning of each iteration of the DATA step. Well, okay, of course, it can't be that easy! There are a few exceptions. SAS does not set the following values to missing:
- variables that are named in a RETAIN statement
- variables that are created in a SUM statement
- data elements in a _TEMPORARY_ array
- any variables that are created with options in the FILE or INFILE statements
- the automatic variables _N_ and _ERROR_
We'll learn what SAS does instead in these cases when we learn about them in future lessons and future courses.
Are you getting the impression that this execution stuff is hard work? Look at all the work that SAS has done so far, and yet it has read in and constructed just one observation of our trees data set. Now SAS just has to do the same thing over and over again, namely reading a record into the input buffer, cycling through the statements in the DATA step, and writing the newly created observation to the output data set.
During the second iteration of the DATA step — that is, when _N_ = 2 — here's what the input buffer:
1 | 2 | 3 | 4 | 5 | 6 | 7 | ... | 16 | ... | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
h | e | m | l | o | c | k | ... | 1 | 4 | 9 | 1 | 3 | 8 | 5 | 2 |
and program data vector:
_N_ | _ERROR_ | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|---|
2 | 0 | hemlock, eastern | 149 | 138 | 52 | 15.9305 |
look like. And, at the bottom of the DATA step, the values in the program data vector are written to the data set as the second observation:
type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|
oak,black | 222 | 105 | 112 | 26.9075 |
hemlock, eastern | 149 | 138 | 52 | 15.9305 |
Then, SAS heads back up to the top of the DATA step again, sets _N_ to 3, retains the _ERROR_ value of 0, resets the user-defined variables in the program data vector to missing, reads in the next record into the input buffer, and creates the third observation. And on and on and on ... the execution phase continues in this manner until the end-of-file marker is reached in the raw data file. When there are no more records to read in the raw data file, the data portion of the new data set is complete.
Let's go back to our original program now:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA trees;
input type $ 1-16 circ_in hght_ft crown_ft;
volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
DATALINES;
oak, black 222 105 112
hemlock, eastern 149 138 52
ash, white 258 80 70
cherry, black 187 91 75
maple, red 210 99 74
elm, american 229 127 104
;
RUN;
PROC PRINT data = trees;
RUN;
Obs | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|
1 | oak, black | 222 | 105 | 112 | 26.9075 |
2 | hemlock, eastern | 149 | 138 | 52 | 15.9305 |
3 | ash, white | 258 | 80 | 70 | 27.6890 |
4 | cherry, black | 187 | 91 | 75 | 16.5464 |
5 | maple, red | 210 | 99 | 74 | 22.7014 |
6 | elm, american | 229 | 127 | 104 | 34.6300 |
and have you launch and run the program. Upon doing so, you should see that at the end of the execution phase, the SAS log confirms that the temporary data set trees was successfully created with 6 observations and 5 variables:
OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA trees;
input type $ 1-16 circ_in hght_ft crown_ft;
volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
DATALINES;
oak, black 222 105 112
hemlock, eastern 149 138 52
ash, white 258 80 70
cherry, black 187 91 75
maple, red 210 99 74
elm, american 229 127 104
;
RUN;
PROC PRINT data = trees;
RUN;
The message in the log window, as well as the output from the PRINT procedure:
Obs | type | circ_in | hght_ft | crown_ft | volume |
---|---|---|---|---|---|
1 | oak, black | 222 | 105 | 112 | 26.9075 |
2 | hemlock, eastern | 149 | 138 | 52 | 15.9305 |
3 | ash, white | 258 | 80 | 70 | 27.6890 |
4 | cherry, black | 187 | 91 | 75 | 16.5464 |
5 | maple, red | 210 | 99 | 74 | 22.7014 |
6 | elm, american | 229 | 127 | 104 | 34.6300 |
confirms that the automatic variables _N_ and _ERROR_ are used only to help SAS process the DATA step, and are therefore not written to the output data set.
So, in a nutshell, that's the execution phase of a DATA step that involves reading in raw data! Let's summarize, so we have the steps organized all in one spot:
- At the beginning of the DATA step, SAS sets the variable values in the program data vector to missing.
- Then, SAS reads the next observation in the input raw data file into the input buffer.
- SAS then proceeds through the DATA step — sequentially line by line — creating any new variable values along the way and storing them in the correct place in the program data vector.
- At the end of the DATA step, SAS writes the (one) observation contained in the program data vector into the next available line of the SAS data set.
- SAS returns to the beginning of the DATA step and completes steps #1-4 until there are no more records to read from the input data set.