Lesson 7: Writing Programs That Work - Part I

Overview

Now that we know the basics of SAS, in this lesson, and the next, we'll spend some time shoring up our foundations with the intention of becoming more efficient SAS programmers. In order to write programs that work, we need to understand how SAS processes programs. So, we'll start by learning about its two processing phases — the compile phase and the execution phase. We'll look specifically at how SAS processes a data set when the DATA step reads in raw data. After discussing the process, we'll use the DATA step debugger to get a dynamic view of how SAS reads in raw data. In the process of doing that, you'll be introduced to the use of the DATA step debugger as a debugging tool. We'll also learn about the three different types of messages — errors, warnings, and notes — that SAS displays in the log window. Finally, we'll discuss four programming practices that will help us all become more efficient SAS programmers.

Objectives

Upon completion of this lesson, you should be able to:

Upon completing this lesson, you should be able to do the following:

describe how SAS processes programs from top-to-bottom and left-to-right
describe how during the compilation phase, SAS checks your program for syntax errors
describe how the descriptor portion of a SAS data set is created at the end of the compilation phase
describe how during the execution phase, the DATA step reads and processes the input data line by line
describe how the data portion of a SAS data set is created when the execution phase is complete
explain the use of an input buffer in the execution phase
explain the use of a program data vector in the execution phase
describe the purpose of the two automatic variables _N_ and _ERROR_
describe how the DATA step works like a loop
describe how (and when) SAS sets the value of each variable to missing at the beginning of each iteration of the DATA step
explain the order in which the variables are defined in the DATA step determines the order in which the variables appear in the output data set
describe how to use the SAS DATA step debugger to find execution-time errors
recognize the difference between error, warning, and note messages displayed in the log window
review the log window every time you execute a SAS program to see what errors, warnings, and notes your program caused SAS to display
write SAS programs that are easy to read
run the PRINT procedure to test each part of your SAS program to make sure your program is doing what you intended
describe how to use a PUT statement as a way to debug a program
test your programs with small data sets
test your programs with representative data

7.1 - Processing SAS Programs

In order to write SAS programs that work well, we need to fully understand what SAS does, that is, how SAS processes a program after you've submitted it by clicking on that little running man . By now, you know that SAS first reads the program — from top-to-bottom and left-to-right — looking for syntax errors, that is, for things such as missing semi-colons, misspelled keywords, and invalid variable names. In reality, SAS processes the program chunk by chunk, or more perhaps more accurately described as step by step.

DATA and PROC statements signal the beginning of a new step. When SAS encounters:

a subsequent DATA or PROC statement,
a RUN statement (for DATA steps and most procedures),
or a QUIT statement (for some procedures)

SAS stops reading statements looking for errors and executes the previous step. Each time a step is executed, SAS:

generates messages about the processing activities in the log window, and
the results of the processing in the output window.

You can get the idea that SAS functions as described by reviewing the log window after submitting a program.

Example 7.1

The following SAS program merely creates a simple data set called trees, sorts the data set by tree type, and then prints the data set:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA trees;
    input type $ 1-16 circ_in hght_ft crown_ft;
	DATALINES;
oak, black        222 105 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
PROC SORT data = trees;
   by type;
RUN;
PROC PRINT data = trees;
  title 'Tree data';
  id type;
RUN;

Tree data
Obs	type	circ_in	hght_in	crown_ft
1	ash, white	258	80	70
2	cherry, black	187	91	75
3	elm,american	229	127	104
4	hemlock, eastern	149	138	52
5	maple, red	210	99	74
6	oak, black	222	105	112

You can go ahead and launch and run the SAS program and review the output, but what we ultimately care about in this discussion is what shows up in the log window:

    OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
    DATA trees;
        input type $ 1-16 circ_in hght_ft crown_ft;
        DATALINES;

NOTE: The data set WORK.TREES has 6 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.03 seconds
      cpu time            0.00 seconds


   ;
   PROC SORT data = trees;
      by type;
   RUN;

NOTE: There were 6 observations read from the data set WORK.TREES.
NOTE: The data set WORK.TREES has 6 observations and 4 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


   PROC PRINT data = trees;
NOTE: Writing HTML Body file: sashtml.htm
     title 'Tree data';
     id type;
   RUN;

The most important thing to note about the contents of the log window is that SAS first displayed messages about the DATA step, and then messages about the SORT procedure step, and finally messages about the PRINT procedure step. That's because SAS first read the DATA step looking for errors, and then when SAS deemed the step to be error-free, SAS executed the DATA step when it encountered the RUN statement. Then SAS read the SORT procedure looking for errors, and then when SAS deemed the step to be error-free, SAS executed the the SORT procedure when it encountered the RUN statement. Finally, SAS read the PRINT procedure looking for errors, and then when SAS deemed the step to be error-free, SAS executed the the PRINT procedure when it encountered the RUN statement. It is in this manner that SAS processes programs step by step.

Oh yeah, there is something to notice about the output that our program generated. You'll see that in spite of us submitting three different steps — the DATA step, the SORT procedure, and the PRINT procedure — SAS generated only one piece of output. That's because some steps don't generate output, but rather just produce messages in the log window. In this case, the DATA step produced messages in the log window, but it did not create a report or other output. Likewise, the SORT procedure produced messages in the log window, but it did not create a report or other output. The PRINT procedure produced messages in the log window and created a report in the output window.

It is the messages contained in the log window that will be the focus of our attention for this lesson and the next. It is those messages that will help us learn how to write programs that work. Let's take a look at one more example.

Example 7.2

The following SAS program merely creates a simple data set called trees, attempts to sort the data set by tree height, and then prints the data set:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
DATA trees;
    input type $ 1-16 circ_in hght_ft crown_ft;
	DATALINES;
oak, black        222 105 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
PROC SORT data = trees;
   by height;
RUN;
PROC PRINT data = trees;
  title 'Tree data again';
  id type;
RUN;

Tree Data Again
type	circ_in	hght_in	crown_ft
oak, black	222	105	112
hemlock, eastern	149	138	52
ashe, white	258	80	70
cherry, black	187	91	75
maple, red	210	99	74
elm,american	229	127	104

If you review the program carefully, you'll see that it shouldn't work as we intended, because we attempted to sort the trees data set by the incorrect variable name height rather than by the correct variable name hght_ft. Let's go ahead and launch and run the SAS program and then take a look at the messages displayed in the log window:

   OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
   DATA trees;
       input type $ 1-16 circ_in hght_ft crown_ft;
       DATALINES;

NOTE: The data set WORK.TREES has 6 observations and 4 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


   ;
   PROC SORT data = trees;
      by height;
ERROR: Variable HEIGHT not found.
   RUN;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds



   PROC PRINT data = trees;
     title 'Tree data again';
     id type;
   RUN;

Once more, this example illustrates how SAS processes one step at a time. As you can see by the messages displayed in the log, the DATA step successfully creates a temporary data set named trees containing four variables and six observations. Upon completing the DATA step, SAS then moves on to read the SORT procedure looking for errors. When SAS discovers that we are attempting to sort the trees data set by a variable height that doesn't exist in the data set, SAS displays an ERROR message in the log window. Because SAS can't even begin to attempt to execute the SORT procedure without knowledge of the correct variable, SAS stops processing the SORT step and says so in the log window. SAS then moves on to read the PRINT procedure looking for errors. When SAS deems that the PRINT procedure is error-free and that the previous error doesn't prevent the PRINT procedure from succeeding, SAS executes the PRINT procedure, as it says so in the log window.

7.2 - Compilation Phase

Thus far, the focus of our attention has been on how SAS processes entire programs step by step. In this section and the next, we'll focus our attention only on how SAS processes a DATA step when the DATA step involves reading in raw data. That phrase is bolded, because it makes a very important clarification. DATA steps involve either reading data from another SAS data set or reading in raw data (instream using DATALINES or from an external file using INFILE). The discussions on this page, and the next two, do not pertain to DATA steps that read data from another SAS data set. Instead, the upcoming discussion pertains only to DATA steps that involve reading in raw data. Now that we've got that clarified, let's move on!

As is the case for many programming languages, SAS processes a DATA step in two phases — a compilation phase and an execution phase. During the compilation phase, SAS checks to make sure that your program meets all of the required syntax rules — that is, SAS checks the gory details, such as making sure that proper naming conventions are followed and that every statement ends in a semi-colon. When the compilation phase is complete, the descriptor portion of the SAS data set, containing the variable names and their attributes, is created. If the DATA step successfully compiles, then the execution phase begins. During the execution phase, the DATA step reads and processes the input data line by line. When the execution phase is complete, the data portion of the SAS data set is created.

That's the overview in a nutshell. Let's now direct our attention solely to the compilation phase, so that we can learn exactly how the descriptor portion of a new data set is created.

The Input Buffer

After SAS checks for syntax errors, SAS creates what is called an input buffer. (Again, such a buffer is created only when the DATA step reads raw data, not when SAS reads data from another SAS data set.) You can think of an input buffer as a temporary holding place for a record of raw data as it is being read by SAS via the INPUT statement. It is, however, just a logical concept; it is not an actual physical storage area. You might visualize it looking something like this:

New Record of Data
1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	...

Then, you can imagine that as a record of raw data is read, it is placed in the blank spaces of the input buffer — column by column — until there is no more data in the record.

The Program Data Vector

After the input buffer is created, SAS creates what is called a program data vector. You can think of the program data vector as the temporary area of memory where SAS builds a data set, one observation at a time. In addition to the program data vector containing a holding place for every variable mentioned in the DATA step, it also contains two automatic variables:

_N_ counts the number of times that the DATA step begins to execute
_ERROR_ is a binary variable that takes on the value of either 0 or 1 to denote whether the data caused an error during execution. The default value of _ERROR_ is 0, which means that no error occurred. When one or more errors occur, the value is set to 1.

SAS uses the two automatic variables for processing, but they are not written to the data set as part of an observation. If your DATA step has

an INPUT statement mentioning three variables, such as store, month, and sales, and
an assignment statement that creates one variable, say taxdue

then you can imagine the program data vector looking something like this:

New Record of Data
_N_	_ERROR_	store	month	sales	taxdue

Descriptor Portion of the SAS Data Set

At the bottom of the DATA step — in most cases, when a RUN statement is encountered — the compilation phase is complete, and the descriptor portion of the new SAS data set is created. As you may recall, in addition to variable names, the descriptor portion of a data set includes the name of the data set, the date and time that the data set was created, the number of observations and variables it contains, and the attributes — such as type, length, format, and label — of the variables.

7.3 - Execution Phase

Once your SAS program has successfully passed the compile phase, SAS then moves into the execution phase. The primary goal of the execution phase is to create the portion of the data set that contains the data values.

During the execution phase, each record, i.e., row, in the input raw data file:

is read into the input buffer,
stored in the program data vector, along with other newly created variables,
and then written to the new data set as one observation.

The DATA step executes once for each record in the input file unless otherwise directed by additional statements that we'll learn about in our future study of SAS.

Example 7.3

The following SAS program reads in various measurements of six different trees into a data set called trees, and while doing so calculates the volume of each tree:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

DATA trees;
    input type $ 1-16 circ_in hght_ft crown_ft;
    volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
    DATALINES;
oak, black        222 105 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
RUN;

PROC PRINT data = trees;
RUN;

Typically, we'd be interested in launching and running the program to learn something about what it does. In this case, though, we are going to just take a walk behind the scenes to see how SAS builds the trees data set using the input buffer and program data vector. Here goes!

Since we are focusing on the execution phase, we assume that SAS has successfully completed the compilation phase. In that case, SAS has created an input buffer:

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	...

and a program data vector:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume

that contains a position for the two automatic variables _N_ and _ERROR_ and a position for each of the five variables mentioned in the DATA step.

Now, at the beginning of the execution phase, when SAS reads the DATA statement:

DATA trees;

SAS:

sets the value of the automatic variable _N_ to 1
sets the value of the automatic variable _ERROR_ to 0 (because there are as of yet no execution errors detected)
sets the remaining user-defined variables to missing

At this point, then, our program data vector looks like this:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume
1	0		.	.	.	.

As is always the case, missing numeric values are represented by periods, and missing character values are represented by blanks.

Reading down the program, SAS then reads the INFILE statement or, as in this case, the DATALINES statement to find the location of the raw data. Upon finding the raw data, SAS places the first record in the input buffer:

1	2	3	4	5	6	7	...	16	...	19	20	21	22	23	24	25	26	27	28	29
o	a	k	,		b	l	...			2	2	2		1	0	5		1	1	5

When the INPUT statement begins to read data values from a record that is held in the input buffer, it uses an input pointer to keep track of its position. Unless told otherwise, the input pointer always starts at column 1. As the INPUT statement executes, the raw data values in columns 1-16 are read and assigned to type in the program data vector:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume
1	0	oak, black	.	.	.	.

Then, the INPUT statement reads the next three fields in the input buffer and stores, them, respectively, in the circ_in, hght_ft, and crown_ft positions in the program data vector:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume
1	0	oak, black	222	105	112	.

Continuing to read down the program, SAS then executes the assignment statement to calculate the volume of the tree:

volume = (0.319*hght_ft)*(0.0000163*circ_in**2);

In case you are curious, that formula comes from the volume of a cone, the assumed (general) shape of a tree. (Those odd-looking factors come from converting our feet and inches measurements into meters.) When SAS completes the calculation, it again stores the value in the correct position in the program data vector:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume
1	0	oak, black	222	105	112	26.9075

Now, when SAS encounters the RUN statement at the end of the DATA step:

RUN;

SAS takes several actions. First, the values in the program data vector are written to the output data set as the first observation:

type	circ_in	hght_ft	crown_ft	volume
oak, black	222	105	112	26.9075

Note that only the user-defined variables are written to the output data set. The automatic variables _N_ and _ERROR_ are not written to the output data set. After writing the first observation to the output data set, SAS returns to the top of the DATA step to begin processing the second observation. At that point, SAS increases the value of _N_ to 2. The automatic variable _ERROR_ retains its value of 0, because SAS did not encounter an error while reading in the first record. Finally, the values of the variables in the program data vector are reset to missing, so that our program data vector now looks like this:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume
2	0		.	.	.	.

This is an important step to note. When reading in raw data, SAS sets the value of each variable to missing at the beginning of each iteration of the DATA step. Well, okay, of course, it can't be that easy! There are a few exceptions. SAS does not set the following values to missing:

variables that are named in a RETAIN statement
variables that are created in a SUM statement
data elements in a _TEMPORARY_ array
any variables that are created with options in the FILE or INFILE statements
the automatic variables _N_ and _ERROR_

We'll learn what SAS does instead in these cases when we learn about them in future lessons and future courses.

Are you getting the impression that this execution stuff is hard work? Look at all the work that SAS has done so far, and yet it has read in and constructed just one observation of our trees data set. Now SAS just has to do the same thing over and over again, namely reading a record into the input buffer, cycling through the statements in the DATA step, and writing the newly created observation to the output data set.

During the second iteration of the DATA step — that is, when _N_ = 2 — here's what the input buffer:

1	2	3	4	5	6	7	...	16	...	19	20	21	22	23	24	25	26	27	28	29
h	e	m	l	o	c	k	...			1	4	9		1	3	8			5	2

and program data vector:

_N_	_ERROR_	type	circ_in	hght_ft	crown_ft	volume
2	0	hemlock, eastern	149	138	52	15.9305

look like. And, at the bottom of the DATA step, the values in the program data vector are written to the data set as the second observation:

type	circ_in	hght_ft	crown_ft	volume
oak,black	222	105	112	26.9075
hemlock, eastern	149	138	52	15.9305

Then, SAS heads back up to the top of the DATA step again, sets _N_ to 3, retains the _ERROR_ value of 0, resets the user-defined variables in the program data vector to missing, reads in the next record into the input buffer, and creates the third observation. And on and on and on ... the execution phase continues in this manner until the end-of-file marker is reached in the raw data file. When there are no more records to read in the raw data file, the data portion of the new data set is complete.

Let's go back to our original program now:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

DATA trees;
    input type $ 1-16 circ_in hght_ft crown_ft;
	volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
	DATALINES;
oak, black        222 105 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
RUN;

PROC PRINT data = trees;
RUN;

Obs	type	circ_in	hght_ft	crown_ft	volume
1	oak, black	222	105	112	26.9075
2	hemlock, eastern	149	138	52	15.9305
3	ash, white	258	80	70	27.6890
4	cherry, black	187	91	75	16.5464
5	maple, red	210	99	74	22.7014
6	elm, american	229	127	104	34.6300

and have you launch and run the program. Upon doing so, you should see that at the end of the execution phase, the SAS log confirms that the temporary data set trees was successfully created with 6 observations and 5 variables:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;
 
DATA trees;
    input type $ 1-16 circ_in hght_ft crown_ft;
	volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
	DATALINES;
oak, black        222 105 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
RUN;
 
PROC PRINT data = trees;
RUN;

The message in the log window, as well as the output from the PRINT procedure:

Obs	type	circ_in	hght_ft	crown_ft	volume
1	oak, black	222	105	112	26.9075
2	hemlock, eastern	149	138	52	15.9305
3	ash, white	258	80	70	27.6890
4	cherry, black	187	91	75	16.5464
5	maple, red	210	99	74	22.7014
6	elm, american	229	127	104	34.6300

confirms that the automatic variables _N_ and _ERROR_ are used only to help SAS process the DATA step, and are therefore not written to the output data set.

So, in a nutshell, that's the execution phase of a DATA step that involves reading in raw data! Let's summarize, so we have the steps organized all in one spot:

At the beginning of the DATA step, SAS sets the variable values in the program data vector to missing.
Then, SAS reads the next observation in the input raw data file into the input buffer.
SAS then proceeds through the DATA step — sequentially line by line — creating any new variable values along the way and storing them in the correct place in the program data vector.
At the end of the DATA step, SAS writes the (one) observation contained in the program data vector into the next available line of the SAS data set.
SAS returns to the beginning of the DATA step and completes steps #1-4 until there are no more records to read from the input data set.

7.4 - DATA Step Debugger

In this section, we'll introduce you to a tool called the DATA step debugger that some SAS programmers like to use as a way to find errors that occur in the execution phase of their programs. In the end, you may too want to use it as a way of debugging your programs. It is important to remember though that the DATA step debugger works only at execution time. That means you can't use the DATA step debugger to help find compile-time errors such as missing semi-colons. Our main purpose of investigating the debugger now is to get a real-time, behind-the-scenes illustration of the execution phase when SAS encounters a DATA step that involves reading in raw data.

Example 7.5

The following DATA step is identical to the one that appears in Example 7.3, except here the DATA step debugger has been invoked by adding the DEBUG option to the end of the DATA statement:

DATA trees / DEBUG;
    input type $ 1-16 circ_in hght_ft crown_ft;
	volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
	DATALINES;
oak, black        222 105 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
RUN;

Launch and run the SAS program. Upon doing so, you should see two windows —the DEBUGGER SOURCE window and the DEBUGGER LOG window —open up:

If the DEBUGGER LOG window is not stacked on top of the DEBUGGER SOURCE window, as illustrated above, you might try selecting Window > Resize in the SAS menu in order to get the windows stacked as they are above. You might also want to notice that once you invoke the DATA step debugger, as we just have, some new menu options appear at the top of your screen. The View, Run, and Breakpoint menus contain debugger commands.

Now that we have the debugger up and running, we are going to step through the program line by line, along the way asking SAS to display the values of all of the variables in the program data vector. To begin, activate the DEBUGGER LOG window by clicking on it once. You'll know that it is activated when its border becomes a bright blue as opposed to a faded blue. Let's start by asking SAS to show us the values of the variables in the program data vector. To do so, type EXAMINE _ALL_ on the command line at the bottom of the DEBUGGER LOG window, and then press your Enter key. (The command line is the area under the dashed line immediately following the greater than (>) symbol.) Upon doing so, the values of the five user-defined variables and the two automatic variables appear in the DEBUGGER LOG window:

DATA STEP Source Level Debugger
Stopped at line 55 column 5
> EXAMINE _ALL_
type =
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 1

It's exactly what we should expect to see at the beginning of the execution phase of the DATA step. The automatic variable _N_ is set to 1, _ERROR_ is set to 0, and the user-defined variables are each set to missing. Now, in order to tell SAS to execute the INPUT statement, either type Step on the command line and press the Enter key once, or simply press the Enter key once. You should see the command in the DEBUGGER LOG window:

DATA STEP Source Level Debugger
Stopped at line 2 column 5
> EXAMINE _ALL_
type =
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 1
>Step
Stepped to line 3 column 5

and you should see that SAS has advanced its processing by one line in the DEBUGGER SOURCE window. Now that SAS has processed the INPUT statement for the first data record, let's see the values of the variables in the program data vector by again typing EXAMINE _ALL_ on the command line and pressing Enter once:

DATA trees / DEBUG;
    input type $ 1-16 circ_in hght_ft crown_ft;
    volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
    DATALINES;

No surprises there, eh? SAS has read in the four data values for type, circ_in, hght_ft, and crown_ft. The value for volume remains missing because SAS hasn't yet executed the assignment statement. Let's tell SAS to do that by advancing its processor one line by pressing the Enter key once. Then type EXAMINE _ALL_ on the command line and press the Enter key again:

DATA STEP Source Level Debugger
Stopped at line 2 column 5
> EXAMINE _ALL_
type =
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 1
>Step
Stepped to line 3 column 5
>
Stepped to line 4 column 1
>EXAMINE _ALL_
type = oak, black
circ_in = 222
hght_ft = 105
crown_ft = 112
volume = 26.907511554
_ERROR = 0
_N_ = 1
>
Stepped to line 2 column 5

SAS advanced the processor one line as we requested, and there's that 26.9075 value that we were expecting volume to be assigned for the first observation. Now, here's the part that really illustrates how the DATA step works like a loop, repetitively executing statements to read data values and create observations one by one. Advance the processor another step by pressing the Enter key once. There you have it ... SAS moves the processor back up to the INPUT statement at the beginning of the DATA step:

DATA trees / DEBUG;
    input type $ 1-16 circ_in hght_ft crown_ft; 
    volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
    DATALINES;

so that it is ready to create the next observation. Take a look at the contents of the program data vector by typing EXAMINE _ALL_ on the command line, and pressing the Enter key:

>
Stepped to line 2 column 5
>EXAMINE _ALL_
type =
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 2

you should not be surprised to see that SAS increased the value of _N_ to 2, retained the _ERROR_ value of 0, and reset the user-defined variables to missing.

You should be getting the idea of this. We are merely using the DATA step debugger to take a behind-the-scenes look at the execution phase as we described it in the last section. If you are finding that you are still learning something from this exercise, you can continue to alternate between advancing the processor and examining the variable values. Alternatively, you can move the process along by pressing the Enter key 16 times (I think) until SAS displays a message indicating that the DATA STEP program has completed execution:

Stepped to line 2 column 5
>
The DATA STEP program has completed execution
>EXAMINE _ALL_
type =
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 7

Now, if you EXAMINE _ALL_ the variables, you can see that the automatic variable _N_ has been increased to 7. Because there are only 6 records in the input raw data, there are no more records to read. SAS has thus completed the execution phase of the DATA step. Let's have you quit the debugger then by typing Quit on the command line and pressing the Enter key, or by selecting the Run menu and then selecting Quit debugger.

Incidentally, in the future, rather than typing EXAMINE _ALL_ on the command line, you could select the View menu, and then Examine values..., and then type _ALL_ in the first box, and select OK. And, rather than typing Step on the command line, and pressing the Enter key, you could select the Run menu, and then Step.

Example 7.6

At first glance, you might look at the following program and think that it is identical to the one in the previous example. If you look at the zero that appears in the value 105 in the "oak, black" record, however, you might notice that it is a little rounder than other zeroes that you've seen. That's because it is the letter O and not the number 0. When we ask SAS to execute the program, we should therefore expect an error when SAS tries to read the character value "1O5" in the first record for the numeric variable hght_ft. Let's use the SAS debugger again to see the behind-the-scenes execution of this program:

DATA trees / DEBUG;
    input type $ 1-16 circ_in hght_ft crown_ft;
	volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
	DATALINES;
oak, black        222 1O5 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
RUN;

Launch and run the SAS program. Upon doing so, you should see the DEBUGGER SOURCE and DEBUGGER LOG windows open again. Again, if the DEBUGGER LOG window is not stacked on top of the DEBUGGER SOURCE window, select Window > Resize in the SAS menu in order to get the windows stacked.

Typing EXAMINE _ALL_ on the command line and pressing the Enter key, we see:

DATA STEP Source Level Debugger

Stopped at line 55 column 5
> EXAMINE _ALL_
type =
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 1

the initialization of the program data vector that we'd expect. Pressing the Enter key once to tell SAS to execute the INPUT statement, and then asking SAS to display the values in the program data vector by typing EXAMINE _ALL_ and pressing the Enter key, we see:

Stopped at line 56 column 5
> EXAMINE _ALL_
type = oak, black
circ_in = 222
hght_ft = .
crown_ft = 112
volume = .
_ERROR = 1
_N_ = 1

As expected, SAS is unable to read in the (mistaken) character value of "1O5" for the numeric hght_ft variable, and so SAS sets its value to missing. Upon encountering the error, SAS changes the value of the automatic variable _ERROR_ to 1.

Upon advancing the processor one line and examining the values of the variables, we see that the value of volume remains missing because SAS can't calculate it without the hght_ft value:

> EXAMINE _ALL_
type = oak, black
circ_in = 222
hght_ft = .
crown_ft = 112
volume = .
_ERROR = 1
_N_ = 1

Okay, one more time ... upon advancing the processor another step and examining the values of the variables, we see that SAS pops back up to the top of the data step to begin processing the second record:

> EXAMINE _ALL_
type = .
circ_in = .
hght_ft = .
crown_ft = .
volume = .
_ERROR = 0
_N_ = 2

As you can see, SAS increased the automatic variable _N_ to 2, reset _ERROR_ to 0 (as there are as of yet no errors detected while processing the observation), and reset the user-defined variables to missing. That's all that I really wanted to illustrate here. If you find it educational, you can continue to alternate between advancing the processor and examining the variable values. Or, you can quit like I'm going to by typing Quit on the command line and pressing the Enter key.

7.5 - Types of Log Messages

When you are trying to write programs that work, you'll no doubt encounter some messages in the log window that you'll need to interpret to figure out what SAS is trying to tell you. In this section, we investigate three different kinds of messages —errors, warnings, and notes — that SAS displays in the log window.

Example 7.7: ERROR Messages

In general, when SAS displays ERROR messages in your log window — in red as illustrated — your program will not run because it contains some kind of syntax error or spelling mistake.

The following code causes SAS to print an ERROR message in the log window:

DATA one;
  input A B C;
	DATALINES;
	1 2 3
	4 5 6
    7 8 9
	;
RUN;

PROC PRINT / DATA = one;
RUN;

First, launch and run the program, and then look in the log window to see the ERROR message that SAS displays:

  PROC PRINT / DATA = one;
                -
                22
                200
NOTE: Writing HTML Body file: sashtml1.htm
ERROR 22-322: Syntax error, expecting one of the following: ;,
              BLANKLINE, CONTENTS, DATA, DOUBLE, GRANDTOTAL_LABEL,
              GRANDTOT_LABEL, GRAND_LABEL, GTOTAL_LABEL, GTOT_LABEL,
              HEADING, LABEL, N, NOOBS, NOSUMLABEL, OBS, ROUND, ROWS,
              SPLIT, STYLE, SUMLABEL, UNIFORM, WIDTH.
ERROR 200-322: The symbol is not recognized and will be ignored.
  RUN;

NOTE: The SAS System stopped processing this step because of errors.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.50 seconds
      cpu time            0.31 seconds

as a result of the inappropriately placed forward slash (/) in the PROC PRINT statement.

Example 7.8

The location of an error is typically easy to find because it is usually underlined, but it is often tricky trying to figure out the source of the error. Sometimes what is wrong in the program is not what is underlined in the log window but something else earlier in the program. The following program illustrates such an event:

DATA one;
    INPUT A B C
	DATALINES;
	1 2 3
	4 5 6
	7 8 9
	;
RUN;

First, review the program and note that the problem with the code is that the INPUT statement is missing its required semi-colon (;). Then, launch and run the program, and then look in the log window to see the ERROR message that the code produces:

  DATA one;
      INPUT A B C
      DATALINES;
      1 2 3
        -
        180
ERROR 180-322: Statement is not valid or it is used out of proper order.

      4 5 6
      7 8 9
      ;
  RUN;

ERROR: No DATALINES or INFILE statement.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.ONE may be incomplete.  When this step was
         stopped there were 0 observations and 4 variables.
WARNING: Data set WORK.ONE was not replaced because this step was
         stopped.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

You should see that SAS underlines the 1 in the first data line rather than the end of the INPUT statement. The moral of the story here is to not only look at what SAS underlines but also at the few lines of code immediately preceding the underlined statement.

Example 7.9: WARNING Messages

When SAS displays WARNING messages in your log window — in green as illustrated — your program will typically run. A warning may mean, however, that SAS has done something that you didn't intend. It is for this reason that you'll always want to check the log window after submitting a program to make sure that it doesn't contain WARNING messages about the execution of your program.

The following code results in SAS printing a WARNING message in the log window:

DATA example2;
   IMPUT a b c;
   DATALINES;
112 234 345
115 367   .
190 110 111
;
RUN;

As you can see by the red-colored font displayed in the (Enhanced) Program Editor, the keyword INPUT has been incorrectly typed as IMPUT. If you don't catch the misspelling in the Program Editor, SAS will, whenever possible, attempt to correct your spelling of certain keywords. In these cases, SAS prints a WARNING message in the log window to alert you to how it interpreted your program in order to get it to run.

Launch and run the program, and then look in the log window to see the WARNING message:

  DATA example2;
     IMPUT a b c;
        -----
        1
WARNING 1-322: Assuming the symbol INPUT was misspelled as IMPUT.

     DATALINES;

NOTE: The data set WORK.EXAMPLE2 has 3 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

that the code produces. Note, too, that in spite of the WARNING message, SAS is still able to complete the DATA step by changing the spelling of IMPUT to INPUT.

Example 7.10: NOTE Messages

NOTE messages, which are displayed in blue as illustrated, are less straightforward than either warnings or errors. Sometimes notes just give you information, like telling you the execution time of each step in your program. Sometimes, however, a NOTE can indicate a problem with the way SAS executes your program.

The following code results in SAS printing a NOTE in the log window:

DATA example2;
   INPUT a b c;
   DATALINES;
112 234 345
115 367
190 110 111
;
RUN;

Launch and run the program, and then look in the log window to see the NOTE that the code produces:

  ;
  RUN;

  DATA example2;
     INPUT a b c;
     DATALINES;

NOTE: SAS went to a new line when INPUT statement reached past the end
      of a line.
NOTE: The data set WORK.EXAMPLE2 has 2 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


  ;
  RUN;

You should see that SAS appropriately warns you that it went to a new line when the INPUT statement didn't find the third data value in the second data line. Incidentally, you can correct this problem either by adding the following line of code just before the INPUT statement:

INFILE DATALINES MISSOVER;

or by adding a missing value (.) to the end of the second data line.

Example 7.11

Beware that not every NOTE that appears in the log window is a problem. The following code is an example in which SAS going to the new line is exactly what is wanted:

DATA example2;
   INPUT a b c;
   DATALINES;
101
111
118
215
620
910
;
RUN;

PROC PRINT data = example2;
RUN;

Obs	a	b	c
1	101	111	118
2	215	620	910

In this case, the programmer purposefully entered one data value in each record. As long as it is what the programmer intended, SAS will go to a new line in each case and thereby read in 2 observations with 3 variables. Launch and run the program. Then, review the output to understand how SAS read in the data, and then review the log window:

  DATA example2;
     INPUT a b c;
     DATALINES;

NOTE: SAS went to a new line when INPUT statement reached past the end
      of a line.
NOTE: The data set WORK.EXAMPLE2 has 2 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


  ;
  RUN;
  PROC PRINT data = example2;
  RUN;

NOTE: There were 2 observations read from the data set WORK.EXAMPLE2.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

to see that the NOTE about SAS going to a new line when the INPUT statement reached past the end of a line is just what the doctor ordered.

7.6 - Good Programming Practices

Half the battle of writing programs that work is to adhere to good programming practices. In this section, we highlight four programming practices that can make your job of writing programs that work that much easier. The four cardinal practices are:

Write programs that are easy to read.
Test each part of your program before proceeding to the next part.
Test your programs with small data sets.
Test your programs with representative data.

Trust me ... these practices really can make you a more efficient SAS programmer.

Example 7.12: Write Programs That are Easy to Read

You may well recall me harping about writing programs that are easy to read and well commented. A primary argument for writing programs that are easy to read is that doing so makes it easier to prevent and/or find errors.

The following example program is needlessly challenging to read:

data example1; input a b c; d=a
+ b-c;e=a-c;f=c+b;g=c-b;datalines;
1 2 3
4 5 6
7 8 9
;
run; proc means; var a; run; proc print; run;

After reviewing the program to appreciate how needlessly awkward it is to read, you can go ahead and launch and run the SAS program.

Although you can write SAS statements in almost any format, a neat and consistent layout enhances readability and helps you understand the purpose of the program. In general, it's a good idea to:

Put only one SAS statement on a line.
Begin with the DATA and PROC steps in the first column of the program editor.
Indent statements within your DATA steps and PROC steps.
Include a RUN statement after every DATA step or PROC step.
Begin RUN statements in the first column of the program editor.
Comment your programs judiciously — that is, don't make too few comments so that it is difficult for you or others to know what your programs are doing, and don't make too many comments so that it is difficult to read your programs.

You might want to entertain yourself by following these guidelines while editing the above program just to make it more readable.

Example 7.13: Test Each Part of Your Program

You can increase your programming efficiency tremendously by making sure each part of your program is working before moving on to write the next part. The simplest way to illustrate how wrong you can go is to imagine that you've just spent the last two weeks performing a complex statistical analysis on a data set only to discover later that the data set contains errors. If you had only used a simple PRINT procedure first to print and review the data set, you would have saved yourself lots of useless work. The reason why I cite this particular example is that it has probably happened at least once to every statistician working out there. This particular statistician is trying to save you from going down the same path!

The following program may appear to work just fine as SAS does indeed produce an answer. If you look carefully at both the input and output, though, you'll see that the answer is not what we should expect:

DATA example2;
   INPUT a b c;
   DATALINES;
112 234 345
115 367
190 110 111
;
RUN;
PROC MEANS;
  var c;
RUN;

The MEANS Procedure
Analysis variable: c
N	Mean	Std Dev	Minimum	Maximum
2	267.5000000	109.6015511	190.0000000	345.0000000

We'll learn later in this course that the MEANS procedure allows us to calculate the mean, as well as other summary statistics, of a numeric variable. In this program, the MEAN procedure tells SAS to calculate the mean of the numeric variable c.

Go ahead and launch and run the SAS program. Then, in reviewing the output, compare the answer we obtained for the mean of the variable c (267.5) with the answer that we should have obtained (228, from 345 plus 111 all divided by 2). Then, if you look at the log window:

  DATA example2;
     INPUT a b c;
     DATALINES;
NOTE: SAS went to a new line when INPUT statement reached past the end
      of a line.
NOTE: The data set WORK.EXAMPLE2 has 2 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
  ;
  RUN;
  PROC MEANS;
    var c;
  RUN;
NOTE: There were 2 observations read from the data set WORK.EXAMPLE2.
NOTE: PROCEDURE MEANS used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds

you might be able to figure out what went awry. In the next lesson, we'll be investigating errors, warnings, and notes, such as this one about SAS going to a new line that SAS prints in the log window as a means to communicate to you that your program might not be doing what you expect.

In this example, SAS couldn't find the value of c in the second line of data, so as reported in the log window, SAS went to the next line of data to find it. If you print the example2 data set by adding a PRINT procedure after the DATA step, you'll see what I mean:

Obs	a	b	c
1	112	234	345
2	115	367	190

Then, you might want to add a missing value (.) placeholder to the second line of data (after the 367) and re-run the program to see that it now works as it should.

Example 7.14

While testing your programs, you might find the PUT statement to be particularly useful. The following program reads in the tree data into the trees data set and calculates the volume of each tree as in Example 7.3. Here though a few PUT statements have been added to help the programmer verify that the program is doing what she expects:

DATA trees;
    input type $ 1-16 circ_in hght_ft crown_ft;
	volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
    if volume = . then do;
	     PUT ' ';
	     PUT 'DATA ERROR!!! ';
		 PUT ' ';
		 PUT ' ';
	end;
	else if volume lt 20 then PUT 'Small tree ' _N_= volume=;
	else if volume ge 20 then PUT 'Large tree ' _N_= volume=;
	DATALINES;
oak, black        222 1O5 112
hemlock, eastern  149 138  52
ash, white        258  80  70
cherry, black     187  91  75
maple, red        210  99  74
elm, american     229 127 104
;
RUN;
PROC PRINT data = trees;
RUN;

Obs	type	circ_in	hght_ft	crown_ft	volume
1	oak, black	222	.	112	.
2	hemlock, eastern	149	138	52	15.9305
3	ash, white	258	80	70	27.6890
4	cherry, black	187	91	75	16.5464
5	maple, red	210	99	74	22.7014
6	elm, american	229	127	104	34.6300

The PUT statement writes messages in the log window. If you launch and run the SAS program and take a look at the log window, you should see that a portion of the log window contains messages created as a result of executing the PUT statements:

  DATA trees;
      input type $ 1-16 circ_in hght_ft crown_ft;
      volume = (0.319*hght_ft)*(0.0000163*circ_in**2);
      if volume = . then do;
           PUT ' ';
           PUT 'DATA ERROR!!! ';
           PUT ' ';
           PUT ' ';
      end;
      else if volume lt 20 then PUT 'Small tree ' _N_= volume=;
      else if volume ge 20 then PUT 'Large tree ' _N_= volume=;
      DATALINES;
NOTE: Invalid data for hght_ft in line 187 23-25.
DATA ERROR!!!
RULE:      ----+----1----+----2----+----3----+----4----+----5----+----6-
           oak, black        222 1O5 112
type=oak, black circ_in=222 hght_ft=. crown_ft=112 volume=. _ERROR_=1
_N_=1
Small tree _N_=2 volume=15.930518479
Large tree _N_=3 volume=27.689026464
Small tree _N_=4 volume=16.546376146
Large tree _N_=5 volume=22.70137023
Large tree _N_=6 volume=34.630038398
NOTE: Missing values were generated as a result of performing an
      operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
      1 at 177:20
NOTE: The data set WORK.TREES has 6 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
  ;
  RUN;
  PROC PRINT data = trees;
  RUN;
NOTE: There were 6 observations read from the data set WORK.TREES.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

7.7 - Summary

We've spent most of our time in this lesson learning how SAS processes programs and, in particular, DATA steps. We did this with the intention of improving our programming efficiency. In short, the morals of the story — and the ones that I am therefore allowed to harp on throughout this semester — are:

write programs that are easy to read
write programs that are commented out well
and use the PRINT procedure freely, freely, freely to test each part of your program before proceeding
read the log window every time you run a SAS program
read the log window every time you run a SAS program
and did I tell you yet to read the log window every time you run your SAS program?

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility