Lesson 13: Reading SAS Data Sets

Overview

We'll start this course by investigating something that you already know how to do well, namely reading SAS data sets. Well, that's not quite true — this lesson actually concerns a bit more than that. We will first review how to read data already contained in one SAS data set into another SAS data set. But then we'll explore the many ways in which a SAS data set can be modified. Some of the ways we will have seen before, such as changing a variable's attributes, while other ways will be new, such as using an accumulator variable to sum the contents of another variable. We'll then learn how to identify the first and last observations in a group, as well as the last observation in a data set. Then, we'll close out the lesson by discussing the process SAS follows when reading in a SAS data set.

Objectives

Upon completion of this lesson, you should be able to:

use a SET statement to read data into a SAS data set from another SAS data set
use an IF-THEN-DELETE statement to exclude observations from a SAS data set
use a subsetting IF statement to include observations in a SAS data set
use a DROP statement to reduce the number of variables in a SAS data set
use a LABEL, FORMAT, or LENGTH statement to change the attributes of a variable
create and use an accumulator variable
use a SELECT statement to assign values to a variable conditionally
use a SET and BY statement together in a DATA step in order to identify the first and last observations of a group and/or subgroup
use FIRST.variables and LAST.variables in order to create new variables, modify the structure of a data set, etc.
use the SET statement's END= option in order to identify the last observation in a SAS data set
describe how SAS processes a DATA step when it involves using a SET statement to read from one SAS data set into another SAS data set

13.1 - Reading a Single SAS Data Set

In this section, we'll review how to read data already contained in one SAS data set into another SAS data set.

Example 13.1

The following program uses the DATA step's SET statement to create a temporary SAS data set called work.penngolf, which is identical to the permanent SAS data set called stat481.penngolf:

Copy the code below:

OPTIONS PS = 58 LS = 72 NODATE NONUMBER;

LIBNAME stat481 'C:\yourdrivenIfame\Stat481WC\01sasdata\sasndata';

DATA penngolf;
   set stat481.penngolf;
RUN;

PROC PRINT data = penngolf NOOBS;
   title 'The penngolf data set';
RUN;

The penngolf data set
ID	Name	Architect	Year	Type	Par	Yards	Slope	USGA
101	Toftrees	Ed Ault	1968	Resort	72	7018	134	74.3
102	Penn State Blue	Willie Park, Jr.	1921	Public	72	6525	128	72.0
103	Centre Hills	Alex Findlay	1921	Private	71	6392	128	71.2
104	Lewistown CC		.	Private	72	6779	125	72.3
105	State College Elks	Lowell Erdman	1973	SemiPri	71	6369	123	70.9
106	Park Hill CC	James Harrison	1966	SemiPri	70	6004	126	69.3
107	Sinking Valley CC	Ed Ault	1967	SemiPri	72	6755	132	73.4
108	Williamsport CC	A.W Tillinghast	1909	Private	71	6489	131	71.9
109	Standing Stone GC	Geoffrey Cornish	1973	SemiPri	70	6593	120	71.4
110	Bucknell GC		1960	SemiPri	70	6253	132	70.0

Of course, the LIBNAME statement just tells SAS that the nickname for the directory in which the permanent SAS data set resides is stat481. Now for the DATA step:

Because a one-level name is used, the DATA statement tells SAS to create a temporary data set called penngolf. Recall that SAS stores all temporary data sets in a temporary library called work. Therefore, we can refer to the created data set either by its one-level name penngolf, or by its two-level name work.penngolf.
The SET statement tells SAS to assign the data in the existing permanent SAS data set stat481.penngolf — observation by observation — to the temporary SAS data set called penngolf. Because the variables in the existing data set have already been named, no INPUT statement is necessary. In fact, if you were to include an INPUT statement, SAS would croak back at you.

That's all there is to it. Before we have you run this thing, let me confess something to you — this has got to be the silliest program around! There is really no good reason for taking a permanent SAS data set and simply turning it into a temporary SAS data set without including other statements in the DATA step to modify the data set in some way. Doing so just reduces the efficiency of your programs. Instead, you should just go ahead and use the permanent SAS data set in any procedure you want. What I am begging for here is not to let me (or anyone else!) see you mimic this kind of DATA step in your future programs. I certainly wouldn't want you to pick up any bad habits from this course.

That confession made, let's run the program! Before the program will work for you, you will, of course, need to download and save the penngolf data set (click to save!) in a convenient location on your computer. Then, launch the SAS program, and edit the LIBNAME statement so it reflects the location in which you saved the data set on your computer. Then, run the program, and review the output to familiarize yourself with the data set that we'll work with throughout the remainder of the lesson.

13.2 - Manipulating Data in a SAS Data Set

In this section, we'll review the many ways in which a SAS data set can be modified, including:

reducing the number of observations in a data set
reducing the number of variables in a data set
creating or modifying a variable
changing a variable's attributes
creating an accumulator variable
assigning values conditionally

While many of the methods will already be familiar to you, some you will see just for the first time.

Reducing the Number of Observations in a Data Set

There may be situations in which you'd like to exclude some of the observations in a SAS data set from an analysis that you are conducting or a report that you are generating. I like to think of doing such a thing as "making a data set shorter." Although there are other ways, we'll investigate two common ways of reducing the number of observations in a data set here. We can use an IF-THEN-DELETE statement to exclude observations from the output data set. Or we can use a subsetting IF statement to include observations in our output data set. We will investigate yet another way of reducing the number of observations in our data sets in the next lesson.

Example 13.2

The following program uses an IF-THEN-DELETE statement to exclude golf courses whose par is 70 from the penngolf data set:

DATA penngolf;
     set stat481.penngolf;
     if par = 70 then DELETE;
RUN;
PROC PRINT data = penngolf NOOBS;
     title 'The penngolf data set';
RUN;

The penngolf data set
ID	Name	Architect	Year	Type	Par	Yards	Slope	USGA
101	Toftrees	Ed Ault	1968	Resort	72	7,018	134	74.3
102	Penn State Blue	Willie Park, Jr.	1921	Public	72	6,525	128	72.0
103	Centre Hills	Alex Findlay	1921	Private	71	6,392	128	71.2
104	Lewistown CC		.	Private	72	6,779	125	72.3
105	State College Elks	Lowell Erdman	1973	SemiPri	71	6,369	123	70.9
107	Sinking Valley CC	Ed Ault	1967	SemiPri	72	6,755	132	73.4
108	Williamsport CC	A.W Tillinghast	1909	Private	71	6,489	131	71.9

The IF-THEN-DELETE statement is pretty straightforward, and therefore probably doesn't deserve any explanation. Launch and run the program, and review the output to convince yourself that golf courses whose par is 70 were excluded from the temporary penngolf data set.

Example 13.3

The following program uses a subsetting IF statement to include only those golf courses whose par is greater than 70 in the penngolf data set:

DATA penngolf;
     set stat481.penngolf;
     if par GT 70;
RUN;
PROC PRINT data = penngolf NOOBS;
     title 'The penngolf data set';
RUN;

The penngolf data set
ID	Name	Architect	Year	Type	Par	Yards	Slope	USGA
101	Toftrees	Ed Ault	1968	Resort	72	7,018	134	74.3
102	Penn State Blue	Willie Park, Jr.	1921	Public	72	6,525	128	72.0
103	Centre Hills	Alex Findlay	1921	Private	71	6,392	128	71.2
104	Lewistown CC		.	Private	72	6,779	125	72.3
105	State College Elks	Lowell Erdman	1973	SemiPri	71	6,369	123	70.9
107	Sinking Valley CC	Ed Ault	1967	SemiPri	72	6,755	132	73.4
108	Williamsport CC	A.W Tillinghast	1909	Private	71	6,489	131	71.9

Again, the subsetting IF statement is pretty straightforward, and therefore probably also doesn't deserve any explanation. Launch and run the program, and review the output to convince yourself that only golf courses whose par is greater than 70 were included in the temporary penngolf data set. You might also want to take note that the resulting data set is the same as the resulting data set from the previous example — just accomplished in a different way!

Reducing the Number of Variables in a Data Set

There may be situations in which you'd like to reduce the number of variables in a data set in order to work with a data set of a much more manageable size. I like to think of doing such a thing as "making a data set thinner." Although we'll investigate how to do this much more extensively in the next lesson, we'll at least introduce the concept by illustrating the use of a DROP statement.

Example 13.4

The following program uses a DROP statement to tell SAS to drop the Architect variable from the temporary output data set penngolf:

DATA penngolf;
     set stat481.penngolf;
     drop Architect;
RUN;
PROC PRINT data = penngolf NOOBS;
     title 'The penngolf data set';
RUN;

The penngolf data set
ID	Name	Year	Type	Par	Yards	Slope	USGA
101	Toftrees	1968	Resort	72	7,018	134	74.3
102	Penn State Blue	1921	Public	72	6,525	128	72.0
103	Centre Hills	1921	Private	71	6,392	128	71.2
104	Lewistown CC	.	Private	72	6,779	125	72.3
105	State College Elks	1973	SemiPri	71	6,369	123	70.9
106	Park Hill CC	1966	SemiPri	70	6,004	126	69.3
107	Sinking Valley CC	1967	SemiPri	72	6,755	132	73.4
108	Williamsport CC	1909	Private	71	6,489	131	71.9
109	Standing Stone GC	1973	SemiPri	70	6,593	120	71.4
110	Bucknell GC	1960	SemiPri	70	6,253	132	70.0

Self-explanatory again, eh? In the next lesson, we'll learn that rather than using a DROP statement to exclude variables, we could use a KEEP statement to include variables in the output data set. We'll also discuss the difference between the DROP and KEEP statements and the alternative DROP= and KEEP= options of the DATA and SET statements. For now, let's have you launch and run the program, and review the output to convince yourself that the Architect variable was not included in the output temporary penngolf data set.

Creating or Modifying a Variable

There is definitely nothing new here! Surely, you've created a new variable or two in the myriad SAS programs you have written by now. We include it here only for the sake of completeness, so it can be acknowledged as one way in which a SAS data set can be modified.

Example 13.5

The following program uses the Slope and USGA variables in conjunction with an assignment statement to create a new variable called Bogey:

DATA penngolf;
     set stat481.penngolf;
     Bogey = 0.186*Slope + USGA;
RUN;
PROC PRINT data = penngolf NOOBS;
     title 'The penngolf data set';
       var Name Yards Par Slope USGA Bogey;
RUN;

The penngolf data set
Name	Yards	Par	Slope	USGA	Bogey
Toftrees	7,018	72	134	74.3	99.224
Penn State Blue	6,525	72	128	72.0	95.808
Centre Hills	6,392	71	128	71.2	95.008
Lewistown CC	6,779	72	125	72.3	95.550
State College Elks	6,369	71	123	70.9	93.778
Park Hill CC	6,004	70	126	69.3	92.736
Sinking Valley CC	6,755	72	132	73.4	97.952
Williamsport CC	6,489	71	131	71.9	96.266
Standing Stone GC	6,593	70	120	71.4	93.720
Bucknell GC	6,253	70	132	70.0	94.554

Launch and run the program, and review the output to convince yourself that the Bogey variable was calculated as suggested by the assignment statement.

Changing a Variable's Attributes

Recall that one of the products of the compile phase is the descriptor portion of a data set containing information about the attributes of each variable in the data set. The attribute information includes the variable's name, type, length, format, informat, and label. As you know, we can use a LENGTH statement to modify the length of a variable, a FORMAT statement to modify the format of a variable, and a LABEL statement to associate a descriptive label to a variable.

Example 13.6

The following program uses 1) a LABEL statement to associate descriptive labels to some of the variables in the penngolf data set and 2) a FORMAT statement to tell SAS how to display two of the variables (Bogey and Yards):

DATA penngolf;
     set stat481.penngolf;
     Bogey = 0.186*Slope + USGA;
     label Bogey = 'Bogey Rating'
          USGA = 'USGA Rating'
          Slope = 'Course Slope'
          Par = 'Course Par'
          Yards = 'Total Yardage';
     format Bogey 4.1 Yards comma5.;
RUN;
PROC PRINT data = penngolf NOOBS LABEL;
     title 'The penngolf data set';
   var Name Yards Par Slope USGA Bogey;
RUN;

The penngolf data set
Name	Total Yardage	Course Par	Course Slope	USGA Rating	Bogey Rating
Toftrees	7,018	72	134	74.3	99.2
Penn State Blue	6,525	72	128	72.0	95.8
Centre Hills	6,392	71	128	71.2	95.0
Lewistown CC	6,779	72	125	72.3	95.6
State College Elks	6,369	71	123	70.9	93.8
Park Hill CC	6,004	70	126	69.3	92.7
Sinking Valley CC	6,755	72	132	73.4	98.0
Williamsport CC	6,489	71	131	71.9	96.3
Standing Stone GC	6,593	70	120	71.4	93.7
Bucknell GC	6,253	70	132	70.0	94.6

I personally have never seen the need for it, but a label can be as long as 256 characters. Seems a bit like overkill to me. Anyway, launch and run the program, and review the output to convince yourself that the variables were labeled and formatted as suggested.

Creating an Accumulator Variable

Now, this topic is definitely new to us! You might often find yourself in a situation in which you want to create a variable that accumulates the values of another variable. For example, if you have a data set that contains monthly sales, you might want to create a variable that contains year-to-date sales. To create such a variable, you'd most likely want to take advantage of what SAS calls an accumulator variable. To add the contents of a variable, or the result of an expression, to an accumulator variable, we need to use what SAS calls a sum statement.

The general form of a sum statement is:

variable + expression;

where:

variable tells SAS the desired name of your accumulator variable (yeartodate, perhaps). The variable, which must be numeric, is automatically set to 0 before the first observation in the data set is read.
expression is any valid SAS expression. It might be something as simple as the name of the variable that you want to add up. One thing that is important to keep in mind is that if the expression produces a missing value, or a variable's value is missing, the sum statement treats it like a zero. This is in stark contrast to an assignment statement, in which SAS assigns the variable to the left of the equal sign (=) a missing value if any variable on the right side of the equal sign (=) contains a missing value.

In short, the sum statement adds the result of the expression that is on the right side of the plus sign (+) to the numeric accumulator variable that is on the left side of the plus sign. Then, as you can imagine would have to be the case in order to work properly, the accumulator variable's value is retained from one iteration of the DATA step to the next. That is, the accumulator variable is not set to missing as variables usually are when reading data. Let's take a look at an example!

Example 13.7

The following program uses a sum statement and an accumulator variable called TotalYards to tell SAS to tally up the number of Yards that the golf courses in the penngolf data set have:

DATA penngolf;
    set stat481.penngolf;
    TotalYards + Yards;
 RUN;
 
 PROC PRINT data = penngolf NOOBS;
    title 'The penngolf data set';
    var Name Yards TotalYards;
 RUN;

The penngolf data set
Name	Yards	Total Yards
Toftrees	7,018	7018
Penn State Blue	6,525	13543
Centre Hills	6,392	19935
Lewistown CC	6,779	24714
State College Elks	6,369	33083
Park Hill CC	6,004	39087
Sinking Valley CC	6,755	45842
Williamsport CC	6,489	52331
Standing Stone GC	6,593	58924
Bucknell GC	6,253	65177

Launch and run the program, and review the output. So, let's see here, the value of TotalYards is set to 0 before the first observation is read. Then, as you can see, the value of TotalYards in the first observation becomes 0 plus 7018, or 7018. Then, the value of TotalYards in the second observation becomes 7018 plus 6525, or 13543. Then, the value of TotalYards in the third observation becomes 13543 plus 6392, or 19935. And so on.

Assigning Values Conditionally

You learned how to use IF-THEN-ELSE statements in Stat 480 as a way of assigning values to variables conditionally. Here, we'll explore the SELECT group as an alternative method.

Example 13.8

The following program uses a SELECT group to create a numeric variable called AssnFee that depends on the values of the character variable Type:

DATA penngolf;
    set stat481.penngolf;
    select (Type);
        when ("Resort")  AssnFee = 5000;
        when ("Private") AssnFee = 4000;
        when ("SemiPri") AssnFee = 2000;
        when ("Public")  AssnFee = 1000;
        otherwise AssnFee = .;
    end;
    format Yards AssnFee comma5.;
 RUN;

 PROC PRINT data = penngolf NOOBS;
    title 'The penngolf data set';
    var Name Yards Type AssnFee;
 RUN;

The penngolf data set
Name	Yards	Type	Assn Fee
Toftrees	7,018	Resort	5,00
Penn State Blue	6,525	Public	1,000
Centre Hills	6,392	Private	4,000
Lewistown CC	6,779	Private	4,000
State College Elks	6,369	SemiPri	2,000
Park Hill CC	6,004	SemiPri	2,000
Sinking Valley CC	6,755	SemiPri	2,000
Williamsport CC	6,489	Private	4,000
Standing Stone GC	6,593	SemiPri	2,000
Bucknell GC	6,253	SemiPri	2,000

As you can see, a SELECT group is comprised of a SELECT statement, a series of required WHEN statements, an optional OTHERWISE statement, and a required END statement. The SELECT statement begins, and the END statement ends a SELECT group. The easiest way to explain this SELECT group is probably just to compare it to the comparable IF-THEN-ELSE statement:

if (Type = "Resort") then AssnFee = 5000;
else if (Type = "Private") then AssnFee = 4000;
else if (Type = "SemiPri") then AssnFee = 2000;
else if (Type = "Public") then AssnFee = 1000;
else AssnFee = .;

That is, in this example, we've told SAS that we want to compare the variable Type to all of the when-conditions appearing in the WHEN statements. If the value of Type for a particular observation matches a when-condition, then the statement that follows that when-condition is executed. If the value of Type for a particular observation does not match any of the when-conditions, SAS executes the OTHERWISE statement.

Launch and run the program, and review the output to convince yourself that the SELECT group functions just like above the IF-THEN-ELSE statement. You might also want to note in this example that we've optionally taken a shortcut of including the variable Type in parentheses in the SELECT statement. This shortcut tells SAS to compare the variable of Type to the values in the parentheses in the WHEN statements. If we did not take this optional shortcut, our SELECT group would have to be written as follows:

select; when (Type = "Resort") AssnFee = 5000;
	when (Type = "Private") AssnFee = 4000;
	when (Type = "SemiPri") AssnFee = 2000;
	when (Type = "Public") AssnFee = 1000;
	otherwise AssnFee = .;
end;

You might want to edit and re-run the program to convince yourself that the two SELECT groups are equivalent.

Now you might be wondering when you should use a SELECT group, and when you should use IF-THEN-ELSE statements. When you have a long series of mutually exclusive conditions and the comparison is numeric, using a SELECT group is a bit more efficient than using a series of IF-THEN-ELSE statements. When you need SAS to evaluate just a few conditions (like one or two), you might find it more straightforward to just use an IF-THEN-ELSE statement. You wouldn't lose any efficiency in doing so. That said, if efficiency seems like a minor issue to you, you might just find it easier to read and debug programs containing SELECT groups.

Example 13.9

The following program uses a SELECT group to create a new variable called Course whose value depends on the value of Yards:

DATA penngolf;
    set stat481.penngolf;
    LENGTH Course $ 7;
    select;
        when (Yards GE 6700)          Course = 'Long';
        when (6400 LE Yards LT 6700)  Course = 'Medium';
        when (Yards LT 6400)          Course = 'Short';
        otherwise                     Course = 'Unknown';
    end;
    format Yards comma5.;
 RUN;

 PROC PRINT data = penngolf NOOBS;
    title 'The penngolf data set';
    var Name Yards Course;

The penngolf data set
Name	Yards	Course
Toftrees	7,018	Long
Penn State Blue	6,525	Medium
Centre Hills	6,392	Short
Lewistown CC	6,779	Long
State College Elks	6,369	Short
Park Hill CC	6,004	Short
Sinking Valley CC	6,755	Long
Williamsport CC	6,489	Medium
Standing Stone GC	6,593	Medium
Bucknell GC	6,253	Short

In this example, we did not include a variable (or expression) in parentheses in the SELECT statement. As SAS processes an observation, it starts then with the first WHEN statement. If the when-condition is true, SAS executes the statement that follows the when-condition. If it is false, SAS proceeds to the next WHEN statement. If its when-condition is true SAS executes the statement that follows the when-condition. And so on ... SAS proceeds in this way until it reaches the OTHERWISE statement.

Let's have you launch and run the program, and review the output to convince yourself that SAS executes the SELECT group just as it would the comparable IF-THEN-ELSE statement.

Oh, you might also want to remind yourself why it is necessary to use the LENGTH statement to tell SAS explicitly that the length of the Course variable is 7 characters. Recall that, by default, SAS defines the length of a new character variable to be the same as the length of the first value assigned to the variable in the DATA step. Here, then, in the absence of the LENGTH statement, SAS would define the length of the Course variable to be 4, because Long is the first value to be associated with Course. That then means that the values assigned to the Course variable would be Long, Medi, Shor, and Unkn because of the truncation that would ensue. To see this firsthand, comment out (or delete) the LENGTH statement, re-run the program, and review the resulting output. Point made?! Let's take a look at one more example!

Example 13.10

The following SAS program is identical to the previous program, except the order of the conditions in the SELECT group has been reversed and the LENGTH statement has been removed:

DATA penngolf;
    set stat481.penngolf;
    select;
        when (Yards = .)       Course = 'Unknown';
        when (Yards LT 6400 )  Course = 'Short';
        when (Yards LT 6700)   Course = 'Medium';
        otherwise              Course = 'Long';
    end;
    format Yards comma5.;
 RUN;
 PROC PRINT data = penngolf NOOBS;
    title 'The penngolf data set';
    var Name Yards Course;
 RUN;

The penngolf data set
Name	Yards	Course
Toftrees	7,018	Long
Penn State Blue	6,525	Medium
Centre Hills	6,392	Short
Lewistown CC	6,779	Long
State College Elks	6,369	Short
Park Hill CC	6,004	Short
Sinking Valley CC	6,755	Long
Williamsport CC	6,489	Medium
Standing Stone GC	6,593	Medium
Bucknell GC	6,253	Short

Launch and run the program, and review the output to convince yourself that this program produces output identical to the output produced by the previous program. Then, there are two points that I'd like to make concerning this program.

First, the LENGTH statement is no longer necessary. Here, SAS would define the length of the Course variable to be 7, because Unknown is the first value to be associated with Course. Since Unknown is the longest value that we assign to Course, the default length of 7 works just fine for us. No need to override the default!

Second, consider the yardage of 6392 for the Centre Hills golf course. If you look at the SELECT group, you can see that a yardage of 6392 meets the condition of both the second and third WHEN statements. That is, 6392 is less than 6400, and it is less than 6700. How does SAS behave in this case? The answer lies in knowing that if more than one WHEN statement is true, SAS uses only the first WHEN statement. Once SAS finds a when-condition that is true, no other when-conditions are evaluated for a particular observation. So, here SAS deems that the second WHEN statement is true when Yards are 6392, and therefore assigns Short to the Course variable. SAS then steps out of the SELECT group, and at the end of the DATA step, moves on to process the next observation in the data set.

Incidentally, the first WHEN statement in our SELECT group:

when (Yards = .) Course = 'Unknown';

should remind us of another important point. Do you remember that good old programming habit about always programming for missing values? If we removed the above WHEN statement from our SELECT group, what value would SAS assign to Course if Yards were missing for a particular golf course? Remembering that a missing value (.) is considered smaller than any other numerical value, SAS would consider the missing value to be less than 6400, and Course would be errantly assigned to the value of Short. So, the good old programming habit holds for SELECT groups as well: always program for missing values.

13.3 - Finding First and Last Observations

In this section, we learn how to use a BY statement, in conjunction with a SET statement, to identify the first and last observations among a group of observations. As we'll see in the examples that follow, once we've identified the first and last observations in a group, we can use the information to modify existing variables, create new variables, change the structure of our data set, and so on.

Throughout this section, we'll work with a permanent SAS data set called sales:

Store	Dept	Quarter	Sales
101	10	1	110001.50
101	10	2	113101.20
101	10	3	111932.15
101	10	4	99901.10
101	20	1	110002.36
101	20	2	99922.39
101	20	3	98832.98
101	20	4	110101.70
121	20	1	121947.10
121	20	2	119964.69
121	20	3	122136.28
121	20	4	120111.11
121	10	1	127192.92
121	10	2	125280.13
121	10	3	128203.56
121	10	4	123632.29
109	10	1	120422.77
109	10	2	123984.32
109	10	3	121801.29
109	10	4	122125.66
109	30	1	98310.13
109	30	2	97331.25
109	30	3	96386.28
109	30	4	98511.90
109	20	1	115239.09
109	20	2	113001.98
109	20	3	114234.32
109	20	4	114122.65

containing four variables:

Store: a three-digit identification number of a store
Dept: a two-digit number identifying a department within each store (10 = clothing, 20 = housewares, 30 = sporting goods, say)
Quarter: quarter of the year (1 = first three months, 2 = second three months, ..., and so on)
Sales: the total dollar amount of the sales for the quarter in the given store's department

Let's be concrete here ... suppose we use a BY statement, in conjunction with a SET statement, to tell SAS to group the observations in the sales data set by Store:

DATA salesgrouped;
   set sales;
   by Store;
RUN;

Then, SAS knows that the first eight observations when Store = 101 comprise the first group, the next eight observations when Store = 121 comprise the second group, and the last twelve observations when Store = 109 comprise the last group. Well, okay, that's not technically quite correct! There's a little rule that we can't ignore ... when you use the BY statement with a SET statement, the data set(s) that are listed in the SET statement must be sorted by the values of the BY variable(s). As you can see, the sales data set is not yet sorted by the Store value. We'll worry about such details in the programs that follow. For now, know that our interest is in identifying the first and last observations within the Store = 101 group, the first and last observations within the Store = 121 group, and so on. Let's work our way through an example!

Finding the First and Last Observations in a Group

When you specify just one variable name in a BY statement, you can identify the first and last observations in a group.

Example 13.11

The following program tells SAS to process the sales data set by Store, just so we can get a behind-the-scenes look at how SAS groups observations and how we can subsequently find the first and last observations of each group:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';

PROC SORT data = stat481.sales out = srtdsales;
       by Store;
RUN;

DATA storesales;
          set srtdsales;
          by Store;
          firstStore = first.Store;
          lastStore = last.Store;
RUN;

PROC PRINT data= storesales;
        title 'Behind the scene view of the storesales data set';
           id Store;
RUN;

Behind the scene view of the storesales data set
Store	Dept	Quarter	Sales	firstStore	lastStore
101	10	1	110001.50	1	0
101	10	2	113101.20	0	0
101	10	3	111932.15	0	0
101	10	4	99901.10	0	0
101	20	1	110002.36	0	0
101	20	2	99922.39	0	0
101	20	3	98832.98	0	0
101	20	4	110101.70	0	1
109	10	1	120422.77	1	0
109	10	2	123984.32	0	0
109	10	3	121801.29	0	0
109	10	4	122125.66	0	0
109	30	1	98310.13	0	0
109	30	2	97331.25	0	0
109	30	3	96386.28	0	0
109	30	4	98511.90	0	0
109	20	1	115239.09	0	0
109	20	2	113001.98	0	0
109	20	3	114234.32	0	0
109	20	4	114122.65	0	1
121	20	1	121947.10	1	0
121	20	2	119964.69	0	0
121	20	3	122136.28	0	0
121	20	4	120111.11	0	0
121	10	1	127192.92	0	0
121	10	2	125280.13	0	0
121	10	3	128203.56	0	0
121	10	4	123632.29	0	1

The first procedure just takes care of the details that we ignored earlier. Because we want SAS to process the sales data set BY Store, we first need to sort the data. The SORT procedure tells SAS to sort the sales data set by the value of Store and to call the resulting sorted data set srtdsales.

Now for the DATA step. The SET and BY statements tell SAS to process the data by grouping observations with a similar Store value together. How does SAS accomplish this? Whenever you use a BY statement with a SET statement, SAS automatically creates two temporary variables for each variable name that appears in the BY statement. One of the temporary variables is called FIRST.variable, where variable is the variable name appearing in the BY statement. The other temporary variable is called, what else but, LAST.variable. The two variables always equal either 1 or 0:

FIRST.variable = 1 when an observation is the first observation in a BY group
FIRST.variable = 0 when an observation is not the first observation in a BY group
LAST.variable = 1 when an observation is the last observation in a BY group
LAST.variable = 0 when an observation is not the last observation in a BY group

SAS uses the values of the FIRST.variable and LAST.variable temporary variables to identify the first and last observations in a group, and therefore the group itself. Oh, a comment about that adjective temporary ... SAS places FIRST.variable and LAST.variable in the program data vector and they are therefore available for DATA step programming, but SAS does not add them to the SAS data set being created. It is in that sense that they are temporary.

Because SAS does not write FIRST.variables and LAST.variables to output data sets, we have to do some finagling to see their contents. The two assignment statements:

firstStore = first.Store;
        lastStore  = last.Store;

simply tell SAS to assign the values of the temporary variables, FIRST.Store and LAST.Store, to permanent variables, firstStore and lastStore, respectively. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the FIRST.variables and LAST.variables.

You should see that firstStore (and hence the internal FIRST.Store variable) equals 1 for the first observation of Store 101, for the first observation of Store 109, and for the first observation of Store 121. And, firstStore equals 0 for all of the other observations in the data set. Likewise, lastStore (and hence the internal LAST.Store variable) equals 1 for the last observation of Store 101, for the last observation of Store 109, and for the last observation of Store 121. And, lastStore equals 0 for all of the other observations in the data set.

Now that we know the mechanism by which SAS can identify the first and last observations in a group, let's take a look at an example in which we use the information to create a new variable, as well as change the structure of a data set.

Example 13.12

The following program uses the SET and BY statements to tell SAS to identify the first and last observations for each Store that appears in the sales data set, and to subsequently use that information to determine and display the total sales (StoreTotal) — that is, across all of the departments and quarters — for each Store:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
PROC SORT data = stat481.sales out = srtdsales;
        by Store;
RUN;
                        
DATA storesales;
          set srtdsales;
          by Store;
             if first.Store then StoreTotal = 0;
             StoreTotal + Sales;
             if last.Store;
             drop Dept Quarter Sales;
             format StoreTotal Dollar13.2;
RUN;

PROC PRINT data= storesales;
        title 'Sales by Store';
        id Store;
        sum StoreTotal;
RUN;

Sales by Store
Store	StoreTotal
101	$853,795.38
109	$1,335,471.64
121	$988,468.08
$3,177,735.10

We better start by having you run the program just to make sure you see first what the program does ... If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output. In short, the program takes the sales data set displayed earlier and collapses it to create a new data set called storesales that contains just three observations — one for each Store — and two variables — the store (Store) and the total sales (StoreTotal).

Now, let's work our way through the code. Again, the SORT procedure tells SAS to sort the sales data set by the value of Store and to call the resulting sorted data set srtdsales. The SET and BY statements tell SAS to create the temporary variables FIRST.Store and LAST.Store. The next two statements in the DATA step:

if first.Store then StoreTotal = 0;
   StoreTotal + Sales;

can almost be read as if they were in English. Set the store's total sales (StoreTotal) to 0 when you encounter the first observation for the store. Add the Sales for that observation to the StoreTotal. For every subsequent observation of the Store, add the Sales amount for the Store to the StoreTotal. When you encounter the last observation for the Store:

if last.Store;
   drop Dept Quarter Sales;
   format StoreTotal Dollar13.2;

include the observation in the storesales data set, and while so doing exclude the Dept, Quarter, and Sales variables and format the StoreTotal variable so its values are displayed as dollar amounts.

It is the subsetting IF statement:

if last.Store;

that deserves a bit more of a mention. First, it is equivalent to saying:

if last.Store = 1;

Recall that a subsetting IF statement tells SAS which variables to include in the output data set. Here, the subsetting IF statement tells SAS to include only those observations that correspond to the last observation for each Store. It is therefore the statement that takes the sales data set containing 28 observations and collapses it into the storesales data set containing just 3 observations. After all, the LAST.Store variable = 1 only for three observations — the three observations that correspond to the last observation for each Store.

Finding the First and Last Observations in Subgroups

In the previous examples, we specified just one variable name, Store, in the BY statement. Doing so allowed us to identify the first and last observations for each Store group. When you specify multiple variable names in a BY statement, you can identify the first and last observations in, not just one group, but multiple subgroups.

Example 13.13

The following program tells SAS to process the sales data set by Store and Dept, so we can get a behind-the-scenes look at how we can find the first and last observations of two subgroups:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';

PROC SORT data = stat481.sales out = srtdsales;
       by Store Dept;
RUN;

DATA storesales;
       set srtdsales;
       by Store Dept;
       firstStore = first.Store;
       lastStore  = last.Store;
       firstDept  = first.Dept;
       lastDept   = last.Dept;
RUN;

PROC PRINT data = storesales;
        title 'Behind the scene view of the storesales data set';
          id Store;
RUN;

Behind the scene view of the storesales data set
Store	Dept	Quarter	Sales	firstStore	lastStore	firstDept	lastDept
101	10	1	110001.50	1	0	1	0
101	10	2	113101.20	0	0	0	0
101	10	3	111932.15	0	0	0	0
101	10	4	99901.10	0	0	0	1
101	20	1	110002.36	0	0	1	0
101	20	2	99922.39	0	0	0	0
101	20	3	98832.98	0	0	0	0
101	20	4	110101.70	0	1	0	1
109	10	1	120422.77	1	0	1	0
109	10	2	123984.32	0	0	0	0
109	10	3	121801.29	0	0	0	0
109	10	4	122125.66	0	0	0	1
109	20	1	115239.09	0	0	1	0
109	20	2	113001.98	0	0	0	0
109	20	3	114234.32	0	0	0	0
109	20	4	114122.65	0	0	0	1
109	30	1	98310.13	0	0	1	0
109	30	2	97331.25	0	0	0	0
109	30	3	96386.28	0	0	0	0
109	30	4	98511.90	0	1	0	1
121	10	1	127192.92	1	0	1	0
121	10	2	125280.13	0	0	0	0
121	10	3	128203.56	0	0	0	0
121	10	4	123632.29	0	0	0	1
121	20	1	121947.10	0	0	1	0
121	20	2	119964.69	0	0	0	0
121	20	3	122136.28	0	0	0	0
121	20	4	120111.11	0	1	0	1

Again, because we want SAS to process the sales data set BY Store and Dept, we first need to sort the data. The SORT procedure tells SAS to sort the sales data set by the value of Store and Dept and to call the resulting sorted data set srtdsales. The SET and BY statements tell SAS to process the srtdsales data set by Store and Dept. Because two variables appear in the BY statement, SAS creates two temporary variables for each one:

FIRST.Store = 1 when SAS encounters a Store's first observation, and 0 otherwise
LAST.Store = 1,
when SAS encounters a Store's last observation, and 0 otherwise
FIRST.Dept = 1, when SAS encounters a Dept's first observation, and 0 otherwise
LAST.Dept = 1, when SAS encounters a Dept's last observation, and 0 otherwise

Because SAS does not write FIRST.variables and LAST.variables to output data sets, we again do some finagling to see their contents. The four assignment statements:

firstStore = first.Store;
lastStore  = last.Store;
firstDept  = first.Dept;
lastDept   = last.Dept;

tell SAS to assign the values of the temporary variables to permanent variables so we can view their contents. The PRINT procedure tells SAS to print the resulting data set so that we can take an inside peek at the values of the FIRST.variables and LAST.variables.

If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output to convince yourself that SAS uses the two FIRST.variables and the two LAST.variables to identify the first and last observations of each Store and of each Dept within each Store.

Example 13.14

The following program uses the SET and BY statements to tell SAS to identify the first and last observations for each Store and for each Dept within each Store, and to subsequently use that information to determine and display the total sales (DeptTotal) for each Dept within each Store:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';

PROC SORT data = stat481.sales out = srtdsales;
  by Store Dept;
RUN;

DATA storesales;
     set srtdsales;
     by Store Dept;
     if first.Dept then DeptTotal = 0;
     DeptTotal + Sales;
     if last.Dept;
     drop Quarter Sales;
     format DeptTotal Dollar13.2;
RUN;
                                    
PROC PRINT data = storesales NOOBS;
     title 'Sales by Store and Department';
     sum DeptTotal;
RUN;

Sales by Store and Department
Store	Dept	DeptTotal
101	10	$434,935.95
101	20	$418,859.43
109	10	$488,334.04
109	20	$456,598.04
109	30	$390,539.56
121	10	$504,308.90
121	20	$484,159.18
$3,177,735.10

In reviewing the program, you should see that the program is just a slightly modified version of that of Example 1.12. Let's start again by having you launch and run the SAS program, and reviewing the output just to make sure you see first what the program does. (Before running, don't forget to edit the LIBNAME statement to reflect the location of your stored sales data set.) In short, the program takes the sales data set displayed earlier and collapses it to create a new data set called storesales that contains seven observations — one for each Dept within each Store — and three variables — the store (Store), the department (Dept) and the total sales (DeptTotal).

Again, the SORT procedure tells SAS to sort the sales data set by the value of Store and Dept and to call the resulting sorted data set srtdsales. The SET and BY statements tell SAS to create the temporary variables FIRST.Store, LAST.Store, FIRST.Dept, and LAST.Dept. The next two statements in the DATA step:

if first.Dept then DeptTotal = 0;
DeptTotal + Sales;

can again be read as if they were in English. Set the department's total sales (DeptTotal) to 0 when you encounter the first observation for the department. Add the Sales for that observation to the DeptTotal. For every subsequent observation for the Dept, add the Sales amount for the Dept to the DeptTotal. When you encounter the last observation for the Dept (within each Store):

if last.Dept;
drop Quarter Sales;
format DeptTotal Dollar13.2;

include the observation in the storesales data set, and while so doing exclude the Quarter and Sales variables, and format the DeptTotal variable so its values are displayed as dollar amounts.

It is again the subsetting IF statement:

if last.Dept;

that takes the sales data set containing 28 observations and collapses it into the storesales data set containing just 7 observations. After all, the LAST.Dept variable = 1 for just seven observations — the seven observations that correspond to the last observation for each Dept within each Store.

13.4 - Detecting the End of a SAS Data Set

Now that we've spent some time identifying the first and last observations among a group of observations, now we'll turn our attention to identifying the last observation in a data set. You might find yourself in situations in which you need to determine when the last observation in an input data set has been read, so that you can then tell SAS to take some action.

In this section, we'll investigate how to use the SET statement's END= option to tell SAS to create a temporary numeric value whose value is used to detect the last observation. We'll start first by taking a look at a program in which we would benefit from knowing the last observation.

Example 13.15

The following program uses an accumulator variable called TotalSales to determine the overall total Sales amount in the sales data set:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
DATA storesales;
	set stat481.sales;
	TotalSales + Sales;
	format Sales TotalSales Dollar13.2;
RUN;
 
PROC PRINT data = storesales NOOBS;
	title;
RUN;

Store	Dept	Quarter	Sales	TotalSales
101	10	1	$110,001.50	$110,001.50
101	10	2	$113,101.20	$223,102.70
101	10	3	$111,932.15	$335,034.85
101	10	4	$99,901.10	$434,935.95
101	20	1	$110,002.36	$544,938.31
101	20	2	$99,922.39	$644,860.70
101	20	3	$98,832.98	$743,693.68
101	20	4	$110,101.70	$853,795.38
121	20	1	$121,947.10	$975,742.48
121	20	2	$119,964.69	$1,095,707.17
121	20	3	$122,136.28	$1,217,843.45
121	20	4	$120,111.11	$1,337,954.56
121	10	1	$127,192.92	$1,465,147.48
121	10	2	$125,280.13	$1,590,427.61
121	10	3	$128,203.56	$1,718,631.17
121	10	4	$123,632.29	$1,842,263.46
109	10	1	$120,422.77	$1,962,686.23
109	10	2	$123,984.32	$2,086,670.55
109	10	3	$121,801.29	$2,208,471.84
109	10	4	$122,125.66	$2,330,597.50
109	30	1	$98,310.13	$2,428,907.63
109	30	2	$97,331.25	$2,526,238.88
109	30	3	$96,386.28	$2,622,625.16
109	30	4	$98,511.90	$2,721,137.06
109	20	1	$115,239.09	$2,836,376.15
109	20	2	$113,001.98	$2,949,378.13
109	20	3	$114,234.32	$3,063,612.45
109	20	4	$114,122.65	$3,177,735.10

If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output to convince yourself that the value of the TotalSales variable for each observation is the sum of the Sales values for the observation and all of the observations that precede it.

In this case, we ended up with a data set called storesales whose TotalSales variable contains a running total of the Sales values. Now, suppose instead we wanted a data set called storesales which contains only the last observation with the overall total of all Sales. The SET statement's END= option can help us select the last observation.

Example 13.16

The following program uses the SET statement's END= option and a subsetting IF statement to tell SAS to write only the last observation in the input data set (stat481.sales) to the output data set (storesales):

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';
DATA storesales;
	set stat481.sales end=last;
	TotalSales + Sales;
	format Sales TotalSales Dollar13.2;
	drop Store Dept Sales Quarter;
	if last;
RUN;

PROC PRINT data = storesales NOOBS;
	title;
RUN;

TotalSales
$3,177,735.10

The END=last option tells SAS to create a temporary numeric variable called last, which is initialized to 0 and set to 1 only when the SET statement reads the last observation in the input data set. Although we used the variable name last here, we could have used any valid SAS variable name. The variable is temporary in that it is placed in the program data vector but not written to the output data set.

The subsetting IF statement:

if last;

is the statement that tells SAS to include in the output data set only the last observation in the input data set. The DROP statement tells SAS to drop the then meaningless Store, Dept, Quarter, and Sales variables from the output data set.

If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement to reflect the location in which you saved the data set. Then, run the program, and review the output to convince yourself that the storesales data set contains just one observation and one variable containing the overall total sales (TotalSales) of all of the stores.

13.5 - Understanding How Data Sets are Read

In Stat 480, we spent a lesson investigating how SAS processes a raw data file that is read into a SAS data set. Do you remember? The compile phase creates the program data vector, and hence the descriptor portion of the SAS data set. The execution phase does the actual data processing, in which the values in the program data vector are set to missing, the current record is placed in an input buffer, the values are read from the input buffer into the appropriate position in the program data vector, and then an observation is written from the program data vector to the output data set. Hopefully, this sounds at least a little bit familiar, because as we'll soon see, the process by which SAS reads an existing SAS data set into another SAS data set is very similar. The main difference is that while reading an existing SAS data set with the SET statement, SAS retains the values of the variables from one observation to the next. Let's work through an example.

Example 13.17

The following program uses a SET statement to read the permanent SAS data set stat481.sales, and then creates a variable called SalesTax, as well as a new temporary SAS data set called tax:

LIBNAME stat481 'C:\yourdrivename\Stat481WC\01sasdata\sasndata';

DATA tax;
	set stat481.sales;
	SalesTax = Sales * 0.06; 
RUN;

PROC PRINT data = tax;
	title 'The tax data set';
RUN;

The tax data set
Obs	Store	Dept	Quarter	Sales	SalesTax
1	101	10	1	110001.50	6600.09
2	101	10	2	113101.20	6786.07
3	101	10	3	111932.15	6715.93
4	101	10	4	99901.10	5994.07
5	101	20	1	110002.36	6600.14
6	101	20	2	99922.39	5995.34
7	101	20	3	98832.98	5929.98
8	101	20	4	110101.70	6606.10
9	121	20	1	121947.10	7316.83
10	121	20	2	119964.69	7197.88
11	121	20	3	122136.28	7328.18
12	121	20	4	120111.11	7206.67
13	121	10	1	127192.92	7631.58
14	121	10	2	125280.13	7516.81
15	121	10	3	128203.56	7692.21
16	121	10	4	123632.29	7417.94
17	109	10	1	120422.77	7225.37
18	109	10	2	123984.32	7439.06
19	109	10	3	121801.29	7308.08
20	109	10	4	122125.66	7327.54
21	109	30	1	98310.13	5898.61
22	109	30	2	97331.25	5839.88
23	109	30	3	96386.28	5783.18
24	109	30	4	98511.90	5910.71
25	109	20	1	115239.09	6914.35
26	109	20	2	113001.98	6780.12
27	109	20	3	114234.32	6854.06
28	109	20	4	114122.65	6847.36

A pretty straightforward program! If you haven't already done so, download the sales data set, and save it to a convenient location on your computer. Launch the SAS program and edit the LIBNAME statement so it reflects the location in which you saved the data set. Then, run the program, and review the output to familiarize yourself with the content of the output data set tax. Now, let's walk our way through how SAS processes the DATA step in this program.

Compile phase

During the compile phase, SAS takes the following steps:

SAS creates a program data vector containing the automatic variables _N_ and _ERROR_:
_N_
_ERROR_

1
0
SAS scans each statement in the DATA step looking for syntax errors, such as missing semicolons and invalid statements
When SAS compiles the SET statement, SAS adds a position to the program data vector for each variable in the input data set. SAS gets the variable names and attributes, such as type and length, from the input data set. Our input data set here (stat481.sales) tells SAS to add four positions to the program data vector — one for Store, one for Dept, one for Quarter, and one for Sales:
_N_
_ERROR_
Store
Dept
Quarter
Sales

1
0
SAS also adds a position to the program data vector for any variables that are created in the DATA step. The attributes of each of these variables are determined by the expression in the statement. The one assignment statement in our DATA step tells SAS to add one position to the program data vector — for the new variable SalesTax:
_N_
_ERROR_
Store
Dept
Quarter
Sales
Sales Tax
1
0
SAS completes the compile phase at the bottom of the DATA step, and it is then that SAS makes the descriptor portion of the SAS data set

_N_	_ERROR_
1	0

The output data set does not yet contain any observations, because SAS has not yet begun executing the program. When the compile phase is complete, that's when SAS starts the execution phase

Execution phase

During the execution phase, SAS takes the following steps:

Is this starting to sound like an endless loop? You should be getting the idea now ... the process continues as described until all of the observations are read

The DATA step executes once for each observation in the input data set. In our case, SAS will execute 28 times because there are 28 observations in the input data set
At the beginning of the execution phase, SAS sets all of the data set variables in the program data vector to missing:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

1 0 . . . . .

Because it is the first iteration of the DATA step, the automatic variable _N_ is set to 1. And, because SAS has not yet encountered any errors, the automatic variable _ERROR_ is set to 0
The SET statement reads the first observation from the input data set and writes the values to the program data vector:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

1 0 101 10 1 110001.50 .
The assignment statement executes to compute the first value of SalesTax:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

1 0 101 10 1 110001.50 6600.09
At the end of the first iteration of the DATA step, the values in the program data vector are written to the output data set tax as the first observation
The value of the automatic variable _N_ is increased to 2, and control returns to the top of the DATA step. The automatic variable _ERROR_ retains its value of 0, since SAS has still not encountered an error. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a Sum statement. All other variable values, such as the values of the variable SalesTax, are reset to missing. Taking all of this into account, our program data vector looks like this at the beginning of the second iteration of the DATA step:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

2 0 101 10 1 110001.50 .

As stated earlier, this is the step that differs the greatest from when SAS instead reads from a raw data file. Recall that when SAS reads from a raw data file, SAS sets the value of each variable to missing (with a few special exceptions) at the beginning of each iteration
As the SET statement executes, the values from the second observation are written to the program data vector:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

2 0 101 10 2 113101.20 .
The assignment statement executes again to compute the value for SalesTax for the second observation:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

2 0 101 10 2 113101.20 6786.07
At the bottom of the DATA step, the values in the program data vector are written to the output data set tax as the second observation
The value of the automatic variable _N_ is increased to 3, and control returns to the top of the DATA step. The automatic variable _ERROR_ retains its value of 0, since SAS has still not encountered an error. SAS retains the values of variables that were read from a SAS data set with the SET statement, or that were created by a Sum statement. All other variable values, such as the values of the variable SalesTax, are reset to missing. Taking all of this into account, our program data vector looks like this at the beginning of the third iteration of the DATA step:

_N_ _ERROR_ Store Dept Quarter Sales Sales Tax

2 0 101 10 2 113101.20 .

Is this starting to sound like an endless loop? You should be getting the idea now ... the process continues as described until all of the observations are read.

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
1	0	.	.	.	.	.

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
1	0	101	10	1	110001.50	.

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
1	0	101	10	1	110001.50	6600.09

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
2	0	101	10	1	110001.50	.

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
2	0	101	10	2	113101.20	.

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
2	0	101	10	2	113101.20	6786.07

_N_	_ERROR_	Store	Dept	Quarter	Sales	Sales Tax
2	0	101	10	2	113101.20	.

13.6 - Summary

In this lesson, we spent time investigating various issues that concern reading one SAS data set into another SAS data set. We explored the many ways in which a SAS data set can be modified, such as using an accumulator variable to sum the contents of another variable and using a SELECT group to assign values conditionally. We also learned how to use a SET and BY statement together in order to be able to identify the first and last observations in a group. We also learned how to detect the end of a data set. Finally, we closed out the lesson by discussing the steps SAS takes when reading in a SAS data set. The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility