13.2 - Manipulating Data in a SAS Data Set
13.2 - Manipulating Data in a SAS Data SetIn this section, we'll review the many ways in which a SAS data set can be modified, including:
- reducing the number of observations in a data set
- reducing the number of variables in a data set
- creating or modifying a variable
- changing a variable's attributes
- creating an accumulator variable
- assigning values conditionally
While many of the methods will already be familiar to you, some you will see just for the first time.
Reducing the Number of Observations in a Data Set
There may be situations in which you'd like to exclude some of the observations in a SAS data set from an analysis that you are conducting or a report that you are generating. I like to think of doing such a thing as "making a data set shorter." Although there are other ways, we'll investigate two common ways of reducing the number of observations in a data set here. We can use an IF-THEN-DELETE statement to exclude observations from the output data set. Or we can use a subsetting IF statement to include observations in our output data set. We will investigate yet another way of reducing the number of observations in our data sets in the next lesson.
Example 13.2
The following program uses an IF-THEN-DELETE statement to exclude golf courses whose par is 70 from the penngolf data set:
DATA penngolf;
set stat481.penngolf;
if par = 70 then DELETE;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
RUN;
ID | Name | Architect | Year | Type | Par | Yards | Slope | USGA |
---|---|---|---|---|---|---|---|---|
101 | Toftrees | Ed Ault | 1968 | Resort | 72 | 7,018 | 134 | 74.3 |
102 | Penn State Blue | Willie Park, Jr. | 1921 | Public | 72 | 6,525 | 128 | 72.0 |
103 | Centre Hills | Alex Findlay | 1921 | Private | 71 | 6,392 | 128 | 71.2 |
104 | Lewistown CC | . | Private | 72 | 6,779 | 125 | 72.3 | |
105 | State College Elks | Lowell Erdman | 1973 | SemiPri | 71 | 6,369 | 123 | 70.9 |
107 | Sinking Valley CC | Ed Ault | 1967 | SemiPri | 72 | 6,755 | 132 | 73.4 |
108 | Williamsport CC | A.W Tillinghast | 1909 | Private | 71 | 6,489 | 131 | 71.9 |
The IF-THEN-DELETE statement is pretty straightforward, and therefore probably doesn't deserve any explanation. Launch and run the program, and review the output to convince yourself that golf courses whose par is 70 were excluded from the temporary penngolf data set.
Example 13.3
The following program uses a subsetting IF statement to include only those golf courses whose par is greater than 70 in the penngolf data set:
DATA penngolf;
set stat481.penngolf;
if par GT 70;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
RUN;
ID | Name | Architect | Year | Type | Par | Yards | Slope | USGA |
---|---|---|---|---|---|---|---|---|
101 | Toftrees | Ed Ault | 1968 | Resort | 72 | 7,018 | 134 | 74.3 |
102 | Penn State Blue | Willie Park, Jr. | 1921 | Public | 72 | 6,525 | 128 | 72.0 |
103 | Centre Hills | Alex Findlay | 1921 | Private | 71 | 6,392 | 128 | 71.2 |
104 | Lewistown CC | . | Private | 72 | 6,779 | 125 | 72.3 | |
105 | State College Elks | Lowell Erdman | 1973 | SemiPri | 71 | 6,369 | 123 | 70.9 |
107 | Sinking Valley CC | Ed Ault | 1967 | SemiPri | 72 | 6,755 | 132 | 73.4 |
108 | Williamsport CC | A.W Tillinghast | 1909 | Private | 71 | 6,489 | 131 | 71.9 |
Again, the subsetting IF statement is pretty straightforward, and therefore probably also doesn't deserve any explanation. Launch and run the program, and review the output to convince yourself that only golf courses whose par is greater than 70 were included in the temporary penngolf data set. You might also want to take note that the resulting data set is the same as the resulting data set from the previous example — just accomplished in a different way!
Reducing the Number of Variables in a Data Set
There may be situations in which you'd like to reduce the number of variables in a data set in order to work with a data set of a much more manageable size. I like to think of doing such a thing as "making a data set thinner." Although we'll investigate how to do this much more extensively in the next lesson, we'll at least introduce the concept by illustrating the use of a DROP statement.
Example 13.4
The following program uses a DROP statement to tell SAS to drop the Architect variable from the temporary output data set penngolf:
DATA penngolf;
set stat481.penngolf;
drop Architect;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
RUN;
ID | Name | Year | Type | Par | Yards | Slope | USGA |
---|---|---|---|---|---|---|---|
101 | Toftrees | 1968 | Resort | 72 | 7,018 | 134 | 74.3 |
102 | Penn State Blue | 1921 | Public | 72 | 6,525 | 128 | 72.0 |
103 | Centre Hills | 1921 | Private | 71 | 6,392 | 128 | 71.2 |
104 | Lewistown CC | . | Private | 72 | 6,779 | 125 | 72.3 |
105 | State College Elks | 1973 | SemiPri | 71 | 6,369 | 123 | 70.9 |
106 | Park Hill CC | 1966 | SemiPri | 70 | 6,004 | 126 | 69.3 |
107 | Sinking Valley CC | 1967 | SemiPri | 72 | 6,755 | 132 | 73.4 |
108 | Williamsport CC | 1909 | Private | 71 | 6,489 | 131 | 71.9 |
109 | Standing Stone GC | 1973 | SemiPri | 70 | 6,593 | 120 | 71.4 |
110 | Bucknell GC | 1960 | SemiPri | 70 | 6,253 | 132 | 70.0 |
Self-explanatory again, eh? In the next lesson, we'll learn that rather than using a DROP statement to exclude variables, we could use a KEEP statement to include variables in the output data set. We'll also discuss the difference between the DROP and KEEP statements and the alternative DROP= and KEEP= options of the DATA and SET statements. For now, let's have you launch and run the program, and review the output to convince yourself that the Architect variable was not included in the output temporary penngolf data set.
Creating or Modifying a Variable
There is definitely nothing new here! Surely, you've created a new variable or two in the myriad SAS programs you have written by now. We include it here only for the sake of completeness, so it can be acknowledged as one way in which a SAS data set can be modified.
Example 13.5
The following program uses the Slope and USGA variables in conjunction with an assignment statement to create a new variable called Bogey:
Name | Yards | Par | Slope | USGA | Bogey |
---|---|---|---|---|---|
Toftrees | 7,018 | 72 | 134 | 74.3 | 99.224 |
Penn State Blue | 6,525 | 72 | 128 | 72.0 | 95.808 |
Centre Hills | 6,392 | 71 | 128 | 71.2 | 95.008 |
Lewistown CC | 6,779 | 72 | 125 | 72.3 | 95.550 |
State College Elks | 6,369 | 71 | 123 | 70.9 | 93.778 |
Park Hill CC | 6,004 | 70 | 126 | 69.3 | 92.736 |
Sinking Valley CC | 6,755 | 72 | 132 | 73.4 | 97.952 |
Williamsport CC | 6,489 | 71 | 131 | 71.9 | 96.266 |
Standing Stone GC | 6,593 | 70 | 120 | 71.4 | 93.720 |
Bucknell GC | 6,253 | 70 | 132 | 70.0 | 94.554 |
Launch and run the program, and review the output to convince yourself that the Bogey variable was calculated as suggested by the assignment statement.
Changing a Variable's Attributes
Recall that one of the products of the compile phase is the descriptor portion of a data set containing information about the attributes of each variable in the data set. The attribute information includes the variable's name, type, length, format, informat, and label. As you know, we can use a LENGTH statement to modify the length of a variable, a FORMAT statement to modify the format of a variable, and a LABEL statement to associate a descriptive label to a variable.
Example 13.6
The following program uses 1) a LABEL statement to associate descriptive labels to some of the variables in the penngolf data set and 2) a FORMAT statement to tell SAS how to display two of the variables (Bogey and Yards):
DATA penngolf;
set stat481.penngolf;
Bogey = 0.186*Slope + USGA;
label Bogey = 'Bogey Rating'
USGA = 'USGA Rating'
Slope = 'Course Slope'
Par = 'Course Par'
Yards = 'Total Yardage';
format Bogey 4.1 Yards comma5.;
RUN;
PROC PRINT data = penngolf NOOBS LABEL;
title 'The penngolf data set';
var Name Yards Par Slope USGA Bogey;
RUN;
Name | Total Yardage | Course Par | Course Slope | USGA Rating | Bogey Rating |
---|---|---|---|---|---|
Toftrees | 7,018 | 72 | 134 | 74.3 | 99.2 |
Penn State Blue | 6,525 | 72 | 128 | 72.0 | 95.8 |
Centre Hills | 6,392 | 71 | 128 | 71.2 | 95.0 |
Lewistown CC | 6,779 | 72 | 125 | 72.3 | 95.6 |
State College Elks | 6,369 | 71 | 123 | 70.9 | 93.8 |
Park Hill CC | 6,004 | 70 | 126 | 69.3 | 92.7 |
Sinking Valley CC | 6,755 | 72 | 132 | 73.4 | 98.0 |
Williamsport CC | 6,489 | 71 | 131 | 71.9 | 96.3 |
Standing Stone GC | 6,593 | 70 | 120 | 71.4 | 93.7 |
Bucknell GC | 6,253 | 70 | 132 | 70.0 | 94.6 |
I personally have never seen the need for it, but a label can be as long as 256 characters. Seems a bit like overkill to me. Anyway, launch and run the program, and review the output to convince yourself that the variables were labeled and formatted as suggested.
Creating an Accumulator Variable
Now, this topic is definitely new to us! You might often find yourself in a situation in which you want to create a variable that accumulates the values of another variable. For example, if you have a data set that contains monthly sales, you might want to create a variable that contains year-to-date sales. To create such a variable, you'd most likely want to take advantage of what SAS calls an accumulator variable. To add the contents of a variable, or the result of an expression, to an accumulator variable, we need to use what SAS calls a sum statement.
The general form of a sum statement is:
variable + expression;
where:
- variable tells SAS the desired name of your accumulator variable (yeartodate, perhaps). The variable, which must be numeric, is automatically set to 0 before the first observation in the data set is read.
- expression is any valid SAS expression. It might be something as simple as the name of the variable that you want to add up. One thing that is important to keep in mind is that if the expression produces a missing value, or a variable's value is missing, the sum statement treats it like a zero. This is in stark contrast to an assignment statement, in which SAS assigns the variable to the left of the equal sign (=) a missing value if any variable on the right side of the equal sign (=) contains a missing value.
In short, the sum statement adds the result of the expression that is on the right side of the plus sign (+) to the numeric accumulator variable that is on the left side of the plus sign. Then, as you can imagine would have to be the case in order to work properly, the accumulator variable's value is retained from one iteration of the DATA step to the next. That is, the accumulator variable is not set to missing as variables usually are when reading data. Let's take a look at an example!
Example 13.7
The following program uses a sum statement and an accumulator variable called TotalYards to tell SAS to tally up the number of Yards that the golf courses in the penngolf data set have:
DATA penngolf;
set stat481.penngolf;
TotalYards + Yards;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
var Name Yards TotalYards;
RUN;
Name | Yards | Total Yards |
---|---|---|
Toftrees | 7,018 | 7018 |
Penn State Blue | 6,525 | 13543 |
Centre Hills | 6,392 | 19935 |
Lewistown CC | 6,779 | 24714 |
State College Elks | 6,369 | 33083 |
Park Hill CC | 6,004 | 39087 |
Sinking Valley CC | 6,755 | 45842 |
Williamsport CC | 6,489 | 52331 |
Standing Stone GC | 6,593 | 58924 |
Bucknell GC | 6,253 | 65177 |
Launch and run the program, and review the output. So, let's see here, the value of TotalYards is set to 0 before the first observation is read. Then, as you can see, the value of TotalYards in the first observation becomes 0 plus 7018, or 7018. Then, the value of TotalYards in the second observation becomes 7018 plus 6525, or 13543. Then, the value of TotalYards in the third observation becomes 13543 plus 6392, or 19935. And so on.
Assigning Values Conditionally
You learned how to use IF-THEN-ELSE statements in Stat 480 as a way of assigning values to variables conditionally. Here, we'll explore the SELECT group as an alternative method.
Example 13.8
The following program uses a SELECT group to create a numeric variable called AssnFee that depends on the values of the character variable Type:
DATA penngolf;
set stat481.penngolf;
select (Type);
when ("Resort") AssnFee = 5000;
when ("Private") AssnFee = 4000;
when ("SemiPri") AssnFee = 2000;
when ("Public") AssnFee = 1000;
otherwise AssnFee = .;
end;
format Yards AssnFee comma5.;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
var Name Yards Type AssnFee;
RUN;
Name | Yards | Type | Assn Fee |
---|---|---|---|
Toftrees | 7,018 | Resort | 5,00 |
Penn State Blue | 6,525 | Public | 1,000 |
Centre Hills | 6,392 | Private | 4,000 |
Lewistown CC | 6,779 | Private | 4,000 |
State College Elks | 6,369 | SemiPri | 2,000 |
Park Hill CC | 6,004 | SemiPri | 2,000 |
Sinking Valley CC | 6,755 | SemiPri | 2,000 |
Williamsport CC | 6,489 | Private | 4,000 |
Standing Stone GC | 6,593 | SemiPri | 2,000 |
Bucknell GC | 6,253 | SemiPri | 2,000 |
As you can see, a SELECT group is comprised of a SELECT statement, a series of required WHEN statements, an optional OTHERWISE statement, and a required END statement. The SELECT statement begins, and the END statement ends a SELECT group. The easiest way to explain this SELECT group is probably just to compare it to the comparable IF-THEN-ELSE statement:
if (Type = "Resort") then AssnFee = 5000;
else if (Type = "Private") then AssnFee = 4000;
else if (Type = "SemiPri") then AssnFee = 2000;
else if (Type = "Public") then AssnFee = 1000;
else AssnFee = .;
That is, in this example, we've told SAS that we want to compare the variable Type to all of the when-conditions appearing in the WHEN statements. If the value of Type for a particular observation matches a when-condition, then the statement that follows that when-condition is executed. If the value of Type for a particular observation does not match any of the when-conditions, SAS executes the OTHERWISE statement.
Launch and run the program, and review the output to convince yourself that the SELECT group functions just like above the IF-THEN-ELSE statement. You might also want to note in this example that we've optionally taken a shortcut of including the variable Type in parentheses in the SELECT statement. This shortcut tells SAS to compare the variable of Type to the values in the parentheses in the WHEN statements. If we did not take this optional shortcut, our SELECT group would have to be written as follows:
select; when (Type = "Resort") AssnFee = 5000;
when (Type = "Private") AssnFee = 4000;
when (Type = "SemiPri") AssnFee = 2000;
when (Type = "Public") AssnFee = 1000;
otherwise AssnFee = .;
end;
You might want to edit and re-run the program to convince yourself that the two SELECT groups are equivalent.
Now you might be wondering when you should use a SELECT group, and when you should use IF-THEN-ELSE statements. When you have a long series of mutually exclusive conditions and the comparison is numeric, using a SELECT group is a bit more efficient than using a series of IF-THEN-ELSE statements. When you need SAS to evaluate just a few conditions (like one or two), you might find it more straightforward to just use an IF-THEN-ELSE statement. You wouldn't lose any efficiency in doing so. That said, if efficiency seems like a minor issue to you, you might just find it easier to read and debug programs containing SELECT groups.
Example 13.9
The following program uses a SELECT group to create a new variable called Course whose value depends on the value of Yards:
DATA penngolf;
set stat481.penngolf;
LENGTH Course $ 7;
select;
when (Yards GE 6700) Course = 'Long';
when (6400 LE Yards LT 6700) Course = 'Medium';
when (Yards LT 6400) Course = 'Short';
otherwise Course = 'Unknown';
end;
format Yards comma5.;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
var Name Yards Course;
Name | Yards | Course |
---|---|---|
Toftrees | 7,018 | Long |
Penn State Blue | 6,525 | Medium |
Centre Hills | 6,392 | Short |
Lewistown CC | 6,779 | Long |
State College Elks | 6,369 | Short |
Park Hill CC | 6,004 | Short |
Sinking Valley CC | 6,755 | Long |
Williamsport CC | 6,489 | Medium |
Standing Stone GC | 6,593 | Medium |
Bucknell GC | 6,253 | Short |
In this example, we did not include a variable (or expression) in parentheses in the SELECT statement. As SAS processes an observation, it starts then with the first WHEN statement. If the when-condition is true, SAS executes the statement that follows the when-condition. If it is false, SAS proceeds to the next WHEN statement. If its when-condition is true SAS executes the statement that follows the when-condition. And so on ... SAS proceeds in this way until it reaches the OTHERWISE statement.
Let's have you launch and run the program, and review the output to convince yourself that SAS executes the SELECT group just as it would the comparable IF-THEN-ELSE statement.
Oh, you might also want to remind yourself why it is necessary to use the LENGTH statement to tell SAS explicitly that the length of the Course variable is 7 characters. Recall that, by default, SAS defines the length of a new character variable to be the same as the length of the first value assigned to the variable in the DATA step. Here, then, in the absence of the LENGTH statement, SAS would define the length of the Course variable to be 4, because Long is the first value to be associated with Course. That then means that the values assigned to the Course variable would be Long, Medi, Shor, and Unkn because of the truncation that would ensue. To see this firsthand, comment out (or delete) the LENGTH statement, re-run the program, and review the resulting output. Point made?! Let's take a look at one more example!
Example 13.10
The following SAS program is identical to the previous program, except the order of the conditions in the SELECT group has been reversed and the LENGTH statement has been removed:
DATA penngolf;
set stat481.penngolf;
select;
when (Yards = .) Course = 'Unknown';
when (Yards LT 6400 ) Course = 'Short';
when (Yards LT 6700) Course = 'Medium';
otherwise Course = 'Long';
end;
format Yards comma5.;
RUN;
PROC PRINT data = penngolf NOOBS;
title 'The penngolf data set';
var Name Yards Course;
RUN;
Name | Yards | Course |
---|---|---|
Toftrees | 7,018 | Long |
Penn State Blue | 6,525 | Medium |
Centre Hills | 6,392 | Short |
Lewistown CC | 6,779 | Long |
State College Elks | 6,369 | Short |
Park Hill CC | 6,004 | Short |
Sinking Valley CC | 6,755 | Long |
Williamsport CC | 6,489 | Medium |
Standing Stone GC | 6,593 | Medium |
Bucknell GC | 6,253 | Short |
Launch and run the program, and review the output to convince yourself that this program produces output identical to the output produced by the previous program. Then, there are two points that I'd like to make concerning this program.
First, the LENGTH statement is no longer necessary. Here, SAS would define the length of the Course variable to be 7, because Unknown is the first value to be associated with Course. Since Unknown is the longest value that we assign to Course, the default length of 7 works just fine for us. No need to override the default!
Second, consider the yardage of 6392 for the Centre Hills golf course. If you look at the SELECT group, you can see that a yardage of 6392 meets the condition of both the second and third WHEN statements. That is, 6392 is less than 6400, and it is less than 6700. How does SAS behave in this case? The answer lies in knowing that if more than one WHEN statement is true, SAS uses only the first WHEN statement. Once SAS finds a when-condition that is true, no other when-conditions are evaluated for a particular observation. So, here SAS deems that the second WHEN statement is true when Yards are 6392, and therefore assigns Short to the Course variable. SAS then steps out of the SELECT group, and at the end of the DATA step, moves on to process the next observation in the data set.
Incidentally, the first WHEN statement in our SELECT group:
when (Yards = .) Course = 'Unknown';
should remind us of another important point. Do you remember that good old programming habit about always programming for missing values? If we removed the above WHEN statement from our SELECT group, what value would SAS assign to Course if Yards were missing for a particular golf course? Remembering that a missing value (.) is considered smaller than any other numerical value, SAS would consider the missing value to be less than 6400, and Course would be errantly assigned to the value of Short. So, the good old programming habit holds for SELECT groups as well: always program for missing values.