Lesson 15: Combining SAS Data Sets -- Part I
Lesson 15: Combining SAS Data Sets -- Part IOverview
In this lesson, we will learn how to combine data sets in four different ways — one-to-one reading, one-to-one merging, concatenating, and interleaving. Although one-to-one reading and one-to-one merging both involve placing one data set "to the right" of other data sets to create a single "fat" data set, the results can differ slightly. Concatenating involves stacking one data set "below" other data sets to create a single "tall" data set. Interleaving also involves stacking data sets, but it only deals with sorted data sets. That is, interleaving combines individual sorted data sets into one big sorted data set. Once we've learned the four mentioned ways of combining SAS data sets, we'll also learn how to use DATA step options at the same time that we one-to-one read, one-to-one merge, concatenate or interleave two or more SAS data sets.
Objectives
Upon completing this lesson, you should be able to do the following:
- write SAS code to one-to-one read two or more SAS data sets
- predict the size of the output data set when one-to-one reading two or more SAS data sets
- write SAS code to one-to-one merge two or more SAS data sets
- predict the size of the output data set when one-to-one merging two or more SAS data sets
- write SAS code to concatenate two or more SAS data sets
- predict the size of the output data set when concatenating two or more SAS data sets
- differentiate how SAS handles concatenating data sets when variable types differ across the input data sets
- describe how SAS handles concatenating data sets when formats, informats, or labels differ across the input data sets
- describe how SAS handles concatenating data sets when variables have different lengths across the input data sets
- write SAS code to interleave two or more SAS data sets
- predict the size of the output data set when interleaving two or more SAS data sets
- use the PUT function to perform an explicit numeric-to-character conversion
- use any of the DATA step options at the same time that you one-to-one read, one-to-one merge, concatenate or interleave two or more SAS data sets
15.1 - One-to-One Reading
15.1 - One-to-One ReadingOne-to-one reading combines two or more SAS data sets, one "to the right" of the other into a single "fat" data set. That is, one-to-one reading combines observations from two or more data sets into a single observation in a new data set. For example, suppose the data set patients contains three variables: patient ID number (ID), gender (Sex), and age of the patient (Age):
ID | Sex | Age |
---|---|---|
1157 | F | 33 |
2395 | F | 48 |
1098 | M | 39 |
4829 | F | 24 |
3456 | M | 30 |
5920 | M | 41 |
1493 | F | 42 |
and the data set scale contains three variables: ID (number), Height (in inches) and Weight (in pounds):
ID | Height | Weight |
---|---|---|
1157 | 65 | 122 |
2395 | 64 | 130 |
1098 | 70 | 178 |
4829 | 67 | 142 |
3456 | 72 | 190 |
5920 | 71 | 188 |
Then, when we one-to-one read the two data sets, we get what I like to call a "fat" data set, called say one2oneread:
ID | Sex | Age | Height | Weight |
---|---|---|---|---|
157 | F | 33 | 65 | 122 |
2395 | F | 48 | 64 | 130 |
1098 | M | 39 | 70 | 178 |
4829 | F | 24 | 67 | 142 |
3456 | M | 30 | 72 | 190 |
5920 | M | 41 | 71 | 188 |
in which the second data set gets placed to the "right" of the first data set. Note that the observations are combined based on their relative position in the data set. The first observation of patient is combined with the first observation of scale to create the first observation in one2oneread; the second observation of patient is combined with the second observation of scale to create the second observation in one2oneread; and so on. The DATA step stops after it reads the last observation from the smallest data set. Therefore, the number of observations in the new data set always equals the numbers of observations in the smallest data set you name for one-to-one reading.
Example 15.1
The following program uses one-to-one reading to combine the patients data set with the scale data set:
DATA patients;
DATA patients;
input ID Sex $ Age;
DATALINES;
1157 F 33
2395 F 48
1098 M 39
4829 F 24
3456 M 30
5920 M 41
1493 F 42
;
RUN;
DATA scale;
input ID Height Weight;
DATALINES;
1157 65 122
2395 64 130
1098 70 178
4829 67 142
3456 72 190
5920 71 188
;
RUN;
DATA one2oneread;
set patients;
set scale;
RUN;
PROC PRINT NOOBS;
title 'The one2oneread data set';
RUN;
ID | Sex | Age | Height | Weight |
---|---|---|---|---|
1157 | F | 33 | 65 | 122 |
2395 | F | 48 | 64 | 130 |
1098 | M | 39 | 70 | 178 |
4829 | F | 24 | 67 | 142 |
3456 | M | 30 | 72 | 190 |
5920 | M | 41 | 71 | 188 |
Of course, the first two DATA steps just read in the respective patients and scale data sets. The meat of the one-to-one read takes place in the third (and last) DATA step, in which we see two SET statements. The first SET statement tells SAS first to read the contents of the patients data set into the program data vector, and then the second SET statement tells SAS to read the contents of the scale data set into the program data vector.
Launch and run the SAS program, and review the output to convince yourself that the data sets are combined as described. You should note, in particular, that SAS does indeed stop reading after reaching the last observation in the scale data set. Hence, the combined data set, one2oneread, contains six observations, the number of observations in the smallest of the two data sets. Note, too, that the position of the variables in the one2oneread data set directly corresponds to the order in which the SET statements appear in the DATA step. Because the scale data set appears to the right of the patients data set in the SET statement, the variables from the scale data set appear to the right of the variables from the patients data set in the combined one2oneread data set.
Example 15.2
The following program uses one-to-one reading to combine the patients data set with the scale data set in the reverse order from that of the previous program:
DATA one2oneread2;
set scale;
set patients;
RUN;
PROC PRINT NOOBS;
title 'The one2oneread2 data set';
RUN;
ID | Height | Weight | Sex | Age |
---|---|---|---|---|
1157 | 65 | 122 | F | 33 |
2395 | 64 | 130 | F | 48 |
1098 | 70 | 178 | M | 39 |
4829 | 67 | 142 | F | 24 |
3456 | 72 | 190 | M | 30 |
5920 | 71 | 188 | M | 41 |
Note, here, that the SET statement for the scale data set appears first in the DATA step, followed by the SET statement for the patients data set. Launch and run the SAS program, and review the output to convince yourself that the variables from the patients data set appear to the right of the variables from the scale data set in the combined one2oneread2 data set.
Hmmm... but what about the fact that the ID variable appears in both the patients and scale data sets, but it appears only once in the combined one2oneread2 data set? How does SAS handle the situation? Well... in general, if the data sets contain variables that have the same names, the values that are read in from the last data set overwrite the values that were read in from earlier data sets. Let's take a look at a contrived example to illustrate this point.
Example 15.3
The following program uses one-to-one reading to combine the one data set with the two data set to create a new data set called onetwo:
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetwo;
set one;
set two;
RUN;
PROC PRINT data = onetwo NOOBS;
title 'The onetwo data set';
RUN;
ID | VarA | VarB | VarC |
---|---|---|---|
40 | A1 | B4 | C1 |
50 | A2 | B5 | C2 |
As you review the first two DATA steps, in which SAS reads in the respective one and two data sets, note that the two data sets share two variables, namely ID and VarB. The third (and last) DATA step tells SAS to combine the two data sets using the one-to-one reading method. Let's walk our way through how SAS processes the DATA step. At the end of the compile phase, SAS will have created a program data vector containing the variables from the one and two data sets in the order in which they appear in the DATA step:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
1 | 0 | . |
During the first iteration of the DATA step, the first SET statement reads one observation from data set one:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
1 | 0 | 10 | A1 | B1 |
Then, the second SET statement reads one observation from data set two. The values for ID and VarB in data set two overwrite the values for ID and VarB in data set one:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
1 | 0 | 40 | A1 | B4 | C1 |
Being at the end of the first iteration of the DATA step, SAS writes the contents of the program data vector as the first observation in the onetwo SAS data set. Upon returning to the top of the DATA step, the program data vector looks like this at the beginning of the second iteration of the DATA step:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
2 | 0 | 40 | A1 | B4 | C1 |
Recall that SAS retains the values of variables that were read from a SAS data set with the SET statement. Now, the first SET statement reads the second observation from the one data set:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
2 | 0 | 20 | A2 | B2 | C1 |
And, the second SET statement reads the second observation from data set two. Again, the values for ID and VarB in data set two overwrite the values for ID and VarB in data set one:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
2 | 0 | 50 | A2 | B5 | C2 |
Being at the end of the second iteration of the DATA step, SAS writes the contents of the program data vector as the second observation in the onetwo SAS data set. Because there are no more observations in the two data set, processing stops. That is, the DATA step does not read the third observation from the one data set.
Now, launch and run the SAS program, and review the output to convince yourself that the one and two data sets are combined as described.
One more comment. Note that although each of our one-to-one reading examples involved combining just two data sets, you can specify any number of SET statements when one-to-one reading ... and therefore the sky's the limit.
15.2 - One-to-One Merging
15.2 - One-to-One MergingAt first glance, one-to-one merging appears to be the same as one-to-one reading, since it too combines two or more SAS data sets, one "to the right" of the other into a single "fat" data set. That is, just like one-to-one reading, one-to-one merging combines observations from two or more data sets into a single observation in a new data set. There is just one primary difference though — SAS continues to merge observations until it has read all of the observations from all of the data sets. For example, suppose again that our patients data set contains three variables: patient ID number (ID), gender (Sex), and age of the patient (Age):
ID | Sex | Age |
---|---|---|
1157 | F | 33 |
2395 | F | 48 |
1098 | M | 39 |
4829 | F | 24 |
3456 | M | 30 |
5920 | M | 41 |
1493 | F | 42 |
and our scale data set contains three variables: ID (number), Height (in inches) and Weight (in pounds):
ID | Height | Weight |
---|---|---|
1157 | 65 | 122 |
2395 | 64 | 130 |
1098 | 70 | 178 |
4829 | 67 | 142 |
3456 | 72 | 190 |
5920 | 71 | 188 |
Then, when we one-to-one merge the two data sets, we get a data set, called say one2onemerge, that looks like this:
ID | Sex | Age | Height | Weight |
---|---|---|---|---|
157 | F | 33 | 65 | 122 |
2395 | F | 48 | 64 | 130 |
1098 | M | 39 | 70 | 178 |
4829 | F | 24 | 67 | 142 |
3456 | M | 30 | 72 | 190 |
5920 | M | 41 | 71 | 188 |
1493 | F | 42 | . | . |
in which the second data set gets placed to the "right" of the first data set. Again, the observations are combined based on their relative position in the data set. The first observation of patient is combined with the first observation of scale to create the first observation in one2onemerge; the second observation of patient is combined with the second observation of scale to create the second observation in one2onemerge; and so on. When SAS performs a one-to-one merge, the DATA step continues to read observations until the last observation is read from the largest data set. That's why the one2onemerge data set has one more observation than the one2oneread data set. In general, the number of observations in a data set created by a one-to-one merge always equals the numbers of observations in the largest data set named for one-to-one merging.
Example 15.4
The following program uses one-to-one merging to combine the patients data set with the scale data set:
DATA one2onemerge;
merge patients scale;
RUN;
PROC PRINT NOOBS;
title 'The one2onemerge data set';
RUN;
ID | VarA | VarB | VarC |
---|---|---|---|
40 | A1 | B4 | C1 |
50 | A2 | B5 | C2 |
The MERGE statement tells SAS to create a new data set by combining patients and scale. Launch and run the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. You should see that the first observation in one2onemerge contains the first observation of patients and scale, the second observation in one2onemerge contains the second observation of patients and scale, and so on. Since there are seven observations in patients and six observations in scale, the new one2onemerge data set contains seven observations, with missing values for the Height and Weight variables in the seventh observation. Note that although this example only combined two data sets, the MERGE statement can contain any number of input data sets.
Just as is true for one-to-one reading, if data sets that are being one-to-one merged contain variables that have the same names, the values that are read in from the last data set overwrite the values that were read in from earlier data sets. Let's go back to our contrived example to illustrate this point.
Example 15.5
The following program uses one-to-one merging to combine the one data set with the two data set to create a new data set called onetwomerged:
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetwomerged;
merge one two;
RUN;
PROC PRINT data = onetwomerged NOOBS;
title 'The onetwomerged data set';
RUN;
ID | VarA | VarB | VarC |
---|---|---|---|
40 | A1 | B4 | C1 |
50 | A2 | B5 | C2 |
30 | A3 | B3 |
Note again that the one and two data sets share two variables, namely ID and VarB. The third (and last) DATA step tells SAS to combine the two data sets using the one-to-one merging method. Let's walk our way through how SAS processes the DATA step. At the end of the compile phase, SAS will have created a program data vector containing the variables from the one and two data sets in the order in which they appear in the DATA step:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
1 | 0 | . | . |
During the first iteration of the DATA step, the MERGE statement reads first one observation from data set one:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
1 | 0 | 10 | A1 | B1 | . |
and then one observation from data set two. The values for ID and VarB in data set two overwrites the values for ID and VarB in data set one:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
1 | 0 | 40 | A1 | B4 | C1 |
Being at the end of the first iteration of the DATA step, SAS writes the contents of the program data vector as the first observation in the onetwomerged data set. Upon returning to the top of the DATA step, the program data vector looks like this at the beginning of the second iteration of the DATA step:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
2 | 0 | 40 | A1 | B4 | C1 |
Just as is true for a SET statement, SAS retains the values of variables that were read from a SAS data set with the MERGE statement. Now, the MERGE statement reads the second observation from the one data set:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
2 | 0 | 20 | A2 | B2 | C1 |
and the second observation from the two data set. Again, the values for ID and VarB in data set two overwrites the values for ID and VarB in data set one:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
2 | 0 | 50 | A2 | B5 | C2 |
Being at the end of the second iteration of the DATA step, SAS writes the contents of the program data vector as the second observation in the onetwomerged data set. Now this is where things get different! Because an observation still exists in the one data set, SAS returns to the top of the DATA step. The program data vector looks like this at the beginning of the third iteration of the DATA step:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
3 | 0 | 50 | A2 | B5 | C2 |
And, the MERGE statement reads the third observation from the one data set:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
3 | 0 | 30 | A3 | B3 | C2 |
SAS attempts to read a third observation from the two data set but instead encounters an end-of-data set marker. Therefore, as is always the case in this kind of situation, SAS sets the values of all of that data set's variables in the program data vector to missing:
_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|
3 | 0 | 30 | A3 | B |
Being at the end of the third iteration of the DATA step, SAS writes the contents of the program data vector as the third observation in the onetwomerged SAS data set. Because there are no more observations in either the one or the two data set, processing stops. Thank goodness!
Now, launch and run the SAS program, and review the output to convince yourself that the one and two data sets are combined as described.
One more closing comment. One-to-one reading and one-to-one merging require users to exercise extreme caution when combining two (or more) data sets based on relative position only. It would just take one of the data sets to be "shifted" ever so slightly to get really messed up results. It's for this reason that I personally don't find the one-to-one read or the one-to-one merge all that practical. The more useful and therefore much more common merge performed in SAS is what is called match-merging. We'll learn about it in the next lesson.
15.3 - Concatenating Two or More Data Sets
15.3 - Concatenating Two or More Data SetsTo concatenate two or more SAS data sets means to stack one "on top" of the other into a single SAS data set. For example, suppose the data set store1 contains three variables, store (number), day (of the week), and sales (in dollars):
Store | Day | Sale |
---|---|---|
1 | M | 1200 |
1 | T | 1435 |
1 | W | 1712 |
1 | R | 1529 |
1 | F | 1920 |
1 | S | 2325 |
and the data set store2 contains the same three variables:
Store | Day | Sales |
---|---|---|
2 | M | 2215 |
2 | T | 2458 |
2 | W | 1789 |
2 | R | 1692 |
2 | F | 2105 |
2 | S | 2847 |
Then, when we concatenate the two data sets, we get what I like to call a "tall" data set:
Store | Day | Sales |
---|---|---|
1 | M | 1200 |
1 | T | 1435 |
1 | W | 1712 |
1 | R | 1529 |
1 | F | 1920 |
1 | S | 2325 |
2 | M | 2215 |
2 | T | 2458 |
2 | W | 1789 |
2 | R | 1692 |
2 | F | 2105 |
2 | S | 2847 |
in which the data sets are stacked on top of each other. Note that the number of observations in the new data set is the sum of the numbers of observations in the original data sets. To concatenate SAS data sets, you simplify specify a list of data set names in one SET statement.
Example 15.6
The following program concatenates the store1 and store2 data sets to create a new "tall" data set called bothstores:
DATA store1;
input Store Day $ Sales;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store2;
input Store Day $ Sales;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
DATA bothstores;
set store1 store2;
RUN;
PROC PRINT data = bothstores NOOBS;
title 'The bothstores data set';
RUN;
Store | Day | Sales |
---|---|---|
1 | M | 1200 |
1 | T | 1435 |
1 | W | 1712 |
1 | R | 1529 |
1 | F | 1920 |
1 | S | 2325 |
2 | M | 2215 |
2 | T | 2458 |
2 | W | 1798 |
2 | R | 1692 |
2 | F | 2105 |
2 | S | 2847 |
Note that the input data sets — store1 and store2 — contain the same variables — Store, Day, and Sales — with identical attributes. In the third DATA step, the DATA statement tells SAS to create a new data set called bothstores, and the SET statement tells SAS that the data set should contain first the observations from store1 and then the observations from store2. Note that although we have specified only two input data sets here, the SET statement can contain any number of input data sets.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS did indeed concatenate the store1 and store2 data sets to make one "tall" data set called bothstores. You might then want to edit the SET statement so that store1 follows store2, and re-run the SAS program to see that then the contents of store1 follow the contents of store2 in the bothstores data set.
In general, a data set that is created by concatenating data sets contains all of the variables and all of the observations from all of the input data sets. Therefore, the number of variables the new data set contains always equals the total number of unique variables among all of the input data sets. And, the number of observations in the new data set is the sum of the number of observations in the input data sets. Let's return to the contrived example we've used throughout this lesson.
Example 15.7
The following program concatenates the one and two data sets to create a new "tall" data set called onetopstwo:
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetopstwo;
set one two;
RUN;
PROC PRINT data = onetopstwo NOOBS;
title 'The onetopstwo data set';
RUN;
ID | VarA | VarB | VarC |
---|---|---|---|
10 | A1 | B1 | |
20 | A2 | B2 | |
30 | A3 | B3 | |
40 | B4 | C1 | |
50 | B5 | C2 |
As you review the first two DATA steps, in which SAS reads in the respective one and two data sets, note that the total number of unique variables is four — ID, VarA, VarB, and VarC. The total number of observations among the two input data sets is 3 + 2 = 5. Therefore, we can expect the concatenated data set onetopstwo to contain four variables and five observations. Launch and run the SAS program, and review the output to convince yourself that SAS did grab first all of the variables and all of the observations from the one data set and then all of the variables and all of the observations from the two data set. As you can see, to make it all work out okay, observations arising from the one data set have missing values for VarC, and observations from the two data set have missing values for VarA.
When Variable Attributes Differ
As you know, variable attributes include the type of variable (character vs. numeric), the informat (how the variable is read in) and format (how its values are printed) of a variable, the length of the variable, and the label (how its variable name is printed) of a variable. Concatenating data sets when variable attributes differ across the input data sets may pose problems for SAS (and therefore you):
- If the data sets you name in the SET statement contain variables with the same names and types, you can concatenate the data sets without modification.
- If the variable types differ, you must modify one or more of the data sets before concatenating them. SAS will not concatenate the data sets until you do.
- If the lengths, formats, informats, or labels differ, you may want to modify one or more of the data sets before concatenating them. SAS will concatenate the data sets; you may just not like the results.
Example 15.8
If a variable is defined as numeric in one data set named in the SET statement and as a character in another data set, SAS issues an error message and will not concatenate the data sets. The following program attempts to concatenate the store3 and store4 data sets when the Store variable is defined as a character in the store3 data set, but as numeric in the store4 data set:
DATA store3;
input Store $ 1 Day $ 3 Sales 5-8;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store4;
input Store 1 Day $ 3 Sales 5-8;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
DATA bothstores2;
set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
title 'The bothstores2 data set';
RUN;
Launch and run the SAS program, and review the log window. You can see that SAS balks at the idea of concatenating the two data sets and gives us a "dead in the water" message:
DATA bothstores2;
set store3 store4;
ERROR: Variable Store has been defined as both character and numeric.
RUN;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.BOTHSTORES2 may be incomplete. When this step was stopped there were 0
observations and 3 variables.
WARNING: Data set WORK.BOTHSTORES2 was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
In order to concatenate the store3 and store4 data sets successfully, we need to change either the Store character variable in the store3 data set to numeric or the Store numeric variable in the store4 data set to character. As you know from our work in Stat 480, to perform an explicit character-to-numeric conversion of the Store variable in the store3 data set, we'd have to use the INPUT function. Alternatively, we could use the PUT function to perform an explicit numeric-to-character conversion of the Store variable in the store4 data set. That's what the following DATA step does:
DATA store4 (rename = (Store2 = Store));
set store4;
Store2 = put(Store,1.);
drop Store;
RUN;
PROC CONTENTS data = store3;
title 'Contents of store3';
RUN;
PROC CONTENTS data = store4;
title 'Contents of store4';
RUN;
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
2 | Day | Char | 1 |
3 | Sales | Num | 8 |
1 | Store | Char | 1 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
1 | Day | Char | 1 |
2 | Sales | Num | 8 |
3 | Store | Char | 1 |
If you launch and run the SAS program, and review the output from the CONTENTS procedures, you can see that now the Store variable in the store3 data set is defined as a character, as is the Store variable in the store4 data set.
How does the conversion work? Well, of course, it's all because of the PUT function. The general form of the PUT function is:
PUT(source,format)
where source is a numeric variable, constant or expression to be converted to a character value, and format is a valid format that matches the data type of the source variable. In our example, Store is the numeric variable that we want to convert to a character variable, and since the values for Store are one-digit numbers, the numeric format 1. is an appropriate format to specify.
The assignment statement in our program:
Store2 = put(Store, 1.);
thus tells SAS to convert the numeric values in Store to character values and to store the results in a character variable called Store2. A little detail issue here ... if we do not use a different variable name than Store on the left side of the equation, SAS will be forced to do an automatic numeric-to-character conversion rather than the explicit numeric-to-character conversion that we desire. The one other statement in the DATA step, as well as the RENAME= data set option, is necessary just because of this naming issue. The DROP statement tells SAS to drop the Store variable, since we now have the desired character variable Store2. And, the RENAME= option tells SAS to rename the Store2 variable back to Store, just so it matches the original name of the variable.
Now that we've seen both the PUT function and INPUT function in action, it pays to summarize:
- To perform an explicit numeric-to-character conversion, use the (source, format) with a numeric source and a numeric format.
- To perform an explicit character-to-numeric conversion, use the INPUT(source, informat) function with a character source and a numeric informat.
To remember which function requires a format versus an informat, note that the INPUT function requires an informat.
Ahhh... but back to the task on hand. Finally, the following program allows us to accomplish our original goal of concatenating the store3 and store4 data sets:
DATA bothstores2;
set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
title 'The bothstores2 data set';
RUN;
Store | Day | Sales |
---|---|---|
1 | M | 1200 |
1 | T | 1435 |
1 | W | 1712 |
1 | R | 1529 |
1 | F | 1920 |
1 | S | 2325 |
2 | M | 2215 |
2 | T | 2458 |
2 | W | 1798 |
2 | R | 1692 |
2 | F | 2105 |
2 | S | 2847 |
Simple enough! Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that SAS successfully concatenated store3 and store4 into a data set called bothstores2.
Now that we've taken a look at what we need to do when we want to concatenate data sets containing the same variables but with different variable types, let's address a few more attributes. When data sets are concatenated with the SET statement, the formats, informats, and labels associated with the variables in the new data set are determined by the following rule:
If two or more data sets define different formats, informats, or labels for the same variable, the variable in the new data set takes the attribute from the first data set in the SET statement that contains the variable with the attribute.
Example 15.9
The following program creates two data sets — store5 and store6 — that intentionally contain different labels and different formats for the Sales variable:
DATA store5;
input Store 1 Day $ 3 Sales 5-8;
format Sales comma5.;
label Sales = 'Total Sales';
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store6;
input Store 1 Day $ 3 Sales 5-8;
format Sales dollar6.;
label Sales = 'Sales for Day';
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
PROC CONTENTS data = store5;
title 'Contents of the store5 data set';
RUN;
PROC CONTENTS data = store6;
title 'Contents of the store6 data set';
RUN;
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Label |
2 | Day | Char | 1 | ||
3 | Sales | Num | 8 | COMMA5. | Total Sales |
1 | Store | Num | 8 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Label |
2 | Day | Char | 1 | ||
3 | Sales | Num | 8 | DOLLAR6. | Sales for Day |
1 | Store | Num | 8 |
As you can see, the first DATA step tells SAS to set the format and label for the Sales variable as comma5. and Total Sales, respectively. And, the second DATA step tells SAS to set the format and label for the Sales variable as dollar6. and Sales for Day, respectively. Launch and run the SAS program, and review the output from the CONTENTS procedures to convince yourself that the formats and labels for the two versions of the Sales variable have been assigned as described.
Now, the following program tells SAS to store first the observations from the store5 data set, followed by the observations from the store6 data set, into a new data set called bothstores3:
DATA bothstores3;
set store5 store6;
RUN;
PROC PRINT data = bothstores3 NOOBS LABEL;
title 'The bothstores3 data set';
RUN;
Store | Day | Total Sales |
---|---|---|
1 | M | 1,200 |
1 | T | 1,435 |
1 | W | 1,712 |
1 | R | 1,529 |
1 | F | 1,920 |
1 | S | 2,325 |
2 | M | 2,215 |
2 | T | 2,458 |
2 | W | 1,798 |
2 | R | 1,692 |
2 | F | 2,105 |
2 | S | 2,847 |
Recall the rule that if two data sets explicitly define different formats and labels for the same variable, then the variable in the new data set takes the attribute from the first data set in the SET statement that explicitly defines the attribute. That said, we should expect the Sales variable in the bothstores3 data set to take on the attributes of Sales in the store5 data set. Specifically, Sales should be formatted as comma5. and labeled as Total Sales. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.
The following program reverses the order of the concatenation of the store5 and store6 data sets. That is, the program tells SAS to store first the observations from the store6 data set, followed by the observations from the store5 data set, into a new data set called bothstores4:
DATA bothstores4;
set store6 store5;
RUN;
PROC PRINT data = bothstores4 NOOBS LABEL;
title 'The bothstores4 data set';
RUN;
Store | Day | Sales for Day |
---|---|---|
2 | M | $2,215 |
2 | T | $2,458 |
2 | W | $1,798 |
2 | R | $1,692 |
2 | F | $2,105 |
2 | S | $2,847 |
1 | M | $1,200 |
1 | T | $1,435 |
1 | W | $1,712 |
1 | R | $1,529 |
1 | F | $1,920 |
1 | S | $2,325 |
According to the rule, we should expect the Sales variable in the bothstores4 data set to take on the attributes of Sales in the store6 data set. Specifically, Sales should be formatted as dollar6. and labeled as Sales for Day. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.
And, now onto one last attribute, namely that of variable lengths. When the SET statement is used to concatenate data sets in which the same variable is assigned different lengths, the lengths of the variables in the new data set are determined by the following rule:
If two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement.
Example 15.10
The following program creates two data sets — store7 and store8 — that intentionally contain different lengths for the numeric variable Store and the character variable Day:
DATA store7;
length Store 4;
input Store 1 Day $ 3-5 Sales 7-10;
DATALINES;
1 Mon 1200
1 Tue 1435
1 Wed 1712
1 Thu 1529
1 Fri 1920
1 Sat 2325
;
RUN;
DATA store8;
input Store 1 Day $ 3 Sales 5-8;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
PROC CONTENTS data = store7;
title 'Contents of the store7 data set';
RUN;
PROC CONTENTS data = store8;
title 'Contents of the store8 data set';
RUN;
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
2 | Day | Char | 3 |
3 | Sales | Num | 8 |
1 | Store | Num | 4 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
2 | Day | Char | 1 |
3 | Sales | Num | 8 |
1 | Store | Num | 8 |
As you can see, the LENGTH statement in the first DATA step tells SAS to set the length for the numeric variable Store to 4 in the store7 data set. Since the character variable Day is not explicitly mentioned in a LENGTH statement, SAS sets the length of Day in the store7 data set to 3, the number of columns specified in the INPUT statement. And, since neither the numeric variable Store nor the character variable Day are explicitly mentioned in a LENGTH statement in the second DATA step, SAS sets the length of Store in store8 to 8 by default, and the length of Day to 1, the number of columns specified in the INPUT statement. Launch and run the SAS program, and review the output from the CONTENTS procedures to convince yourself that the lengths for the two versions of the Store and Day variables have been assigned as described.
Now, the following program tells SAS to store first the observations from the store7 data set, followed by the observations from the store8 data set, into a new data set called bothstores5:
DATA bothstores5;
set store7 store8;
RUN;
PROC PRINT data = bothstores5 NOOBS LABEL;
title 'The bothstores5 data set';
RUN;
PROC CONTENTS data = bothstores5;
title '
Contents of the bothstores5 data set';
RUN;
Store | Day | Sales |
---|---|---|
1 | Mon | 1200 |
1 | Tue | 1435 |
1 | Wed | 1712 |
1 | Thu | 1529 |
1 | Fri | 1920 |
1 | Sat | 2325 |
2 | M | 2215 |
2 | T | 2458 |
2 | W | 1798 |
2 | R | 1692 |
2 | F | 2105 |
2 | S | 2847 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
2 | Day | Char | 3 |
3 | Sales | Num | 8 |
1 | Store | Num | 4 |
Recall the rule that if two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement. That said, we should expect the Store variable in the bothstores5 data set to take on the attributes of Store in the store7 data set. Specifically, Store should be assigned a length of 4. Likewise, we should expect the Day variable in the bothstores5 data set to take on the attributes of Day in the store7 data set. Specifically, Day should be assigned a length of 3. Launch and run the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected.
The following program reverses the order of the concatenation of the store7 and store8 data sets. That is, the program tells SAS to store first the observations from the store8 data set, followed by the observations from the store7 data set, into a new data set called bothstores6:
DATA bothstores6;
set store8 store7;
RUN;
PROC PRINT data = bothstores6 NOOBS LABEL;
title 'The bothstores6 data set';
RUN;
PROC CONTENTS data = bothstores6;
title '
Contents of the bothstores6 data set';
RUN;
Store | Day | Sales |
---|---|---|
2 | M | 2215 |
2 | T | 2458 |
2 | W | 1798 |
2 | R | 1692 |
2 | F | 2105 |
2 | S | 2847 |
1 | M | 1200 |
1 | T | 1435 |
1 | W | 1712 |
1 | T | 1529 |
1 | F | 1920 |
1 | S | 2325 |
Alphabetic List of Variables and Attributes | |||
---|---|---|---|
# | Variable | Type | Len |
2 | Day | Char | 1 |
3 | Sales | Num | 8 |
1 | Store | Num | 8 |
According to the rule, we should expect the Store variable in the bothstores6 data set to take on the attributes of Store in the store8 data set. Specifically, Store should be assigned a length of 8. Likewise, we should expect the Day variable in the bothstores6 data set to take on the attributes of Day in the store8 data set. Specifically, Day should be assigned a length of 1. Launch and run the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected. You should note in particular how SAS is forced to truncate the values of the Day variable in order to get them to fit within their assigned length of 1.
15.4 - Interleaving SAS data sets
15.4 - Interleaving SAS data setsProvided your data sets are properly sorted, you can interleave two or more SAS data sets. Interleaving combines individual sorted data sets into one big sorted data set based on the variables listed in a BY statement. For example, when data set one:
Year | X |
---|---|
2000 | 1 |
2001 | 2 |
2002 | 3 |
2003 | 4 |
and data set two:
Year | X |
---|---|
2001 | 5 |
2002 | 6 |
2003 | 7 |
2004 | 8 |
are interleaved by variable year, we get data set three:
Year | X |
---|---|
2000 | 1 |
2001 | 2 |
2001 | 5 |
2002 | 3 |
2002 | 6 |
2003 | 4 |
2003 | 7 |
2004 | 8 |
To interleave, you specify the data sets you want interleaved in the SET statement, and indicate on which variable you want the final data set sorted in the BY statement. You may interleave as many data sets as you'd like. The resulting data set contains all of the variables and all of the observations from all of the input data sets.
Example 15.11
The following program interleaves the one and two data sets by year:
DATA one;
input year x;
DATALINES;
2000 1
2001 2
2002 3
2003 4
;
RUN;
DATA two;
input year x;
DATALINES;
2001 5
2002 6
2003 7
2004 8
;
RUN;
DATA three;
set one two;
by year;
RUN;
PROC PRINT data = three NOOBS;
title 'The interleaved three data set';
RUN;
year | x |
---|---|
2000 | 1 |
2001 | 2 |
2001 | 5 |
2002 | 3 |
2002 | 6 |
2003 | 4 |
2003 | 7 |
2004 | 8 |
The first two DATA steps, of course, just tell SAS to read the data values into the one and two data sets. The third, and most relevant DATA step to this topic, contains both a SET statement and a BY statement. That tells SAS that we want to interleave the data sets appearing in the SET statement (one and two) by the variable appearing in the BY statement (year) and to store the result in a data set called three. Launch and run the SAS program. Review the output from the PRINT procedure to convince yourself that the three data set contains the contents of the one and two data sets sorted by year.
Example 15.12
As you may have noticed, interleaving is equivalent to the two-step process of concatenating two or more data sets and then sorting them. The following program illustrates this point by using the two-step process to create a data set four that is identical to the data set three:
DATA unsortedfour;
set one two;
RUN;
PROC PRINT data = unsortedfour NOOBS;
title 'The unsortedfour data set';
RUN;
PROC SORT data = unsortedfour out = four;
by year;
RUN;
PROC PRINT data = four NOOBS;
title '
The four data set';
RUN;
year | x |
---|---|
2000 | 1 |
2001 | 2 |
2001 | 5 |
2002 | 3 |
2002 | 6 |
2003 | 4 |
2003 | 7 |
2004 | 8 |
Launch and run the SAS program, and review the output from the last PRINT procedure to convince yourself that the four data set is identical to the interleaved three data set.
15.5 - Data Step Options
15.5 - Data Step OptionsAll of the DATA step options — FIRSTOBS=, OBS=, RENAME=, WHERE=, DROP=, KEEP=, and IN= — that we learned about in the previous lesson can be used at the same time that you are one-to-one reading, one-to-one merging, concatenating, or interleaving data sets. If you attach a DATA step option to a SET or MERGE statement, SAS takes action when observations are being read from the input data sets. And, if you attach a DATA step option to a DATA statement, SAS takes action when observations are being written to the output data set. We'll take a look at just one example that involves merging two data sets that contain identical variable names.
Example 15.13
The following program attempts to one-to-one merge two data sets — firstnames and lastnames. The firstnames data set contains a variable called name that contains the first names of five individuals, and the lastnames data set also contains a variable called name that contains the last names of four individuals:
DATA firstnames;
input subj 5-9 name $ 10-16 gender 19
height 21-22 weight 24-26;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benny 2 68 190
1302 Felicia 1 63 115
;
RUN;
DATA lastnames;
input name $ 4-9 sysbp 11-13 diasbp 14-15;
CARDS;
Smith 120 80
White 130 90
Jones 125 72
Arnold 135 95
;
RUN;
DATA alldata;
merge firstnames lastnames;
RUN;
PROC PRINT data=alldata NOOBS;
title 'The alldata data set';
RUN;
subj | name | gender | height | weight | sysbp | diasbp |
---|---|---|---|---|---|---|
1024 | Smith | 1 | 65 | 125 | 120 | 8 |
1167 | White | 1 | 68 | 140 | 130 | 9 |
1168 | Jones | 2 | 68 | 190 | 125 | 7 |
1201 | Arnold | 2 | 68 | 190 | 135 | 9 |
1302 | Felicia | 1 | 63 | 115 | . | . |
The first two DATA steps, of course, just tell SAS to read the data values into the firstnames and lastnames data sets. The third contains a MERGE statement that tells SAS to one-to-one merge the observations in the firstnames and lastnames data sets. Launch and run the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. As you can see, because the variable name exists in both data sets, the value from the data set appearing last in the MERGE statement, i.e., lastnames, is the value that goes into the new data set. That is, wherever an observation exists in lastnames, it effectively writes over the value of name from firstnames.
In order to merge the firstnames and lastnames data sets correctly, we must first change the name of the variable name to something different in one of the two data sets. We'll change the name in both data sets. The following program illustrates changing the variable name to f_name in the firstnames data set and to l_name in the lastnames data set, while simultaneously merging firstnames and lastnames in a one-to-one manner:
DATA alldata2;
merge firstnames (rename = (name=f_name))
lastnames (rename = (name=l_name));
RUN;
PROC PRINT data=alldata2 NOOBS;
title 'The alldata2 data set';
RUN;
subj | f_name | gender | height | weight | l_name | sysbp | diasbp |
---|---|---|---|---|---|---|---|
1024 | Alice | 1 | 65 | 125 | Smith | 120 | 8 |
1167 | Maryann | 1 | 68 | 140 | White | 130 | 9 |
1168 | Thomas | 2 | 68 | 190 | Jones | 125 | 7 |
1201 | Benny | 2 | 68 | 190 | Arnold | 135 | 9 |
1302 | Felicia | 1 | 63 | 115 | . | . |
Because the first RENAME= option is attached to the firstnames data set on the MERGE statement, it tells SAS to change the name variable to f_name when SAS reads in observations from the firstnames data set. And, because the second RENAME= option is attached to the lastnames data set on the MERGE statement, it tells SAS to change the name variable to l_name when SAS reads in observations from the lastnames data set. At the same time that SAS is executing the name changes, it performs the one-to-one merge of the firstnames and lastnames data sets.
Launch and run the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. As you can see, we have now successfully combined the two data sets so that all of their original information is preserved.
15.6 - Summary
15.6 - SummaryIn this lesson, we learned how to combine data sets in four different ways — one-to-one reading, one-to-one merging, concatenating, and interleaving.
The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.