# Lesson 15: Combining SAS Data Sets -- Part I

Lesson 15: Combining SAS Data Sets -- Part I## Overview

In this lesson, we will learn how to combine data sets in four different ways — one-to-one reading, one-to-one merging, concatenating and interleaving. Although one-to-one reading and one-to-one merging both involve placing one data set to "to the right" of other data sets to create a single "fat" data set, the results can differ slightly. Concatenating involves stacking one data set "below" other data sets to create a single "tall" data set. Interleaving also involves stacking data sets, but it only deals with sorted data sets. That is, interleaving combines individual sorted data sets into one big sorted data set. Once we've learned the four mentioned ways of combining SAS data sets, we'll also learn how to use DATA step options at the same time that we one-to-one read, one-to-one merge, concatenate or interleave two or more SAS data sets.

## Objectives

Upon completing this lesson, you should be able to do the following:

- write SAS code to one-to-one read two or more SAS data sets
- predict the size of the output data set when one-to-one reading two or more SAS data sets
- write SAS code to one-to-one merge two or more SAS data sets
- predict the size of the output data set when one-to-one merging two or more SAS data sets
- write SAS code to concatenate two or more SAS data sets
- predict the size of the output data set when concatenating two or more SAS data sets
- understand how SAS handles concatenating data sets when variable types differ across the input data sets
- understand how SAS handles concatenating data sets when formats, informats, or labels differ across the input data sets
- understand how SAS handles concatenating data sets when variables have different lengths across the input data sets
- write SAS code to interleave two or more SAS data sets
- predict the size of the output data set when interleaving two or more SAS data sets
- use the PUT function to perform an explicit numeric-to-character conversion
- use any of the DATA step options at the same time that you one-to-one read, one-to-one merge, concatenate or interleave two or more SAS data sets

# 15.1 - One-to-One Reading

15.1 - One-to-One Reading**One-to-one reading** combines two or more SAS data sets, one "to the right" of the other into a single "fat" data set. That is, one-to-one reading combines observations from two or more data sets into a single observation in a new data set. For example, suppose the data set `patients` contains three variables: patient ID number (`ID`), gender (`Sex`), and age of the patient (`Age`):

ID | Sex | Age |
---|---|---|

1157 | F | 33 |

2395 | F | 48 |

1098 | M | 39 |

4829 | F | 24 |

3456 | M | 30 |

5920 | M | 41 |

1493 | F | 42 |

and the data set *scale* contains three variables: `ID` (number),` Height` (in inches) and `Weight` (in pounds):

ID | Height | Weight |
---|---|---|

1157 | 65 | 122 |

2395 | 64 | 130 |

1098 | 70 | 178 |

4829 | 67 | 142 |

3456 | 72 | 190 |

5920 | 71 | 188 |

Then, when we one-to-one read the two data sets, we get what I like to call a "fat" data set, called say *one2oneread*:

ID | Sex | Age | Height | Weight |
---|---|---|---|---|

157 | F | 33 | 65 | 122 |

2395 | F | 48 | 64 | 130 |

1098 | M | 39 | 70 | 178 |

4829 | F | 24 | 67 | 142 |

3456 | M | 30 | 72 | 190 |

5920 | M | 41 | 71 | 188 |

in which the second data set gets placed to the "right" of the first data set. Note that the observations are **combined based on their relative position** in the data set. The first observation of *patient* is combined with the first observation of *scale* to create the first observation in *one2oneread*; the second observation of *patient* is combined with the second observation of *scale* to create the second observation in *one2oneread*; and so on. The DATA step stops after it reads the last observation from the smallest data set. Therefore, the number of observations in the new data set always equals the numbers of observations in the smallest data set you name for one-to-one reading.

## Example 15.1

The following program uses one-to-one reading to combine the *patients* data set with the *scale* data set:

```
DATA patients;
input ID Sex $ Age;
DATALINES;
1157 F 33
2395 F 48
1098 M 39
4829 F 24
3456 M 30
5920 M 41
1493 F 42
;
RUN;
DATA scale;
input ID Height Weight;
DATALINES;
1157 65 122
2395 64 130
1098 70 178
4829 67 142
3456 72 190
5920 71 188
;
RUN;
DATA one2oneread;
set patients;
set scale;
RUN;
PROC PRINT NOOBS;
title 'The one2oneread data set';
RUN;
```

Of course, the first two DATA steps just read in the respective *patients* and *scale* data sets. The meat of the one-to-one read takes place in the third (and last) DATA step, in which we see two SET statements. The first SET statement tells SAS first to read the contents of the *patients* data set into the program data vector, and then the second SET statement tells SAS to read the contents of the *scale* data set into the program data vector.

Launch and run * * the SAS program, and review the output to convince yourself that the data sets are combined as described. You should note, in particular, that SAS does indeed stop reading after reaching the last observation in the *scale* data set. Hence, the combined data set, *one2oneread*, contains six observations, the number of observations in the smallest of the two data sets. Note, too, that the position of the variables in the *one2oneread* data set directly corresponds to the order in which the SET statements appear in the DATA step. Because the *scale* data set appears to the right of the *patients* data set in the SET statement, the variables from the *scale* data set appear to the right of the variables from the *patients* data set in the combined *one2oneread* data set.

## Example 15.2

The following program uses one-to-one reading to combine the *patients* data set with the *scale* data set in the reverse order from that of the previous program:

```
DATA one2oneread2;
set scale;
set patients;
RUN;
PROC PRINT NOOBS;
title 'The one2oneread2 data set';
RUN;
```

Note, here, that the SET statement for the *scale* data set appears first in the DATA step, followed by the SET statement for the *patients* data set. Launch and run * * the SAS program, and review the output to convince yourself that the variables from the *patients* data set appear to the right of the variables from the *scale* data set in the combined *one2oneread2* data set.

Hmmm... but what about the fact that the `ID` variable appears in both the *patients* and *scale* data sets, but it appears only once in the combined *one2oneread2* data set? How does SAS handle the situation? Well... in general, if the data sets contain variables that have the same names, the values that are read in from the last data set overwrite the values that were read in from earlier data sets. Let's take a look at a contrived example to illustrate this point.

## Example 15.3

The following program uses one-to-one reading to combine the `one` data set with the `two` data set to create a new data set called `onetwo`:

```
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetwo;
set one;
set two;
RUN;
PROC PRINT data = onetwo NOOBS;
title 'The onetwo data set';
RUN;
```

As you review the first two DATA steps, in which SAS reads in the respective `one` and `two `data sets, note that the two data sets share two variables, namely `ID` and `VarB. `The third (and last) DATA step tells SAS to combine the two data sets using the one-to-one reading method. Let's walk our way through how SAS processes the DATA step. At the end of the compile phase, SAS will have created a program data vector containing the variables from the one and two data sets in the order in which they appear in the DATA step:` `

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

1 | 0 | . |

During the first iteration of the DATA step, the first SET statement reads one observation from data set `one`:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

1 | 0 | 10 | A1 | B1 |

Then, the second SET statement reads one observation from data set `two`. The values for `ID` and `VarB` in data set `two` overwrites the values for `ID and VarB in data set one: `

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

1 | 0 | 40 | A1 | B4 | C1 |

Being at the end of the first iteration of the DATA step, SAS writes the contents of the program data vector as the first observation in the `onetwo` SAS data set. Upon returning to the top of the DATA step, the program data vector looks like this at the beginning of the second iteration of the DATA step:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

2 | 0 | 40 | A1 | B4 | C1 |

Recall that SAS retains the values of variables that were read from a SAS data set with the SET statement. Now, the first SET statement reads the second observation from the `one` data set:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

2 | 0 | 20 | A2 | B2 | C1 |

And, the second SET statement reads the second observation from data set `two`. Again, the values for `ID` and `VarB` in data set `two` overwrites the values for `ID` and `VarB` in data set `one`:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

2 | 0 | 50 | A2 | B5 | C2 |

Being at the end of the second iteration of the DATA step, SAS writes the contents of the program data vector as the second observation in the `onetwo` SAS data set. Because there are no more observations in the `two` data set, processing stops. That is, the DATA step does not read the third observation from the `one` data set.

Now, launch and run * * the SAS program, and review the output to convince yourself that the `one` and *two* data sets are combined as described.

**One more comment**. Note that although each of our one-to-one reading examples involved combining just two data sets, you can specify any number of SET statements when one-to-one reading ... and therefore the sky's the limit.

# 15.2 - One-to-One Merging

15.2 - One-to-One MergingAt first glance, **one-to-one merging** appears to be the same as one-to-one reading, since it too combines two or more SAS data sets, one "to the right" of the other into a single "fat" data set. That is, just like one-to-one reading, one-to-one merging combines observations from two or more data sets into a single observation in a new data set. There is just one primary difference though — SAS continues to merge observations until it has read *all* of the observations from *all* of the data sets. For example, suppose again that our *patients* data set contains three variables: patient ID number (`ID`), gender (`Sex`), and age of the patient (`Age`):

ID | Sex | Age |
---|---|---|

1157 | F | 33 |

2395 | F | 48 |

1098 | M | 39 |

4829 | F | 24 |

3456 | M | 30 |

5920 | M | 41 |

1493 | F | 42 |

and our *scale* data set contains three variables: `ID` (number),` Height` (in inches) and `Weight` (in pounds):

ID | Height | Weight |
---|---|---|

1157 | 65 | 122 |

2395 | 64 | 130 |

1098 | 70 | 178 |

4829 | 67 | 142 |

3456 | 72 | 190 |

5920 | 71 | 188 |

Then, when we one-to-one merge the two data sets, we get a data set, called say *one2onemerge*, that looks like this:

ID | Sex | Age | Height | Weight |
---|---|---|---|---|

157 | F | 33 | 65 | 122 |

2395 | F | 48 | 64 | 130 |

1098 | M | 39 | 70 | 178 |

4829 | F | 24 | 67 | 142 |

3456 | M | 30 | 72 | 190 |

5920 | M | 41 | 71 | 188 |

1493 | F | 42 | . | . |

in which the second data set gets placed to the "right" of the first data set. Again, the observations are **combined based on their relative position** in the data set. The first observation of *patient* is combined with the first observation of *scale* to create the first observation in *one2onemerge*; the second observation of *patient* is combined with the second observation of *scale* to create the second observation in *one2onemerge*; and so on. When SAS performs a one-to-one merge, the DATA step continues to read observations until the last observation is read from the largest data set. That's why the *one2onemerge* data set has one more observation than the *one2oneread* data set. In general, the number of observations in a data set created by a one-to-one merge always equals the numbers of observations in the largest data set named for one-to-one merging.

## Example 15.4

The following program uses one-to-one merging to combine the *patients* data set with the *scale* data set:

```
DATA one2onemerge;
merge patients scale;
RUN;
PROC PRINT NOOBS;
title 'The one2onemerge data set';
RUN;
```

The MERGE statement tells SAS to create a new data set by combining *patients* and *scale*. Launch and run * * the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. You should see that the first observation in *one2onemerge* contains the first observation of *patients* and *scale*, the second observation in *one2onemerge* contains the second observation of *patients* and *scale*, and so on. Since there are seven observations in *patients* and six observations in *scale*, the new *one2onemerge *data set contains seven observations, with missing values for the `Height` and `Weight `variables in the seventh observation. Note that although this example only combined two data sets, the MERGE statement can contain any number of input data sets.

Just as is true for one-to-one reading, if data sets that are being one-to-one merged contain variables that have the same names, the values that are read in from the last data set overwrite the values that were read in from earlier data sets. Let's go back to our contrived example to illustrate this point.

## Example 15.5

The following program uses one-to-one merging to combine the *one* data set with the *two* data set to create a new data set called *onetwomerged*:

```
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetwomerged;
merge one two;
RUN;
PROC PRINT data = onetwomerged NOOBS;
title 'The onetwomerged data set';
RUN;
```

Note again that the *one* and *two* data sets share two variables, namely `ID` and `VarB`. The third (and last) DATA step tells SAS to combine the two data sets using the one-to-one merging method. Let's walk our way through how SAS processes the DATA step. At the end of the compile phase, SAS will have created a program data vector containing the variables from the *one* and *two* data sets in the order in which they appear in the DATA step:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

1 | 0 | . | . |

During the first iteration of the DATA step, the MERGE statement reads first one observation from data set *one*:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

1 | 0 | 10 | A1 | B1 | . |

and then one observation from data set *two*. The values for `ID` and `VarB` in data set *two* overwrites the values for `ID` and `VarB` in data set *one*:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

1 | 0 | 40 | A1 | B4 | C1 |

Being at the end of the first iteration of the DATA step, SAS writes the contents of the program data vector as the first observation in the *onetwomerged* data set. Upon returning to the top of the DATA step, the program data vector looks like this at the beginning of the second iteration of the DATA step:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

2 | 0 | 40 | A1 | B4 | C1 |

Just as is true for a SET statement, SAS retains the values of variables that were read from a SAS data set with the MERGE statement. Now, the MERGE statement reads the second observation from the *one* data set:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

2 | 0 | 20 | A2 | B2 | C1 |

and the second observation from the* two* data set. Again, the values for `ID` and `VarB` in data set *two* overwrites the values for `ID` and `VarB` in data set *one*:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

2 | 0 | 50 | A2 | B5 | C2 |

Being at the end of the second iteration of the DATA step, SAS writes the contents of the program data vector as the second observation in the *onetwomerged* data set. Now this is where things get different! Because an observation still exists in the *one* data set, SAS returns to the top of the DATA step. The program data vector looks like this at the beginning of the third iteration of the DATA step:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

3 | 0 | 50 | A2 | B5 | C2 |

And, the MERGE statement reads the third observation from the *one* data set:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

3 | 0 | 30 | A3 | B3 | C2 |

SAS attempts to read a third observation from the *two* data set, but instead encounters an end-of-data set marker. Therefore, *as is always the case in this kind of situation*, SAS sets the values of all of that data set's variables in the program data vector to missing:

_N_ | _ERROR_ | Num | VarA | VarB | VarC |
---|---|---|---|---|---|

3 | 0 | 30 | A3 | B |

Being at the end of the third iteration of the DATA step, SAS writes the contents of the program data vector as the third observation in the *onetwomerged* SAS data set. Because there are no more observations in either the *one* or the *two* data set, processing stops. Thank goodness!

Now, launch and run * * the SAS program, and review the output to convince yourself that the *one* and *two* data sets are combined as described.

**One more closing comment.** One-to-one reading and one-to-one merging require users to exercise extreme caution when combining two (or more) data sets based on relative position only. It would just take one of the data sets to be "shifted" ever so slightly to get really messed up results. It's for this reason that I personally don't find the one-to-one read or the one-to-one merge all that practical. The more useful and therefore much more common merge performed in SAS is what is called **match-merging**. We'll learn about it in the next lesson.

# 15.3 - Concatenating Two or More Data Sets

15.3 - Concatenating Two or More Data SetsTo concatenate two or more SAS data sets means to stack one "on top" of the other into a single SAS data set. For example, suppose the data set *store1* contains three variables, `store` (number), `day `(of the week), and `sales` (in dollars):

Store | Day | Sale |
---|---|---|

1 | M | 1200 |

1 | T | 1435 |

1 | W | 1712 |

1 | R | 1529 |

1 | F | 1920 |

1 | S | 2325 |

and the data set *store2* contains the same three variables:

Store | Day | Sales |
---|---|---|

2 | M | 2215 |

2 | T | 2458 |

2 | W | 1789 |

2 | R | 1692 |

2 | F | 2105 |

2 | S | 2847 |

Then, when we concatenate the two data sets, we get what I like to call a "tall" data set:

Store | Day | Sales |
---|---|---|

1 | M | 1200 |

1 | T | 1435 |

1 | W | 1712 |

1 | R | 1529 |

1 | F | 1920 |

1 | S | 2325 |

2 | M | 2215 |

2 | T | 2458 |

2 | W | 1789 |

2 | R | 1692 |

2 | F | 2105 |

2 | S | 2847 |

in which the data sets are stacked on top of each other. Note that the number of observations in the new data set is the sum of the numbers of observations in the original data sets. To concatenate SAS data sets, you simplify specify a list of data set names in *one* SET statement.

## Example 15.6

The following program concatenates the *store1* and *store2* data sets to create a new "tall" data set called *bothstores*:

` ````
DATA store1;
input Store Day $ Sales;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store2;
input Store Day $ Sales;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
DATA bothstores;
set store1 store2;
RUN;
PROC PRINT data = bothstores NOOBS;
title 'The bothstores data set';
RUN;
```

Note that the input data sets — *store1* and *store2* — contain the same variables — `Store`, `Day`, and `Sales` — with identical attributes. In the third DATA step, the DATA statement tells SAS to create a new data set called *bothstores*, and the SET statement tells SAS that the data set should contain first the observations from *store1* and then the observations from *store2*. Note that although we have specified only two input data sets here, the SET statement can contain any number of input data sets.

Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that SAS did indeed concatenate the *store1* and *store2* data sets to make one "tall" data set called *bothstores*. You might then want to edit the SET statement so that *store1* follows *store2*, and re-run * * the SAS program to see that then the contents of *store1* follow the contents of *store2* in the *bothstores* data set.

In general, a data set that is created by concatenating data sets contains *all* of the variables and *all* of the observations from *all* of the input data sets. Therefore, the number of variables the new data set contains always equals the total number of unique variables among all of the input data sets. And, the number of observations in the new data set is the sum of the numbers of observations in the input data sets. Let's return to the contrived example we've used throughout this lesson.

## Example 15.7

The following program concatenates the *one* and *two *data sets to create a new "tall" data set called *onetopstwo*:

```
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetopstwo;
set one two;
RUN;
PROC PRINT data = onetopstwo NOOBS;
title 'The onetopstwo data set';
RUN;
```

As you review the first two DATA steps, in which SAS reads in the respective *one* and *two* data sets, note that the total number of unique variables is four — `ID`, `VarA`, `VarB`, and `VarC`. The total number of observations among the two input data sets is 3 + 2 = 5. Therefore, we can expect the concatenated data set *onetopstwo* to contain four variables and five observations. Launch and run * * the SAS program, and review the output to convince yourself that SAS did grab first all of the variables and all of the observations from the *one* data set and then all of the variables and all of the observations from the *two* data set. As you can see, to make it all work out okay, observations arising from the *one* data set have missing values for `VarC`, and observations from the *two* data set have missing values for `VarA`.

### When Variable Attributes Differ

As you know, variable attributes include the type of variable (character vs. numeric), the informat (how the variable is read in) and format (how its values are printed) of a variable, the length of the variable, and the label (how its variable name is printed) of a variable. Concatenating data sets when variable attributes differ across the input data sets may pose problems for SAS (and therefore you):

- If the data sets you name in the SET statement contain variables with the same names and types, you can concatenate the data sets without modification.
- If the variable types differ, you
*must*modify one or more of the data sets before concatenating them. SAS will not concatenate the data sets until you do. - If the lengths, formats, informats or labels differ, you
*may want*to modify one or more of the data sets before concatenating them. SAS will concatenate the data sets; you may just not like the results.

## Example 15.8

If a variable is defined as numeric in one data set named in the SET statement and as character in another data set, SAS issues an error message and will not concatenate the data sets. The following program attempts to concatenate the *store3* and *store4* data sets, when the `Store` variable is defined as character in the *store3* data set, but as numeric in the *store4* data set:

```
DATA store3;
input Store $ 1 Day $ 3 Sales 5-8;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store4;
input Store 1 Day $ 3 Sales 5-8;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
DATA bothstores2;
set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
title 'The bothstores2 data set';
RUN;
```

Launch and run * * the SAS program, and review the log window. You can see that SAS balks at the idea of concatenating the two data sets and gives us a "dead in the water" error message:

In order to concatenate the *store3* and *store4* data sets successfully, we need to change either the *Store* character variable in the *store3* data set to numeric or the *Store* numeric variable in the *store4* data set to character. As you know from our work in Stat 480, to perform an explicit character-to-numeric conversion of the `Store` variable in the *store3* data set, we'd have to use the INPUT function. Alternatively, we could use the PUT function to perform an explicit numeric-to-character conversion of the `Store` variable in the *store4* data set. That's what the following DATA step does:

```
DATA store4 (rename = (Store2 = Store));
set store4;
Store2 = put(Store,1.);
drop Store;
RUN;
PROC CONTENTS data = store3;
title 'Contents of store3';
RUN;
PROC CONTENTS data = store4;
title 'Contents of store4';
RUN;
```

If you launch and run * * the SAS program, and review the output from the CONTENTS procedures, you can see that now the `Store` variable in the *store3* data set is defined as character, as is the `Store` variable in the *store4* data set.

How does the conversion work? Well, of course, it's all because of the PUT function. The general form of the PUT function is:

`PUT(`*source*,*format*)

where `source` is a numeric variable, constant or expression to be converted to a character value, and `format` is a valid format that matches the data type of the source variable. In our example, `Store` is the numeric variable that we want to convert to a character variable, and since the values for `Store` are one-digit numbers, the numeric format `1.` is an appropriate format to specify.

The assignment statement in our program:

`Store2 = put(Store, 1.);`

thus tells SAS to convert the numeric values in `Store` to character values and to store the results in a character variable called `Store2`. A little detail issue here ... if we do not use a different variable name than `Store` on the left side of the equation, SAS will be forced to do an automatic numeric-to-character conversion rather than the explicit numeric-to-character conversion that we desire. The one other statement in the DATA step, as well as the RENAME= data set option, is necessary just because of this naming issue. The DROP statement tells SAS to drop the `Store` variable, since we now have the desired character variable `Store2`. And, the RENAME= option tells SAS to rename the `Store2` variable back to `Store`, just so it matches the original name of the variable.

Now that we've seen both the PUT function and INPUT function in action, it pays to summarize:

- To perform an explicit numeric-to-character conversion, use the
`(`with a numeric source and a numeric format.*source*,*format*) - To perform an explicit character-to-numeric conversion, use the
`INPUT(`function with a character source and a numeric informat.*source*,*informat*)

To remember which function requires a format versus an informat, note that the **IN**PUT function requires and **in**format.

Ahhh... but back to the task on hand. Finally, the following program allows us to accomplish our original goal of concatenating the *store3* and *store4 *data sets:

```
DATA bothstores2;
set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
title 'The bothstores2 data set';
RUN;
```

Simple enough! Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that SAS successfully concatenated *store3* and *store4* into a data set called *bothstores2*.

Now that we've taken a look at what we need to do when we want to concatenate data sets containing the same variables but with different variable types, let's address a few more attributes. When data sets are concatenated with the SET statement, the formats, informats, and labels associated with the variables in the new data set are determined by the following rule:

*If two or more data sets define different formats, informats, or labels for the same variable, the variable in the new data set takes the attribute from the first data set in the SET statement that contains the variable with the attribute.*

## Example 15.9

The following program creates two data sets — *store5* and *store6* — that intentionally contain different labels and different formats for the `Sales` variable:

```
DATA store5;
input Store 1 Day $ 3 Sales 5-8;
format Sales comma5.;
label Sales = 'Total Sales';
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store6;
input Store 1 Day $ 3 Sales 5-8;
format Sales dollar6.;
label Sales = 'Sales for Day';
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
PROC CONTENTS data = store5;
title 'Contents of the store5 data set';
RUN;
PROC CONTENTS data = store6;
title 'Contents of the store6 data set';
RUN;
```

As you can see, the first DATA step tells SAS to set the format and label for the `Sales` variable as *comma5.* and Total Sales, respectively. And, the second DATA step tells SAS to set the format and label for the `Sales` variable as *dollar6.* and Sales for Day, respectively. Launch and run * * the SAS program, and review the output from the CONTENTS procedures to convince yourself that the formats and labels for the two versions of the `Sales` variable have been assigned as described.

Now, the following program tells SAS to store first the observations from the *store5* data set, followed by the observations from the *store6* data set, into a new data set called *bothstores3*:

` ````
DATA bothstores3;
set store5 store6;
RUN;
PROC PRINT data = bothstores3 NOOBS LABEL;
title 'The bothstores3 data set';
RUN;
```

Recall the rule that if two data sets explicitly define different formats and labels for the same variable, then the variable in the new data set takes the attribute from the first data set in the SET statement that explicitly defines the attribute. That said, we should expect the `Sales` variable in the *bothstores3* data set to take on the attributes of `Sales` in the *store5* data set. Specifically, `Sales` should be formatted as *comma5.* and labeled as Total Sales. Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the *store5* and *store6* data sets. That is, the program tells SAS to store first the observations from the *store6* data set, followed by the observations from the *store5* data set, into a new data set called *bothstores4*:

` ````
DATA bothstores4;
set store6 store5;
RUN;
PROC PRINT data = bothstores4 NOOBS LABEL;
title 'The bothstores4 data set';
RUN;
```

According to the rule, we should expect the `Sales` variable in the *bothstores4* data set to take on the attributes of `Sales` in the *store6* data set. Specifically, `Sales` should be formatted as *dollar6.* and labeled as Sales for Day. Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

And, now onto one last attribute, namely that of variable lengths.** **When the SET statement is used to concatenate data sets in which the same variable is assigned different lengths, the lengths of the variables in the new data set are determined by the following rule:

*If two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement.*

## Example 15.10

The following program creates two data sets — *store7* and *store8* — that intentionally contain different lengths for the numeric variable `Store` and the character variable `Day`:

```
DATA store7;
length Store 4;
input Store 1 Day $ 3-5 Sales 7-10;
DATALINES;
1 Mon 1200
1 Tue 1435
1 Wed 1712
1 Thu 1529
1 Fri 1920
1 Sat 2325
;
RUN;
DATA store8;
input Store 1 Day $ 3 Sales 5-8;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
PROC CONTENTS data = store7;
title 'Contents of the store7 data set';
RUN;
PROC CONTENTS data = store8;
title 'Contents of the store8 data set';
RUN;
```

As you can see, the LENGTH statement in the first DATA step tells SAS to set the length for the numeric variable `Store` to 4 in the *store7* data set. Since the character variable `Day` is not explicitly mentioned in a LENGTH statement, SAS sets the length of `Day` in the *store7* data set to 3, the number of columns specified in the INPUT statement. And, since neither the numeric variable `Store` nor the character variable `Day` are explicitly mentioned in a LENGTH statement in the second DATA step, SAS sets the length of `Store` in *store8* to 8 by default, and the length of `Day` to 1, the number of columns specified in the INPUT statement. Launch and run * * the SAS program, and review the output from the CONTENTS procedures to convince yourself that the lengths for the two versions of the `Store` and `Day` variables have been assigned as described.

Now, the following program tells SAS to store first the observations from the *store7* data set, followed by the observations from the *store8* data set, into a new data set called *bothstores5*:

```
DATA bothstores5;
set store7 store8;
RUN;
PROC PRINT data = bothstores5 NOOBS LABEL;
title 'The bothstores5 data set';
RUN;
PROC CONTENTS data = bothstores5;
title 'Contents of the bothstores5 data set';
RUN;
```

Recall the rule that if two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement. That said, we should expect the `Store` variable in the *bothstores5* data set to take on the attributes of `Store` in the *store7* data set. Specifically, `Store` should be assigned a length of 4. Likewise, we should expect the `Day` variable in the *bothstores5* data set to take on the attributes of `Day` in the *store7* data set. Specifically, `Day` should be assigned a length of 3. Launch and run * * the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the *store7* and *store8* data sets. That is, the program tells SAS to store first the observations from the *store8* data set, followed by the observations from the *store7* data set, into a new data set called *bothstores6*:

` ````
DATA bothstores6;
set store8 store7;
RUN;
PROC PRINT data = bothstores6 NOOBS LABEL;
title 'The bothstores6 data set';
RUN;
PROC CONTENTS data = bothstores6;
title 'Contents of the bothstores6 data set';
RUN;
```

According to the rule, we should expect the `Store` variable in the *bothstores6* data set to take on the attributes of `Store` in the *store8* data set. Specifically, *Store* should be assigned a length of 8. Likewise, we should expect the *Day* variable in the *bothstores6* data set to take on the attributes of `Day` in the *store8* data set. Specifically, `Day` should be assigned a length of 1. Launch and run * * the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected. You should note in particular how SAS is forced to truncate the values of the `Day` variable in order to get them to fit within their assigned length of 1.

# 15.4 - Interleaving SAS data sets

15.4 - Interleaving SAS data setsProvided your data sets are properly sorted, you can **interleave** two or more SAS data sets. Interleaving combines individual sorted data sets into one big sorted data set based on the variables listed in a BY statement. For example, when data set *one*:

Year | X |
---|---|

2000 | 1 |

2001 | 2 |

2002 | 3 |

2003 | 4 |

and data set *two*:

Year | X |
---|---|

2001 | 5 |

2002 | 6 |

2003 | 7 |

2004 | 8 |

are interleaved by variable *year*, we get data set *three*:

Year | X |
---|---|

2000 | 1 |

2001 | 2 |

2001 | 5 |

2002 | 3 |

2002 | 6 |

2003 | 4 |

2003 | 7 |

2004 | 8 |

To interleave, you specify the data sets you want interleaved in the SET statement, and indicate on which variable you want the final data set sorted in the BY statement. You may interleave as many data sets as you'd like. The resulting data set contains all of the variables and all of the observations from all of the input data sets.

## Example 15.11

The following program interleaves the* one* and *two *data sets by *year*:

```
DATA one;
input year x;
DATALINES;
2000 1
2001 2
2002 3
2003 4
;
RUN;
DATA two;
input year x;
DATALINES;
2001 5
2002 6
2003 7
2004 8
;
RUN;
DATA three;
set one two;
by year;
RUN;
PROC PRINT data = three NOOBS;
title 'The interleaved three data set';
RUN;
```

The first two DATA steps, of course, just tell SAS to read the data values into the *one* and *two* data sets. The third, and most relevant DATA step to this topic, contains both a SET statement and a BY statement. That tells SAS that we want to interleave the data sets appearing in the SET statement (*one* and *two*) by the variable appearing in the BY statement (*year*) and to store the result in a data set called *three*. Launch and run * * the SAS program. Review the output from the PRINT procedure to convince yourself that the *three *data set contains the contents of the *one* and *two* data sets sorted by *year*.

## Example 15.12

As you may have noticed, interleaving is equivalent to the two-step process of concatenating two or more data sets and then sorting them. The following program illustrates this point by using the two-step process to create a data set *four *that is identical to the data set *three*:

```
DATA unsortedfour;
set one two;
RUN;
PROC PRINT data = unsortedfour NOOBS;
title 'The unsortedfour data set';
RUN;
PROC SORT data = unsortedfour out = four;
by year;
RUN;
PROC PRINT data = four NOOBS;
title 'The four data set';
RUN;
```

Launch and run * * the SAS program, and review the output from the last PRINT procedure to convince yourself that the *four* data set is identical to the interleaved *three *data set.

# 15.5 - Data Step Options

15.5 - Data Step OptionsAll of the DATA step options — FIRSTOBS=, OBS=, RENAME=, WHERE=, DROP=, KEEP=, and IN= — that we learned about in the previous lesson can be used at the same time that you are one-to-one reading, one-to-one merging, concatenating, or interleaving data sets. If you attach a DATA step option to a SET or MERGE statement, SAS takes action when observations are being read from the input data sets. And, if you attach a DATA step option to a DATA statement, SAS takes action when observations are being written to the output data set. We'll take a look at just one example that involves merging two data sets that contain identical variable names.

## Example 15.13

The following program attempts to one-to-one merge two data sets — *firstnames* and* lastnames*. The *firstnames* data set contains a variable called `name` that contains the first names of five individuals, and the *lastnames* data set also contains a variable called `name` that contains the last names of four individuals:

```
DATA firstnames;
input subj 5-9 name $ 10-16 gender 19
height 21-22 weight 24-26;
CARDS;
1024 Alice 1 65 125
1167 Maryann 1 68 140
1168 Thomas 2 68 190
1201 Benny 2 68 190
1302 Felicia 1 63 115
;
RUN;
DATA lastnames;
input name $ 4-9 sysbp 11-13 diasbp 14-15;
CARDS;
Smith 120 80
White 130 90
Jones 125 72
Arnold 135 95
;
RUN;
DATA alldata;
merge firstnames lastnames;
RUN;
PROC PRINT data=alldata NOOBS;
title 'The alldata data set';
RUN;
```

The first two DATA steps, of course, just tell SAS to read the data values into the *firstnames* and *lastnames * data sets. The third contains a MERGE statement that tells SAS to one-to-one merge the observations in the *firstnames* and *lastnames* data sets. Launch and run * * the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. As you can see, because the variable *name *exists in both data sets, the value from the data set appearing last in the MERGE statement, *i.e.*, *lastnames*, is the value that goes into the new data set. That is, whereever an observation exists in *lastnames*, it effectively writes over the value of *name* from *firstnames*.

In order to merge the *firstnames* and *lastnames* data sets correctly, we must first change the name of the variable `name `to something different in one of the two data sets. We'll change the name in both data sets. The following program illustrates changing the variable `name` to `f_name` in the *firstnames *data set and to `l_name` in the *lastnames* data set, while simultaneously merging *firstnames* and *lastnames *in a one-to-one manner:

```
DATA alldata2;
merge firstnames (rename = (name=f_name))
lastnames (rename = (name=l_name));
RUN;
PROC PRINT data=alldata2 NOOBS;
title 'The alldata2 data set';
RUN;
```

Because the first RENAME= option is attached to the *firstnames* data set on the MERGE statement, it tells SAS to change the `name` variable to `f_name` when SAS reads in observations from the *firstnames* data set. And, because the second RENAME= option is attached to the *lastnames* data set on the MERGE statement, it tells SAS to change the `name` variable to `l_name` when SAS reads in observations from the *lastnames* data set. At the same time that SAS is executing the name changes, it performs the one-to-one merge of the *firstnames* and *lastnames* data sets.

Launch and run * * the SAS program and review the output from the PRINT procedure to see the results of the one-to-one merge. As you can see, we have now successfully combined the two data sets so that all of their original information was preserved.

# 15.6 - Summary

15.6 - SummaryIn this lesson, we learned how to combine data sets in four different ways — one-to-one reading, one-to-one merging, concatenating and interleaving.

The homework for this lesson will give you more practice with these techniques so that you become even more familiar with how they work and can use them in your own SAS programming.