# 15.3 - Concatenating Two or More Data Sets

15.3 - Concatenating Two or More Data SetsTo concatenate two or more SAS data sets means to stack one "on top" of the other into a single SAS data set. For example, suppose the data set *store1* contains three variables, `store` (number), `day `(of the week), and `sales` (in dollars):

Store | Day | Sale |
---|---|---|

1 | M | 1200 |

1 | T | 1435 |

1 | W | 1712 |

1 | R | 1529 |

1 | F | 1920 |

1 | S | 2325 |

and the data set *store2* contains the same three variables:

Store | Day | Sales |
---|---|---|

2 | M | 2215 |

2 | T | 2458 |

2 | W | 1789 |

2 | R | 1692 |

2 | F | 2105 |

2 | S | 2847 |

Then, when we concatenate the two data sets, we get what I like to call a "tall" data set:

Store | Day | Sales |
---|---|---|

1 | M | 1200 |

1 | T | 1435 |

1 | W | 1712 |

1 | R | 1529 |

1 | F | 1920 |

1 | S | 2325 |

2 | M | 2215 |

2 | T | 2458 |

2 | W | 1789 |

2 | R | 1692 |

2 | F | 2105 |

2 | S | 2847 |

in which the data sets are stacked on top of each other. Note that the number of observations in the new data set is the sum of the numbers of observations in the original data sets. To concatenate SAS data sets, you simplify specify a list of data set names in *one* SET statement.

## Example 15.6

The following program concatenates the *store1* and *store2* data sets to create a new "tall" data set called *bothstores*:

` ````
DATA store1;
input Store Day $ Sales;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store2;
input Store Day $ Sales;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
DATA bothstores;
set store1 store2;
RUN;
PROC PRINT data = bothstores NOOBS;
title 'The bothstores data set';
RUN;
```

Note that the input data sets — *store1* and *store2* — contain the same variables — `Store`, `Day`, and `Sales` — with identical attributes. In the third DATA step, the DATA statement tells SAS to create a new data set called *bothstores*, and the SET statement tells SAS that the data set should contain first the observations from *store1* and then the observations from *store2*. Note that although we have specified only two input data sets here, the SET statement can contain any number of input data sets.

Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that SAS did indeed concatenate the *store1* and *store2* data sets to make one "tall" data set called *bothstores*. You might then want to edit the SET statement so that *store1* follows *store2*, and re-run * * the SAS program to see that then the contents of *store1* follow the contents of *store2* in the *bothstores* data set.

In general, a data set that is created by concatenating data sets contains *all* of the variables and *all* of the observations from *all* of the input data sets. Therefore, the number of variables the new data set contains always equals the total number of unique variables among all of the input data sets. And, the number of observations in the new data set is the sum of the numbers of observations in the input data sets. Let's return to the contrived example we've used throughout this lesson.

## Example 15.7

The following program concatenates the *one* and *two *data sets to create a new "tall" data set called *onetopstwo*:

```
DATA one;
input ID VarA $ VarB $;
DATALINES;
10 A1 B1
20 A2 B2
30 A3 B3
;
RUN;
DATA two;
input ID VarB $ VarC $;
DATALINES;
40 B4 C1
50 B5 C2
;
RUN;
DATA onetopstwo;
set one two;
RUN;
PROC PRINT data = onetopstwo NOOBS;
title 'The onetopstwo data set';
RUN;
```

As you review the first two DATA steps, in which SAS reads in the respective *one* and *two* data sets, note that the total number of unique variables is four — `ID`, `VarA`, `VarB`, and `VarC`. The total number of observations among the two input data sets is 3 + 2 = 5. Therefore, we can expect the concatenated data set *onetopstwo* to contain four variables and five observations. Launch and run * * the SAS program, and review the output to convince yourself that SAS did grab first all of the variables and all of the observations from the *one* data set and then all of the variables and all of the observations from the *two* data set. As you can see, to make it all work out okay, observations arising from the *one* data set have missing values for `VarC`, and observations from the *two* data set have missing values for `VarA`.

### When Variable Attributes Differ

As you know, variable attributes include the type of variable (character vs. numeric), the informat (how the variable is read in) and format (how its values are printed) of a variable, the length of the variable, and the label (how its variable name is printed) of a variable. Concatenating data sets when variable attributes differ across the input data sets may pose problems for SAS (and therefore you):

- If the data sets you name in the SET statement contain variables with the same names and types, you can concatenate the data sets without modification.
- If the variable types differ, you
*must*modify one or more of the data sets before concatenating them. SAS will not concatenate the data sets until you do. - If the lengths, formats, informats or labels differ, you
*may want*to modify one or more of the data sets before concatenating them. SAS will concatenate the data sets; you may just not like the results.

## Example 15.8

If a variable is defined as numeric in one data set named in the SET statement and as character in another data set, SAS issues an error message and will not concatenate the data sets. The following program attempts to concatenate the *store3* and *store4* data sets, when the `Store` variable is defined as character in the *store3* data set, but as numeric in the *store4* data set:

```
DATA store3;
input Store $ 1 Day $ 3 Sales 5-8;
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store4;
input Store 1 Day $ 3 Sales 5-8;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
DATA bothstores2;
set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
title 'The bothstores2 data set';
RUN;
```

Launch and run * * the SAS program, and review the log window. You can see that SAS balks at the idea of concatenating the two data sets and gives us a "dead in the water" error message:

In order to concatenate the *store3* and *store4* data sets successfully, we need to change either the *Store* character variable in the *store3* data set to numeric or the *Store* numeric variable in the *store4* data set to character. As you know from our work in Stat 480, to perform an explicit character-to-numeric conversion of the `Store` variable in the *store3* data set, we'd have to use the INPUT function. Alternatively, we could use the PUT function to perform an explicit numeric-to-character conversion of the `Store` variable in the *store4* data set. That's what the following DATA step does:

```
DATA store4 (rename = (Store2 = Store));
set store4;
Store2 = put(Store,1.);
drop Store;
RUN;
PROC CONTENTS data = store3;
title 'Contents of store3';
RUN;
PROC CONTENTS data = store4;
title 'Contents of store4';
RUN;
```

If you launch and run * * the SAS program, and review the output from the CONTENTS procedures, you can see that now the `Store` variable in the *store3* data set is defined as character, as is the `Store` variable in the *store4* data set.

How does the conversion work? Well, of course, it's all because of the PUT function. The general form of the PUT function is:

`PUT(`*source*,*format*)

where `source` is a numeric variable, constant or expression to be converted to a character value, and `format` is a valid format that matches the data type of the source variable. In our example, `Store` is the numeric variable that we want to convert to a character variable, and since the values for `Store` are one-digit numbers, the numeric format `1.` is an appropriate format to specify.

The assignment statement in our program:

`Store2 = put(Store, 1.);`

thus tells SAS to convert the numeric values in `Store` to character values and to store the results in a character variable called `Store2`. A little detail issue here ... if we do not use a different variable name than `Store` on the left side of the equation, SAS will be forced to do an automatic numeric-to-character conversion rather than the explicit numeric-to-character conversion that we desire. The one other statement in the DATA step, as well as the RENAME= data set option, is necessary just because of this naming issue. The DROP statement tells SAS to drop the `Store` variable, since we now have the desired character variable `Store2`. And, the RENAME= option tells SAS to rename the `Store2` variable back to `Store`, just so it matches the original name of the variable.

Now that we've seen both the PUT function and INPUT function in action, it pays to summarize:

- To perform an explicit numeric-to-character conversion, use the
`(`with a numeric source and a numeric format.*source*,*format*) - To perform an explicit character-to-numeric conversion, use the
`INPUT(`function with a character source and a numeric informat.*source*,*informat*)

To remember which function requires a format versus an informat, note that the **IN**PUT function requires and **in**format.

Ahhh... but back to the task on hand. Finally, the following program allows us to accomplish our original goal of concatenating the *store3* and *store4 *data sets:

```
DATA bothstores2;
set store3 store4;
RUN;
PROC PRINT data = bothstores2 NOOBS;
title 'The bothstores2 data set';
RUN;
```

Simple enough! Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that SAS successfully concatenated *store3* and *store4* into a data set called *bothstores2*.

Now that we've taken a look at what we need to do when we want to concatenate data sets containing the same variables but with different variable types, let's address a few more attributes. When data sets are concatenated with the SET statement, the formats, informats, and labels associated with the variables in the new data set are determined by the following rule:

*If two or more data sets define different formats, informats, or labels for the same variable, the variable in the new data set takes the attribute from the first data set in the SET statement that contains the variable with the attribute.*

## Example 15.9

The following program creates two data sets — *store5* and *store6* — that intentionally contain different labels and different formats for the `Sales` variable:

```
DATA store5;
input Store 1 Day $ 3 Sales 5-8;
format Sales comma5.;
label Sales = 'Total Sales';
DATALINES;
1 M 1200
1 T 1435
1 W 1712
1 R 1529
1 F 1920
1 S 2325
;
RUN;
DATA store6;
input Store 1 Day $ 3 Sales 5-8;
format Sales dollar6.;
label Sales = 'Sales for Day';
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
PROC CONTENTS data = store5;
title 'Contents of the store5 data set';
RUN;
PROC CONTENTS data = store6;
title 'Contents of the store6 data set';
RUN;
```

As you can see, the first DATA step tells SAS to set the format and label for the `Sales` variable as *comma5.* and Total Sales, respectively. And, the second DATA step tells SAS to set the format and label for the `Sales` variable as *dollar6.* and Sales for Day, respectively. Launch and run * * the SAS program, and review the output from the CONTENTS procedures to convince yourself that the formats and labels for the two versions of the `Sales` variable have been assigned as described.

Now, the following program tells SAS to store first the observations from the *store5* data set, followed by the observations from the *store6* data set, into a new data set called *bothstores3*:

` ````
DATA bothstores3;
set store5 store6;
RUN;
PROC PRINT data = bothstores3 NOOBS LABEL;
title 'The bothstores3 data set';
RUN;
```

Recall the rule that if two data sets explicitly define different formats and labels for the same variable, then the variable in the new data set takes the attribute from the first data set in the SET statement that explicitly defines the attribute. That said, we should expect the `Sales` variable in the *bothstores3* data set to take on the attributes of `Sales` in the *store5* data set. Specifically, `Sales` should be formatted as *comma5.* and labeled as Total Sales. Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the *store5* and *store6* data sets. That is, the program tells SAS to store first the observations from the *store6* data set, followed by the observations from the *store5* data set, into a new data set called *bothstores4*:

` ````
DATA bothstores4;
set store6 store5;
RUN;
PROC PRINT data = bothstores4 NOOBS LABEL;
title 'The bothstores4 data set';
RUN;
```

According to the rule, we should expect the `Sales` variable in the *bothstores4* data set to take on the attributes of `Sales` in the *store6* data set. Specifically, `Sales` should be formatted as *dollar6.* and labeled as Sales for Day. Launch and run * * the SAS program, and review the output from the PRINT procedure to convince yourself that the concatenation behaved as expected.

And, now onto one last attribute, namely that of variable lengths.** **When the SET statement is used to concatenate data sets in which the same variable is assigned different lengths, the lengths of the variables in the new data set are determined by the following rule:

*If two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement.*

## Example 15.10

The following program creates two data sets — *store7* and *store8* — that intentionally contain different lengths for the numeric variable `Store` and the character variable `Day`:

```
DATA store7;
length Store 4;
input Store 1 Day $ 3-5 Sales 7-10;
DATALINES;
1 Mon 1200
1 Tue 1435
1 Wed 1712
1 Thu 1529
1 Fri 1920
1 Sat 2325
;
RUN;
DATA store8;
input Store 1 Day $ 3 Sales 5-8;
DATALINES;
2 M 2215
2 T 2458
2 W 1798
2 R 1692
2 F 2105
2 S 2847
;
RUN;
PROC CONTENTS data = store7;
title 'Contents of the store7 data set';
RUN;
PROC CONTENTS data = store8;
title 'Contents of the store8 data set';
RUN;
```

As you can see, the LENGTH statement in the first DATA step tells SAS to set the length for the numeric variable `Store` to 4 in the *store7* data set. Since the character variable `Day` is not explicitly mentioned in a LENGTH statement, SAS sets the length of `Day` in the *store7* data set to 3, the number of columns specified in the INPUT statement. And, since neither the numeric variable `Store` nor the character variable `Day` are explicitly mentioned in a LENGTH statement in the second DATA step, SAS sets the length of `Store` in *store8* to 8 by default, and the length of `Day` to 1, the number of columns specified in the INPUT statement. Launch and run * * the SAS program, and review the output from the CONTENTS procedures to convince yourself that the lengths for the two versions of the `Store` and `Day` variables have been assigned as described.

Now, the following program tells SAS to store first the observations from the *store7* data set, followed by the observations from the *store8* data set, into a new data set called *bothstores5*:

```
DATA bothstores5;
set store7 store8;
RUN;
PROC PRINT data = bothstores5 NOOBS LABEL;
title 'The bothstores5 data set';
RUN;
PROC CONTENTS data = bothstores5;
title 'Contents of the bothstores5 data set';
RUN;
```

Recall the rule that if two or more data sets have different lengths for the same variable, the variable in the new data set takes the length of the variable from the data set that appears first in the SET statement. That said, we should expect the `Store` variable in the *bothstores5* data set to take on the attributes of `Store` in the *store7* data set. Specifically, `Store` should be assigned a length of 4. Likewise, we should expect the `Day` variable in the *bothstores5* data set to take on the attributes of `Day` in the *store7* data set. Specifically, `Day` should be assigned a length of 3. Launch and run * * the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected.

The following program reverses the order of the concatenation of the *store7* and *store8* data sets. That is, the program tells SAS to store first the observations from the *store8* data set, followed by the observations from the *store7* data set, into a new data set called *bothstores6*:

` ````
DATA bothstores6;
set store8 store7;
RUN;
PROC PRINT data = bothstores6 NOOBS LABEL;
title 'The bothstores6 data set';
RUN;
PROC CONTENTS data = bothstores6;
title 'Contents of the bothstores6 data set';
RUN;
```

According to the rule, we should expect the `Store` variable in the *bothstores6* data set to take on the attributes of `Store` in the *store8* data set. Specifically, *Store* should be assigned a length of 8. Likewise, we should expect the *Day* variable in the *bothstores6* data set to take on the attributes of `Day` in the *store8* data set. Specifically, `Day` should be assigned a length of 1. Launch and run * * the SAS program, and review the output from the PRINT and CONTENTS procedures to convince yourself that the concatenation behaved as expected. You should note in particular how SAS is forced to truncate the values of the `Day` variable in order to get them to fit within their assigned length of 1.