32.7 - Querying Multiple Tables

So far, all the examples in this lesson are querying a single table. However, as matter of fact, you can specify multiple tables in the FROM clause. Querying more than more table at a time makes PROC SQL even more powerful in data manipulation.

The following examples use two tables:

Survey Data (survey.sas7bdat) contains:

ID, Gender, GPA, SmokeCigarrets, SATM, SATV

id Gender GPA SmokeCig SATM SATV
1001 Male 2.67 No 700 700
1002 Female 2.98 No 700 500
1003 Female 2.67 No 470 470
1004 Female 3.6 No 710 560
1005 Female 3.76 No 600 520
1006 Male 3.86 No 610 720
1007 Male 3.94 No 710 670
1008 Male 2.8 Yes 610 580
1009 Male 3.48 No 690 620

Survey2 Data (survey2.sas7bdat) contains:

ID, Seating, DiveInfluence, Height, Weight

id Seating DriverInfluen Height Weight
1001 Middle No 68 190
1002 Middle No 54 110
1003 Middle No 65 225
1004 Middle No 52 135
1005 Back No 72 128
1006 Middle No 70 188
1007 Back No 70 155
1008 Middle Yes 68 160
1009 Front No 72 160

Download these two tables if you have not done so. Revise the libnameto reflect the directory that you save the files.

Example 32.26 Section

The following program attempts to get demographic information about students from two separate tables, survey and survey2:

PROC SQL;
	create table demo_info as
	select ID,
		Gender,
		Height,
		Weight
	from stat482.survey, stat482.survey2;
QUIT;

Let’s review the code. In this SQL procedure, we used the CREATE TABLE clause to save and name the new table as demo_info. The subsequent SELECT clause chooses ID, gender, height and weight columns from two tables. In FROM clause, two tables’ names are listed.

Launch and run the SAS program. You should expect no result in the output window because the CREATE TABLE clause suppresses output. On the other hand, check the log window and you will find the error message: “Ambiguous reference, column ID is in more than one table”.

ERROR: Ambiguous reference, column ID is in more than one table.

As you observed two tables, ID is in both tables and contains the same information. If a column in the SELECT statement appears in multiple tables, the table it is chosen from has to be specified by adding the table’s name in front as this:

Table.Column

So to make it right, we revise the previous program a little bit: change ID to survey.ID, which means that we use ID from survey data. The other change is the tables’ names. You can give a table an alias with or without the keyword AS after its original name. In the following program, we use S1 for survey data and S2 to survey2 data. And as you can see, it’s okay to use one level alias even for a permanent file. This makes life easier! In this way, ID can be specified as S1.ID.

PROC SQL;
		create table demo_info as
		select s1.ID,
			Gender,
			Height,
			Weight
	from stat482.survey as s1, stat482.survey2 as s2;
QUIT;

Everything seems good. Now launch and run the SAS program. As before, there is no output because of the CREATE TABLE statement. Check the log file in which there are two notes that need your attention.

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

NOTE: Table WORK.DEMO_INFO created, with 51076 rows and 4 columns.

The first is “ The execution of this query involves performing one or more Cartesian product joins that can not be optimized”. What is a Cartesian product? It refers to a query result in which each row in the first table is combined with every row in the second table. If you specify multiple tables in FROM clause but do not use a WHERE clause to choose needed rows, a Cartesian product is generated. For example,

If we submit the following program:

PROC SQL;
Select *
from table1, table2;

Table1 has 3 rows; Table2 has 3 rows as well. Their Cartesian product contains (3*3)9 rows.

Table1

name value1
x 1
y 2
z 3
\(\times \)

Table2

name value2
A 4
B 5
C 6
\(=\)

Result:

name value1 name value2
x 1 A 4
x 1 B 5
x 1 C 6
y 2 A 4
y 2 B 5
y 2 C 6
z 3 A 4
z 3 B 5
z 3 C 6

In the program for this example, there is no WHERE clause. So SAS generated a Cartesian product and gave you the note. Both Survey and Survey2 have 226 rows in the table. The query should have (226*226) = 51076 rows as the result. That’s why you got the other note, “Table Work.demo_info created, with 51076 rows and 4 columns.” Clearly, this can’t be correct. How do we get the desired result? Let’s make a final push.

Example 32.27 Section

The following program selects the demographic information of students (ID, gender, height and weight) from two tables, survey and survey2:

PROC SQL;
	create table demo_info as
	select s1.ID,
			Gender,
			Height,
			Weight
	from stat482.survey as s1, stat482.survey2 as s2
	where s1.ID = s2.ID;
	select *
	from demo_info;
QUIT;

Let’s check through the code. Only one more clause has been added to the query, WHERE. We use the WHERE clause to subset the whole Cartesian product by only selecting the rows with matched ID numbers. Note that the column names in the WHERE clause do not have to be the same. At last, to be able to check the table in person, another query is added to display the data in the output window.

Launch and run the SAS program, and review the log file and the output.

NOTE: Table WORK.DEMO_INFO creates, with 226 rows and 4 columns.

Finally, we got what we want. As you can see from the query result, it’s like combining two columns from each table horizontally. SAS also call it join. In this particular case, since we only chose the matched rows, it’s also called the inner join. Such type of join is very similar to Merge By in the DATA step but requiring less computing resources and less coding. There are other types of join and data union (a vertical combination of rows) in PROC SQL which are beyond this lesson’s scope. If you are interested, you can explore them yourself with the foundation of this lesson!