32.7 - Querying Multiple Tables32.7 - Querying Multiple Tables
So far, all the examples in this lesson are querying a single table. However, as matter of fact, you can specify multiple tables in the FROM clause. Querying more than more table at a time makes PROC SQL even more powerful in data manipulation.
The following examples use two tables:
Survey Data (survey.sas7bdat) contains:
ID, Gender, GPA, SmokeCigarrets, SATM, SATV
Survey2 Data (survey2.sas7bdat) contains:
ID, Seating, DiveInfluence, Height, Weight
Download these two tables if you have not done so. Revise the
libnameto reflect the directory that you save the files.
The following program attempts to get demographic information about students from two separate tables, survey and survey2:
PROC SQL; create table demo_info as select ID, Gender, Height, Weight from stat482.survey, stat482.survey2; QUIT;
Let’s review the code. In this SQL procedure, we used the CREATE TABLE clause to save and name the new table as
demo_info. The subsequent SELECT clause chooses ID, gender, height and weight columns from two tables. In FROM clause, two tables’ names are listed.
Launch and run the SAS program. You should expect no result in the output window because the CREATE TABLE clause suppresses output. On the other hand, check the log window and you will find the error message: “Ambiguous reference, column ID is in more than one table”.
ERROR: Ambiguous reference, column ID is in more than one table.
As you observed two tables, ID is in both tables and contains the same information. If a column in the SELECT statement appears in multiple tables, the table it is chosen from has to be specified by adding the table’s name in front as this:
So to make it right, we revise the previous program a little bit: change ID to survey.ID, which means that we use ID from survey data. The other change is the tables’ names. You can give a table an alias with or without the keyword AS after its original name. In the following program, we use S1 for survey data and S2 to survey2 data. And as you can see, it’s okay to use one level alias even for a permanent file. This makes life easier! In this way, ID can be specified as S1.ID.
PROC SQL; create table demo_info as select s1.ID, Gender, Height, Weight from stat482.survey as s1, stat482.survey2 as s2; QUIT;
Everything seems good. Now launch and run the SAS program. As before, there is no output because of the CREATE TABLE statement. Check the log file in which there are two notes that need your attention.
NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.
NOTE: Table WORK.DEMO_INFO created, with 51076 rows and 4 columns.
The first is “ The execution of this query involves performing one or more Cartesian product joins that can not be optimized”. What is a Cartesian product? It refers to a query result in which each row in the first table is combined with every row in the second table. If you specify multiple tables in FROM clause but do not use a WHERE clause to choose needed rows, a Cartesian product is generated. For example,
If we submit the following program:
from table1, table2;
Table1 has 3 rows; Table2 has 3 rows as well. Their Cartesian product contains (3*3)9 rows.
In the program for this example, there is no WHERE clause. So SAS generated a Cartesian product and gave you the note. Both Survey and Survey2 have 226 rows in the table. The query should have (226*226) = 51076 rows as the result. That’s why you got the other note, “Table Work.demo_info created, with 51076 rows and 4 columns.” Clearly, this can’t be correct. How do we get the desired result? Let’s make a final push.
The following program selects the demographic information of students (ID, gender, height and weight) from two tables, survey and survey2:
PROC SQL; create table demo_info as select s1.ID, Gender, Height, Weight from stat482.survey as s1, stat482.survey2 as s2 where s1.ID = s2.ID; select * from demo_info; QUIT;
Let’s check through the code. Only one more clause has been added to the query, WHERE. We use the WHERE clause to subset the whole Cartesian product by only selecting the rows with matched ID numbers. Note that the column names in the WHERE clause do not have to be the same. At last, to be able to check the table in person, another query is added to display the data in the output window.
Launch and run the SAS program, and review the log file and the output.
NOTE: Table WORK.DEMO_INFO creates, with 226 rows and 4 columns.
Finally, we got what we want. As you can see from the query result, it’s like combining two columns from each table horizontally. SAS also call it join. In this particular case, since we only chose the matched rows, it’s also called the inner join. Such type of join is very similar to Merge By in the DATA step but requiring less computing resources and less coding. There are other types of join and data union (a vertical combination of rows) in PROC SQL which are beyond this lesson’s scope. If you are interested, you can explore them yourself with the foundation of this lesson!