11.2.1  Modeling Ordinal Data with Loglinear Models
Earlier in the course we had described the ways to perform significance tests for independence and conditional independence, and to measure (linear) associations with ordinal categorical variables.
For example we focused on the CMH statistic and correlation measures for testing independence and linear associations for the example with the heart disease and cholesterol level. We concluded that the variables are not independent, and that the linear association was not strong either M^{2} = 26.15, df = 1 (see Lesson 4, "Ordinal data" section).
Serum cholesterol (mg/100 cc)

total


0–199

200–219

220–259

260+


CHD

12

8

31

41

92

no CHD

307

246

439

245

1237

total

319

254

470

286

1329

Can we answer the same questions (and more) via loglinear models?
Modeling Ordinal Data in 2way Tables
Loglinear models for contingency tables, by default, treat all variables as nominal variables.
If there is an ordering of the categories of the variables, this is not taken into account. That means that we could rearrange the rows and/or columns of a table and we would get the same fitted odds ratios for the data as we would given the orginal ordering of the rows and/or columns.
To model ordinal data with loglinear models we can apply some of the general ideas we saw with incomplete tables and analysis of ordinal data from the earlier in the semester.
That is, we typically
 assign scores to the levels of our categorical variables, and
 include additional parameters (which represent these scores) into a loglinear model to model the dependency between two variables.
Linear by Linear Association Model
This is the most common loglinear model when you have ordinal data.
Objective:
Modeling the log counts by accounting for the ordering of the categories of discrete variables.
Suppose we assign scores for the categories of the row variable, u_{1} ≤ u_{2} ≤ ... ≤ u_{I} , and for the categories of the column variable, v_{1} ≤ v_{2} ≤ ... ≤ v_{J} . These are numbers or values that you will use to describe the difference in magnitude between these variables. Then we can model the dependency between two variables, e.g. C = CHD, and S = Serum cholesterol.
Model Structure:
\(\text{log}(\mu_{ij})=\lambda+\lambda_i^C+\lambda_j^S+\beta u_i v_j\)
For each row i, the log fitted values are a linear function of the columns. For each column j, the log fitted values are a linear function of the rows.
Parameter estimates and interpretation:
This model only has one more parameter than the independence model (i.e., βu_{i}v_{j}), and is in between the independence and the saturated models by its complexity. We are trying to say something about the 'linear by linear association' by modeling this association between these two categories based on the scores that you have assigned.
 If β > 0, then C and S are positively associated (i.e., C tends to go up as S goes up).
 If β < 0, the C and S are negatively associated.
 The odds ratio for any 2 × 2 subtable is a direct function of the row and column scores and β
\begin{align}
\theta(ij,i'j') &= \text{log}(\mu_{ij})+\text{log}(\mu_{i'j'})\text{log}(\mu_{i'j})\text{log}(\mu_{ij'})\\
&= \beta(u_iu_{i'})(v_jv_{j'})\\
\end{align}
Model Fit:
We use the G^{2} and ΔG^{2} as with any other loglinear model.
For our example see HeartDiseaseLoglin.sas and the resulting output, HeartDiseaseLoglin.lst. In this SAS program we create two new numerical variables as seen below in the datalines statement:
Here is the R code for this example, HeartDiseaseLoglin.R.:
We observe G^{2} = 4.09, df = 2, pvalue = 0.13 which indicates that the linear by linear association model fits well, and significantly better than the independence model where ΔG^{2} = 27.832, df = 1, pvalue < 0.001. Notice the equivalence of the values of the ΔG^{2}, M^{2}, and the likelihood ratio statistic for "xCHD*yserum" parameter under the significance testing for the individual parameters (e.g. 'Type 3 Analysis' output of GENMOD).
\(\hat{\beta}=0.574\) and exp(−0.574) = 0.56 means that the estimated odds ratio for a unit change in row and column scores of 'chdnochd' and '0199 – 200219' equal 0.56.
Linear by Linear Association Model 01:27 Tuesday, April 4, 2006 6
The GENMOD Procedure
Analysis Of Parameter Estimates
Likelihood Ratio
Standard 95% Confidence Chi
Parameter DF Estimate Error Limits Square
serum 220259 1 0.5937 0.2303 1.0563 0.1513 6.65
serum 260+ 0 0.0000 0.0000 0.0000 0.0000 .
xCHD*yserum 1 0.5740 0.1161 0.8089 0.3527 24.45
Scale 0 1.0000 0.0000 1.0000 1.0000
Look at the model fitted values ('Pred' from SAS "Observation Statistics" table or from using the"fitted" function in R):
\(\hat{\theta}=(8.16\times 242.69)/(11.31\times 310.84)=0.56\)
The cells in red:
Observation Statistics Observation count xCHD yserum CHD serum Pred Xbeta Std HessWgt Lower Upper Resraw Reschi Resdev StResdev StReschi Reslik 1 12 1 1 chd 0199 8.1580998 2.0990113 0.2624486 8.1580998 4.8774452 13.64538 3.8419002 1.3450907 1.2560608 1.8977359 2.032248 1.9744497 2 8 1 2 chd 200199 11.308199 2.425528 0.1694369 11.308199 8.1127572 15.762257 3.308199 0.983773 1.038756 1.264002 1.197096 1.242676 3 31 1 3 chd 220259 35.909358 3.5809979 0.1111088 35.909358 28.882293 44.646109 4.909358 0.819258 0.839078 1.12459 1.098027 1.112893 Linear by Linear Association Model 01:27 Tuesday, April 4, 2006 7 The GENMOD Procedure Observation Statistics Observation count xCHD yserum CHD serum Pred Xbeta Std HessWgt Lower Upper Resraw Reschi Resdev StResdev StReschi Reslik 4 41 1 4 chd 260+ 36.624367 3.6007138 0.146867 36.624367 27.463552 48.840886 4.3756334 0.7230292 0.7093048 1.5477726 1.5777207 1.5714785 5 307 2 1 nochd 0199 310.84191 5.7392845 0.0563922 310.84191 278.31617 347.16882 3.841913 0.21791 0.218361 2.036463 2.032255 2.032303 6 246 2 2 nochd 200199 242.69181 5.4917924 0.0631727 242.69181 214.42846 274.68049 3.3081921 0.2123553 0.2118756 1.1943894 1.1970937 1.1970087 7 439 2 3 nochd 220259 434.09065 6.0732534 0.0468783 434.09065 395.98388 475.86454 4.9093545 0.235632 0.2351899 1.0959661 1.0980261 1.0979313 8 245 2 4 nochd 260+ 249.37563 5.5189603 0.0623404 249.37563 220.6936 281.78528 4.375634 0.277086 0.277902 1.582369 1.577721 1.577864
The estimated odds of 'chd' and higher level of cholesterol, e.g. '260+' under this model are
\(\hat{\theta}=\text{exp}(0.574(21)(41))=0.18=\dfrac{8.16\times 249.37}{36.62\times 310.84}\)
We can use this evidence to conclude that a person is about 5.5 times more likely to have a heart condition with such a high cholesterol level.
Choice of Scores
There are many options for assigning the score values, and these can be of equal or unequal spacing.
The most common choice of scores are consecutive integers; that is u_{1} = 1, u_{2} = 2, ... u_{I} = I and v_{1} = 1, v_{2 } = 2, ... v_{J} = J (which is what we used in the above example).
The model with such scores is a special case of the linear by linear association model and is known as the Uniform Association Model. It is called the uniform association model because the odds ratios for any two adjacent rows and any two adjacent columns are the same and equal to
\(\theta=\text{exp}(\beta(u_iu_{i1})(v_jv_{j1}))=\text{exp}(\beta)\)
In other words, the Local Odds Ratio equals exp(β) and is the same for adjacent rows and columns.
Also, sets of scores with the same spacing between them will lead to the same goodnessoffit statistics, fitted counts, odds ratios, and \(\hat{\beta}\) .
For example, v_{1} = 1, v_{2 } = 2, v_{3} = 3, v_{4} = 4 and v_{1} = 8, v_{2} = 9, v_{3} = 10, v_{4} = 11 will yield the same results.
However, please note: Two sets of scores with the same relative spacing will lead to the same goodnessoffit statistics, fitted counts, and odds ratios, BUT different estimates of β.
For example, v_{1} = 1, v_{2} = 2v_{3} = 4, v_{4} = 8 and v_{1} = 2, v_{2} = 4, v_{3} = 8, v_{4} = 16
The choice of scores may highly depend on your data and the context of your problem. There are other ways of using and modeling ordinality, e.g. Cumlative logit models (ref. Agresti(2002), sec 7.2 and 7.3 , Agresti (2007), Sec. 6.2, and Agresti(1996), Sec. 8.2 and 8.3.; which has already been discussed.
Generalization to Higherdimensional Tables
For higherdimensions we already know how to test for associations and conditional independence with ordinal data, and combinations of ordinal and nominal, via CMH statistic (recall Lesson 4).
The modeling approach described today generalizes to higherdimensional tables as well. We can always create new variables representing the scores.
Association models are generalization of linear by linear association models for multiway tables.
We can also combine ordinal and nominal variables where we only assign the scores to the ordinal variables, and estimate scores from the data. Some of these models are known as row effects, column effect and row and column effects models. These are more advanced topics on this issues. For more information on some of these models see Agresti (2013), Sec. 9.59.6, Sec. 7.27.3, and Agresti (2007), Sec. 6.26.3.