Had a dose of stats at work during this week and I am still struggling to find a way out the model that I am working on. To take a break from that mode of thinking and also thanks to this three day weekend , I am writing a rather elaborate summary of , what I consider to be one of the well written books in statistics for linear models. The point of this summary is to motivate beginners to intermediate level stats oriented students to have a look at this book . I am certain that there will something to take away from this book that is so wonderfully written. Ok, now coming to the summary..
This book gives a nice recap of the basic linear models with just the right amount of math.
Chapter 1 starts off with one of the most important aspects of linear models, Confounding. Ideally a randomized controlled experiment is best as one can minimize confounding. However most of the data, at least in the financial world is observational. Dealing with stocks, options, vol data, pretty much everything is observational . So, one might get a handle on the association between variables but one needs to be very careful in making a causal inference. This chapter has three case studies , the null hypothesis of first case being, "Does mammography speed up detection by enough to matter ? ". Second case is relating to the outbreak of cholera and various causal inferences about it. Third case is the famous study by Yule relating to the poverty and policy choices. This chapter kind of sets the tone for the topics to follow.
Chapter 2 is about simple one variable regression. Often neglected point in most of the books in stats is the difference between parameters and estimates. There is often a confusion between the two and sometimes people use them interchangeably. Infact this is the first book on stats that I have come across, where this is emphasized very very clearly. Estimates aren't parameters , and residuals aren't random errors.When I hypothesize a linear model between let's say Y and X variables, I am assuming Y = aX + b + Epsilon as the model . a, b, Epsilon are all parameters and whatever statistical technique is used to compute a, b, Epsilon, the values resulting are estimates. Karl Pearson thought that all the data comes from a distribution and by collecting large enough data you can have the knowledge of the distribution. However it was Fisher who believed that all data you see are realizations of an abstract distribution. All one can do is estimate the parameters of the abstract distribution from the given data. Parameters are basically functional forms of the data. When you assume a Data generative process like a simple regression, you are assuming an abstract distribution for a and b. So , as per Fisher, one can get an estimate of a and b. This point is made repeatedly so that the reader never confuses every again the distinction between estimates and parameters.
Chapter 3 can be skipped if you are aware of basic matrix algebra. When I look back at my lectures that I have attended at various places, somehow the profs never emphasized the importance of matrix algebra. A sound knowledge of matrix algebra is a prerequisite for understanding stats. So, may be I was at the wrong places:) or May be I was not concentrating well.... Later , when I actually started crunching numbers to work on them , I realized that with out matrix algebra, you just can't do anything in stats. Pretty much everything in stats required a basic to intermediate working knowledge of matrices. Some of the concepts I think that are very useful from a practical point of view are Four Fundamental Subspaces, Projection on to sub spaces, Positive definite matrices, Eigen value decomposition, Row and Column Rank of matrix, Row and Column Spaces, Idempotent matrices, Projection matrices, Various ways to decompose matrices starting from Cholesky, QR, SVD , Matrix differentiation, Orthogonal Basis, Linear Transformations . Even a simple regression of Y with X1 and X2 needs a matrix decomposition algo. Why ? Because estimates are usually present as inverse of some combination of matrices. Inverse of matrix means one has to use some kind of decomposition for numerical stability. Anyway, for a reader who is well versed with these concepts, one can safely ignore this Chapter.
Chapter 4 introduces Multiple Regression. Ideally the content in this chapter is pretty straight forward with assumptions relating to linear model and based on the assumptions , computing the estimates of the parameters. However the beauty of this chapter are set of well laid out questions at the end of it, which makes you think about a lot of aspects on a model , as simple as Multiple regression. Until a few years ago, I thought that this is all there is to modeling. How naive of me!! Once I got introduced to PDEs and Stochastic Models, I came to realize that I knew absolutely nothing about modeling. Whatever modeling I had done, was fairly basic. I still remember the days at an analytics firm in Bangalore where I was building logit models for mortgage prepayments. We used to build a large number of logit models and were pretty content with it. In the hindsight, what I was doing was like a drop in the ocean of statistical modeling. In that sense, this book too gives you only the basics about statistical modeling. But the basics are extremely well written. I wish this book was published when I was starting off on stats! . Ok, coming back to the questions that you will think about after reading this chapter are

Why is sum of residuals not equal to 0 if your initial linear model has no intercept term ?

What exactly is the problem with collinear independent variable in a multiple regression ? Does it effect the estimate or the standard error of the estimate ?

Why Hat matrix is important ?

What’s the problem with omitted variable model ?

If a model is represented as Y ~ X1 + X2 , what is effect of error terms being correlated with one of the independent terms ? What is the effect of residuals being correlated ?

What happens when you exclude a variable which is orthogonal to all the variables in the model ?

If a model is represented as y~X1 + X2 + X3 , how do we test the hypo that beta_2 + beta_3 = some constant, beta_2 , beta_3 being the estimates of the coefficients of X2 and X3
Actually by merely reflecting on the model , there can be tons and tons of questions which you can think about and probably answer them. For me the biggest take away from this chapter is the important of crossprod(X,X) , meaning X^T times X. I had never realized before reading this chapter, that Xtransp*X governs a lot of things about the estimates.
Chapter 5 introduces GLM, a topic which I am extremely interested in because I still don't know how to deal with them. In theory I do know, but I have never implemented a GLM model till date in finance. Well,as far as fin applications are concerned , you will never have residuals which are IIDs. This was a learning that was drilled in to my head by my guide during masters. How do you model the estimates if you know that errors are correlated , errors form a stationary process, errors are Poisson, errors are from a prior distribution ? You will get a basic flavour of it from this chapter. For doing anything in the real world, you would probably have to refer some other book on GLM wont suffice.
Chapter 6 talks about Path Models. This chapter is suited for bed time reading. Well, for first timers, path model is a graphical way to represent a set of structural equations. The chapter starts off with standardized regression , then gives a very superb discussion about the way in which a physics model is different from a statistical model. I loved this part of the chapter, as the discussion between Hooke's law and a possible regression equation was too good. If you cannot verbalize the difference between a statistical model and Physics based model to a sophomore, then in all likelihood, you might want to read this part of the book. Some of the terms which you will be very clear after reading this chapter are

Causal Mechanism

Selection Vs Intervention

Response Schedule

Dummy Variables – An interesting thing, I feel like writing here , relating to dummy variables, is about my experience with some person who had done a masters in statistics from reputed institution. I remember asking that person a question on Dummy variables,many years ago. If a variable takes p categorical values, then why should there be only p1 variables in the regression equation. She mumbled something like “ Well you can obviously get the other effect from the p1 effects..” Intuitively I get it, there is a redundancy. I did get that point , but my question was more from a math point of view. What happens in a regression equation if I have a dummy variable for each of the levels of the categorical variable? However all I got was some angrezi and I wanted an answer in Math. Anyways, later I figured out that by placing all the p categories as variables, you get a design matrix X which does not have full rank. If the design matrix is not full rank, it is not invertible. If it is not invertible, forget regression !. Sometimes people think intuition is the killer . I don’t deny the importance of intuition at all. But sometimes arguments made in math are easy to understand. Every thing falls in place and you just get it..There is no touchy feely thing here.Its plain simple math...i am digressing from the intent of the post. Coming back to the things you learn from this chapter on path models,

Association

Linkage
I love diagrams and this one sums up a basic structural model in a nice way. Boxes at the top represent distribution and arrows represents the realization from the distribution.
Chapter 7 talks about Maximum Likelihood Estimation( MLE). The basic idea behind MLE is that you specify the distribution of the variable under study , and then find the parameters of the distribution such that the data available has the maximum likelihood. With any estimate, there needs to be inference. That’s where Fischer Information comes in, which helps one to get an idea about the variance of the estimates. Once you step out of the simple regression world and enter in to the world of logits, probits, bivariate logits, etc. MLE is inevitable. Also there are very many methods to find MLE using optimization. Starting from basic grid search to sophisticated simulated annealing algorithms, one can use a lot of methods to find the parameters and their estimates . Somehow this chapter fails to mention the classic trinity tests. Likelihood Ratio ( LR) , Wald test , Lagrange Multiplier test( LM) test. One inevitably uses one of these in MLE estimation. This chapter does give a basic intro of MLE , just enough to read far more math oriented books. Personally I feel MLE is fascinating. For all the years that I have been thinking that stats is bayesian or frequentist, I have realized that there is a third direction called MLE which makes statistics beautiful.
Chapter 8 is about Bootstrapping, my love of life for the past 1.5 year. As the name suggests, it means data will pull itself up to give you the estimate you are looking for.
Basic idea is resampling from the empirical distribution function. Typically in a Monte Carlo, you simulate from the distribution that you have assumed. In Bootstrapping you trust the data and basically resample from the data and compute the statistic that you are interested in. This chapter takes you through a few examples of using bootstrapping techniques. Resample from the data and calculate the sample statistic that you are interested. If you do it N number of times and take the average , then the estimate converges to population mean. Will it always work? No. If the sample is not representative of the original population, no amount of bootstrapping will help you . Let’s say you generate 100 standard normal numbers each of which is greater than 3. No amount of bootstrapping is going to be of help you compute the correct population mean.One crucial thing to remember is that you are sampling data from an empirical distribution of the sample. Bootstrapping can be extended to figure out parameter standard errors in a regression equation. It can also be extended to Auto regressive equations, GLM models etc.
Chapter 9 is about Simultaneous equations. As the name suggests, this relates to modeling a set of equations. Path diagrams are heavily used to convey the connections. This section can be skipped during the first read of the book. Ideally one can revisit this section when one is comfortable with single equation modeling.
Chapter 10 concludes with some issues in Statistical modeling. Well, in reality , a few pages alone from this chapter would make you realize that most of the models described in the chapter 19 are good for publishing research papers. But in reality, these models breakdown . Residuals are not IID in the real world, they are DDDDependent and differently distributed. Models are rarely LCC – Linear with constant coefficients. They are NLNC – Non linear with Non Constant coefficients. The author sums up saying
“The goal of empirical research is – or should be – to increase our understanding of the phenomena, rather than displaying our mastery of a technique”
Principles and concepts which are in Chap 1  Chap 10 is basically about 200 odd pages, but the next 200 pages of the book comprises some of the amazing case studies which will make you think , make you question , make your mind stretch on a ton of aspects relating to linear models.
Take away :
If you use linear models in your work, you cannot miss this book .
Its priceless !