This book by J.R.Norris is a classic reference for Markov Chains. It is unlikely that any paper on Markov chains in the recent times does not mention this book somewhere in the paper. I will try to attempt to summarize the main chapters of the book in this post.
The book starts off by saying that Markov process are random processes that don’t retain memory. When the statespace is restricted to finite or countably many, these processes are called Markov Chains. The intro gives a great clarity for any reader in listing the kind of chains that would be dealt in the book. The following gives a basic flavor of Markov chains that are dealt in the introductory chapters:
Fig 1 is that of a finite Markov chain with three states 1,2,3. Since the movement of the random variable is restricted to discrete states, the chain is termed as discrete chain. In Fig 2, the random variable moves from 0 to 1 after waiting for a time T(exponential density). In Fig 3, the random variable moves from 0 to 1 and then 1 to 2 and so on, but the time that the variable has to wait to jump from one state to another is a poisson process with rate lambda. The latter two chains correspond to continuous Markov chain.
A very useful diagram which summarizes a lot of terminology is the following Markov chain.
The features of this Markov chain are the following :

{0},{1,2,3},{4,5,6} are communicating states

{1,2,3} and {4,5,6} are closed states. These closed states are recurrent states

{0} is transient state

{4,5,6} is periodic chain

{1,2,3} is aperiodic chain
All these terms become intuitively clear in the context of the above Markov chain. Clear definitions of these terms are presented in subsequent chapters. A few questions are posed for the above Markov chain so as to motivate the reader to think about ways to solve various kinds of problems that arise in this context of a Discrete Markov chains.
Chapter 1: DiscreteTime Markov Chains
This chapter forms one of the core chapters of the book and the important theorems on the following are stated and proved:
Firstly necessary terms are defined such as statespace, initial distribution, transition matrix, stochastic matrix. Using these objects, Markov chain is defined. The symbolic representation of the Markov chain essentially captures the memoryless property of the Markov chain. The chapter then introduces the method to calculate the probability of remaining in a particular state after n transitions. One of the obvious but dumb method that one can think of , is to keep multiplying the transition matrix n times and then use the initial position and this powered matrix to calculate the probability of remaining in the same state. There is a better method and this sections explains it by Linear algebra where powers of the matrix are easily calculated using diagonalization method. Basically one finds the Eigen values of the transition matrix, writes the power of matrix in terms of Eigen values and then calculates the probability of remaining in the same state after n transitions. This method is computationally efficient in finding the probability of remaining in the same state after n transitions. Once the Eigen vectors are found, any element in the transition matrix can be found by the following equation:
The chapter introduces the “class structure”, way to identify different states and categorize them. Two states are said to be “communicating” when there is positive probability to reach one from another. If a set of states are in such a way that each communicates with the other, then such a set of states are called “closed state”. A closed state is one from which there is no escape. “communicates” is actually an equivalence relation and set theory tells us that we can categorize the sets based on the equivalence relation. If there is only one single class in the chain then such a chain is called “irreducible”. If there is only a single state in an equivalence relation, such a state is called “absorbing state”. This means that if a chain enters this state, it remains there forever.
Hitting times and absorption probabilities are introduced subsequently. The first time I came across hitting times was in Marek Capinski’s book and subsequently in Shreve’s book. I could not appreciate those concepts at that time as much as I should have. Hitting times were defined and various distributions were computed, some in closed form, some using numerical analysis. I did not have the faintest clue on its application in real world, except in American options case. Years later when I was developing a pairs model, things became clear. Modeling the hitting times for the spread and estimating the parameters of the model made me realize the importance of the practical application of hitting times. This book has given me a lens to view the hitting times , i.e via Markov chain. The key takeaway from this section is that the vector of hitting time probabilities (mean hitting times) are found by solving a set of linear equations and this solution is the minimal nonnegative solution. The fact that the solution is a minimal nonnegative equation plays an important role in determining the solution to the system of equations involving hitting time probabilities(mean hitting times). This is showed using gambler’s ruin example and a simple birthdeath chain.
Strong Markov Property is introduced subsequently. To summarize in plain English, the intention behind using “strong” is as follows: Markov property is memoryless conditional on the chain taking a particular value, at a time interval m. To expand this property to different situations, one of the first generalizations that one can think of is ,”Is the memoryless property applicable to random times, instead of a fixed time,as given in the definition of the Markov chain?”. The answer to this question is yes, but in the context of specific time variables , called, stopping times. “Stopping time “ variables are those variables whose value is only dependent on the history of the path and not based on the future of the chain. In this context, first passage time + first hitting time are defined and are used to prove the “Strong Markov Property” theorem. The theorem states that Markov property is preserved at stopping times. Ok, so what if Markov property holds good ? Well, this property gives us the flexibility to compute the entire probability generating function for the hitting times. If the chain is in state(i), the probability of hitting 0 , let’s say, can be computed by a recurrence relation and then using the condition that the set of linear equations are minimal nonnegative solutions. However if one has to compute the entire set of probabilities at once, then the generating function approach works like a charm. Strong Markov property is used to come up with closed forms for the probability generating functions for gambler’s ruin and other examples. The reader can at once see the importance of the knowing generating function, as it is basically the signature of the random variable. You can calculate hitting probabilities starting from any state, you can compute the mean hitting time by simply finding the derivative of the generating function. So, the main takeaway from this section is that , “Strong Markov Property” helps in computing more general mathematical objects like generating functions which provide much more information about the hitting time probabilities.
A State in the Markov chain state space can be transient or recurrent based on the whether the probability of hitting the specific state as number of steps decreases to 0 or tends to 1. The section on transient and recurrence behavior is very important as it provides definitions for these terms, links these terms to class structure. Some of the main theorems mentioned in this section are

Every recurrent state is closed

Every finite closed set is recurrent

If C is a communicating class, either all states are recurrent or all states are transient
First passage time decomposition problem given as an exercise problem is very useful to get another perspective towards characterizing a recurrent state based on probability of return in the nth step. The chapter then moves on to using the following conditions for state classification:
The above conditions are applied to random walk in 1D, 2D and 3D space so that a reader can get an idea to use the same for other situations.
The chapter then shifts to explaining invariant distributions. Invariant distributions are those distributions specific to a transition matrix that have the property of reaching a stationary value. Meaning, if you start with some random distribution of initial states, after taking infinite steps, the Markov chain starts moving across states in such a way that the probability distribution of values it takes, mirrors the invariant distribution. Or in other words, the probability of moving from statei to statej becomes constant as the number of steps in the Markov chain increases. The calculation of invariant distribution becomes easier if one chooses to use linear algebra methods. If one computes the eigen vectors for the transpose of the transition matrix and pick the vector that corresponds to eigen value 1, then the vector is nothing but the invariant distribution of the Markov chain. The chapter then introduces the expected time spent between recurrent visits to a state. This expected time is very useful in classifying the state space based on the value. If the value is finite , then the state can be termed as “positive recurrent state” and if the value is infinite , then the state can be categorized as “null recurrent state”. Thus states can be classified in transient states, positive recurrent states and null recurrent states. Irreducible chain is a chain where each state is positive recurrent state. Computation of invariant measure also leads to easy calculation of expected return times.
Another related concept discussed is the periodicity. In an irreducible chain, either every state is periodic or every state is aperiodic. Basically a state is aperiodic if one can revisit the same state for all sufficiently large number of steps. Well it is easy to definite periodic state. A state is periodic if the chain revisits the same state in a set pattern. The relevance of periodicity is that , the invariant distribution exists for a irreducible and aperiodic chain. An extremely important aspect relevant to invariant distributions is discussed, i.e, time reversibility. One of the sufficient conditions for the existence of invariant measure is “detailed balance” condition. This condition is defined and applied to a few examples. The chapter ends with proving ergodic theorem, the pivotal achievement of A.A. Markov
In real world , no one hands us the transition matrix. One ought to estimate it in some manner. Given a Markov chain realization and a set of states, it is important to estimate the transition matrix. The chapter shows an approach that is typically used in most of the situations, i.e MLE method. With MLE method, one writes the likelihood function and use the Lagrangian method to compute the transition probabilities. The problem with this approach is that the transition probabilities are not consistent estimates for transient states. Hence this approach is modified so that the transition matrix estimates are consistent.
Chapter 2 : Continuous Time Markov Chains  I
Continuous Markov chain is more involved and hence the author has decided to split the concepts in to two chapters, first being written to equip the readers with the machinery to understand continuous time processes and another chapter using this machinery to discuss continuous Markov chains. This chapter discusses the machinery needed to understand Continuous time processes.
The first thing the chapter touches upon is the infinitesimal generator or the Q matrix. This generator is defined and various properties are explored. Once this is done, the crucial link between transition matrix and Q matrix is provided using the following theorem:
This connection between P and Q matrix gives the flexibility of creating a Markov process at fine grid points. A few examples are provided to illustrate the computation of transition matrix given the Q matrix. The chapter then moves on to giving the first introduction to continuoustime random processes. A very important result from measure theory says that Right continuous processes can be determined from its finitedimensional distributions. Such right continuous processes are chosen so that the probability can be summarized with the help of jump times and holding times. The most important thing to notice is the summation of a stochastic process using jump times and holding times. If you carefully look at the basic stochastic processes, what one assumes about holding times and jump times is the basis for the random process realization. For example in a poisson process ,the holding times are exponentially iids, i.e the process holding time before a jump is independent of the holding time after the jump. The chapter then introduces exponential distribution and shows the connection between exponential distribution and memoryless property. An important theorem stated in the context of this section is : Sum of independent exponential random variables is either certain to be finite or certain to be infinite based on the harmonic mean of the parameters of the exponential random variables. Another term introduced in this section is the “first explosion time”, that is defined as the supremum of all jump times. A process satisfying the requirement that if t > first explosion time, then it takes on the value infinity, is called “minimal” process. The minimal does not refer to the state of the process but to the interval of time over which the process is active.
The section moves on to defining poisson processes based on holding times and jump times. A simple way to construct a Poisson process is to simulate a sequence of exponential random variables with poisson rate lambda, and define the jump chain such that it jumps by 1 unit after every holding time. Three equivalent conditions are stated to identify any poisson process and the proof involves forward equations. Subsequently a few properties of Poisson chains are explored. A counting process is characterized by the distribution of increments. The increments might be independent OR they can be stationary increments where it is only dependent on the length of the interval. So the first question one must ask when applying a counting process to a problem is , “Does the context have independent increments or stationary increments ?”
The section then introduces Birth processes, which can be thought of a generalization of poisson process where the holding times are distributed with an exponential distribution but with different rates. Like in the case of Poisson process, the section proceeds in building the Q matrix, formulating the forward equation for computing transition probabilities from the Q matrix and finally defining Birth process in three equivalent ways.
The most interesting part of this chapter is the one on Jumping times and holding times where continuous Markov chains are formulated using the joint distribution of holding times and jump chain. It goes on define a Markov chain as follows :
It then goes on give three constructions to Markov chain. The one I liked was the one with a good analogy , i.e
Imagine a statespace I as a labyrinth of chambers and passages, each passage shut off by a single door which opens briefly from time to time to allow you go through in one direction only. Suppose the door giving access to chamber j from chamber I opens at the jump times of a Poisson process of rate q(i_j) and you take every change to move that you can, then you will perform a Markov chain with Q matrixQ
The above analogy then is used in taking a set of Poisson processes to construct a Markov chain. I found this kind of analogy very appealing as such things are always easy to understand and recall in various situations. Actually this whole construction of Markov processes using holding times and jump times itself is very interesting and makes tremendous sense in getting a general idea of the way random processes work. Sections on explosion times , forward and backward equations , Non minimal chains are to be read selectively as the difficult levels shoots up exponentially.
In author’s words, “There is an irreducible level of difficulty here.”
Chapter 3 : Continuous Time Markov Chains  II
With the necessary mathematical tools and framework covered in the previous chapter, the author introduces the continuous time Markov chain concepts in the same order as they were done for the discrete case. Basic terms are introduced like generator matrix, the initial distribution, transition matrix and transition probabilities. Class structure for the states is discussed where in the states are categorized based on their communication status. In this context, the system of equations needed to solve the hitting times and expected hitting times are introduced. Depending on the Q matrix, the system of equations help one determine the minimal solution for hitting time probabilities and expected hitting times of various states. Subsequently recurrent and transient states are defined in the context of continuous time processes and conditions for verifying the recurrence or transience of a state are given. They are basically in the integral form whereas they are in the summation form in Chapter 1 for the discrete Markov chains. Invariant distributions and ergodic law are discussed.
The basic funda that this chapter and the previous chapter for the discrete case teaches is that, one must analyze a Markov chain in the following manner, by posing questions to the problem at hand

What is the transition matrix ?

What are the states of the process ?

Is there an underlying Q matrix for the process ?

What is the transition matrix for the jump process, if it is a continuous Markov chain ?

What are the types of states ?

How can one identify or verify whether a state is recurrent or transient ?

What are the first passage time probabilities and expected hitting times for various states ?

Does the Markov chain converge to an equilibrium state ?

Is the Markov chain irreducible?

Is the Markov chain periodic or aperiodic ?

Does time reversal hold good ? This is crucial for MCMC applications where detailed balance is one of the conditions for chain convergence(though it is only a sufficient condition and not a necessary condition)
The first 3 chapters comprise ~ 60% of the content of the book. I found chapter 4 to be very dense and overwhelming , as the author tries to extend the Markov chains and cram concepts relating to Martingales, Potential theory, Electrical networks and Brownian Motion, all in just 40 pages. The last chapter gives a flavor of applications of Markov chains by giving sample models in biology,queuing theory, resource management, Markov decision processes and MCMC.
Takeaway :
The book gives an intuitive yet rigorous introduction to discrete and continuous Markov chains. The decomposition of a Markov chain in terms of holding times and jump chain to explain various concepts of a continuous Markov process is the highlight of this book.