See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/261438976 MUTUAL INFORMATION-BASED REGISTRATION OF DIGITALLY RECONSTRUCTED RADIOGRAPHS AND ELECTRONIC PORTAL IMAGES Thesis · December 2005 READS CITATION 996 1 1 author: Kate Bachman Colorado School of Mines 8 PUBLICATIONS 115 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Implementation of Autofocusing Algorithms View project Implementation of an Autofocusing Algorithm in a Light Sheet Micro- scope for Imaging Thick 3D Samples View project All content following this page was uploaded by Kate Bachman on 09 April 2014. The user has requested enhancement of the downloaded file.
MUTUAL INFORMATION-BASED REGISTRATION OF DIGITALLY RECONSTRUCTED RADIOGRAPHS AND ELECTRONIC PORTAL IMAGES by Katherine Anne Bachman Master of Basic Science, Mathematics, University of Colorado at Denver, 2002 Bachelor of Science, Chemistry, University of Colorado at Denver, 2000 A thesis submitted to the University of Colorado at Denver in partial fulfillment of the requirements for the degree of Master of Science Applied Mathematics 2005
Bachman, Katherine Anne (M.S., Applied Mathematics) Mutual Information Mutual Information-Based Registration of Digitally Re- constructed Radiographs and Electronic Portal Images Thesis directed by Professor Weldon A. Lodwick ABSTRACT This study regards discrete mutual information and demonstrates the use of the theory with an example of radiological image registration with in-plane, two- dimensional images, using various search strategies. Image registration is the process of finding an optimal geometric transformation between corresponding image data. Although it has applications in many fields, the one that is addressed in this thesis is medical image registration. Medical image registration has a wide range of potential applications, but the emphasis is on radiological imaging. Mutual information, which is given by the difference between the sum of the entropies of individual overlapping images and the joint entropy of the combined images, is a measure of the reduction in uncertainty about one image due to knowl- edge of the other. It makes no assumption of the functional form or relationship between image intensities in the two images. In addition to having application in communication and computer vision, mutual information has proven robust and has resulted in fully automated registration algorithms that are currently in use. The thesis is organized as follows. Chapter 1 gives a broad overview of med- ical image registration as the context in which to present the subject of mutual information-based medical image registration. Chapter 2 regards the theory and mathematics of discrete mutual information and its origin in information theory. Chapter 3 looks at the implementation of the theory applied to image registration in general. Chapter 4 looks at an application of mutual information-based image registration in radiological imaging - registration of Digitally Reconstructed Ra- diographs (DRRs) and Electronic Portal Images (EPIs). The Appendix includes information that is relevant, but not critical, to the understanding of the material presented in this thesis. Because probability theory is a major part of information
theory and, consequently, mutual information theory, a brief overview of discrete probability theory is included in the Appendix as a quick reference. Examples are provided in the Appendix as well as throughout the body of the thesis.
DEDICATED TO To my parents, Dr. and Mrs. Miles MacGran, and to my husband, Randy.
Table of Contents List of Figures iii List of Tables v 1 Medical Image Registration Overview 1 1.1 Taxonomy of Medical Image Registration Methodology . . . . . . . 2 1.1.1 Prospective and Retrospective Methods . . . . . . . . . . . 4 1.2 Types of Transformation . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Image Registration as an Optimization Problem . . . . . . . . . . . 7 2 Information Theory 10 2.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Entropy (uncertainty or complexity) . . . . . . . . . . . . . . . . . 12 2.2.1 Joint Entropy and Conditional Entropy . . . . . . . . . . . 21 2.2.2 Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.3 Concavity of Entropy . . . . . . . . . . . . . . . . . . . . . 32 2.2.4 Entropy of a Signaling System . . . . . . . . . . . . . . . . 33 2.2.5 Entropy and Kolmogorov Complexity . . . . . . . . . . . . 33 2.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Information Theory Applied to Image Registration 46 4 Mutual Information-Based Registration of Digitally Reconstructed Radiographs and Electronic Portal Images 56 4.1 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1.1 Greedy Algorithm . . . . . . . . . . . . . . . . . . . . . . . 61 i
4.1.2 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . 73 4.1.3 Nelder-Mead (MATLAB’s fminsearch) . . . . . . . . . . . . 75 4.1.4 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Other Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5 Concluding Remarks 80 A Brief Discrete Probability Theory 82 B Convexity 88 C Log Conversion 90 D Sample Output 92 Bibliography 104 Index 108 ii
List of Figures 1.1 Registration example with multiple solutions. The image B is to be 8 transformed to image A. The rightmost image shows B registered to A by two translations and a rotation of 45 degrees. . . . . . . . 2.1 Discrete noiseless binary channel. . . . . . . . . . . . . . . . . . . . 10 2.2 Discrete binary symmetric channel. . . . . . . . . . . . . . . . . . . 11 2.3 Decomposition of a choice from three possibilities. . . . . . . . . . 13 2.4 The entropy function for logarithmic bases 2, e, and 10. . . . . . . 17 2.5 The entropy function (logarithmic base 2) of two probabilities. . . 18 2.6 Bound on the loge(x) function. . . . . . . . . . . . . . . . . . . . . 20 2.7 Noisy communication channel, Example 2.2. . . . . . . . . . . . . . 25 2.8 Venn diagram of the relationship between entropy and mutual in- 41 formation. The mutual information I(A, B) corresponds to the in- tersection of the information in A with the information in B. . . . 43 2.9 Venn diagram of the relationship between joint entropy and mutual information for independent A and B. . . . . . . . . . . . . . . . . 3.1 Graphical representations of two randomly generated 5×5 matrices 48 taking on values from the set [0 1 2]. . . . . . . . . . . . . . . . . . 49 3.2 Graphical representations of the joint histograms (joint probability distributions) involving the two matrices represented in Figure 3.1. 50 51 3.3 Venn diagram of the relationship between joint entropy and mutual 52 information of identical images. . . . . . . . . . . . . . . . . . . . . 3.4 The image data sets A and B. . . . . . . . . . . . . . . . . . . . . . 3.5 The joint probability distribution of Example 3.1 as a surface. . . . iii
3.6 The image data sets A, rotated -90 degrees, and B. . . . . . . . . . 54 55 3.7 The joint probability distribution as a surface of Example 3.1 with image A rotated −90◦ as in Figure 3.6. . . . . . . . . . . . . . . . . 4.1 A typical radiation treatment plan. [11] . . . . . . . . . . . . . . . 57 4.2 Representation of the beam portal for a particular radiation treat- 58 ment and the cross-hairs of the linear accelerator. [11] . . . . . . . 58 4.3 Typical aligned pelvis EPI. [11] . . . . . . . . . . . . . . . . . . . . 59 4.4 Typical aligned pelvis DRR. [11] . . . . . . . . . . . . . . . . . . . 59 4.5 A misaligned EPI. [11] . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 The DRR automatically rotated to the EPI position (Figure 4.5). 60 64 [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.7 Greedy algorithm, run 1, 256×256 images. . . . . . . . . . . . . . . 69 4.8 Greedy algorithm, run 2, 256×256 images. . . . . . . . . . . . . . . 69 4.9 64×64 DRR and EPI. . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.10 128×128 DRR and EPI. . . . . . . . . . . . . . . . . . . . . . . . . 77 4.11 256×256 DRR and EPI. . . . . . . . . . . . . . . . . . . . . . . . . 4.12 Simulated Annealing. Plot of MI versus Angle of Rotation. . . . . 78 4.13 Simulated Annealing. Detail of Figure 4.12, plot of MI versus Angle of Rotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Drawers containing gold and silver coins, Example A.5. . . . . . . 86 iv
List of Tables 4.1 Greedy algorithm. Sample run (Figure 4.7) converges to an optimal 65 solution in 8.712 minutes. . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Greedy algorithm. Sample run (Figure 4.8) converges to a subopti- 68 mal solution in 5.0823 minutes. . . . . . . . . . . . . . . . . . . . . 70 71 4.3 Greedy algorithm. Convergence data for 22 runs, 256×256 images. 71 4.4 Greedy algorithm. 64×64 run converges in 0.6504 minutes. . . . . 72 4.5 Greedy algorithm. 128×128 run converges in 1.2058 minutes. . . . 74 4.6 Greedy algorithm. 256×256 run converges in 4.3367 minutes. . . . 74 4.7 Greedy algorithm. Summary of Tables 4.4, 4.5, and 4.6. . . . . . . 75 4.8 Genetic algorithm. 256×256 images. . . . . . . . . . . . . . . . . . 76 4.9 Genetic algorithm. Parameter list for run pairs. . . . . . . . . . . . 4.10 MATLAB’s fminsearch (Nelder-Mead) algorithm. 256×256 images. 4.11 Simulated Annealing. 256×256 images. . . . . . . . . . . . . . . . . v
Chapter 1 Medical Image Registration Overview Image registration is the process of finding an optimal geometric transformation between corresponding image data. In other words, given a reference or model, image A, and a test or floating image, image B, find a suitable transformation T such that the transformed test image becomes similar to the reference. The image registration problem typically occurs when two images represent essentially the same object, but there is no direct spatial correspondence between them. The images might be acquired with different sensors or the same sensor at different times or from different perspectives. Modern three-dimensional radiation treatment planning is based on sequences of tomographic images. Computed tomography (CT) has the potential to quan- titatively characterize the physical properties of heterogeneous tissue in terms of electron densities. Magnetic resonance (MR), that looks at hydrogen atom den- sities, is very often superior to CT, especially for the task of differentiating be- tween healthy tissue and tumor tissue. Positron emission tomography (PET), single photo emission tomography (SPECT), and MRS (magnetic resonance spec- troscopy) imaging have the potential to include information on tumor metabolism. The various image modalities are non-competing. They have specific properties and deliver complementary information. The images supply important information for delineation of tumor and target volume, and for therapy monitoring. 1
1.1 Taxonomy of Medical Image Registration Methodology There is usually a high degree of shared information between images of dif- ferent modalities of the same structures. This is certainly the case with images of the same modality that have been acquired at different times or from differ- ent perspectives. For the example presented in Chapter 4, the modality of the digitally reconstructed radiograph (DRR) is CT, or CT-derived, and the problem is rectification of orientation, or pose, of the DRR with respect to an electronic portal image (EPI) by in-plane rotations and/or shifts. In order to use image sequences of the same or various modalities simultane- ously, a definite relation between the picture elements (pixels) or volume elements (voxels) of the various image sequences needs to be established. Methods that are able to calculate and establish these relations are called registration, matching, or image correlation techniques. 1.1 Taxonomy of Medical Image Registration Methodology Image registration methods can be categorized by • User interaction – Manual The transformation is determined directly by user interaction. – Semi-automatic The transformation is calculated automatically, but the user deter- mines the image features used and the start-up parameters. – Automatic No user interaction is required. • Scope and elasticity of transformation – Scope ∗ Global 2
1.1 Taxonomy of Medical Image Registration Methodology Global transformations modify the data set as a whole, that is, not just the therapy relevant area, by applying a single func- tion to all elements. ∗ Local In local procedures the function can vary. A single voxel, pixel, image slice, or organ could be affected. – Elasticity or plasticity (geometrical properties) of transformation ∗ Rigid transformations The distance between any two points in the first image is pre- served when these two points are mapped onto the second image. Rigid transformations can be decomposed into trans- lation, rotation, and reflection. Rigid transformations are spe- cial cases of affine transformations. A transformation is called affine when any straight line in the first image is mapped onto a straight line in the second image while parallelism is pre- served. ∗ Non-rigid affine transformations Examples of non-rigid affine transformations are both uniform and non-uniform scaling and shearing. ∗ Curved or elastic transformations A curved or elastic transformation can map a straight line onto a curve, for example, transformations described by polynomial functions. Rigid transformations can be described as curved transformations with zero elasticity. Elastic transformations can be approximated using local affine, that is, piecewise linear algorithms when using a small granularity, that is, a fine level of detail. • Usage of auxiliary means Methods are distinguished by the amount of auxiliary information that must be used before or during image acquisition to allow subsequent registration, for example, the use of external or internal markers. 3
1.1 Taxonomy of Medical Image Registration Methodology 1.1.1 Prospective and Retrospective Methods The methods listed in the previous section can also be categorized under what are called prospective and retrospective image registration methods [13]. Prospective methods always register artificial as opposed to anatomical land- marks (fiducials), for example, objects attached to the patient that are detectable in all pertinent modalities. If the position of the landmarks relative to the patient remains constant, a high degree of accuracy is possible. The disadvantage is that, since the calculation of the transformation is based only on a restricted number of landmarks, the resulting transformations are always rigid. If organ movements occur between the acquisition of different image series, these shifts cannot be con- sidered correctly. Additionally, this procedure is invasive and inconvenient for the patient and staff. Retrospective methods do not need any setup steps and are not subject to the conditions of image acquisition. To derive the necessary transformation, only patient related references (anatomical landmarks, surface of structures, etc.) are used. Retrospective methods include the following: • Feature-based methods – Point matching (Procrustes method) Identification of corresponding points in the image series requires user interaction. An advantage of this method is that landmarks can be defined just in the therapy relevant image area or locally. There- fore, the registration results will not be influenced by possible distortions in other regions of the image. A disadvantage is that it is difficult to locate corresponding landmarks in complementary image modalities precisely. Thus, a large amount of anatomical knowledge is required of the user. The precision of the resulting transformation depends on the precision with which these corre- sponding landmarks can be identified and depends therefore on the resolution (pixel size) and slice distance/thickness of the images. – Surface matching 4
1.1 Taxonomy of Medical Image Registration Methodology Rather than single points, this method uses the entire surface of a structure, for example, a patient’s contour. A transformation is used that minimizes the distance between the surfaces of corre- sponding structures. An example is the head/hat analogy, where the contour of the head, segmented in the first image data set, is put over the head contour in the second data set. In this way, uncertainties in the definition of single points do not carry much weight. The disadvantage is that the structures must be segmented in an initial step. Since segmentation is a non-trivial task and normally only semi-automatic segmentation procedures deliver acceptable results, these methods are time-consuming. • Metrics of similarity This retrospective approach is a method whereby a value for the similarity of the two data sets is calculated. The image data set, an area or a volume to be registered, is transformed by an optimization procedure (Section 1.3) until the similarity with the first data set achieves a max- imum. Voxel similarity measure techniques can be fully automatic and have an accuracy comparable to bone-implanted markers, but they can also fail [5, 26]. A common cause of failure can be that the images are poorly aligned at the start. – Sum of squared intensity differences, a correlation-based distance mea- sure – Mutual information, an information theoretic technique Mutual information measurements consider the intensity distribution of both image data sets and are therefore well suited to the reg- istration of multimodal image sequences, since no presumptions about the intensities have to be made. For example, it is possible that regions with a high intensity in the reference image corre- spond to regions with low, medium, or high intensity in the test image. The mutual information reaches a maximum if, for a given intensity in image A, a distinct intensity in image B is found. 5
1.2 Types of Transformation There are no restrictions concerning the intensity combinations. An advantage of this method is that it does not depend on the localization of single landmarks. No user interaction is required. It can be fully automatic in that it makes no assumption of the functional form or relationship between intensities in the two im- age data sets. A disadvantage is that the calculation effort is much higher compared to point matching methods. Additionally, since this method considers the global intensity distribution and looks for the optimum over the image sequence on the whole, it is pos- sible that the precision of the superposition is suboptimal in the area of interest. The relevant theory and an application of mutual information-based medical image registration are presented in the chapters that follow. 1.2 Types of Transformation Since we are three-dimensional, moving beings, registration should be four-dimensional, that is, it should include the three spatial dimensions as well the temporal dimen- sion. In practice, however, assumptions and approximations are made so that the body can be represented in fewer dimensions. • 2D-2D 2D images may be registered simply by a rotation and two orthogonal trans- lations. Differences in scaling from the real object to each of the images to be registered may also be necessary. Controlling the geometry of image ac- quisition is usually very difficult, so clinically relevant examples of 2D-2D registration are rare. For this study, 2D images are used. • 3D-3D In the case of 3D-3D image registration, three translations and three ro- tations bring the images into registration. It is assumed that the imaged part of the body behaves as a rigid body, that is, the internal anatomy of the patient has not distorted or changed. Careful calibration of scanning device/s is required to determine image scaling. 6
1.3 Image Registration as an Optimization Problem • 2D-3D When establishing correspondence between 3D and projection images, 2D- 3D registration may be required. The main application of these methods is in image-guided interventions. • Time This class of registration problem concerns image sequences that follow some process that changes with time. [5] 1.3 Image Registration as an Optimization Problem Algorithms that directly calculate the transformation to establish correspondence between two images can be used where the images to be registered have very similar intensities and the transformation required to establish correspondence is small. This is the case with the point matching, or Procrustes, method described above. In other cases, a process of optimization is required. Optimization algorithms compute a cost or similarity function relating to how well two images are registered. The goal is to minimize or maximize the associated cost function. The cost function can be expressed as C = Ctransf ormation − Csimilarity, where the first term characterizes the cost associated with particular transforma- tions (lateral translations, rotations, nonlinear, etc.) and the second term charac- terizes the similarity between the reference and test images. Mutual information and sum of squared intensity differences are examples of cost functions. For mutual information-based image registration, as will be seen in the following chapters, the cost function increases as the images to be registered come into alignment. Con- versely, when the sum of squared intensity differences algorithm is used, the cost function decreases as alignment increases. Image registration as an optimization problem is hard to solve in that the problem is ill-posed [9]. Small changes of the input images can lead to completely 7
1.3 Image Registration as an Optimization Problem Figure 1.1: Registration example with multiple solutions. The image B is to be transformed to image A. The rightmost image shows B registered to A by two translations and a rotation of 45 degrees. different registration results. Optimization algorithms can converge to a subopti- mal solution that can cause registration to fail. Moreover, the solution may not be unique. An example of this is illustrated in Figure 1.1. In order to register the test image B to the reference image A, several solutions are found: two pure translations followed by a rotation of 45 degrees, or equivalently, a rotation of 135 degrees followed by two pure translations, etc. The objective value at optimality is the same, but the solution path is different. Clearly, more information is required in order to determine which transformation should be used. A property of medical images that are registered is that they are discrete. That is, the object is sampled at a finite number of voxels or pixels. For this study, four algorithms are used with mutual information to find a solution to a medical image registration problem: a greedy algorithm, a genetic algorithm, the fminsearch algorithm of MATLAB1, and finally, simulated annealing. Fminsearch and simulated annealing were used to refine the results obtained with the greedy and genetic algorithms. Greedy algorithms are those that follow the problem solving meta-heuristic of making the locally optimum choice at each stage with the hope of finding the global optimum. Greedy algorithms find the overall or globally optimal solution for some 1MATLAB is a registered trademark of The MathWorks, Inc. 8
1.3 Image Registration as an Optimization Problem optimization problems, but may find suboptimal solutions for some instances of other problems. Greedy algorithms work in phases. In each phase, a decision is made that appears to be good, without regard for future consequences. Generally, this means that some local optimum is chosen. This ‘take what you can get now’ strategy is the source of the name for this class of algorithms. When the algorithm terminates, it is hoped that the local optimum is the global optimum. If that is the case, then the algorithm is correct. Otherwise, the algorithm has produced a suboptimal solution. In the case of medical image registration, it is probably safe to state that the global optimum is the required solution. Genetic algorithms comprise a particular class of evolutionary algorithms, in- spired by the mechanisms of genetics, that has been applied to global optimization problems. Genetic algorithms use biologically-derived techniques such as inheri- tance, mutation, natural selection, and recombination or crossover. With genetic algorithms, each of a population consisting of a number of trial solutions, image transformations for this study, is evaluated to yield ‘fitness.’ A new generation is created, crossover being the dominate means of creating new members, from the better of them. The process is continued through a number of generations with the objective that the population should evolve to contain a global optimum or at least an acceptable solution. The MATLAB program fminsearch uses the Nelder-Mead simplex method of [7]. This is a direct search method that does not use numerical or analytic gra- dients. It is based on evaluating a function at the vertices of a simplex, then iteratively shrinking the simplex as better points are found until some desired bound is obtained [10]. Simulated annealing is based on concepts and techniques from physics, specifi- cally statistical mechanics. Physically, the technique involves heating and gradual cooling, toward zero or no randomness, of a system such as magnets or atoms in an alloy to increase the size of its crystals and reduce their defects. Algorithmi- cally, simulated annealing is based on the idea of taking a random walk through the search space at successively lower temperatures, temperature being a param- eter controlling randomness, where the probability of taking a step is given by a Boltzmann distribution. 9
Chapter 2 Information Theory Entropy, relative entropy, and mutual information, the fundamental quantities of information theory, are defined as functionals of probability distributions. The derivation of mutual information from information theory is considered in this chapter. 2.1 Information In 1928, R. V. L. Hartley (1888-1970) was the first to suggest the use of logarithms for the measure of information [6]. The definition of information in the context of information theory is an engineering definition based on probabilities, not on the meaning of symbols communicated to a human receiver. Therefore, it is somewhat counterintuitive. Consider an alphabet of n symbols a1, a2, . . . , an each with its corresponding Figure 2.1: Discrete noiseless binary channel. 10
2.1 Information Figure 2.2: Discrete binary symmetric channel. probability p(a1), p(a2), . . . , p(an) of occurrence, communicated via some channel. If p(a1) = 1, then a1 will be received with certainty. This is the case of the noiseless binary communication channel represented in Figure 2.1. Any transmission, 0 or 1, is received without error. There is no surprise in the occurrence of a1 given p(a1), so no information is obtained. If a symbol with a low probability occurs, there is more surprise, more information. This might be the case for the binary channel represented in Figure 2.2, where, for example, a 0 is sent and 1 is received. Thus, information is somewhat inversely related to the probability of occurrence. The information from two independent symbols is the sum of the information from each separately. Since the probabilities of two independent choices are multi- plied together to get the probability of the compound event, it is natural to define the amount of information as Definition 2.1 Information 1 I (ai ) = log . p(ai) [22] Remark 2.1 Information is inversely proportional to probability. Symbols with the least probability of occurring will provide the most information. 11
2.2 Entropy (uncertainty or complexity) As a result, 1 I(a1) + I(a2) = log p(a1)p(a2) = I(a1, a2). Thus, given the two probabilities of events, the probability of both occurring, as- suming independence, is a product. This product yields the amount of information as a sum. 2.2 Entropy (uncertainty or complexity) Investigation by physicists in the 18th and 19th centuries concerning the loss of functional energy from combustion reactions led to the concept of entropy. In an 1850 paper, Rudolf Clausius (1822-1888), German physicist and mathematician, introduced the concept of a thermodynamic system and first stated the basic idea of the second law of thermodynamics that states that the entropy of an isolated system cannot decrease. In 1865, Clausius gave the first mathematical version of the concept of entropy and coined the word ‘entropy’ from the Greek word ‘trope’ that means ‘to drift’ or ‘to turn.’ Austrian mathematician and physicist Ludwig Boltzmann (1844-1906) is credited with the invention of statistical mechanics. His theories and those of American mathematical physicist Willard Gibbs (1839- 1903) provided a connection between the macroscopic property of entropy and the microscopic state of the system. [21] The Shannon-Wiener (Claude Shannon (1916-2001), Norbert Wiener, 1894- 1964) entropy measure H, originally developed as part of communication theory in the 1940s [14, 16], is the most commonly used measure of information in signal and image processing. The entropy function measures the amount of uncertainty, surprise, or information in the outcome of, for example, the reception of a message. This formula is derived from three conditions that a measure of uncertainty in a communication channel should satisfy. 1. The functional should be continuous in p. 2. If all pi equal 1 , where n is the number of symbols, then H should be n monotonically increasing in n. 12
2.2 Entropy (uncertainty or complexity) Figure 2.3: Decomposition of a choice from three possibilities. 3. If a choice is broken down into a sequence of choices, then the original value of H should be the weighted sum of the(constituent )H. For example, p2 p3 H(p1, p2, p3) = H(p1, p2 + p3) + (p2 + p3)H p2+p3 , p2+p3 . The meaning of condition (3) is illustrated in Figure 2.3. On the left there are three possibilities: p1 = 1 , p2 = 1 , and p3 = 1 . On the right a choice is made 2 3 6 first between two possibilities each with probability 1 . If the second occurs, 2 then a choice is made between probabilities 2 and 1 . The final results have 3 3 the same probabilities as before. In this special case it is required that ( ) () () 111 11 1 21 H , , =H , + H , . 236 22 2 33 The coefficient 1 is the weighting factor introduced, because this second 2 choice occurs only half the time. Claude Shannon proved that the − ∑n pi log pi form was the only functional i=1 form satisfying all three conditions. 13
2.2 Entropy (uncertainty or complexity) Theorem 2.1 The only H satisfying all three assumptions is of the form ∑n H = −K pi log pi, i=1 where K is a positive constant. Proof: Let H ( 1 , 1 , . . . , 1 ) = A(n). From condition (3) a choice can be decom- n n n posed from sm equally likely possibilities into a series of m choices each from s equally likely possibilities from which A(sm) = mA(s) is obtained. Similarly, A(tn) = nA(t). An arbitrarily large n can be chosen and an m found to satisfy sm ≤ tn < s(m+1). Taking the logarithms and dividing by n log s yields m ≤ log t ≤ m + 1 or m − log t < ϵ, n log s n n n log s where ϵ is arbitrarily small. From the monotonic property of A(n), A(sm) ≤ A(tn) ≤ A(sm+1) mA(s) ≤ nA(t) ≤ (m + 1)A(s). Dividing by nA(s), m ≤ A(t) ≤ m1 m − A(t) < ϵ + or n A(s) n n n A(s) A(t) − log t ≤ 2ϵ A(t) = K log t, A(s) log s where K must be positive to satisfy condition (2). Now suppose there is a choice from n possibilities with commeasurable prob- abilities pi = ∑ni where the ni are integers. A choice can be broken down from ni 14
2.2 Entropy (uncertainty or complexity) ∑ ni possibilities into a choice from n possibilities with probabilities p1, . . . , pn and then, if the ith were chosen, a choice from ni with equal probabilities. Using ∑ condition (3) again, the total choice from ni is equated as computed by two methods ∑∑ ∑ K pi log ni = H(pi, . . . , pn) + K pi log ni. Hence [∑ ∑ ∑ ] H =K pi log ni − pi log ni = ∑ ∑ni = −K ∑ −K pi log ni pi log pi. If the pi are incommeasurable, they may be approximated by rationals and the same expression must hold by the continuity assumption, condition (1). Thus the expression holds in general. The choice of coefficient K is a matter of convenience and amounts to the choice of a unit of measure. [15] Definition 2.2 The entropy H(X) of a discrete random variable X (Appendix A, Definition A.1) is a measure of the uncertainty of the random variable, and is defined by ∑ H(X) = − p(x) log p(x) x∈X ∑1 = p(x) log . p(x) x∈X The above quantity can also be expressed as H(p). Remark 2.2 The entropy of X can also be interpreted as the expected value (Ap- pendix A, Definition A.3) of log 1 ) , where X is drawn according to the probability p(X distribution function p(x) (Appendix A, Definition A.2). Thus 1 H(X) = Ep log p(X) . [2] The following lemmas are immediate consequences of Definition 2.2. 15
2.2 Entropy (uncertainty or complexity) Lemma 2.1 H(X) ≥ 0. Proof: 0 ≤ p(x) ≤ 1 implies log 1 ≥ 0. p(x) Lemma 2.2 Hb(X) = (logb a)Ha(X). (see Appendix C) Proof: logb p = loga p = logb a loga p. loga b Using Lemma 2.2 for the function p log2 1 (Figure 2.4, r=2) and deriving, p () () d 1d1 dp p log2 p = dp p loge p log2 e = log2 1 − log2 e p is obtained. From the last equation, it can be seen that the slope at p = 0 is infinite and that the maximum of the entropy occurs at p = 1 . e Remark 2.3 lim (x loge x) = 0. x→0 () lim (x loge x) = lim loge x . x→0 x→0 1 x Application of l’Hospital’s rule yields () 1 lim x = lim (−x) = 0. −1 x→0 x2 x→0 16
2.2 Entropy (uncertainty or complexity) Figure 2.4: The entropy function for logarithmic bases 2, e, and 10. When using logarithmic base 2, the unit of information is called a bit (binary digit). When base 10 is used, the unit is called a Hartley, after R. V. L. Hartley, the first to propose the use of the logarithmic measure of information. A nat is the unit of information corresponding to logarithmic base e. 17
2.2 Entropy (uncertainty or complexity) Figure 2.5: The entropy function (logarithmic base 2) of two probabilities. 18
2.2 Entropy (uncertainty or complexity) Example 2.1 Entropy as a function of two probabilities (Figure 2.5). Let { X= 1 with probability p, 0 with probability 1 − p. Then H(X) = −p log p − (1 − p) log(1 − p) ≡ H(p). Figures 2.4 and 2.5 illustrate some of the basic properties of entropy. Entropy is a concave (Appendix B, Definition B.2) function of the distribution and equals 0 when p equals 0 or 1. This makes sense since there is no uncertainty when p equals 0 or 1. In other words, when the outcome is certain, entropy vanishes. For a given n, entropy is a maximum when all probabilities are equal to 1 . In the case of two n probabilities, using logarithmic base 2, as illustrated in Figure 2.5, p equal to 1 2 corresponds to the maximum value H(p) = 1 of entropy. Theorem 2.2 Fundamental inequality ∑ ( yi ) xi i xi log2 ≤ 0 Proof: Fitting the tangent line at the point (0,1) on the loge x function (Figure 2.6), the slope is d(loge x) |x=1 = 1, dx so that the tangent line is y − 0 = 1(x − 1) or y = x − 1. Therefore, for all x greater than 0, loge x ≤ x − 1, (2.1) 19
2.2 Entropy (uncertainty or complexity) Figure 2.6: Bound on the loge(x) function. 20
2.2 Entropy (uncertainty or complexity) where equality holds only at x = 1. ∑ Let xi and∑yi be two probability distribution functions, where, of course, xi = 1 and yi = 1, xi, yi ≥ 0. Now consider the following expression, using Lemma 2.2, involving both distributions. ∑ ( yi ) 1 ∑ ( yi ) xi loge xi i xi log2 = 2 i xi loge . Using the relationship of Equation 2.1, 1 ∑ ( yi ) 1 ∑ ( ) loge xi loge xi yi 1 2 i xi loge ≤ 2 xi − i 1∑ = loge 2 (i (yi − xi) ∑∑ ) yi − xi = 1 = loge 2 ii 0. Converting back to logarithmic base 2 yields the fundamental inequality, ∑ ( yi ) xi i xi log2 ≤ 0, or equivalently, ∑ ( xi ) yi i xi log2 ≥ 0. (2.2) Equality holds only when the xi = yi. The fundamental inequality is also called the relative entropy or Kullback-Leibler distance (Section 2.2.2). [22] 2.2.1 Joint Entropy and Conditional Entropy Definition 2.3 The joint entropy H(X, Y ) of a pair of discrete random variables (X, Y ) with a joint distribution p(x, y) (Appendix A, Definition A.4) is defined as H(X, Y ) = − ∑∑ p(x, y) log p(x, y). (2.3) x∈X y∈Y 21
2.2 Entropy (uncertainty or complexity) For the special case when X and Y are statistically independent (Appendix A, Definition A.5) for all i and j, the independence condition is expressed by P (xi, yj) = p(xi)p(yj). The joint entropy becomes ∑∑ {[ ] [ ]} 1 1 H(X, Y ) = i j p(xi)p(yj) log p(xi) + log p(yj) . Since ∑∑ p(xi) = 1 and p(yj) = 1, ij H(X, Y ) = H(X) + H(Y ). Definition 2.4 If (X, Y ) ∼ p(x, y), then the conditional entropy H(Y |X) is de- fined as H(Y |X) = ∑ p(x)H(Y |X = x) x∈X∑ ∑ = − p(x) p(y|x) log p(y|x) x∑∈X ∑ y∈Y =− p(x, y) log p(y|x) x∈X y∈Y − ∑ ∑ p(x, y) . = p(x, y) log p(x) (2.4) x∈X y∈Y Remark 2.4 If X and Y are independent, then H(Y |X) = − ∑∑ p(x)p(y) log p(y) x∈X y∈Y ∑∑ 1 = p(x)p(y) log p(y) x∈X y∈Y ∑1 = p(y) log p(y) y∈Y = H(Y ). 22
2.2 Entropy (uncertainty or complexity) Theorem 2.3 Conditioning reduces entropy. H(X|Y ) ≤ H(X) with equality if and only if X and Y are independent (Theorem 2.14). Theorem 2.4 Chain rule for entropy Let X1, X2, . . . , Xn be drawn according to p(x1, x2, . . . , xn). Then ∑n H(Xi|Xi−1, . . . , X1). H(X1, X2, . . . , Xn) = i=1 Proof: By repeated application of the two-variable expansion rule for entropies, H(X1, X2) = H(X1) + H(X2|X1), H(X1, X2, X3) = H(X1) + H(X2, X3|X1) = H(X1) + H(X2|X1) + H(X3|X2, X1), ... H(X1, . . . , Xn) = H(X1) + H(X2|X1) + . . . + H(Xn|Xn−1, . . . , X1) ∑n = H(Xi|Xi−1, . . . , X1). i=1 Alternative proof: Write p(x1, . . . , xn) = ∏n p(xi|xi−1, . . . , x1) and evaluate i=1 H(X1, X2, . . . , Xn) = − ∑ p(x1, x2, . . . , xn) log p(x1, x2, . . . , xn) x1,x2,...,xn ∑ ∏n =− p(x1, . . . , xn) log p(xi|xi−1, . . . , x1) x1,...,xn i=1 =− ∑ ∑n p(x1, . . . , xn) log p(xi|xi−1, . . . , x1) x1,...,xn i=1 =− ∑n ∑ p(x1, . . . , xn) log p(xi|xi−1, . . . , x1) i=1 x1,...,xn =− ∑n ∑ p(x1, . . . , xi) log p(xi|xi−1, . . . , x1) i=1 x1,...,xi ∑n = H(Xi|Xi−1, . . . , X1). i=1 23
2.2 Entropy (uncertainty or complexity) [2] Theorem 2.5 The entropy of a pair of random variables is the entropy of one plus the conditional entropy of the other. H(X, Y ) = H(X) + H(Y |X). Proof: H(X, Y ) = − ∑∑ p(x, y) log p(x, y). =− x∑∈X y∑∈Y p(x, y) log p(x)p(y|x) =− x∑∈X y∑∈Y p(x, y) log p(x) − ∑∑ p(x, y) log p(y|x) x∑∈X y∈Y ∑ ∑x∈X y∈Y = − p(x) log p(x) − p(x, y) log p(y|x) x∈X x∈X y∈Y = H(X) + H(Y |X). Equivalently, log p(X, Y ) = log p(X) + log p(Y |X) can be written. Taking the expectation (Appendix A, Definition A.3) of both sides of the equation yields the theorem. [2] Example 2.2 Noisy communication channel. The input source to a noisy communication channel (Figure 2.7) is a random variable X over the four symbols a, b, c, d. The output from the channel is a random variable Y over the same four symbols. The joint distribution of X and Y is given by 24
2.2 Entropy (uncertainty or complexity) Figure 2.7: Noisy communication channel, Example 2.2. Y y=a y=b y=c y=d x=a 1 11 1 8 16 16 4 X x=b 1 1 1 0 16 8 16 x=c 1 1 1 0 32 32 16 x=d 1 1 1 0 32 32 16 The marginal distribution (Appendix A, Definition A.4) for Y is ( 1 , 1 , 1 , 1 ) 4 4 4 4 . The marginal entropy of Y , H(Y ), is ∑4 [] − p(yi) log2 p(yi) 1 = 4 − 4 (log2 1 − log2 4) i=1 () 1 ∑4 = − p(yi) log2 p(yi) 4 2 i=1 = 2 bits. The marginal distribution for X is ( 1 , 1 , 1 , 1 ) 2 4 8 8 . 25
2.2 Entropy (uncertainty or complexity) The marginal entropy of X, H(X), is ∑4 [] − p(xi) log2 p(xi) 1 1 = 2 − 8 (log2 1 − log2 8) + − 4 (log2 1 − log2 4) i=1 +− 1 − log2 2) ( 2)(log2(1 ) 31 = 2 +2 82 7 = bits. 4 The joint entropy H(X, Y ) of X and Y in bits is the sum of −p log p (Section 2.2.1, Equation 2.3) over all 16 probabilities in the joint distribution. ∑∑ 1 [] − p(x, y) log2 p(x, y) 4 1 = − (log2 1 − log2 4) + 2 − 8 (log2 1 − log2 8) x∈X y∈Y [] +6 − 1 (log2 1 − log2 16) [ ] 16 +4 −1 (log2 1− log2 32) (32 ) ( ) ( 13 4 5 ) = +2 +6 +4 2 8 16 32 27 = bits. 8 26
2.2 Entropy (uncertainty or complexity) The conditional entropy H(X|Y ) (Section 2.2.1, Equation 2.4) is ∑ ∑ () () 1 1 − p(x, y) log2 p(x, y) = − 1 log2 8 − 1 log2 16 p(y) 8 1 16 1 x∈X y∈Y 4 4 [ ( )] 1 −2 1 32 32 log2 1 () ( 1 )4 1 16 8 − 1 log2 1 − 1 log2 1 16 8 4 4( [ 1 )] 32 −2 1 log2 1 32 4 [ ( 1 )] () 16 1 −4 1 1 − 1 log2 4 16 log2 4 4 1 4 11 3 11 3 1 = + + + + + + +0 8 8 16 8 8 16 2 11 = bits. 8 The conditional entropy H(Y |X) = 13 bits. 8 The joint entropy H(X, Y ) can also be calculated using Theorem 2.5. () 27 11 bits = bits. H(X, Y ) = H(Y ) + H(X|Y ) = 2+ 88 This example is continued in Section 2.3, where the mutual information for this example is calculated. 2.2.2 Relative Entropy Definition 2.5 The relative entropy or Kullback-Leibler distance between two prob- ability mass functions (Appendix A, Definition A.2) p(x) and q(x) is defined as D(p∥q) = ∑ p(x) (2.5) p(x) log q(x) x∈X p(X ) = Ep log q(X) , where E denotes expectation. 27
2.2 Entropy (uncertainty or complexity) Definition 2.6 The conditional relative entropy, D(p(y|x)∥q(y|x)) is the average of the relative entropies between the conditional probability mass functions p(y|x) and q(y|x) averaged over the probability mass function p(x). More precisely, D(p(y|x)∥q(y|x)) = ∑ p(x) ∑ p(y|x) log p(y|x) q(y|x) x y p(Y |X) = Ep(x,y) log q(Y |X) , where E denotes expectation. [2] Theorem 2.6 Chain rule for relative entropy D(p(x, y)∥q(x, y)) = D(p(x)∥q(x)) + D(p(y|x)∥q(y|x)). Proof: D(p(x, y)∥q(x, y)) ∑∑ p(x, y) = p(x, y) log q(x, y) xy ∑∑ p(x)p(y|x) = p(x, y) log q(x)p(y|x) xy ∑∑ p(x) ∑ ∑ p(y|x) = p(x, y) log + p(x, y) log p(y|x) q(x) xy xy = D(p(x)∥q(x)) + D(p(y|x)∥q(y|x)). [2] Theorem 2.7 Jensen’s inequality If f is a convex function (Appendix B, Definition B.1) and X is a random variable, then Ef (X) ≥ f (EX), (2.6) where E denotes expectation. If f is strictly convex, then equality implies that X = EX with probability 1, that is, X is a constant. 28
2.2 Entropy (uncertainty or complexity) Proof: This is proven for discrete distributions by induction on the number of mass points. For a two mass point distribution, the inequality (Equation 2.6) becomes p1f (x1) + p2f (x2) ≥ f (p1x1 + p2x2), which follows directly from the definition of convex functions. Suppose that the theorem is true for distributions with k−1 mass points. Then writing p′i = pi 1−pk for i = 1, 2, . . . , k − 1, ∑k ∑k−1 pif (xi) = pkf (xk) + (1 − pk) pi′ f (xi) i=1 i=(1∑k−1 ) ≥ pkf (xk) + (1 − pk)f pi′ xi ( i=1 ) ∑k−1 ≥ f pkxk + (1 − pk) pi′ xi (∑k ) i=1 =f pixi , i=1 where the first inequality follows from the induction hypothesis and the second follows from the definition of convexity. [2] Theorem 2.8 Log sum inequality For non-negative numbers, a1, a2, . . . , an and b1, b2, . . . , bn, ∑n ai log ai ≥ (∑n ) log ∑n ai bi ai ∑in=1 bi i=1 i=1 i=1 with equality if and only if ai = constant. bi Proof: Assume without loss of generality that ai > 0 and bi > 0 (If ai = bi = 0, then the sums are 0). The function f (t) = t log t is strictly convex, since f ′′ (t) = 1 log e > 0 for all positive t (Appendix B, Theorem B.1). Hence, by t Jensen’s inequality (Theorem 2.7), ∑ (∑ ) aif (ti) ≥ f aiti 29
2.2 Entropy (uncertainty or complexity) ∑ Setting ai = ∑nbi bj and ti = ai , for ai ≥ 0, i ai = 1. bi j=1 ∑ ∑ai log ai ≥ ∑ ∑ai log ∑ ∑ai bj bi bj bj is obtained, that is the log sum inequality. [2] Theorem 2.9 D(p∥q) is convex in the pair (p, q), that is, if (p1, q1) and (p2, q2) are two pairs of probability mass functions, then D(λp1 + (1 − λ)p2∥λq1 + (1 − λ)q2) ≤ λD(p1∥q1) + (1 − λ)D(p2∥q2) (2.7) for all 0 ≤ λ ≤ 1. Proof: The log sum inequality (Theorem 2.8) is applied to a term on the lefthand side of 2.7, that is, (λp1(x) + (1 − λ)p2(x)) log λp1(x) + (1 − λ)p2(x) λq1(x) + (1 − λ)q2(x) ≤ λp1(x) log λp1(x) + (1 − λ)p2(x) log (1 − λ)p2(x) . λq1(x) (1 − λ)q2(x) Summing this over all x, the desired property is obtained. [2] Theorem 2.10 Information inequality Let p(x), q(x), x ∈ X , be two probability mass functions. Then D(p∥q) ≥ 0 with equality if and only if p(x) = q(x) for all x. 30
2.2 Entropy (uncertainty or complexity) Proof: Let A = {x : p(x) > 0} be the support set of p(x). Then −D(p∥q) = − ∑ p(x) log p(x) q(x) x∈A ∑ q(x) = p(x) log p(x) x∈A ≤ ∑ q(x) (2.8) log p(x) p(x) x∑∈A = log q(x) x∑∈A ≤ log q(x) x∈X = log 1 = 0, where 2.8 follows from Theorem 2.7, Jensen’s inequality. Since log t is a strictly concave function of t, there is equality in 2.8 if and only if q(x) = 1 everywhere, p(x) that is, p(x) = q(x). Hence D(p∥q) = 0 if and only if p(x) = q(x) for all x. [2] Corollary 2.1 D(p(y|x)∥q(y|x)) ≥ 0, with equality if and only if p(y|x) = q(y|x) for all y and x with p(x) > 0. [2] Theorem 2.11 H(X) ≤ log |H |, where |H | denotes the number of elements in the range of X, with equality if and only if X has a uniform distribution over H . Proof: Let u(x) = 1 be the uniform probability mass function over H , and |H | let p(x) be the probability mass function for X. Then D(p∥u) = ∑ p(x) = log |H | − H (X ). p(x) log u(x) 31
2.2 Entropy (uncertainty or complexity) By the non-negativity of relative entropy (Equation 2.2, Theorem 2.2, and Theo- rem 2.10), 0 ≤ D(p∥u) = log |H | − H(X). (2.9) [2] 2.2.3 Concavity of Entropy Theorem 2.12 H(p) is a concave function of p. Proof: From Equation 2.9, H(p) = log |H | − D(p∥u), where u is the uniform distribution on |H | outcomes. The concavity of H then follows directly from the convexity of D (Theorem 2.9). Alternative Proof: Let X1 be a random variable with distribution p1 taking on values in a set A. Let X2 be another random variable with distribution p2 on the same set. Let { 1 with probability λ θ= 2 with probability 1 − λ Let Z = Xθ. Then the distribution of Z is λp1 + (1 − λ)p2. Since conditioning reduces entropy (Theorem 2.3), H(Z) ≥ H(Z|θ), or equivalently, H(λp1 + (1 − λ)p2) ≥ λH(p1) + (1 − λ)H(p2), which proves the concavity of the entropy as a function of the distribution. [2] 32
2.2 Entropy (uncertainty or complexity) 2.2.4 Entropy of a Signaling System Consider again the communication channel of Section 2.1. If I(ai) units of infor- mation (Definition 2.1) are obtained when the symbol ai is received, then, on the average, since p(ai) is the probability of getting the information I(ai), then for each symbol ai, 1 p(ai)I(ai) = p(ai) log p(ai) . From this it follows that, on the average, over the entire alphabet of symbols ai, ∑n 1 i=1 p(ai) log p(ai) will be received. This important quantity is the entropy of the signaling system A having symbols ai with probabilities p(ai). ∑n 1 H(A) = i=1 p(ai) log p(ai) 2.2.5 Entropy and Kolmogorov Complexity The Kolmogorov complexity is approximately equal to the Shannon entropy H. Entropy is often called information in that it is a bound length on the code that is required to transmit a message. Kolmogorov complexity (also called minimum description length principle, Chaitin complexity, descriptive complexity, shortest program length, and algo- rithmic entropy) is related to communication. If the sender and receiver agree upon a specification method, such as an encoding or compression technique, then a message a can be transmitted as b and decoded given some fixed method L, de- noted L(b) = a. The cost of the transmission of a is the length of the transmitted message, b. The least cost is the minimum length of such a message. The mini- mum length is the entropy of a given the method of transmission. In other words, entropy supplies a lower bound on the average code length for any instantaneous (decodability without reference to future codewords since the end of the codeword is immediately recognizable) decodable system. The complexity of the minimum description length is universal, that is, computer independent. 33
2.2 Entropy (uncertainty or complexity) Definition 2.7 The Kolmogorov complexity of a binary string a, K(a), is defined as the length of the shortest binary program for computing the string. K(a) = min[U (b) = a], where U represents the abstract universal Turing computer that can implement any algorithm and compute any computable function. [2, 3] Remark 2.5 Kolmogorov complexity is reminiscent of Ockham’s Razor, a princi- ple formulated by William of Ockham (about 1288-1348) stating that terms, con- cepts, assumption, etc., must not be multiplied beyond necessity. Remark 2.6 Einstein stated that “Everything should be made as simple as possi- ble, but no simpler.” Definition 2.8 Relative entropy of a symbol source The ratio of the entropy of a symbol source to the maximum value it could have while restricted to the same symbols is called its relative entropy. This is the maximum compression possible when encoding into the same symbol set. Example 2.3 Coding example. As a simple illustration, consider a source that produces a sequence of letters chosen from A, B, C, and D with respective probabilities 1 , 1 , 1 , and 1 , successive 2 4 8 8 symbols being chosen independently. Then, using the logarithmic base 2, () 1 11 12 1 H = − log + log + log 2 24 48 8 7 = bits per symbol. 4 Therefore, a coding system can be approximated to encode messages from this source into binary digits with an average of 7 binary digits per symbol. The 4 limiting value can be achieved in this case by arranging letters in order of decreasing probability, and then, starting with the symbol that has the highest probability of 34
2.2 Entropy (uncertainty or complexity) occurrence, assigning codewords of increasing length so that the longest codeword is assigned to the symbol with the lowest probability. A0 B 10 C 110 D 111 The average number of binary digits used in encoding a sequence of N symbols will be () N 1 ×1+ 1 ×2+ 2 ×3 7 248 = N. 4 The binary digits 0, 1 have probabilities 1 , 1 , so the H for the coded sequences is 2 2 one bit per symbol. On the average, there are 7 binary symbols per original letter, 4 so the entropies on a time basis are the same. The maximum possible entropy for the original symbol set is log2 4 = 2, occurring when A, B, C, and D have probabilities 1 , 1 , 1 , and 1 . Therefore, the relative entropy (Definition 2.8) is 4 4 4 4 7 . The binary sequences can be translated into the original set of symbols on a 8 two-to-one basis as follows: 00 A′ 01 B′ 10 C′ 11 D′ This double process then encodes the original message into the same symbols but with an average compression ratio of 7 . [15] 8 Theorem 2.13 Fundamental theorem for a noiseless channel Let a source have entropy H (bits per symbol) and a channel have a capacity C (bits per second). Then it is possible to encode the output of the source in such a way as to transmit at the average rate C −ϵ symbols per second over the channel, H where ϵ is arbitrarily small. It is not possible to transmit at an average rate greater than C . H Proof: That C cannot be exceeded may be proven by noting that the entropy of H the channel input per second H′ (bits per second) is equal to that of the source 35
2.2 Entropy (uncertainty or complexity) since the transmitter must be non-singular, and this entropy cannot exceed the channel capacity. Hence H′ ≤ C and the number of symbols per second equals H′ ≤ C . H H To prove the first part of the theorem, consider the set of all sequences of N symbols produced by the source. For large N , the symbols can be divided into two groups, one containing less than 2(H+η)N members and the second containing less than 2RN members (where R is the logarithm of the number of different symbols) and having a total probability less than µ. As N increases, η and µ approach zero. The number of signals of duration T in the channel is greater than 2(C−θ)T with θ small when T is large. If () H T = +λ N C is chosen, then there will be a sufficient number of sequences of channel symbols for the high probability group when N and T are sufficiently large (however small λ) and also some additional ones. The high probability group is coded in an arbitrary one-to-one way into this set. The remaining sequences are represented by larger sequences, starting and ending with one of the sequences not used for the high probability group. This special sequence acts as a start and stop signal for a different code. In between, a sufficient time is allowed to give enough different sequences for all the low probability messages. This will require () R T1 = +φ N C where φ is small. The mean rate of transmission in message symbols per second will then be greater than [ T1 ]−1 [ ( H )( )]−1 (1 − δ) T (1 R +φ . + δ = − δ) +λ +δ NN CC As N increases, δ, λ, and φ approach zero and the rate approaches C . [15] H Remark 2.7 If a source can only produce one particular message, its entropy is zero and no channel is required. Theorem 2.13 is a precursor to Shannon’s proof, not presented here (see [15]), that, even in the presence of noise (errors), a suitable encoding system can be found. 36
2.3 Mutual Information 2.3 Mutual Information Mutual information is a measure of the amount of information that one random variable contains about another random variable. It is the reduction in the uncer- tainty of one random variable due to the knowledge of the other. C. E. Shannon first presented the functional form of mutual information in 1948. He defined it as the “rate of transmission” in a noisy communication channel between source and receiver. [14] Consider a communication channel where the input symbols are the ai of al- phabet A and the output symbols are the bj of alphabet B. The channel is defined by the conditional probabilities P (ai|bi). Prior to reception, the probability of the input symbol ai was p(ai). This is the a priori probability of ai. After reception of bj, the probability that the input symbol was ai becomes P (ai|bi), the conditional probability that ai was sent given that bj was received. This is an a posteriori probability of ai. The change in the probability measures how much the receiver learned from the reception of the bj. In an ideal channel with no noise, the a posteriori probability is 1, since it is certain from the received bj exactly what was sent. In practical systems there are finite nonzero probabilities that errors will occur and that the receiver cannot be absolutely sure what was sent. The difference between the information uncertainty before (the a priori probabilities) and after (the a posteriori probabilities) reception of a bj measures the gain in information due to the reception of the bj. This information is called the mutual information and is defined as Definition 2.9 Mutual information [] [ 1 ][ P (ai|bj) ] 1 P (ai|bj) p(ai) I(ai, bj) = log p(ai) − log = log . (2.10) [22] If p(ai) equals P (ai|bj), then no information has been gained and the mutual information is zero, that is, no information has been transmitted. Only when something new has been learned about the probabilities of the ai from the received bj is the mutual information positive. Multiplying the numerator and denominator of the rightmost log of Equation 37
2.3 Mutual Information 2.10 term by p(bj) yields P (ai|bj) = P (ai|bj)p(bj) = P (ai, bj) = P (bj|ai)p(ai) = P (bj|ai) . p(ai) p(ai)p(bj) p(ai)p(bj) p(ai)p(bj) p(bj ) Therefore, [] P (ai, bj) I(ai, bj) = log p(ai )p(bj ) = I(bj, ai). From the symmetry of the ai and bj, [] I(ai, bj) = log P (ai|bj) , I(bj, ai) = [ p(ai) ] log P (bj|ai) , p(bj ) and I(ai, bj) = I(bj, ai). Additionally, I(ai, bj) ≤ I(ai). This follows from Definition 2.1 [] 1 I(ai) = log p(ai) , and I(ai, bj) = log P (ai|bj) + I(ai) = log P (ai |bj ) + log 1 p(ai) = log P (ai|bj) − log p(ai) = log P (ai|bj) . (2.11) p(ai) Since the maximum for P (ai|bj) is 1 and the maximum for the log is thus 0, I(ai, bj) ≤ I(ai). 38
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127