76 2. PROBABILITY DISTRIBUTIONS              We can solve for the Lagrange multiplier λ by substituting (2.32) into the constraint              k µk = 1 to give λ = −N . Thus we obtain the maximum likelihood solution in              the form                                                      mk                                             µkML  =  N                             (2.33)              which is the fraction of the N observations for which xk = 1.                   We can consider the joint distribution of the quantities m1, . . . , mK, conditioned              on the parameters µ and on the total number N of observations. From (2.29) this              takes the form              Mult(m1, m2, . . . , mK|µ, N ) =               N            K         (2.34)                                                      m1m2 . . . mK                                                                             µmk k                                                                         k=1              which is known as the multinomial distribution. The normalization coefficient is the              number of ways of partitioning N objects into K groups of size m1, . . . , mK and is              given by                                 N N!                                                   =.                               (2.35)                                m1m2 . . . mK         m1!m2! . . . mK !              Note that the variables mk are subject to the constraint                                              K                                     (2.36)                                                 mk = N.                                             k=1              2.2.1 The Dirichlet distribution                   We now introduce a family of prior distributions for the parameters {µk} of              the multinomial distribution (2.34). By inspection of the form of the multinomial              distribution, we see that the conjugate prior is given by                                                              K                     (2.37)                                          p(µ|α) ∝ µkαk−1                                                            k=1Exercise 2.9  where 0 µk 1 and k µk = 1. Here α1, . . . , αK are the parameters of the              distribution, and α denotes (α1, . . . , αK)T. Note that, because of the summation              constraint, the distribution over the space of the {µk} is confined to a simplex of              dimensionality K − 1, as illustrated for K = 3 in Figure 2.4.                   The normalized form for this distribution is by                                Dir(µ|α)  =        Γ(α0)          K                 (2.38)                                             Γ(α1) · · · Γ(αK )                                                                     µkαk −1                                                                 k=1              which is called the Dirichlet distribution. Here Γ(x) is the gamma function defined              by (1.141) while                                                   K                                             α0 = αk.                               (2.39)                                                   k=1
2.2. Multinomial Variables        77Figure 2.4  The Dirichlet distribution over three variables µ1, µ2, µ3                    µ2            is confined to a simplex (a bounded linear manifold) of            the form shownP, as a consequence of the constraints            0 µk 1 and k µk = 1.                                                                                                           µ1                                                                                    µ3            Plots of the Dirichlet distribution over the simplex, for various settings of the param-            eters αk, are shown in Figure 2.5.                 Multiplying the prior (2.38) by the likelihood function (2.34), we obtain the            posterior distribution for the parameters {µk} in the form                                                                     K                                     (2.40)            p(µ|D, α) ∝ p(D|µ)p(µ|α) ∝ µαk k+mk−1.                                                                    k=1            We see that the posterior distribution again takes the form of a Dirichlet distribution,            confirming that the Dirichlet is indeed a conjugate prior for the multinomial. This            allows us to determine the normalization coefficient by comparison with (2.38) so            that            p(µ|D, α) = Dir(µ|α + m)            =  Γ(α1                                          +   Γ(α0    + N)    +  mK )   K   µαk k+mk−1  (2.41)                                                                m1) · ·  · Γ(αK           k=1            where we have denoted m = (m1, . . . , mK)T. As for the case of the binomial            distribution with its beta prior, we can interpret the parameters αk of the Dirichlet            prior as an effective number of observations of xk = 1.                 Note that two-state quantities can either be represented as binary variables and                Lejeune Dirichlet                            from ‘le jeune de Richelet’ (the young person from                                                             Richelet). Dirichlet’s first paper, which was published                     1805–1859                               in 1825, brought him instant fame. It concerned Fer-                                                             mat’s last theorem, which claims that there are no                        Johann Peter Gustav Lejeune          positive integer solutions to xn + yn = zn for n > 2.                        Dirichlet was a modest and re-       Dirichlet gave a partial proof for the case n = 5, which                        served mathematician who made        was sent to Legendre for review and who in turn com-                        contributions in number theory, me-  pleted the proof. Later, Dirichlet gave a complete proof                        chanics, and astronomy, and who      for n = 14, although a full proof of Fermat’s last theo-                        gave the first rigorous analysis of   rem for arbitrary n had to wait until the work of AndrewFourier series. His family originated from Richelet          Wiles in the closing years of the 20th century.in Belgium, and the name Lejeune Dirichlet comes
78 2. PROBABILITY DISTRIBUTIONSFigure 2.5 Plots of the Dirichlet distribution over three variables, where the two horizontal axes are coordinatesin the plane of the simplex and the vertical axis corresponds to the value of the density. Here {αk} = 0.1 on theleft plot, {αk} = 1 in the centre plot, and {αk} = 10 in the right plot.                          modelled using the binomial distribution (2.9) or as 1-of-2 variables and modelled                          using the multinomial distribution (2.34) with K = 2.               2.3. The Gaussian Distribution               The Gaussian, also known as the normal distribution, is a widely used model for the               distribution of continuous variables. In the case of a single variable x, the Gaussian               distribution can be written in the form               N (x|µ, σ2)  =         1         exp  −   1   (x  −  µ)2  (2.42)                                 (2πσ2)1/2              2σ2               where µ is the mean and σ2 is the variance. For a D-dimensional vector x, the               multivariate Gaussian distribution takes the form                                     11         − 1 (x − µ)TΣ−1(x − µ)   (2.43)               N (x|µ, Σ) = (2π)D/2 |Σ|1/2 exp    2Section 1.6    where µ is a D-dimensional mean vector, Σ is a D × D covariance matrix, and |Σ|Exercise 2.14  denotes the determinant of Σ.                    The Gaussian distribution arises in many different contexts and can be motivated               from a variety of different perspectives. For example, we have already seen that for               a single real variable, the distribution that maximizes the entropy is the Gaussian.               This property applies also to the multivariate Gaussian.                    Another situation in which the Gaussian distribution arises is when we consider               the sum of multiple random variables. The central limit theorem (due to Laplace)               tells us that, subject to certain mild conditions, the sum of a set of random variables,               which is of course itself a random variable, has a distribution that becomes increas-               ingly Gaussian as the number of terms in the sum increases (Walker, 1969). We can
2.3. The Gaussian Distribution  793           3                                                  3   N =1        N =2                                               N = 102           2                                                  2111   000    0 0.5 1 0 0.5 1 0 0.5 1Figure 2.6 Histogram plots of the mean of N uniformly distributed numbers for various values of N . Weobserve that as N increases, the distribution tends towards a Gaussian.Appendix C  illustrate this by considering N variables x1, . . . , xN each of which has a uniform            distribution over the interval [0, 1] and then considering the distribution of the mean            (x1 + · · · + xN )/N . For large N , this distribution tends to a Gaussian, as illustrated            in Figure 2.6. In practice, the convergence to a Gaussian as N increases can be            very rapid. One consequence of this result is that the binomial distribution (2.9),            which is a distribution over m defined by the sum of N observations of the random            binary variable x, will tend to a Gaussian as N → ∞ (see Figure 2.1 for the case of            N = 10).                 The Gaussian distribution has many important analytical properties, and we shall            consider several of these in detail. As a result, this section will be rather more tech-            nically involved than some of the earlier sections, and will require familiarity with            various matrix identities. However, we strongly encourage the reader to become pro-            ficient in manipulating Gaussian distributions using the techniques presented here as            this will prove invaluable in understanding the more complex models presented in            later chapters.                 We begin by considering the geometrical form of the Gaussian distribution. The                Carl Friedrich Gauss                           ematician and scientist with a reputation for being a                                                               hard-working perfectionist. One of his many contribu-                     1777–1855                                 tions was to show that least squares can be derived                                                               under the assumption of normally distributed errors.                        It is said that when Gauss went        He also created an early formulation of non-Euclidean                        to elementary school at age 7, his     geometry (a self-consistent geometrical theory that vi-                        teacher Bu¨ ttner, trying to keep the  olates the axioms of Euclid) but was reluctant to dis-                        class occupied, asked the pupils to    cuss it openly for fear that his reputation might suffer                        sum the integers from 1 to 100. To     if it were seen that he believed in such a geometry.                        the teacher’s amazement, Gauss         At one point, Gauss was asked to conduct a geodeticarrived at the answer in a matter of moments by noting         survey of the state of Hanover, which led to his for-that the sum can be represented as 50 pairs (1 + 100,          mulation of the normal distribution, now also known2+99, etc.) each of which added to 101, giving the an-         as the Gaussian. After his death, a study of his di-swer 5,050. It is now believed that the problem which          aries revealed that he had discovered several impor-was actually set was of the same form but somewhat             tant mathematical results years or even decades be-harder in that the sequence had a larger starting value        fore they were published by others.and a larger increment. Gauss was a German math-
80 2. PROBABILITY DISTRIBUTIONS               functional dependence of the Gaussian on x is through the quadratic form  (2.44)                                                ∆2 = (x − µ)TΣ−1(x − µ)Exercise 2.17  which appears in the exponent. The quantity ∆ is called the Mahalanobis distanceExercise 2.18  from µ to x and reduces to the Euclidean distance when Σ is the identity matrix. The               Gaussian distribution will be constant on surfaces in x-space for which this quadraticExercise 2.19  form is constant.                    First of all, we note that the matrix Σ can be taken to be symmetric, without               loss of generality, because any antisymmetric component would disappear from the               exponent. Now consider the eigenvector equation for the covariance matrix                                            Σui = λiui                                   (2.45)               where i = 1, . . . , D. Because Σ is a real, symmetric matrix its eigenvalues will be               real, and its eigenvectors can be chosen to form an orthonormal set, so that                                            uiTuj = Iij                                  (2.46)               where Iij is the i, j element of the identity matrix and satisfies                                     Iij =  1, if i = j                                  (2.47)                                            0, otherwise.               The covariance matrix Σ can be expressed as an expansion in terms of its eigenvec-               tors in the form                 D                                     Σ = λiuiuTi                                         (2.48)                                               i=1               and similarly the inverse covariance matrix Σ−1 can be expressed as                                     Σ−1    =   D   1   ui  uTi  .                       (2.49)                                               i=1  λi               Substituting (2.49) into (2.44), the quadratic form becomes                                            ∆2 = D yi2                                   (2.50)                                                    i=1 λi               where we have defined                                     yi = uTi (x − µ).                                   (2.51)               We can interpret {yi} as a new coordinate system defined by the orthonormal vectors               ui that are shifted and rotated with respect to the original xi coordinates. Forming               the vector y = (y1, . . . , yD)T, we have                                     y = U(x − µ)                                        (2.52)
                                
                                
                                Search
                            
                            Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374
- 375
- 376
- 377
- 378
- 379
- 380
- 381
- 382
- 383
- 384
- 385
- 386
- 387
- 388
- 389
- 390
- 391
- 392
- 393
- 394
- 395
- 396
- 397
- 398
- 399
- 400
- 401
- 402
- 403
- 404
- 405
- 406
- 407
- 408
- 409
- 410
- 411
- 412
- 413
- 414
- 415
- 416
- 417
- 418
- 419
- 420
- 421
- 422
- 423
- 424
- 425
- 426
- 427
- 428
- 429
- 430
- 431
- 432
- 433
- 434
- 435
- 436
- 437
- 438
- 439
- 440
- 441
- 442
- 443
- 444
- 445
- 446
- 447
- 448
- 449
- 450
- 451
- 452
- 453
- 454
- 455
- 456
- 457
- 458
- 459
- 460
- 461
- 462
- 463
- 464
- 465
- 466
- 467
- 468
- 469
- 470
- 471
- 472
- 473
- 474
- 475
- 476
- 477
- 478
- 479
- 480
- 481
- 482
- 483
- 484
- 485
- 486
- 487
- 488
- 489
- 490
- 491
- 492
- 493
- 494
- 495
- 496
- 497
- 498
- 499
- 500
- 501
- 502
- 503
- 504
- 505
- 506
- 507
- 508
- 509
- 510
- 511
- 512
- 513
- 514
- 515
- 516
- 517
- 518
- 519
- 520
- 521
- 522
- 523
- 524
- 525
- 526
- 527
- 528
- 529
- 530
- 531
- 532
- 533
- 534
- 535
- 536
- 537
- 538
- 539
- 540
- 541
- 542
- 543
- 544
- 545
- 546
- 547
- 548
- 549
- 550
- 551
- 552
- 553
- 554
- 555
- 556
- 557
- 558
- 559
- 560
- 561
- 562
- 563
- 564
- 565
- 566
- 567
- 568
- 569
- 570
- 571
- 572
- 573
- 574
- 575
- 576
- 577
- 578
- 579
- 580
- 581
- 582
- 583
- 584
- 585
- 586
- 587
- 588
- 589
- 590
- 591
- 592
- 593
- 594
- 595
- 596
- 597
- 598
- 599
- 600
- 601
- 602
- 603
- 604
- 605
- 606
- 607
- 608
- 609
- 610
- 611
- 612
- 613
- 614
- 615
- 616
- 617
- 618
- 619
- 620
- 621
- 622
- 623
- 624
- 625
- 626
- 627
- 628
- 629
- 630
- 631
- 632
- 633
- 634
- 635
- 636
- 637
- 638
- 639
- 640
- 641
- 642
- 643
- 644
- 645
- 646
- 647
- 648
- 649
- 650
- 651
- 652
- 653
- 654
- 655
- 656
- 657
- 658
- 659
- 660
- 661
- 662
- 663
- 664
- 665
- 666
- 667
- 668
- 669
- 670
- 671
- 672
- 673
- 674
- 675
- 676
- 677
- 678
- 679
- 680
- 681
- 682
- 683
- 684
- 685
- 686
- 687
- 688
- 689
- 690
- 691
- 692
- 693
- 694
- 695
- 696
- 697
- 698
- 699
- 700
- 701
- 702
- 703
- 704
- 705
- 706
- 707
- 708
- 709
- 710
- 711
- 712
- 713
- 714
- 715
- 716
- 717
- 718
- 719
- 720
- 721
- 722
- 723
- 724
- 725
- 726
- 727
- 728
- 729
- 730
- 731
- 732
- 733
- 734
- 735
- 736
- 737
- 738
- 739
- 740
- 741
- 742
- 743
- 744
- 745
- 746
- 747
- 748
- 749
- 750
- 751
- 752
- 753
- 754
- 755
- 756
- 757
- 758
- 1 - 50
- 51 - 100
- 101 - 150
- 151 - 200
- 201 - 250
- 251 - 300
- 301 - 350
- 351 - 400
- 401 - 450
- 451 - 500
- 501 - 550
- 551 - 600
- 601 - 650
- 651 - 700
- 701 - 750
- 751 - 758
Pages:
                                             
                    