Home Explore Visual Media Coding and Transmission

Visual Media Coding and Transmission

Published by Willington Island, 2021-07-26 02:21:34

Description: Visual Media Coding and Transmission is an output of VISNET II NoE, which is an EC IST-FP6 collaborative research project by twelve esteemed institutions from across Europe in the fields of networked audiovisual systems and home platforms. The authors provide information that will be essential for the future study and development of visual media communications technologies. The book contains details of video coding principles, which lead to advanced video coding developments in the form of Scalable Coding, Distributed Video Coding, Non-Normative Video Coding Tools and Transform Based Multi-View Coding. Having detailed the latest work in Visual Media Coding, networking aspects of Video Communication is detailed. Various Wireless Channel Models are presented to form the basis for both link level quality of service (QoS) and cross network transmission of compressed visual data. Finally, Context-Based Visual Media Content Adaptation is discussed with some examples.

MEDIA DOODLE

Read the Text Version

Pages:

Visual Media Coding and Transmission Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6

Visual Media Coding and Transmission Ahmet Kondoz Centre for Communication Systems Research, University of Surrey, UK

This edition ﬁrst published 2009 # 2009 John Wiley & Sons Ltd. Registered office John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United Kingdom For details of our global editorial ofﬁces, for customer services and for information about how to apply for permission to reuse the copyright material in this book please see our website at www.wiley.com. The right of the author to be identiﬁed as the author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books. Designations used by companies to distinguish their products are often claimed as trademarks. All brand names and product names used in this book are trade names, service marks, trademarks or registered trademarks of their respective owners. The publisher is not associated with any product or vendor mentioned in this book. This publication is designed to provide accurate and authoritative information in regard to the subject matter covered. It is sold on the understanding that the publisher is not engaged in rendering professional services. If professional advice or other expert assistance is required, the services of a competent professional should be sought. #1998, #2001, #2002, #2003, #2004. 3GPPTM TSs and TRs are the property of ARIB, ATIS, CCSA, ETSI, TTA and TTC who jointly own the copyright in them. They are subject to further modiﬁcations and are therefore provided to you ‘as is’ for information purposes only. Further use is strictly prohibited. Library of Congress Cataloging-in-Publication Data Kondoz, A. M. (Ahmet M.) Visual media coding and transmission / Ahmet Kondoz. p. cm. Includes bibliographical references and index. ISBN 978 0 470 74057 6 (cloth) 1. Multimedia communications. 2. Video compression. 3. Coding theory. 4. Data transmission systems. I. Title. TK5105.15.K65 2009 621.382’1 dc22 2008047067 A catalogue record for this book is available from the British Library. ISBN 9780470740576 (H/B) Set in 10/12pt Times New Roman by Thomson Digital, Noida, India. Printed in Great Britain by CPI Antony Rowe, Chippenham, England

Contents xiii VISNET II Researchers xv Preface xvii Glossary of Abbreviations 1 1 Introduction 7 7 2 Video Coding Principles 7 2.1 Introduction 8 2.2 Redundancy in Video Signals 8 2.3 Fundamentals of Video Compression 9 2.3.1 Video Signal Representation and Picture Structure 14 2.3.2 Removing Spatial Redundancy 16 2.3.3 Removing Temporal Redundancy 17 2.3.4 Basic Video Codec Structure 17 2.4 Advanced Video Compression Techniques 19 2.4.1 Frame Types 20 2.4.2 MC Accuracy 21 2.4.3 MB Mode Selection 22 2.4.4 Integer Transform 22 2.4.5 Intra Prediction 24 2.4.6 Deblocking Filters 24 2.4.7 Multiple Reference Frames and Hierarchical Coding 28 2.4.8 Error-Robust Video Coding 28 2.5 Video Codec Standards 29 2.5.1 Standardization Bodies 29 2.5.2 ITU Standards 31 2.5.3 MPEG Standards 31 2.5.4 H.264/MPEG-4 AVC 31 2.6 Assessment of Video Quality 32 2.6.1 Subjective Performance Evaluation 35 2.6.2 Objective Performance Evaluation 36 2.7 Conclusions References

vi Contents 3 Scalable Video Coding 39 3.1 Introduction 39 3.1.1 Applications and Scenarios 40 3.2 Overview of the State of the Art 41 3.2.1 Scalable Coding Techniques 42 3.2.2 Multiple Description Coding 45 3.2.3 Stereoscopic 3D Video Coding 47 3.3 Scalable Video Coding Techniques 48 3.3.1 Scalable Coding for Shape, Texture, and Depth for 3D Video 48 3.3.2 3D Wavelet Coding 68 3.4 Error Robustness for Scalable Video and Image Coding 74 3.4.1 Correlated Frames for Error Robustness 74 3.4.2 Odd Even Frame Multiple Description Coding for Scalable H.264/AVC 82 3.4.3 Wireless JPEG 2000: JPWL 91 3.4.4 JPWL Simulation Results 94 3.4.5 Towards a Theoretical Approach for Optimal Unequal Error Protection 96 3.5 Conclusions 98 References 99 4 Distributed Video Coding 105 4.1 Introduction 105 4.1.1 The Video Codec Complexity Balance 106 4.2 Distributed Source Coding 109 4.2.1 The Slepian Wolf Theorem 109 4.2.2 The Wyner Ziv Theorem 110 4.2.3 DVC Codec Architecture 111 4.2.4 Input Bitstream Preparation Quantization and Bit Plane Extraction 112 4.2.5 Turbo Encoder 112 4.2.6 Parity Bit Puncturer 114 4.2.7 Side Information 114 4.2.8 Turbo Decoder 115 4.2.9 Reconstruction: Inverse Quantization 116 4.2.10 Key Frame Coding 117 4.3 Stopping Criteria for a Feedback Channel-based Transform Domain Wyner Ziv Video Codec 118 4.3.1 Proposed Technical Solution 118 4.3.2 Performance Evaluation 120 4.4 Rate-distortion Analysis of Motion-compensated Interpolation at the Decoder in Distributed Video Coding 122 4.4.1 Proposed Technical Solution 122 4.4.2 Performance Evaluation 126 4.5 Nonlinear Quantization Technique for Distributed Video Coding 129 4.5.1 Proposed Technical Solution 129 4.5.2 Performance Evaluation 132

Contents vii 4.6 Symmetric Distributed Coding of Stereo Video Sequences 134 4.6.1 Proposed Technical Solution 134 4.6.2 Performance Evaluation 137 4.7 Studying Error-resilience Performance for a Feedback Channel-based 139 Transform Domain Wyner Ziv Video Codec 139 4.7.1 Proposed Technical Solution 140 4.7.2 Performance Evaluation 144 145 4.8 Modeling the DVC Decoder for Error-prone Wireless Channels 149 4.8.1 Proposed Technical Solution 4.8.2 Performance Evaluation 151 152 4.9 Error Concealment Using a DVC Approach for Video 155 Streaming Applications 158 4.9.1 Proposed Technical Solution 159 4.9.2 Performance Evaluation 161 4.10 Conclusions 161 References 162 162 5 Non-normative Video Coding Tools 164 5.1 Introduction 165 5.2 Overview of the State of the Art 165 5.2.1 Rate Control 166 5.2.2 Error Resilience 169 5.3 Rate Control Architecture for Joint MVS Encoding and Transcoding 171 5.3.1 Problem Definition and Objectives 171 5.3.2 Proposed Technical Solution 171 5.3.3 Performance Evaluation 172 5.3.4 Conclusions 177 5.4 Bit Allocation and Buffer Control for MVS Encoding Rate Control 179 5.4.1 Problem Definition and Objectives 179 5.4.2 Proposed Technical Approach 179 5.4.3 Performance Evaluation 180 5.4.4 Conclusions 181 5.5 Optimal Rate Allocation for H.264/AVC Joint MVS Transcoding 182 5.5.1 Problem Definition and Objectives 182 5.5.2 Proposed Technical Solution 182 5.5.3 Performance Evaluation 183 5.5.4 Conclusions 187 5.6 Spatio-temporal Scene-level Error Concealment for Segmented Video 188 5.6.1 Problem Definition and Objectives 5.6.2 Proposed Technical Solution 189 5.6.3 Performance Evaluation 189 5.6.4 Conclusions 189 5.7 An Integrated Error-resilient Object-based Video Coding Architecture 5.7.1 Problem Definition and Objectives 5.7.2 Proposed Technical Solution

viii Contents 5.7.3 Performance Evaluation 195 5.7.4 Conclusions 195 5.8 A Robust FMO Scheme for H.264/AVC Video Transcoding 195 5.8.1 Problem Definition and Objectives 195 5.8.2 Proposed Technical Solution 195 5.8.3 Performance Evaluation 197 5.8.4 Conclusions 198 5.9 Conclusions 199 References 199 6 Transform-based Multi-view Video Coding 203 6.1 Introduction 203 6.2 MVC Encoder Complexity Reduction using a Multi-grid Pyramidal Approach 205 6.2.1 Problem Definition and Objectives 205 6.2.2 Proposed Technical Solution 205 6.2.3 Conclusions and Further Work 208 6.3 Inter-view Prediction using Reconstructed Disparity Information 208 6.3.1 Problem Definition and Objectives 208 6.3.2 Proposed Technical Solution 208 6.3.3 Performance Evaluation 210 6.3.4 Conclusions and Further Work 211 6.4 Multi-view Coding via Virtual View Generation 212 6.4.1 Problem Definition and Objectives 212 6.4.2 Proposed Technical Solution 212 6.4.3 Performance Evaluation 215 6.4.4 Conclusions and Further Work 216 6.5 Low-delay Random View Access in Multi-view Coding Using a Bit Rate-adaptive Downsampling Approach 216 6.5.1 Problem Definition and Objectives 216 6.5.2 Proposed Technical Solution 216 6.5.3 Performance Evaluation 219 6.5.4 Conclusions and Further Work 222 References 222 7 Introduction to Multimedia Communications 225 7.1 Introduction 225 7.2 State of the Art: Wireless Multimedia Communications 228 7.2.1 QoS in Wireless Networks 228 7.2.2 Constraints on Wireless Multimedia Communications 231 7.2.3 Multimedia Compression Technologies 234 7.2.4 Multimedia Transmission Issues in Wireless Networks 235 7.2.5 Resource Management Strategy in Wireless Multimedia Communications 239

Contents ix 7.3 Conclusions 244 References 244 8 Wireless Channel Models 247 8.1 Introduction 247 8.2 GPRS/EGPRS Channel Simulator 247 8.2.1 GSM/EDGE Radio Access Network (GERAN) 247 8.2.2 GPRS Physical Link Layer Model Description 250 8.2.3 EGPRS Physical Link Layer Model Description 252 8.2.4 GPRS Physical Link Layer Simulator 256 8.2.5 EGPRS Physical Link Layer Simulator 261 8.2.6 E/GPRS Radio Interface Data Flow Model 268 8.2.7 Real-time GERAN Emulator 270 8.2.8 Conclusion 271 8.3 UMTS Channel Simulator 272 8.3.1 UMTS Terrestrial Radio Access Network (UTRAN) 272 8.3.2 UMTS Physical Link Layer Model Description 279 8.3.3 Model Verification for Forward Link 290 8.3.4 UMTS Physical Link Layer Simulator 298 8.3.5 Performance Enhancement Techniques 307 8.3.6 UMTS Radio Interface Data Flow Model 309 8.3.7 Real-time UTRAN Emulator 312 8.3.8 Conclusion 313 8.4 WiMAX IEEE 802.16e Modeling 316 8.4.1 Introduction 316 8.4.2 WIMAX System Description 317 8.4.3 Physical Layer Simulation Results and Analysis 323 8.4.4 Error Pattern Files Generation 324 8.5 Conclusions 328 8.6 Appendix: Eb/No and DPCH Ec/Io Calculation 329 References 330 9 Enhancement Schemes for Multimedia Transmission over 333 Wireless Networks 333 9.1 Introduction 333 9.1.1 3G Real-time Audiovisual Requirements 335 9.1.2 Video Transmission over Mobile Communication Systems 339 9.1.3 Circuit-switched Bearers 348 9.1.4 Packet-switched Bearers 350 9.1.5 Video Communications over GPRS 351 9.1.6 GPRS Traffic Capacity 354 9.1.7 Error Performance 357 9.1.8 Video Communications over EGPRS 357 9.1.9 Traffic Characteristics 358 9.1.10 Error Performance 359 9.1.11 Voice Communication over Mobile Channels

x Contents 9.1.12 Support of Voice over UMTS Networks 360 9.1.13 Error-free Performance 361 9.1.14 Error-prone Performance 362 9.1.15 Support of Voice over GPRS Networks 362 9.1.16 Conclusion 363 9.2 Link-level Quality Adaptation Techniques 365 9.2.1 Performance Modeling 365 9.2.2 Probability Calculation 367 9.2.3 Distortion Modeling 368 9.2.4 Propagation Loss Modeling 368 9.2.5 Energy-optimized UEP Scheme 369 9.2.6 Simulation Setup 370 9.2.7 Performance Analysis 372 9.2.8 Conclusion 373 9.3 Link Adaptation for Video Services 373 9.3.1 Time-varying Channel Model Design 374 9.3.2 Link Adaptation for Real-time Video Communications 379 9.3.3 Link Adaptation for Streaming Video Communications 389 9.3.4 Link Adaptation for UMTS 396 9.3.5 Conclusion 402 9.4 User-centric Radio Resource Management in UTRAN 403 9.4.1 Enhanced Call-admission Control Scheme 403 9.4.2 Implementation of UTRAN System-level Simulator 403 9.4.3 Performance Evaluation of Enhanced CAC Scheme 410 9.5 Conclusions 411 References 413 10 Quality Optimization for Cross-network Media Communications 417 10.1 Introduction 417 10.2 Generic Inter-networked QoS-optimization Infrastructure 418 10.2.1 State of the Art 418 10.2.2 Generic of QoS for Heterogeneous Networks 420 10.3 Implementation of a QoS-optimized Inter-networked Emulator 422 10.3.1 Emulation System Physical Link Layer Simulation 426 10.3.2 Emulation System Transmitter/Receiver Unit 428 10.3.3 QoS Mapping Architecture 428 10.3.4 General User Interface 438 10.4 Performances of Video Transmission in Inter-networked Systems 442 10.4.1 Experimental Setup 442 10.4.2 Test for the EDGE System 443 10.4.3 Test for the UMTS System 445 10.4.4 Tests for the EDGE-to-UMTS System 445 10.5 Conclusions 452 References 453

Contents xi 11 Context-based Visual Media Content Adaptation 455 11.1 Introduction 455 11.2 Overview of the State of the Art in Context-aware Content Adaptation 457 11.2.1 Recent Developments in Context-aware Systems 457 11.2.2 Standardization Efforts on Contextual Information for Content Adaptation 467 11.3 Other Standardization Efforts by the IETF and W3C 476 11.4 Summary of Standardization Activities 479 11.4.1 Integrating Digital Rights Management (DRM) with Adaptation 480 11.4.2 Existing DRM Initiatives 480 11.4.3 The New ‘‘Adaptation Authorization’’ Concept 481 11.4.4 Adaptation Decision 482 11.4.5 Context-based Content Adaptation 488 11.5 Generation of Contextual Information and Proﬁling 492 11.5.1 Types and Representations of Contextual Information 492 11.5.2 Context Providers and Profiling 494 11.5.3 User Privacy 497 11.5.4 Generation of Contextual Information 498 11.6 The Application Scenario for Context-based Adaptation of Governed Media Contents 499 11.6.1 Virtual Classroom Application Scenario 500 11.6.2 Mechanisms using Contextual Information in a Virtual Collaboration Application 502 11.6.3 Ontologies in Context-aware Content Adaptation 503 11.6.4 System Architecture of a Scalable Platform for Context-aware and DRM-enabled Content Adaptation 504 11.6.5 Context Providers 507 11.6.6 Adaptation Decision Engine 510 11.6.7 Adaptation Authorization 514 11.6.8 Adaptation Engines Stack 517 11.6.9 Interfaces between Modules of the Content Adaptation Platform 544 11.7 Conclusions 552 References 553 Index 559

VISNET II Researchers UniS Philipp Fechteler Ricardo Martins Omar Abdul-Hameed Info Feldmann Zaheer Ahmad Jens G€uther UPC-TSC Hemantha Kodikara Arachchi Karsten Gru€neberg Pere Joaquim Mindan Murat Badem Oliver Schreer Jose Luis Valenzuela Janko Calic Ralf Tanger Toni Rama Safak Dogan Luis Torres Erhan Ekmekcioglu EPFL Francesc Tarres Anil Fernando Touradj Ebrahimi Christine Glaser Frederic Dufaux UPC-AC Banu Gunel Thien Ha-Minh Jaime Delgado Huseyin Hacihabiboglu Michael Ansorge Eva Rodr´ıguez Hezerul Abdul Karim Shuiming Ye Anna Carreras Ahmet Kondoz Yannick Maret Ruben Tous Yingdong Ma David Marimon Marta Mrak Ulrich Hoffmann TRT-UK Sabih Nasir Mourad Ouaret Chris Firth Gokce Nur Francesca De Simone Tim Masterton Surachai Ongkittikul Carlos Bandeirinha Adrian Waller Kan Ren Peter Vajda Darren Price Daniel Rodriguez Ashkan Yazdani Rachel Craddock Amy Tan Gelareh Mohammadi Marcello Goccia Eeriwarawe Thushara Alessandro Tortelli Ian Mockford Halil Uzuner Luca Bonardi Hamid Asgari Stephane Villette Davide Forzati Charlie Attwood Rajitha Weerakkody Peter de Waard Stewart Worrall IST Jonathan Dennis Lasith Yasakethu Fernando Pereira Doug Watson Jo~ao Ascenso Val Millington HHI Catarina Brites Andy Vooght Peter Eisert Luis Ducla Soares Ju€rgen Rurainsky Paulo Nunes TUB Anna Hilsmann Paulo Correia Thomas Sikora Benjamin Prestele Jose Diogo Areia Zouhair Belkoura David Schneider Jose Quintas Pedro Juan Jose Burred

xiv Giorgio Prandi VISNET II Researchers Riva Davide Michael Droese Francesco Santagata Andrzej Pietrasiewicz Ronald Glasberg Marco Tagliasacchi Adam Pietrowcew Lutz Goldmann Stefano Tubaro Sławomir Rymaszewski Shan Jin Giuseppe Valenzise Radosław Sikora Mustafa Karaman Władysław Skarbek Andreas Krutz IPW Marek Sutkowski Amjad Samour Stanisław Badura Michał Tomaszewski Lilla Bagin´ska Karol Wnukowicz TiLab Jarosław Baszun Giovanni Cordara Filip Borowski INECS Porto Gianluca Francini Andrzej Buchowicz Giorgiana Ciobanu Skjalg Lepsoy Emil Dmoch Filipe Sousa Diego Gibellino Edyta Da˛browska Jaime Cardoso Grzegorz Galin´ski Jaime Dias UPF Piotr Garbat Jorge Mamede Enric Peig Krystian Ignasiak Jose Ruela V´ıctor Torres Mariusz Jakubowski Lu´ıs Corte-Real Xavier Perramon Mariusz Leszczyn´ski Lu´ıs Gustavo Martins Marcin Morgos´ Lu´ıs Filipe Teixeira PoliMi Jacek Naruniec Maria Teresa Andrade Fabio Antonacci Artur Nowakowski Pedro Carvalho Calatroni Alberto Adam Ołdak Ricardo Duarte Marco Marcon Grzegorz Pastuszak V´ıtor Barbosa Matteo Naccari Davide Onofrio

Preface VISNET II is a European Union Network of Excellence (NoE) in the 6th Framework Programme, which brings together 12 leading European organizations in the ﬁeld of Networked Audiovisual Media Technologies. The consortium consists of organizations with a proven track record and strong national and international reputations in audiovisual information technologies. VISNET II integrates over 100 researchers who have made signiﬁcant contributions to this ﬁeld of technology, through standardization activities, international publications, conferences workshop activities, patents, and many other prestigious achievements. The 12 integrated organizations represent 7 European states spanning across a major part of Europe, thereby promising efﬁcient dissemination and exploitation of the resulting technological development to larger communities. This book contains some of the research output of VISNET II in the area of Advanced Video Coding and Networking. The book contains details of video coding principles, which lead to advanced video coding developments in the form of scalable coding, distributed video coding, non-normative video coding tools, and transform-based multi-view coding. Having detailed the latest work in visual media coding, the networking aspects of video communication are presented in the second part of the book. Various wireless channel models are presented, to form the basis for following chapters. Both link-level quality of service (QoS) and cross-network transmission of compressed visual data are considered. Finally, context-based visual media content adaptation is discussed with some examples. It is hoped that this book will be used as a reference not only for some of the advanced video coding techniques, but also for the transmission of video across various wireless systems with well-deﬁned channel models. Ahmet Kondoz University of Surrey VISNET II Coordinator

Glossary of Abbreviations 3GPP 3rd Generation Partnership Project AA Adaptation Authorizer ADE Adaptation Decision Engine ADMITS Adaptation in Distributed Multimedia IT Systems ADTE Adaptation Decision Taking Engine AE Adaptation Engine AES Adaptation Engine Stack AIR Adaptive Intra Refresh API Application Programming Interface AQoS Adaptation Quality of Service ASC Aspect-Scale-Context AV Audiovisual AVC Advanced Video Coding BLER Block Error Rate BSD Bitstream Syntax Description BSDL Bitstream Syntax Description Language CC Convolutional Coding CC Creative Commons CC/PP Composite Capabilities/Preferences Proﬁle CD Coefﬁcient Dropping CDN Content Distribution Networks CIF Common Intermediate Format CoBrA Context Broker Architecture CoDAMoS Context-Driven Adaptation of Mobile Services CoOL Context Ontology Language CoGITO Context Gatherer, Interpreter and Transformer using Ontologies CPU Central Processing Unit CROSLOCIS Creation of Smart Local City Services CS/H.264/AVC Cropping and Scaling of H.264/AVC Encoded Video CxP Context Provider

xviii Glossary of Abbreviations DAML Directory Access Markup Language DANAE Dynamic and distributed Adaptation of scalable multimedia content in a context-Aware Environment dB Decibel DB Database DCT Discrete Cosine Transform DI Digital Item DIA Digital Item Adaptation DID Digital Item Declaration DIDL Digital Item Declaration Language DIP Digital Item Processing DistriNet Distributed Systems and Computer Networks DPRL Digital Property Rights Language DRM Digital Rights Management DS Description Schemes EC European Community EIMS ENTHRONE Integrated Management Supervisor FA Frame Adaptor FD Frame Dropping FMO Flexible Macroblock Ordering FP Framework Program gBS Generic Bitstream Syntax HCI Human Computer Interface HDTV High-Deﬁnition Television HP Hewlett Packard HTML HyperText Markup Language IEC International Electrotechnical Commission IETF Internet Engineering Task Force IBM International Business Machines Corporation iCAP Internet Content Adaptation Protocol IPR Intellectual Property Rights IROI Interactive Region of Interest ISO International Organization for Standardization IST Information Society Technologies ITEC Department of Information Technology, Klagenfurt University JPEG Joint Photographic Experts Group JSVM Joint Scalable Video Model MDS Multimedia Description Schemes MB Macroblock

Glossary of Abbreviations xix MDS Multimedia Description Schemes MIT Massachusetts Institute of Technology MOS Mean Opinion Score MP3 Moving Picture Experts Group Layer-3 Audio (audio ﬁle format/extension) MPEG Motion Picture Experts Group MVP Motion Vector Predictor NAL Network Abstract Layer NALU Network Abstract Layer Unit NoE Network of Excellence ODRL Open Digital Rights Language OIL Ontology Interchange Language OMA Open Mobile Alliance OSCRA Optimized Source and Channel Rate Allocation OWL Web Ontology Language P2P Peer-to-Peer PDA Personal Digital Assistance PSNR Peak Signal-to-Noise Ratio QCIF Quarter Common Intermediate Format QoS Quality of Service QP Quantization Parameter RD Rate Distortion RDF Resource Description Framework RDB Reference Data Base RDD Rights Data Dictionary RDOPT Rate Distortion Optimization REL Rights Expression Language ROI Region of Interest SECAS Simple Environment for Context-Aware Systems SNR Signal-to-Noise Ratio SOAP Simple Object Access Protocol SOCAM Service-Oriented Context-Aware Middleware SVC Scalable Video Coding TM5 Test Model 5 UaProf User Agent Proﬁle UCD Universal Constraints Descriptor UED Usage Environment Descriptions UEP Unequal Error Protection UF Utility Function

xx Glossary of Abbreviations UI User Item UMA Universal Multimedia Access UMTS Universal Mobile Telecommunications System URI Uniform Resource Identiﬁers UTRAN UMTS Terrestrial Radio Access Network VCS Virtual Collaboration System VoD Video on Demand VOP Video Object Plane VQM Video Quality Metric W3C World Wide Web Consortium WAP Wireless Access Protocol WCDMA Wideband Code Division Multiple Access WDP Wireless Datagram Protocol WLAN Wireless Local Area Network WML Website Meta Language WiFi Wireless Fidelity (IEEE 802.11b Wireless Networking) XML eXtensible Markup Language XrML eXtensible rights Markup Language XSLT eXtensible Stylesheet Language Transformations

1 Introduction Networked Audio-Visual Technologies form the basis for the multimedia communication systems that we currently use. The communication systems that must be supported are diverse, ranging from ﬁxed wired to mobile wireless systems. In order to enable an efﬁcient and cost- effective Networked Audio-Visual System, two major technological areas need to be investi- gated: ﬁrst, how to process the content for transmission purposes, which involves various media compression processes; and second, how to transport it over the diverse network technologies that are currently in use or will be deployed in the near future. In this book, therefore, visual data compression schemes are presented ﬁrst, followed by a description of various media trans- mission aspects, including various channel models, and content and link adaptation techniques. Raw digital video signals are very large in size, making it very difﬁcult to transmit or store them. Video compression techniques are therefore essential enabling technologies for digital multimedia applications. Since 1984, a wide range of digital video codecs have been standardized, each of which represents a step forward either in terms of compression efﬁciency or in functionality. The MPEG-x and H.26x video coding standards adopt a hybrid coding approach, employing block-matching motion estimation/compensation, in addition to the discrete cosine transform (DCT) and quantization. The reasons are: ﬁrst, a signiﬁcant proportion of the motion trajectories found in natural video can be approximately described with a rigid translational motion model; second, fewer bits are required to describe simple translational motion; and ﬁnally, the implementation is relatively straightforward and amena- ble to hardware solutions. These hybrid video systems have provided interoperability in heterogeneous network systems. Considering that transmission bandwidth is still a valuable commodity, ongoing developments in video coding seek scalability solutions to achieve a one-coding multiple-decoding feature. To this end, the Joint Video Team of the ITU-T Video Coding Expert Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) have standardized a scalability extension to the existing H.264/AVC codec. The H.264-based Scalable Video Coding (SVC) allows partial transmission and decoding to the bit stream, resulting in various options in terms of picture quality and spatial-temporal resolutions. In this book, several advanced features/techniques relating to scalable video coding are further described, mostly to do with 3D scalable video coding applications. Applications and scenarios for the scalable coding systems, advances in scalable video coding for 3D video applications, a non-standardized scalable 2D model-based video coding scheme applied on the Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6

2 Visual Media Coding and Transmission texture, and depth coding of 3D video are all discussed. A scalable, multiple description coding (MDC) application for stereoscopic 3D video is detailed. Multi-view coding and Distributed Video Coding concepts representing the latest advancements in video coding are also covered in signiﬁcant depth. The deﬁnition of video coding standards is of the utmost importance because it guarantees that video coding equipment from different manufacturers will be able to interoperate. However, the deﬁnition of a standard also represents a signiﬁcant constraint for manufacturers because it limits what they can do. Therefore, in order to minimize the restrictions imposed on manufacturers, only those tools that are essential for interoperability are typically speciﬁed in the standard: the normative tools. The remaining tools, which are not standardized but are also important in video coding systems, are referred to as non-normative tools and this is where competition and evolution of the technology have been taking place. In fact, this strategy of specifying only the bare minimum that can guarantee interoperability ensures that the latest developments in the area of non-normative tools can be easily incorporated in video codecs without compromising their standard compatibility, even after the standard has been ﬁnalized. In addition, this strategy makes it possible for manufacturers to compete against each other and to distinguish between their products in the market. A signiﬁcant amount of research effort is being devoted to the development of non-normative video coding tools, with the target of improving the performance of standard video codecs. In particular, due to their importance, rate control and error resilience non-normative tools are being researched. In this book, therefore, the development of efﬁcient tools for the modules that are non-normative in video coding standards, such as rate control and error concealment, is discussed. For example, multiple video sequence (MVS) joint rate control addresses the development of rate control solutions for encoding video scenes formed from a composition of video objects (VOs), such as in the MPEG-4 standard, and can also be applied to the joint encoding and transcoding of multiple video sequences (VSs) to be transmitted over bandwidth-limited channels using the H.264/ AVC standard. The goal of wireless communication is to allow a user to access required services at any time with no regard to location or mobility. Recent developments in wireless communications, multimedia technologies, and microelectronics technologies have created a new paradigm in mobile communications. Third/fourth-generation (3G/4G) wireless communication technol- ogies provide signiﬁcantly higher transmission rates and service ﬂexibility over a wide coverage area, as compared with second-generation (2G) wireless communication systems. High-compression, error-robust multimedia codecs have been designed to enable the support of multimedia application over error-prone bandwidth-limited channels. The advances of VLSI and DSP technologies are enabling lightweight, low-cost, portable devices capable of transmitting and viewing multimedia streams. The above technological developments have shifted the service requirements of mobile communication from conventional voice telephony to business- and entertainment-oriented multimedia services in wireless communication systems. In order to successfully meet the challenges set by the latest current and future audiovisual communication requirements, the International Telecommunication Union-Radio communications (ITU-R) sector has elaborated on a framework for global 3G standards by recognizing a limited number of radio access technologies. These are: Universal Mobile Telecommunications System (UMTS), Enhanced Data rates for GSM Evolution (EDGE), and CDMA2000. UMTS is based on Wideband CDMA technology and is employed in Europe and Asia using the frequency band around 2 GHz. EDGE is based on TDMA technology and uses

Introduction 3 the same air interface as the successful 2G mobile system GSM. General Packet Radio Service (GPRS) and High-Speed Circuit Switched Data (HSCSD) are introduced by Phase 2 þ of the GSM standardization process. They support enhanced services with data rates up to 144 kbps in the packet-switched and circuit-switched domains, respectively. EDGE, which is the evolution of GPRS and HSCSD, provides 3G services up to 500 kbps within GSM carrier spacing of 200 kHz. CDMA2000 is based on multi-carrier CDMA technology and provides the upgraded solution for existing IS-95 operators, mainly in North America. EDGE and UMTS are the most widely accepted 3G radio access technologies. They are standardised by the 3rd Generation Partnership Project (3GPP). Even though EDGE and UMTS are based on two different multiple-access technologies, both systems share the same core network. The evolved GSM core network serves for a common GSM/UMTS core network that supports GSM/GPRS/ EDGE and UMTS access. In addition, Wireless Local Area Networks (WLAN) are becoming more and more popular for communication at homes, ofﬁces and indoor public areas such as campus environments, airports, hotels, shopping centres and so on. IEEE 802.11 has a number of physical layer speciﬁcations with a common MAC operation. IEEE 802.11 includes two physical layers a frequency-hopping spread-spectrum (FHSS) physical layer and a direct- sequence spread-spectrum (DSSS) physical layer and operates at 2 Mbps. The currently deployed IEEE 802.11b standard provides an additional physical layer based on a high-rate direct-sequence spread-spectrum (HR/DSSS). It operates in the 2.4 GHz unlicensed band and provides bit rates up to 11 Mbps. IEEE 802.11a standard for 5 GHz band provides high bit rates up to 54 Mbps and uses a physical layer based on orthogonal frequency division multiplexing (OFDM). Recently, IEEE 802.11g standard has also been issued to achieve such high bit rates in the 2.4 GHz band. The Worldwide Interoperability for Microwave Access (WiMAX) is a telecommunications technology aimed at providing wireless data over long distances in different ways, from point- to-point links to full mobile cellular access. It is based on the IEEE 802.16 standard, which is also called WirelessMAN. The name WiMAX was created by the WiMAX Forum, which was formed in June 2001 to promote conformance and interoperability of the standard. The forum describes WiMAX as “a standards-based technology enabling the delivery of last mile wireless broadband access as an alternative to cable and DSL”. Mobile WiMAX IEEE 802.16e provides ﬁxed, nomadic and mobile broadband wireless access systems with superior throughput performance. It enables non-line-of-sight reception, and can also cope with high mobility of the receiving station. The IEEE 802.16e enables nomadic capabilities for laptops and other mobile devices, allowing users to beneﬁt from metro area portability of an xDSL-like service. Multimedia services by deﬁnition require the transmission of multiple media streams, such as video, still picture, music, voice, and text data. A combination of these media types provides a number of value-added services, including video telephony, E-commerce services, multi- party video conferencing, virtual ofﬁce, and 3D video. 3D video, for example, provides more natural and immersive visual information to end users than standard 2D video. In the near future, certain 2D video application scenarios are likely be replaced by 3D video in order to achieve a more involving and immersive representation of visual information and to provide more natural methods of communication. 3D video transmission, however, requires more resources than the conventional video communication applications. Different media types have different quality-of-service (QoS) requirements and enforce conﬂicting constraints on the communication networks. Still picture and text data are categorized as background services and require high data rates but have no constraints on

4 Visual Media Coding and Transmission the transmission delay. Voice services, on the other hand, are characterized by low delay. However, they can be coded using ﬁxed low-rate algorithms operating in the 5 24 kbps range. In contrast to voice and data services, low-bit-rate video coding involves rates at tens to hundreds of kbps. Moreover, video applications are delay sensitive and impose tight constraints on system resources. Mobile multimedia applications, consisting of multiple signal types, play an important role in the rapid penetration of future communication services and the success of these communication systems. Even though the high transmission rates and service ﬂexibility have made wireless multimedia communication possible over 3G/4G wireless communication systems, many challenges remain to be addressed in order to support efﬁcient communications in multi-user, multi-service environments. In addition to the high initial cost associated with the deployment of 3G systems, the move from telephony and low-bit-rate data services to bandwidth-consuming 3G services implies high system costs, as these consume a large portion of the available resources. However, for rapid market evolvement, these wideband services should not be substantially more expensive than the services offered today. Therefore, efﬁcient system resource (mainly the bandwidth-limited radio resource) utilization and QoS management are critical in 3G/4G systems. Efﬁcient resource management and the provision of QoS for multimedia applications are in sharp conﬂict with one another. Of course, it is possible to provide high-quality multimedia services by using a large amount of radio resources and very strong channel protection. However, this is clearly inefﬁcient in terms of system resource allocation. Moreover, the perceptual multimedia quality received by end users depends on many factors, such as source rate, channel protection, channel quality, error resilience techniques, transmission/processing power, system load, and user interference. Therefore, it is difﬁcult to obtain an optimal source and network parameter combination for a given set of source and channel characteristics. The time-varying error characteristics of the radio access channel aggravate the problem. In this book, therefore, various QoS-based resource management systems are detailed. For compari- son and validation purposes, a number of wireless channel models are described. The key QoS improvement techniques, including content and link-adaptation techniques, are covered. Future media Internet will allow new applications with support for ubiquitous media-rich content service technologies to be realized. Virtual collaboration, extended home platforms, augmented, mixed and virtual realities, gaming, telemedicine, e-learning and so on, in which users with possibly diverse geographical locations, terminal types, connectivity, usage environments, and preferences access and exchange pervasive yet protected and trusted content, are just a few examples. These multiple forms of diversity requires content to be transported and rendered in different forms, which necessitates the use of context-aware content adaptation. This avoids the alternative of predicting, generating and storing all the different forms required for every item of content. Therefore, there is a growing need for devising adequate concepts and functionalities of a context-aware content adaptation platform that suits the requirements of such multimedia application scenarios. This platform needs to be able to consume low-level contextual information to infer higher-level contexts, and thus decide the need and type of adaptation operations to be performed upon the content. In this way, usage constraints can be met while restrictions imposed by the Digital Rights Management (DRM) governing the use of protected content are satisﬁed. In this book, comprehensive discussions are presented on the use of contextual information in adaptation decision operations, with a view to managing the DRM and the authorization for adaptation, consequently outlining the appropriate adaptation decision techniques and

Introduction 5 adaptation mechanisms. The main challenges are found by identifying integrated tools and systems that support adaptive, context-aware and distributed applications which react to the characteristics and conditions of the usage environment and provide transparent access and delivery of content, where digital rights are adequately managed. The discussions focus on describing a scalable platform for context-aware and DRM-enabled adaptation of multimedia content. The platform has a modular architecture to ensure scalability, and well-deﬁned interfaces based on open standards for interoperability as well as portability. The modules are classiﬁed into four categories, namely: 1. Adaptation Decision Engine (ADE); 2. Adaptation Authoriser (AA); 3. Context Providers (CxPs); and 4. Adaptation Engine Stacks (AESs), which comprise Adaptation Engines (AEs). During the adaptation decision-taking stage the platform uses ontologies to enable semantic description of real-world situations. The decision- taking process is triggered by low-level contextual information and driven by rules provided by the ontologies. It supports a variety of adaptations, which can be dynamically conﬁgured. The overall objective of this platform is to enable the efﬁcient gathering and use of context information, ultimately in order to build content adaptation applications that maximize user satisfaction.

2 Video Coding Principles 2.1 Introduction Raw digital video signals are very large in size, making it very difﬁcult to transmit or store them. Video compression techniques are therefore essential enabling technologies for digital multimedia applications. Since 1984, a wide range of digital video codecs have been standardized, each of which represents a step forward either in terms of compression efﬁciency or in functionality. This chapter describes the basic principles behind most standard block- based video codecs currently in use. It begins with a discussion of the types of redundancy present in most video signals (Section 2.2) and proceeds to describe some basic techniques for removing such redundancies (Section 2.3). Section 2.4 investigates enhancements to the basic techniques which have been used in recent video coding standards to provide improve- ments in video quality. This section also discusses the effects of communication channel errors on decoded video quality. Section 2.5 provides a summary of the available video coding standards and describes some of the key differences between them. Section 2.6 gives an overview of how video quality can be assessed. It includes a description of objective and subjective assessment techniques. 2.2 Redundancy in Video Signals Compression techniques are generally based upon removal of redundancy in the original signal. In video signals, the redundancy can be classiﬁed as spatial, temporal, or source-coding. Most standard video codecs attempt to remove these types of redundancy, taking into account certain properties of the human visual system. Spatial redundancy is present in areas of images or video frames where pixel values vary by small amounts. In the image shown in Figure 2.1, spatial redundancy is present in parts of the background, and in skin areas such as the shoulder. Temporal redundancy is present in video signals when there is signiﬁcant similarity between successive video frames. Figure 2.2 shows two successive frames from a video sequence. It is clear that the difference between the two frames is small, indicating that it would be inefﬁcient to simply compress a video signal as a series of images. Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6

8 Visual Media Coding and Transmission Figure 2.1 Spatial redundancy is present in areas of an image or video frame where the pixel values are very similar Source-coding redundancy is present if the symbols produced by the video codec are inefﬁciently mapped to a binary bitstream. Typically, entropy coding techniques are used to exploit the statistics of the output video data, where some symbols occur with greater probability than others. 2.3 Fundamentals of Video Compression This section describes how spatial redundancy and temporal redundancy can be removed from a video signal. It also describes how a typical video codec combines the two techniques to achieve compression. 2.3.1 Video Signal Representation and Picture Structure Video coding is usually performed with YUV 4 : 2 : 0 format video as an input. This format represents video using one luminance plane (Y) and two chrominance planes (Cb and Cr). The luminance plane represents black and white information, while the chrominance planes con- tain all of the color data. Because luminance data is perceptually more important than the Figure 2.2 Temporal redundancy occurs when there is a large amount of similarity between video frames

Video Coding Principles 9 Figure 2.3 Most video codecs break up a video frame into a number of smaller units for coding chrominance data, the resolution of the chrominance planes is half that of the luminance in both dimensions. Thus, each chrominance plane contains a quarter of the pixels contained in the luminance plane. Downsampling the color information means that less information needs to be compressed, but it does not result in a signiﬁcant degradation in quality. Most video coding standards split each video frame into macroblocks (MB), which are 16 Â 16 pixels in size. For the YUV 4 : 2 : 0 format, each MB contains four 8 Â 8 luminance blocks and two 8 Â 8 chrominance blocks, as shown in Figure 2.3. The two chrominance blocks contain information from the Cr and Cb planes respectively. Video codecs code each video frame, starting with the MB in the top left-hand corner. The codec then proceeds horizontally along each row, from left to right. MBs can be grouped. Groups of MBs are known by different names in different standards. For example: . Group of Blocks (GOB): H.263 [1 3] . Video packet: MPEG-4 Version 1 and 2 [4 6] . Slice: MPEG-2 [7] and H.264 [8 10] The grouping is usually performed to make the video bitstream more robust to packet losses in communications channels. Section 2.4.8 includes a description of how video slices can be used in error-resilient video coding. 2.3.2 Removing Spatial Redundancy Removal of spatial redundancy can be achieved by taking into account: . The characteristics of the human vision system: human vision is more sensitive to low- frequency image data than high-frequency data. In addition, luminance information is more important than chrominance information. . Common features of image/video signals: Figure 2.4 shows an image that has been high-pass and low-pass ﬁltered. It is clear from the images that the low-pass-ﬁltered version contains more energy and more useful information than the high-pass-ﬁltered one.

10 Visual Media Coding and Transmission Figure 2.4 ‘Lena’ image subjected to a high pass and a low pass ﬁlter These factors suggest that it is advantageous to consider image/video compression in the frequency domain. Therefore, a transform is needed to convert the original image/video signal into frequency coefﬁcients. The DCT is the most widely used transform in lossy image and video compression. It permits the removal of spatial redundancy by compacting most of the energy of the block into a few coefﬁcients. Each 8 Â 8 pixel block is put through the discrete cosine transform (DCT): r NX1 NX1 pð2n1 þ N42Cðk1ÞCðk2Þ 2N 1Þk1 cos pð2n2 þ 1Þk2 Sðk1; k2Þ ¼ n1 ¼0 n2¼0 sðn1; n2Þcos 2N &p ð2:1Þ where k1; k2; n1 ¼ 0; 1; . . . N À 1; CðkÞ ¼ 1= 2 for k ¼ 0; ; N is the block size (N ¼ 8), 1 otherwise s(n1,n2) is an original image 8 Â 8 block, and S(k1, k2) is the 8 Â 8 transformed block.

Video Coding Principles 11 200 350 180 160 300 140 120 250 100 200 150 80 100 60 40 50 20 0 -50 0 -100 1234 5 6 7 1234 5 6 7 8 (a) Original image block Magnitude (b) DCT- transformed block Magnitude 7 7 4 5 1 3 1 8 9 Figure 2.5 Transform based compression (a) Original image block (b) DCT transformed block An example of the DCT in action is shown below, and illustrated in Figure 2.5. An input block, s(n1,n2), is ﬁrst taken from an image: 23 183 160 94 153 194 163 132 165 Sðn1; n2Þ ¼ 66646666666 183 153 116 176 187 166 130 169 57777777777 ð2:2Þ 179 168 171 182 179 170 131 167 177 177 179 177 179 165 131 167 178 178 179 176 182 164 130 171 179 180 180 179 183 169 132 173 179 179 180 182 183 170 129 173 180 179 181 179 181 170 130 169 It is then put though the DCT, and is rounded to the nearest integer: 2 56 À 27 18 78 À 60 27 3 313 À 27 66666666664 À 38 À 27 13 44 32 À 1 À 24 À 10 77777777577 À 20 À 17 À9 À 10 10 33 21 À 6 À 16 1 À8 5 jSðk1; k2Þj ¼ À6 1 9 17 9 À 10 À 13 3 ð2:3Þ 2 3 4 4 6 4 À3 À7 À5 4 0 À3 À7 À4 0 À1 À2 À9 0 2 3 1 0 À4 À2 À1 3 1 The coefﬁcients in the transformed block represent the energy contained in the block at different frequencies. The lowest frequencies, starting with the DC coefﬁcient, are contained in the top-left corner, while the highest frequencies are contained in the bottom-right, as shown in Figure 2.5. Note that many of the high-frequency coefﬁcients are much smaller than the low-frequency coefﬁcients. Most of the energy in the block is now contained in a few low-frequency coefﬁcients. This is important, as the human eye is most sensitive to low-frequency data.

12 Visual Media Coding and Transmission It should be noted that in most video codec implementations the 2D DCT calculation is replaced by 1D DCT calculations, which are performed on each row and column of the 8 Â 8 block: NX1 pð2x þ 1Þu! ð2:4Þ FðuÞ ¼ aðuÞ f ðxÞcos x¼0 2N for u ¼ 0; 1; 2 . . . ; N À 1. The value of a(u) is deﬁned as: 8 >>< p1 N for u ¼ 0 aðuÞ ¼ >:> p2 for u 6¼ 0 ð2:5Þ N The 1D DCT is used as it is easier to optimize in terms of computational complexity. Next, quantization is performed by dividing the transformed DCT block by a quantization matrix. The standard quantization matrices used in the JPEG codec are shown below: 23 16 11 10 16 24 40 51 61 QY ¼ 66664666666 12 12 14 19 26 58 60 55 77757777777 ð2:6Þ 14 13 16 24 40 57 69 56 14 17 22 29 51 87 80 62 18 22 37 56 68 109 103 77 24 35 55 64 81 104 113 92 49 64 78 87 103 121 120 101 72 92 95 98 112 100 103 99 23 17 18 24 47 99 99 99 99 QUV ¼ 66666666646 18 21 26 66 99 99 99 99 77777777775 ð2:7Þ 24 26 56 99 99 99 99 99 47 66 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 99 where QY is the matrix for luminance (Y plane) and QUV is the matrix for chrominance (U and V planes). The matrix values are set using psycho-visual measurements. Different matrices are used for luminance and chrominance because of the differing perceptual importance of the planes. The quantization matrices determine the output picture quality and output ﬁle size. Scaling the matrices by a value greater than 1 increases the coarseness of the quantization, reducing quality. However, such scaling also reduces the number of nonzero coefﬁcients and the size of the nonzero coefﬁcients, which reduces the number of bits needed to code the video.

Video Coding Principles 13 2.3.2.1 H.263 Quantization Example Take the DCT matrix from above: 23 313 56 À 27 18 78 À 60 27 À 27 46666666666 77777777757 jSðk1; k2Þj ¼ À 38 À 27 13 44 32 À1 À 24 À 10 ð2:8Þ À 20 À 17 10 33 À6 À 16 À9 À 10 17 21 À 10 À 13 1 À8 9 4 À7 À5 5 À6 1 6 À3 9 À4 3 2 3 0 À2 À3 0 4 4 À1 À7 0 4 À9 2 3 1 0 À4 À2 À1 3 1 and divide it by the luminance quantization matrix: 23 20 5 À 3 1 3 À 2 1 0 2 56=11 ... À 27=61 3 66666466666 À3 À2 121 0 0 0 77777777757 313=16 } } 1=99 7775 À1 À1 111 0 0 0 6664 À1 010 0 0 0 jSðk1; k2Þj ¼ À 38=12 ¼ 0 000 0 0 0 ð2:9Þ QY ... 0 0 000 0 0 0 0 0 000 0 0 0 3=72 0 0 0 0 000 000 The combination of the DCT and quantization has clearly reduced the number of nonzero coefﬁcients. The next stage in the encoding process is to zigzag scan the DCT matrix coefﬁcients into a new 1D coefﬁcient matrix, as shown in Figure 2.6. Using the above example, the result is: ½20 5 À3 À1 À2 À3 1 1 À1 À1 0 0 1 2 3 À2 1 1 0 0 0 0 0 0 1 1 0 1 EOB ð2:10Þ Figure 2.6 Zigzag scanning of DCT coefﬁcients

14 Visual Media Coding and Transmission Table 2.1 Run level coding Run 0 0 0 0 0 0 0 0 0 Level 5 3 1 2 3 1 1 1 1 Run 2 0 0 0 0 0 6 0 1 Level 1 2 3 2 1 1 1 1 1 where the EOB symbol indicates the end of the block (i.e. all following coefﬁcients are zero). Note that the number of coefﬁcients to be encoded has been reduced from 64 to 28 (29 including the EOB). The data is further reorganized, with the DC component (the top-left in the DCT matrix) being treated differently from the AC coefﬁcients. DPCM (Differential Pulse Code Modulation) is used on DC coefﬁcients in the H.263 standard [1]. This method of coding generally creates a prediction for the current block’s value ﬁrst, and then transmits the error between the predicted value and the actual value. Thus, the reconstructed intensity for the DC at the decoder, s(n1,n2), is: sðn1; n2Þ ¼ ^sðn1; n2Þ þ eðn1; n2Þ ð2:11Þ where ^sðn1; n2Þ and eðn1; n2Þ are respectively the predicted intensity and the error. For JPEG, the predicted DC coefﬁcient is the DC coefﬁcient in the previous block. Thus, if the previous DC coefﬁcient was 15, the coded value for the example given above will be: 20 À 15 ¼ 5 ð2:12Þ AC coefﬁcients are coded using run-level coding, where each nonzero coefﬁcient is coded using a value for the intensity and a value giving the number of zero coefﬁcients preceding the coefﬁcient. With the above example, the coefﬁcients are represented as shown in Table 2.1. Variable-length coding techniques are used to encode the DC and AC coefﬁcients. The coding scheme is arranged such that the most common values have the shortest codewords. 2.3.3 Removing Temporal Redundancy Image coding attempts to remove spatial redundancy. Video coding features an additional redundancy type: temporal redundancy. This occurs because of strong similarities between successive frames. It would be inefﬁcient to transmit a series of JPEG images. Therefore, video coding aims to transmit the differences between two successive frames, thus achieving even higher compression ratios than for image coding. The simplest method of sending the difference between two frames would be to take the difference in pixel intensities. However, this is inefﬁcient when the changes are simply a matter of objects moving around a scene (e.g. a car moving along a road). Here it would be better to describe the translational motion of the object. This is what most video codec standards attempt to do. A number of different frame types are used in video coding. The two most important types are:

Video Coding Principles 15 . Intra frames (called I frames in MPEG standards): these frames use similar compression methods to JPEG, and do not attempt to remove any temporal redundancy. . Inter frames (called P frames in MPEG): these frames use the previous frame as a reference. Intra frames are usually much larger than inter frames, due to the presence of temporal re- dundancy in them. However, inter frames rely on previous frames being successfully received to ensure correct reconstruction of the current frame. If a frame is dropped somewhere in the network then all subsequent inter frames will be incorrectly decoded. Intra frames can be sent periodically to correct this. Descriptions of other types of frame are given in Section 2.4.1. Motion compensation is the technique used to remove much of the temporal redundancy in video coding. It is preceded by motion estimation. 2.3.3.1 Motion Estimation Motion estimation (ME) attempts to estimate translational motion within a video scene. The output is a series of motion vectors (MVs). The aim is to form a prediction for the current frame based on the previous frame and the MVs. The most straightforward and accurate method of determining MVs is to use block matching. This involves comparing pixels in a certain search window with those in the current frame, as shown in Figure 2.7. Typically, the Mean Square Error is employed, such that the MV can be found from: 01 Âd^1d^2Ã ¼ min @1 X ½sðn1; n2; kÞ À sðn1 þ d1; n2 þ d2k À 1Þ2A ð2:13Þ 256 ðd1 ;d2 Þ ðn1 ;n2 Þ2 b where Âd^1d^2Ã is the optimum MV, s(n1, n2, k) is the pixel intensity at the coordinate (n1, n2) in the kth frame, and b is a 16 Â 16 block. Thus, for each MB in the current frame, the algorithm ﬁnds the MV that gives the minimum MSE compared to an MB in the previous frame. Search window MB n Reference frame Current frame Figure 2.7 ME is carried out by comparing an MB in the current frame with pixels in the reference frame within a preset search window

16 Visual Media Coding and Transmission Although this technique identiﬁes MVs with reasonable accuracy, the procedure requires many calculations for a whole frame. ME is often the most computationally intensive part of a codec implementation, and has prevented digital video encoders being incorporated into low- cost devices. Researchers have examined a variety of methods for reducing the computational complexity of ME. However, they usually result in a tradeoff between complexity and accuracy of MV determination. Suboptimal MV selection means that the coding efﬁciency is reduced, and therefore leads to quality degradation, where a ﬁxed bandwidth is speciﬁed. 2.3.3.2 Intra/Inter Mode Decision Not all MBs should be coded as inter MBs, with motion vectors. For example, new objects may be introduced into a scene. In this situation the difference is so large that an intra MB should be encoded. Within an inter frame, MBs are coded as inter or intra MBs, often depending on the MSE value. If the MSE passes a certain threshold, the MB is coded as Intra, otherwise inter coding is performed. The MSE-based threshold algorithm is simple, but is suboptimal, and can only be used when a limited number of MB modes are available. More sophisticated MB mode- selection algorithms are discussed in Section 2.4.3. 2.3.3.3 Motion Compensation The basic intention of motion compensation (MC) is to form as accurate a prediction as possible of the current frame from the previous frame. This is achieved using the MVs produced in the ME stage. Each inter MB is coded by sending the MV value plus the prediction error. The prediction error is the difference between the motion-compensated prediction for that MB and the actual MB in the current frame. Thus, the transmitted error MB is: ð2:14Þ eðn1; n2; kÞ ¼ sðn1; n2; kÞ À s n1 þ d^1; n2 þ d^2; k À 1 The prediction error is generally smaller in magnitude than the original, meaning that less bits are required to code the error. Therefore, the compression efﬁciency is increased by using MC. 2.3.4 Basic Video Codec Structure The video codec shown in Figure 2.8 demonstrates the basic operation of many video codecs. The major components are: . Transform and Quantizer: perform operations similar to the transform and quantization process described in Section 2.3.2. . Entropy Coder: takes the data for each frame and maps it to binary codewords. It outputs the ﬁnal bitstream. . Encoder Control: can change the MB mode and picture type. It can also vary the coarseness of the quantization and perform rate control. Its precise operation is not standardized. . Feedback Loop: removes temporal redundancy by using ME and MC.

Video Coding Principles 17 Encoder Control Macroblock Mode Quantization Parameter Video - Transform Quantization Quantized Signal Transform Coefficients Inverse Entropy Quantization Coder Inverse Transform Motion + Estimation and Compensation Motion Vectors Figure 2.8 Basic video encoder block diagram 2.4 Advanced Video Compression Techniques Section 2.3 discussed some of the basic video coding techniques that are common to most of the available video coding standards. This section examines some more advanced video coding techniques, which provide improved compression efﬁciency, additional functionality, and robustness to communication channel errors. Particular attention is paid to the H.264 video coding standard [8, 9], which is one of the most recently standardized codecs. Subsequent codecs, such as scalable H.264 and Multi-view Video Coding (MVC) [11], use the H.264 codec as a starting point. Note that scalability is discussed in Chapter 3. 2.4.1 Frame Types Most modern video coding standards are able to code at least three different frame types: . I frames (intra frames): these do not include any motion-compensated prediction from other frames. They are therefore coded completely independently of other frames. As they do not remove temporal redundancy they are usually much larger in size than other frame types. However, they are required to allow random access functionality, to prevent drift between the encoder and decoder picture buffers, and to limit the propagation of errors caused by packet loss (see Section 2.4.8). . P frames (inter frames): these include motion-compensated prediction, and therefore remove much of the temporal redundancy in the video signal. As shown in Figure 2.9, P frames

18 Visual Media Coding and Transmission Frame 1: I Frame 2: P Frame 3: P Frame 4: P Figure 2.9 P frames use a motion compensated version of previous frames to form a prediction of the current frame generally use a motion-compensated version of the previous frame to predict the current frame. Note that P frames can include intra-coded MBs. . B frames: the ‘B’ is used to indicate that bi-directional prediction can be used, as shown in Figure 2.10. A motion-compensated prediction for the current frame is formed using information from a previous frame, a future frame, or both. B frames can provide better compression efﬁciency than P frames. However, because future frames are referenced during encoding and decoding, they inherently incur some delay. Figure 2.11 shows that the frames must be encoded and transmitted in an order that is different from playback. This means that they are not useful in low-delay applications such as videoconferencing. They also require additional memory usage, as more reference frames must be stored. H.264 supports a wider range of frames and MB slices. In fact, H.264 supports ﬁve types of such slice, which include I-type, P-type, and B-type slices. I-type (Intra) slices are the simplest, in which all MBs are coded without referring to other pictures within the video sequence. If previously-coded images are used to predict the current MB it is a called P-type (predictive) slice, and if both previous- and future-coded images are used then it is called a B-type (bi- predictive) slice. Other slices supported by H.264 are the SP-type (Switching P) and the SI-type (Switching I), which are specially-coded slices that enable efﬁcient switching between video streams and random access for video decoders [12]. Avideo decoder may use them to switch between one of Frame 1: I Frame 2: B Frame 3: P Frame 4: B Figure 2.10 B frames use bi directional prediction to obtain predictions of the current frame from past and future frames

Video Coding Principles 19 Playback order Frame 2: Frame 3: B frame P frame Frame 1: I frame Coding/transmission order Frame Frame 3: Frame 2: 1: P frame B frame I frame Figure 2.11 Use of B frames means that frames must be coded and decoded in an order that is different from that of playback several available encoded streams. For example, the same video material may be encoded at multiple bit rates for transmission across the Internet. A receiving terminal will attempt to decode the highest-bit-rate stream it can receive, but it may need to switch automatically to a lower-bit-rate stream if the data throughput drops. 2.4.2 MC Accuracy Providing more accurate MC can signiﬁcantly reduce the magnitude of the prediction error, and therefore fewer bits need to be used to code the transform coefﬁcients. More accuracy can be provided either by allowing ﬁner motion vectors to be used, or by permitting more motion vectors to be used in an MB. The former allows the magnitude of the motion to be described more accurately, while the latter allows for complex motion or for situations where there are objects smaller than an MB. H.264 in particular supports a wider range of spatial accuracy than any of the existing coding standards, as shown in Table 2.2. Amongst earlier standards, only the latest version of MPEG-4 Part 2 (version 2) [5] can provide quarter-pixel accuracy, while others provide only half-pixel accuracy. H.264 also supports quarter-pixel accuracy. For achieving quarter-pixel accuracy, the luminance prediction values at half-sample positions are obtained by applying a 6-tap ﬁlter to the nearest integer value samples [9]. The luminance prediction values at quarter-sample positions are then obtained by averaging samples at integer and half-sample positions. An important point to note is that more accurate MC requires more bits to be used to specify motion vectors. However, more accurate MC should reduce the number of bits required to code the quantized transform coefﬁcients. There is clearly a tradeoff between the number of bits added by the motion vectors and the number of bits saved by better MC. This tradeoff depends upon the source sequence characteristics and on the amount of quantization that is used. Methods of ﬁnding the best tradeoff are dealt with in Section 2.4.3.

20 Visual Media Coding and Transmission Table 2.2 Comparison of the ME accuracies provided by different video codecs Standard MVs per MB Accuracy of luma Motion Compensation H.261 1 Integer pixel MPEG 1 MPEG 2/H.262 1 Integer pixel H.263 2 1/2 pixel MPEG 4 4 1/2 pixel H.264/MPEG 4 pt. 10 4 1/4 pixel 16 1/4 pixel 2.4.3 MB Mode Selection Most of the widely-used video coding standards allow MBs to be coded with a variety of modes. For example: . MPEG-2: INTRA, SKIP, INTER-16 Â 16, INTER-16 Â 8. . H.263/MPEG-4: INTRA, SKIP, INTER-16 Â 16, INTER-8 Â 8. . H.264/AVC: INTRA-4 Â 4, INTRA-16 Â 16, SKIP, INTER-16 Â 16, INTER-16 Â 8, IN- TER-8 Â 16, INTER-8 Â 8; the 8 Â 8 INTER blocks may then be partitioned into 4 Â 4, 8 Â 4, 4 Â 8. Selection of the best mode is an important part of optimizing the compression efﬁciency of an encoder implementation. Mode selection has been the subject of a signiﬁcant amount of research. It is a problem that may be solved using optimization techniques such as Lagrangian Optimization and dynamic programming [13]. The approach currently taken in the H.264 reference software uses Lagrangian Optimization [14]. For mode selection, Lagrangian Optimization may be carried out by minimizing the following Lagrangian cost for each coding unit: JMODEðM; Q; lMODEÞ ¼ DRECðM; QÞ þ lMODERRECðM; QÞ ð2:15Þ where RREC(M,Q) is the rate from compressing the current coding unit with mode, M, and with quantizer value, Q. DREC(M,Q) is the distortion obtained from compressing the current coding unit using mode, M, and quantizer, Q. The distortion can be found by taking the sum of squared differences: SSD ¼ X jsðx; y; tÞ À s0ðx; y; tÞj2 ð2:16Þ ðx;yÞ2A where A is the MB, s is the original MB pixels, and s0 is the reconstructed MB pixels. The remaining parameter, lMODE, can be determined experimentally, and has been found to be consistent for a variety of test sequences [15]. For H.263, the following curve ﬁts the experimental results well: lMODE ¼ 0:85 Á QH:2632 ð2:17Þ where QH.263 is the quantization parameter.

Video Coding Principles 21 For H.264, the following curve has been obtained: lMODE ¼ 0:85 Á 2ðQH:264 12Þ=3 ð2:18Þ where QH.264 is the quantization parameter for H.264. For both H.263 and H.264, rate control can be performed by varying the quantization. Once rate control has been performed, the quantization parameter can be used to calculate l, so that Lagrangian Optimization can be used to ﬁnd the optimum mode. Results have shown that this kind of scheme can bring considerable beneﬁts in terms of compression efﬁciency [14]. However, a major disadvantage of this type of optimization is that it involves considerable computational complexity. Many researchers have attempted to ﬁnd lower-complexity solu- tions to this problem. Choi et al. describe one such scheme [16]. 2.4.4 Integer Transform Similar to earlier standards, H.264 also applies a transform to the prediction residual. However, it does not apply the conventional ﬂoating-point 8 Â 8 DCT transform. Instead, a separable integer transform is applied to 4 Â 4 blocks of the picture [17]. This transform eliminates any mismatches between encoder and decoder in the inverse transform due to precise integer speciﬁcation. In addition, its small size helps in reducing blocking and ringing artifacts. For an MB coded in intra-16 Â 16 mode, a similar 4 Â 4 transform is performed for 4 Â 4 DC coefﬁcients of the luminance signal. The cascading of block transforms is equivalent to an extension of the length of the transform functions. If X is the original image block then the 4 Â 4 pixel DCT of X can be found: 2 32 3 aaaa abac AXAT ¼ 466 b c Àc À b 775:X:664 a c Àa À b 757 ð2:19Þ a Àa Àa a a Àc Àa b c Àb b Àc a Àb a Àc a¼1 r2 b ¼ 1 p cos r2 8 1 3p c¼ cos 28 The integer transform of H.264 is similar to the DCT, except that some of the coefﬁcients are rounded and scaled, providing a new transform: 2 32 3 11 1 1 12 1 1 CXCT ¼ 646 2 1 À1 À2 757:X:664 1 1 À1 À2 775 E ð2:20Þ 1 À1 À1 1 1 À1 À1 2 1 À2 2 À1 1 À2 1 À1 where E is a scaling factor. Note that the transform can be implemented without any multiplications, as it only requires additions and bit shifting.

Pages:

Willington Island

Visual Media Coding and Transmission

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Visual Media Coding and Transmission

Read the Text Version

Willington Island

TOP SEARCH

RELATED PUBLICATIONS