Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Visual Media Coding and Transmission

Visual Media Coding and Transmission

Published by Willington Island, 2021-07-26 02:21:34

Description: Visual Media Coding and Transmission is an output of VISNET II NoE, which is an EC IST-FP6 collaborative research project by twelve esteemed institutions from across Europe in the fields of networked audiovisual systems and home platforms. The authors provide information that will be essential for the future study and development of visual media communications technologies. The book contains details of video coding principles, which lead to advanced video coding developments in the form of Scalable Coding, Distributed Video Coding, Non-Normative Video Coding Tools and Transform Based Multi-View Coding. Having detailed the latest work in Visual Media Coding, networking aspects of Video Communication is detailed. Various Wireless Channel Models are presented to form the basis for both link level quality of service (QoS) and cross network transmission of compressed visual data. Finally, Context-Based Visual Media Content Adaptation is discussed with some examples.

MEDIA DOODLE

Search

Read the Text Version

Non normative Video Coding Tools 181 in terms of r, i.e. D(r,a), where a is the parameter defining the probability density function. Therefore, the optimization problem can be expressed directly in the r domain: min 1 XS Dsðrs; asÞ ð5:27Þ S r s¼1 subject to the overall rate constraint: XS Rtot ð5:28Þ usð1 À rsÞ s¼1 and solved by means of the Lagrange multipliers method. The solution can be found either in closed form or numerically, depending on the functional form of D(r,a). It is worth pointing out that, in the transcoding scenario addressed here, the parameters needed to specify the optimization problem, i.e. [us, as], s ¼ 1,. . ., S, can be readily obtained from the histograms of the decoded DCT coefficients. Therefore, in the transcoding process, the input bitstreams are decoded to obtain the DCT coefficients relative to the current frame of each of the considered sequences (and the relative motion information that will be used in the recoding process). Then for each frame the histogram of the DCT coefficients is evaluated, and the parameters of a generalized Gaussian model that best fit the histogram are estimated. This allows the evaluation of the parameter that will be used in the optimization procedure in Equation (5.27) for optimal bit allocation in the recoding of the current frames relative to all the considered sequences. An assumption of a Laplacian distribution of the DCT coefficients could be made, avoiding a frame-by-frame estimation, but simulation tests have shown that this solution leads to significant performance loss without reducing the computational load in a significant way. In any case the proposed algorithm appears very fast and suitable for a real- time implementation. In addition, the rate allocation algorithm is carried out independently at each time instant. Therefore, each video sequence is encoded at a variable bit rate (VBR), while the overall bit rate is kept constant. At each time instant, any frame-based rate control algorithm can be applied to adaptively adjust the quantization parameter at the MB level, in such a way as to meet the target rate, R*s . 5.5.3 Performance Evaluation (Portions reprinted, with permission, from Mariusz Jakubowski, Grzegorz Pastuszak, “Multi- path adaptive computation-aware search strategy for block-based motion estimation,” The International Conference on Computer as a Tool. EUROCON, 9 12 September. 2007 pages: 175 181. Ó2007 IEEE.) The proposed rate control algorithm has so far been tested on H.263 þ intra-encoded sequences, although its extension to H.264/AVC is readily obtained and is the subject of current and future work. Figure 5.6 shows the results obtained with S ¼ 2 sequences and a target bit rate Rtot ¼ 2 bps. The x axis shows R1, i.e. the rate allocated to the first sequence. The rate allocated to the second sequence can be obtained as R2 ¼ R À R1. Figure 5.6 shows both the RD curve of each individual sequence and the average distortion. The vertical dashed line represents the estimated optimal bit allocation obtained by the proposed algorithm.

182 Visual Media Coding and Transmission Figure 5.6 Results of rate allocation for two sequences. Reproduced by Permission of Ó2007 IEEE 5.5.4 Conclusions (Portions reprinted, with permission, from Mariusz Jakubowski, Grzegorz Pastuszak, “Multi- path adaptive computation-aware search strategy for block-based motion estimation,” The International Conference on Computer as a Tool. EUROCON, 9 12 September. 2007 pages: 175 181. Ó2007 IEEE.) A rate controller module for the global rate control of multiple pre-encoded AVC sequences is used to allocate the bit budget to output sequences. The input sequences may be VBR or CBR, but the output sequences are multiplexed together for transmission at a fixed rate into a single channel. A novel solution for the rate control module proposed in this work decodes the sequence in the pixel domain and then re-encodes it with the new target bit rate. This approach speeds up the transcoding process, and avoids drift propagation-related issues. The parameters needed for the optimization are obtained from the histogram of the decoded DCT coefficients, improving the efficiency of the rate control. The results show that the proposed algorithm achieves optimal bit allocation for all the sequences. 5.6 Spatio-temporal Scene-level Error Concealment for Segmented Video (Portions reprinted, with permission, from G. Valenzise, M. Tagliasacchi, S. Tubaro, L. Piccarreta, “A rho-domain rate controller for multiplexed video sequences”, 26th Picture Coding Symposium 2007, Lisbon, November 2007. Ó2007 EURASIP.) 5.6.1 Problem Definition and Objectives As referred to in Section 5.5, several object-based error-concealment techniques, dealing both with shape and texture data, have been proposed in the literature. These techniques, however, have a serious limitation in common, which is that each video object is independently considered, without ever taking into account how it fits in the video scene. After all, the fact that a concealed video object has a pleasing subjective impact on the user when it is considered on its own does not necessarily mean that the subjective impact of the whole scene, when the

Non normative Video Coding Tools 183 Figure 5.7 Illustration of a typical scene concealment problem: (a) original video scene; (b) composition of two independently error concealed video objects. Reproduced by Permission of Ó2008 IEEE objects are put together, will be acceptable; this represents the difference between object-level and scene-level concealment. An example of this situation is given in Figure 5.7, where a hole has appeared as a result of blindly composing the scene by using two independently-concealed video objects. When concealing a complete video scene, the way the scene was created has to be considered since this will imply different problems and solutions in terms of error concealment. As shown in Figure 5.8, the video objects in a scene can be defined either by segmentation of an existing video sequence (segmented scene), in which case all shapes have to fit perfectly together, or by composition of pre-existing video objects (composed scene), whose shapes do not necessarily have to perfectly fit together. Additionally, it is also possible to use a combination of both approaches. For the work presented here, segmented video scenes (or the segmented parts of hybrid scenes) are considered, since the concealment of composed scenes can typically be limited to object-level concealment. In addition, the proposed technique, which targets the concealment of both shape and texture data, relies not only on available information from the current time instant but also on information from the past it is a spatio-temporal technique. In fact, it is the only known spatio- temporal technique that targets both shape and texture data, and works at the scene level. 5.6.2 Proposed Technical Solution In order to better understand the proposed scene-level error-concealment solution, the types of problem that may appear in segmented video scenes when channel errors occur should be briefly considered, as well as what can be done to solve them. Figure 5.8 Two different scene types: (a) segmented scene; (b) composed scene

184 Visual Media Coding and Transmission 5.6.2.1 Scene-level Error Concealment in Segmented Video Segmented video scenes are obtained from rectangular video scenes by segmentation. This means that, at every time instant, the arbitrarily-shaped video object planes (VOPs) of the various video objects in the scene will fit perfectly together like the pieces in a jigsaw puzzle. These arbitrarily-shaped VOPs are transmitted in the form of rectangular bounding boxes, using shape and texture data. The shape data corresponds to a binary alpha plane, which is used to indicate the parts of the bounding box that belong to the object and, therefore, need to have texture associated with it. For the rest of the work described here, it will be considered that some kind of block-based coding, such as (but not necessarily) that defined in the MPEG-4 visual standard, was used and that channel errors manifest themselves as bursts of consecutive corrupted blocks for which both shape and texture data will have to be concealed, at both object and scene levels. Shape Error Concealment in Segmented Video Since, in segmented scenes, the various VOPs in a time instant have to fit together like the pieces in a jigsaw puzzle, if there is any distortion in their shape data, holes or object overlap will appear, leading to a subjective negative impact. However, the fact that the existing VOPs have to fit perfectly together can also be used when it comes to the concealment of shape data errors. In many cases, it will be possible to conceal at least some parts of the corrupted shape in a given corrupted VOP by considering uncorrupted complementary shape data from surrounding VOPs. For those parts of the corrupted shape for which complementary data is not available because it is corrupted, concealment will be much harder. Thus, depending on the part of the corrupted shape that is being concealed in a VOP, two distinct cases are possible: . Correctly decoded complementary shape data: The shape data from the surrounding VOPs can be used to conceal the part of the corrupted shape under consideration since it is uncorrupted. . Corrupted complementary shape data: The shape data from the surrounding VOPs cannot be used to conceal the part of the corrupted shape under consideration since it is also corrupted. These two cases, which are illustrated in Figure 5.9, correspond to different concealment situations and, therefore, will have to be treated separately in the proposed technique. Figure 5.9 Illustration of the two possible concealment situations for the Stefan video objects (Background and Player): (a) correctly decoded complementary shape data exists; (b) complementary shape data is corrupted in both objects. Reproduced by Permission of Ó2008 IEEE

Non normative Video Coding Tools 185 Texture Error Concealment in Segmented Video When concealing the corrupted texture of a given VOP in a video scene, the available texture from surroundingVOPs appears to beoflittle ornouse since different objects typically have uncorrelated textures. However, in segmented scenes, the correctly decoded shape data from surrounding VOPs can be indirectly used to conceal the corrupted texture data. This is possible because the shape data can be used to determine the motion associated with a given video object, which can then be used to conceal its corrupted texture, as was done in [46]. Therefore, by concealing parts of the corrupted shape data of a given VOP with the correctly decoded complementary shape data, it will be possible to estimate the object motion and conceal the corrupted texture. 5.6.2.2 Proposed Scene-level Error-concealment Algorithm By considering what was said above for the concealment of shape and texture data in segmented video scenes, a complete and novel scene-level shape and texture error-conceal- ment solution is proposed here. The proposed concealment algorithm includes two main consecutive phases, which are described in detail in the following two subsections. Shape and Texture Concealment Based on Available Complementary Shape Data In this phase, all the parts of the corrupted shape for which correctly decoded complementary shape data is available are concealed first. To do this for a given corrupted VOP, two steps are needed: 1. Creation of complementary alpha plane: To begin with, a complementary alpha plane, which corresponds to the union of all the video objects in the scene except for the one currently being concealed, is created. 2. Determination of shapel transparency values: Afterwards, each corrupted shapel (i.e. shape element) of the VOP being concealed is set to the opposite transparency value of the corresponding shapel in the complementary alpha plane. Since the complementary alpha plane can also have corrupted parts, this is only done if the required data is uncorrupted. This whole procedure is repeated for all video objects with corrupted shape. It should be noted that, for those parts of the corrupted shape for which complementary data is available, this type of concealment recovers the corrupted shape without any distortion with respect to the original shape, which does not happen in the second phase, described in the next subsection. In order to recover the texture data associated with the opaque parts of the shape data that has just been concealed, a combination of global and local motion (first proposed in [46]) is used. To do this for a given VOP, four steps are needed: 1. Global motion parameters computation: To begin with, the correctly decoded shape and texture data, as well as the shape data that was just concealed, are considered in order to locally compute global motion parameters for the VOP being concealed. 2. Global motion compensation: Then the computed global motion parameters can be used to motion-compensate the VOP of the previous time instant. 3. Concealment of corrupted data: That way, the texture data associated with the opaque parts of the shape data that has just been concealed is obtained by copying the co-located texture in the motion-compensated previous VOP.

186 Visual Media Coding and Transmission 4. Local motion refinement: Since the global motion model cannot always accurately describe the object motion due to the existence of local motion in some areas of the object, a local motion refinement scheme is applied. In this scheme, the available data surrounding the corrupted data being concealed is used to determine if any local motion exists and, if so, to refine the concealment. Shape and Texture Concealment for Which No Complementary Shape Data is Available In this phase, the remaining corrupted shape data, which could not be concealed in the previous phase because no complementary shape data was available in surrounding objects, will be concealed. The texture data associated with the opaque parts of the concealed shape will also be recovered. This phase is divided into two steps: 1. Individual concealment of video objects: Since the remaining corrupted shape of the various video objects in the scene has no complementary data available to be used for concealment, the remaining corrupted shape and texture data will be concealed independently of the surrounding objects. This can be done by using any of the available techniques in the literature. Here, however, to take advantage of the high temporal redundancy of the video data, individual concealment of video objects will be carried out by using a combination of global and local motion-compensation concealment, as proposed in [46]. This technique is applied to conceal both the shape and the texture data of the corrupted video object under consideration. 2. Elimination of scene artifacts by refinement of the object concealment results: As a result of the previous step, holes or object overlaps may appear in the scene, since objects have been processed independently. The regions that correspond to holes are considered undefined, in the sense that they do not yet belong to any object (i.e. shape and texture are undefined). As for the regions where objects overlap, they will also be considered undefined and treated the same way as holes, because a better method to deal with them (i.e. one that would work consistently for most situations) has not been found. In this last step, these undefined regions are divided among the video objects around them. To do this, a morphological filter based on the dilation operation [45] is cyclically applied to the N objects in the scene, A1, A2,. . ., AN, until all undefined regions disappear. The morphological operation to be applied to object Aj is as follows: \"# Â Ã [N Aj È B À Aj È B \\ Ai ð5:29Þ i¼1;i„j The 3 Â 3 structuring element B that is used for the dilation operation È is shown in Figure 5.10. By cyclically applying this filter, the undefined regions are progressively 0 10 1 11 0 10 Figure 5.10 Structuring element used for the dilation operation in the refinement of individual concealment results. Reproduced by Permission of Ó2008 IEEE

Non normative Video Coding Tools 187 Figure 5.11 Elimination of an undefined region by morphological filtering: (a) initial undefined region; (b) undefined region is shrinking; (c) undefined region has been eliminated. Reproduced by Permission of Ó2008 IEEE absorbed by the objects around them until finally they disappear, as illustrated in Figure 5.11. The final result, however, depends on the ordering of objects in this cycle, but since the region to which this operation is applied is typically very small, the differences will hardly be visible. To estimate the texture values of the pixels in these new regions, an averaging procedure is used. This way, in each iteration of the above-mentioned morphological operation, the texture of the pixels which correspond to the shapels that have been absorbed is estimated by computing the mean of the adjacent 4-connected neighbors that were already included in the object. Since the regions over which texture concealment is necessary are typically very small, this procedure is adequate. 5.6.3 Performance Evaluation In order to illustrate the performance of the proposed shape-concealment process, Figure 5.12 should be considered. In this example, the three video objects in Figure 5.12(a) have been corrupted, as shown in Figure 5.12(b). In the remainder of Figure 5.12, the various steps of the concealment process are shown, leading to the final concealed video objects in Figure 5.12(f). To compare these video objects with the original ones in Figure 5.12(a), the Dn and PSNR metrics used by MPEG [46] may be used for shape and texture, respectively. The Dn metric is defined as: Dn ¼ Shapels differing in concealed and original shapes ð5:30Þ Opaque shapels in original shape which can also be expressed as a percentage, Dn [%] ¼ 100 Â Dn. As for the PSNR metric, since arbitrarily-shaped video objects are used, it is only computed over the pixels that belong to both the decoded VOP being evaluated and the original VOP. The obtained Dn values are 0.01%, 0.15%, and 0.12%, respectively, for the Background, Dancers, and Speakers video objects shown in Figure 5.12(f). The corresponding PSNR values are 37.58 dB, 26.20 dB, and 30.27 dB; the uncorrupted PSNR values are 38.25 dB, 33.51 dB, and 34.18 dB, respectively. As can be seen, although the shapes and textures of these video

188 Visual Media Coding and Transmission Figure 5.12 The concealment process for the Dancers sequence: (a) original uncorrupted video objects (Background, Dancers, Speakers); (b) corrupted video objects; (c) video objects after the corrupted data for which complementary data exists has been concealed; (d) video objects after individual concealment; (e) undefined regions that appear after individual concealment (shown in grey); (f) final concealed video objects. Reproduced by Permission of Ó2008 IEEE objects have been severely corrupted, the results are quite impressive, especially when compared to what is typically achieved by independent concealment alone. The main reason for such a great improvement is the use of the complementary shape data from surrounding objects during the concealment process, which does not happen when only independent concealment is performed. 5.6.4 Conclusions In this section, a shape and texture concealment technique for segmented object-based video scenes, such as those based on the MPEG-4 standard, was proposed. Results were presented, showing the ability of this technique to recover lost data in segmented video scenes with rather

Non normative Video Coding Tools 189 small distortion. Therefore, with this technique, it should be possible for object-based video applications (with more than one object) to be deployed in error-prone environments with an acceptable visual quality. 5.7 An Integrated Error-resilient Object-based Video Coding Architecture (Portions reprinted, with permission, from M. Tagliasacchi, G. Valenzise, S. Tubaro, “Minimum variance optimal rate allocation for multiplexed H.264/AVC bitstreams, Image Processing, IEEE Transactions on Volume 17, Issue 7, July 2008 Page(s):1129 1143. Ó2008 IEEE.) 5.7.1 Problem Definition and Objectives As explained in Sections 5.5 and 5.6, in order to make possible new object-based video services such as those based on the MPEG-4 object-based audiovisual coding standard in error- prone environments, appropriate error-resilience techniques are needed. By combining several complementary error-resilience techniques at both sides of the communication chain, it is possible to further improve the error resilience of the whole system. Therefore, the purpose of the work described here is to propose an object-based video coding architecture, where complementary error-resilient tools, most of which have been previously proposed in VISNET I for all the relevant modules, are integrated. 5.7.2 Proposed Technical Solution The proposed object-based architecture, where the various proposed error-resilience techni- ques will be integrated, is shown in Figure 5.13. At the encoder side, after the video scene has been defined, the coding of video objects is supervised by a resilience configuration module, which is responsible for choosing the most adequate coding parameters in terms of resilience. This is important because the decoding performance will very much depend on the kinds of protective action the encoder has taken. The output of the various video object encoders will then be multiplexed and sent through the channel in question. At the decoder side, the procedure is basically the opposite, but, instead of a scene-definition module, a scene-composition module is used. In order to minimize the negative subjective impact of channel errors in the composed presentation, defensive actions have to be taken by the decoder. This includes error detection and error localization in each video object decoder, followed by (object-level) error concealment [47]. At this point, error concealment is applied independently to each video object, and, therefore, it is called object-level error concealment, which can be of several types depending on the data that is used. Afterwards, a more advanced type of concealment may also be performed in the scene-concealment module, which has access to all the video objects present in the scene: this is the so-called scene-level error concealment. The final concealed video scene is presented to the user by the composition module. In this context, several error-resilience techniques are suggested below for the most important modules of the integrated object-based video coding architecture. Since the

190 Visual Media Coding and Transmission Video Content Scene Definition Resilient Resilient Resilient Resilience Video Video Video Configuration Multiplexing and Synchronization Transmission or Storage De-multiplexing Resilient Resilient Resilient Video Video Video OSC OTC OSC OTC OSC OTC AOSTC AOSTC AOSTC Scene Concealment Scene Composition Video Presentation OSC: Object Spatial Concealment OTC: Object Temporal Concealment Figure 5.13 Error resilient object based video coding architecture. Reproduced by Permission of Ó2008 IEEE suggested techniques have already been individually proposed in the literature by the involved partners, only a brief description of each is provided, in order to explain their role in the context of the architecture; to get the full details on these tools, the reader should use the relevant references. 5.7.2.1 Encoder-side Object-based Error Resilience It is largely recognized that intra-coding refreshment can be used at the encoder side to improve error resilience in video coding systems that rely on predictive (inter-) coding. In these systems, the decoded quality can decay very rapidly due to long-lasting channel error propagation, which can be avoided by using an intra-coding refreshment scheme at the encoder to refresh the decoding process and stop (spatial and temporal) error propagation. This will decrease the coding efficiency, but it will significantly improve error resilience at the decoder side, increasing the overall video subjective impact [47]. Object-based Refreshment Need Metrics In order to design an efficient intra-coding refreshment scheme for an object-based video coding system, it would be helpful to have at the encoder side a method to determine which

Non normative Video Coding Tools 191 components of the video data (shape and texture) of which objects should be refreshed, and when. With this in mind, shape-refreshment-need and texture-refreshment-need metrics have been proposed in [48]. These refreshment-need metrics have been shown to correctly express the necessity of refreshing the corresponding video data (shape or texture) according to some error-resilience criteria; therefore, they can be used by the resilience configuration module at the encoder side to efficiently decide whether or not some parts of the video data (shape or texture) of some video objects should be refreshed at a certain time instant. By doing so, significant improvements should be possible in the video subjective impact at the decoder side, since the decoder gets a selective amount of refreshment “help” depending on the content and its (decoder) concealment difficulty. Adaptive Object-based Video Coding Refreshment Scheme Based on the refreshment need metrics described in the previous section, the resilience configuration module should decide which parts of the shape and texture data of the various objects should be refreshed for each time instant. To do so, the adaptive shape and texture intra- coding refreshment scheme proposed in [50] can be used. This scheme considers a multi-object video scene, and its target is to efficiently control the shape and texture refreshment rate for the various video objects, depending on their refreshment needs, related to the concealment difficulty at the decoder. As shown in [50], when this technique is used the overall video quality is improved for a certain total bit rate when compared to cases where less sophisticated refreshment schemes are used (e.g. a fixed intra-refreshment period for all the objects). This happens because objects with low refreshment needs (i.e. easy to conceal at the decoder when errors occur) can be refreshed less often without a significant reduction in their quality, thus saving refreshment resources. The saved resources are then used to improve the quality of objects with high refreshment needs (i.e. hard to conceal when errors occur) by refreshing them more often. 5.7.2.2 Decoder-side Object-based Error Resilience With the techniques described in Section 5.7.2.1, considerable improvements can be obtained in terms of the decoded video quality. However, further improvements can still be achieved by also using sophisticated shape and texture error-concealment techniques at the decoder side. Since different approaches exist in terms of error concealment, several types of error- concealment technique may have to be used. These include object-level techniques to be used by the individual video object decoders, as well as scene-level techniques to be used by the scene-concealment module. Spatial Error Concealment at the Object Level The error-concealment techniques described in this section are object-level spatial (or intra-) techniques, in the sense that they do not rely on information from other time instants and only use the available information for the video object being concealed at the time instant in question. Two different techniques of this type are needed for a video object: one for shape and one for texture. In terms of spatial shape error concealment, the technique proposed in [51] may be used. This technique is based on contour interpolation and has been shown to achieve very good results. This technique considers that, in object-based video, the loss of shape data corresponds to broken contours, which have to be interpolated. By interpolating these broken contours with

192 Visual Media Coding and Transmission Figure 5.14 The spatial shape concealment process: (a) lost data surrounded by the available shape data; (b) lost data surrounded by the available contours; (c) interpolated contours inside the lost area; (d) recovered shape. Reproduced by Permission of Ó2008 IEEE Bezier curves, the authors have shown that it is possible to recover the complete shape with a good accuracy, since most contours in natural video objects typically have a rather slow direction variation. After the broken contours have been recovered, it is fairly easy to recover the values of the missing shapels (pixels in the shape masks) from the neighboring ones by using an adequate continuity criterion and then filling in the shape. This shape error- concealment technique is illustrated in Figure 5.14, where the lost shape data is shown in gray throughout the example. In terms of spatial texture concealment, the technique proposed in [54] may be used. This technique, which is the only one in the literature for object-based systems, consists of two steps. In the first step, padding is applied to the available texture in order to extend it beyond the object boundaries. This is done in order to facilitate the second step, where the lost pixel values are estimated based on the available surrounding texture data (including the extended texture). For this second step, two different approaches have been proposed, for which results have been presented showing their ability to recover lost texture data in a quite acceptable way. One approach is based on the linear interpolation of the available pixel values, and the other is based on the weighted median of the available pixel values. The results that can be achieved with both approaches are illustrated in Figure 5.15. Temporal Error Concealment at the Object Level This section is devoted to object-level temporal (or inter-) error-concealment techniques, in the sense that they rely on information, for the object at hand, from temporal instants other than the current one. Since in most video objects the data does not change that much in consecutive time instants, these techniques are typically able to achieve better concealment results than spatial techniques.

Non normative Video Coding Tools 193 Figure 5.15 A spatial texture concealment example: (a) corrupted object; (b) original object; (c) concealment with linear interpolation; (d) concealment with weighted median In terms of object-level temporal shape error concealment, the technique proposed in [49] may be integrated in the proposed architecture. This technique is based on a combination of global and local motion compensation. It starts by assuming that the shape changes occurring in consecutive time instants can be described by a global motion model, and simply tries to conceal the corrupted shape data by using the corresponding shape data in the global motion- compensated previous shape. Assuming that the global motion model can accurately describe the shape motion, this alone should be able to produce very good results, as illustrated in Figure 5.16. However, in many cases, such as the one illustrated in Figure 5.17, the shape Figure 5.16 The temporal shape concealment process with low local motion: (a) original uncorrupted shape; (b) corrupted shape; (c) motion compensated previous shape; (d) concealed shape without local motion refinement

194 Visual Media Coding and Transmission Figure 5.17 The temporal shape concealment process with high local motion: (a) original uncorrupted shape; (b) corrupted shape; (c) motion compensated previous shape; (d) concealed shape without local motion refinement; (e) concealed shape with local motion refinement motion cannot be accurately described by global motion alone, due to the existence of strong local motion in some areas of the (non-rigid) shape. Therefore, to avoid significant differences when concealing erroneous areas with local motion, an additional local motion refinement scheme has been introduced. Since the technique presented in [49] also works for texture data, it may also be integrated in the proposed architecture to conceal corrupted texture data. Adaptive Spatio-temporal Error Concealment at the Object Level The main problem with the two previous error-concealment techniques is that neither can be used with acceptable results for all possible situations. On one hand, spatial error-concealment techniques are especially useful when the video data changes greatly in consecutive time instants, such as when new objects appear. On the other hand, temporal error-concealment techniques are typically able to achieve better concealment results when the video data does not change much in consecutive time instants. Therefore, by designing a scheme that adaptively selects one of the two concealment techniques, it should be possible to obtain the advantages of both solutions while compensating for their disadvantages. An adaptive spatio-temporal technique has been proposed in [55] by the involved partners, and may be integrated in the proposed architecture. By using this spatio-temporal concealment technique, the concealment results can be significantly improved, as can be seen in Figure 5.18 for shape data. Figure 5.18 The adaptive spatio temporal error concealment process: (a) uncorrupted original shape; (b) corrupted shape; (c) shape concealed with the spatial concealment technique; (d) shape concealed with the temporal concealment technique; (e) shape concealed with the adaptive spatio temporal concealment technique

Non normative Video Coding Tools 195 Scene-level Error Concealment The error-concealment techniques described until now all have a serious limitation in common, which is the fact that each video object is independently considered, without ever taking into account the scene context in which the objects are inserted. After all, just because a concealed video object has a pleasing subjective impact on the user when it is considered on its own, it does not necessarily mean that the subjective impact of the whole scene will be acceptable, particularly when (segmented) objects should fit together as in a jigsaw puzzle. An example of this situation was illustrated in Section 5.6. Two techniques have been proposed, in [52] and [53] (that in [53] corresponds to the technique described in Section 5.6), to deal with this kind of problem, which may be used in the proposed architecture by the scene-concealment module. In these techniques, which target segmented video scenes, the corrupted parts of the data for which correctly decoded complementary data is available in the neighboring objects are first concealed; this is especially effective for shape data. Then, for the remaining parts of the data that cannot be concealed in this way, object-level concealment techniques, such as those described before, are applied to each object. While the technique in [52] considers a spatial approach for this step, that in [53] considers a temporal approach. As a result of this step, holes or object overlaps may appear in the scene, since objects have been independently processed. Therefore, to eliminate these artifacts in the scene, a final refinement step is applied, based on a morphological filter for the shape and on a pixel-averaging procedure for the texture. 5.7.3 Performance Evaluation Since illustrative results of what is possible with the different techniques proposed for each module of the integrated error-resilient architecture have already been shown in Section 5.7.2, further results will not be given here. 5.7.4 Conclusions An integrated error-resilient object-based video coding architecture has been proposed. This is of the utmost importance because it can make the difference between having acceptable- quality video communications in error-prone environments and not. The work presented here corresponds to a complete error-resilient object-based video coding architecture, and the various parts of the system have been thoroughly investigated. 5.8 A Robust FMO Scheme for H.264/AVC Video Transcoding 5.8.1 Problem Definition and Objectives As explained in Section 5.2, the H.264/AVC standard [56] includes several error-resilience tools, whose proper use and optimization is left open to the codec designer. In the work described here, focus is on studying adaptive FMO schemes to enhance the robustness of pre- encoded video material. 5.8.2 Proposed Technical Solution According to the H.264/AVC syntax, each frame is partitioned into one or more slices, and each slice contains a variable number of MBs. FMO is a coding tool supported by the standard

196 Visual Media Coding and Transmission that enables arbitrary assignment of each MB to the desired slice. FMO can be efficiently combined with FEC-based channel coding to provide unequal error protection (UEP). The basic idea is that the most important slice(s) can be assigned a stronger error-correcting code. The goal is to design efficient algorithms that can be used to provide a ranking of the MBs within a frame. The ranking order is determined by the error induced by the loss of the MB at the decoder side. In other words, those MBs that, if lost, cause a large increase of distortion should be given higher protection. The total increase of distortion, measured in terms of MSE, can be factored out as follows: Dtotðt; iÞ ¼ DMV ðt; iÞ þ Dresðt; iÞ þ Ddriftðt; iÞ ð5:31Þ where t is the frame index and i the MB index, and: . DMV (t,i) is the additional distortion due to the fact that the correct motion vector is not available, and needs to be replaced by the concealed one at the decoder. . Dres(t,i) is the additional distortion due to the fact that the residuals of the current MB are lost. . Ddrift(t,i) is the drift introduced by the fact that the reference frame the MB refers to might be affected by errors. In the first part of the work, the focus is on the DMV (t,i) term only. Given a fraction m 2 [0,1] of the MBs that can be protected, the goal consists of identifying those MBs for which motion- compensated concealment at the decoder cannot reliably estimate the lost motion vector. Three approaches are considered and compared: . Random selection of MBs: A fraction of m MBs is selected at random within the frame. . Selection based on simulated concealment at the encoder:For each MB, the encoder simulates motion-compensated concealment, as if only the current MB is lost. The encoder computes the sum of absolute differences (SAD) between the original MB and the concealed one. The fraction of m MBs with the highest value of the SAD is selected. It is worth pointing out that this solution is computationally demanding because it requires the execution of the concealment algorithm at the encoder for each MB in the frame. . Selection based on motion activity: For each MB, a motion-activity metrics is computed. This metrics is based on the value of the motion vectors in neighboring MBs. If neighboring MBs are split into 4 Â 4 blocks, up to 16 neighboring motion vectors can be collected. If larger blocks (8 Â 8, 16 Â 16) exist, their motion vectors are assigned to each of the constituent 4 Â 4 blocks, in such a way that a list of 16 motion vectors, i.e. mvx ¼ [mvx,1, mvx,2,. . ., mvx,16], mvy ¼ [mvy,1, mvy,2,. . .,mvy,16], is always computed. The activity index is computed as follows: o Sort motion vector component x (y). o Discard the first four and the last four motion vector components in each list. o Compute the standard deviation of the remaining eight motion vector components. o Average the standard deviations of the x and y components. Once the activity index is computed for each MB, the algorithm selects the fraction of m MBs with the highest index. With respect to the previous case, this algorithm is fast and requires the evaluation of data related to the motion field only.

Non normative Video Coding Tools 197 5.8.3 Performance Evaluation In order to test the performance of the proposed algorithms, a simulation is performed as follows. Each sequence is encoded with a constant quantization parameter (i.e. QP ¼ 24) and the dispersed slice grouping is selected. The number of slices per frame is set equal to 9, and 50 channel simulations are carried out for the target packet loss rate (PLR), equal to 10%. At the encoder, for each value of m, one of the three aforementioned approaches is used to select the MBs to be protected. For these MBs, at the decoder, the correct motion vector is used to replace the concealed MB. Figure 5.19 shows the results obtained when random selection is performed. The x-axis indicates the percentage of protected MBs, i.e. 100m%, while the y-axis shows the average PSNR across the sequence and over the channel simulations. The dashed line represents the average PSNR when no channel errors are introduced, while the solid line represents the average PSNR when errors occur and motion-compensated concealment is performed at the decoder (the algorithm included in the JM reference software is used for this purpose). In this test, the case in which a fraction of the bitplanes of the motion vectors are correctly recovered is simulated. Motion vector components are represented as 16 bit integers, where the last two bits are used for the fractional part to accommodate 1/4 pixel accuracy. The “16 Btpln” label indicates that the motion vector is completely recovered. The “14 Btpln” line indicates that only the integer part of the motion vector is recovered, and so on. The rationale behind this test is that a DVC scheme based on turbo codes could be accommodated to protect the most significant bitplanes of the motion vectors. Figure 5.19 illustrates a ramp-like behavior, which is to be expected since the MBs are selected at random. Conversely, both Figure 5.20 and Figure 5.21 show that, for a given value of m, a larger average PSNR can be attained by a careful selection of the MBs. Also, the selection based on simulated concealment at the encoder performs best, but it has a higher computational Foreman random geomeanPSNRs PSNR (dB) 37 6 Bitpl 36 7 Bitpl 35 8 Bitpl 34 9 Bitpl 33 10 Bitpl 32 11 Bitpl 31 12 Bitpl 30 13 Bitpl 29 14 Bitpl 28 15 Bitpl 16 Bitpl 0% PsnrX Xcap PsnrX Xtilde 20% 40% 60% 80% 100% Corrected MBs Figure 5.19 Random selection of MBs

198 Visual Media Coding and Transmission Foreman MSE geomeanPSNRs 37 6 Bitpl 7 Bitpl 8 Bitpl 36 9 Bitpl 10 Bitpl 35 11 Bitpl 12 Bitpl 13 Bitpl 34 14 Bitpl 15 Bitpl PSNR (dB) 16 Bitpl 33 PsnrX Xcap PsnrX Xtilde 32 31 30 29 28 0% 20% 40% 60% 80% 100% Corrected MBs Figure 5.20 Selection based on simulated concealment at the encoder complexity. These results suggest that it is more efficient to protect a fraction of the MBs at full precision than the full frame at a coarser accuracy, suggesting that a DVC-based protection of a fraction of the bitplanes is impractical. 5.8.4 Conclusions An adaptive FMO technique for H.264 was proposed and analyzed in this section. FMO is an important error-resilience tool which arbitrarily assigns MBs to the slices. In the proposed Foreman distortion MVs geomeanPSNRs PSNR (dB) 37 6 B tpl 36 7 B tpl 35 8 B tpl 34 9 B tpl 33 10 Bitpl 32 11 Bitpl 31 12 Bitpl 30 13 Bitpl 29 14 Bitpl 28 15 Bitpl 16 Bitpl 0% PsnrX Xcap PsnrX Xt lde 20% 40% 60% 80% 100% Corrected MBs Figure 5.21 Selection based on motion activity

Non normative Video Coding Tools 199 scheme, an efficient algorithm has been designed to provide a ranking of the MBs within a frame. Three selection methods are considered, namely: random selection of MBs; selection based on simulated concealment at the encoder; and selection based on motion activity, for each MB. The proposed algorithms are simulated and results are compared. Results show that decoded objective video quality improves with careful selection of the MB. Finally, the selection based on simulated concealment at the encoder performs best. 5.9 Conclusions (Portions reprinted, with permission, from M. Naccari, G. Bressan, M. Tagliasacchi, F. Pereira, S. Tubaro, “Unequal error protection based on flexible macroblock ordering for robust H.264/ AVC video transcoding”, Picture Coding Symposium, Lisbon, November 2007. Ó2007 EURASIP.) The techniques described in this chapter correspond to the non-normative areas of video coding standards. They deal with error resilience and rate control, and can readily be applied to existing codecs without compromising standard compatibility, since they work within the realm of the standard. While the objective of error-resilience techniques is to improve the performance of the video coding system in the presence of channel errors, rate control techniques improve performance by adequately distributing the available bit-rate resources in space and time. Research in the field of non-normative video coding tools up till now has shown how important these tools are in getting the best performance from normative video coding standards. The non-normative video coding tools are the “decision makers” that any video coding standard needs in order to find the “right path” through a flexible coding syntax, for example in the most powerful coding standards like MPEG-4 Visual and H.264/AVC. References [1] ITU T Recommendation H.261 (1993), “Video codec for audiovisual services at p  64 kbps,” 1993. [2] CCITT SGXV, “Description of reference model 8 (RM8),” Doc. 525, Jun. 1989. [3] ISO/IEC 11172 2:1993, “Information technology: coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbps: part 2: video,” 1993. [4] E. Viscito and C. Gonzales, “A video compression algorithm with adaptive bit allocation and quantization,” Proc. Visual Communications and Image Processing (VCIP’91), Boston, MA, Vol. 1605, pp. 58 72, 1991. [5] ISO/IEC 13818 2:1996, “Information technology: generic coding of moving pictures and associated audio information: part 2: video,” 1996. [6] MPEG Test Model Editing Committee, “MPEG 2 Test Model 5,” Doc. ISO/IEC JTC1/SC29/WG11 N400, Sydney MPEG meeting, Apr. 1993. [7] ITU T Recommendation H.263 (1996), “Video coding for low bitrate communication,” 1996. [8] ITU T/SG15, “Video codec test model, TMN8,” Doc. Q15 A 59, Portland, OR, Jun. 1997. [9] MPEG Video, “MPEG 4 video verification model 5.0,” Doc. ISO/IEC JTC1/SC29/WG11 N1469, Maceio´ MPEG meeting, Nov. 1996. [10] T. Chiang and Y.Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, pp. 246 250, Feb. 1997. [11] MPEG Video, “MPEG 4 video verification model 8.0,” Doc. ISO/IEC JTC1/SC29/WG11 N1796, Stockholm MPEG meeting, Jul. 1997. [12] A. Vetro, H. Sun, and Y. Wang, “MPEG 4 rate control for multiple video objects,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 1, pp. 186 199, Feb. 1999. [13] J. Ronda, M. Eckert, F. Jaureguizar, and N. Garcıa, “Rate control and bit allocation for MPEG 4,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 9, No. 8, pp. 1243 1258, Dec. 1999.

200 Visual Media Coding and Transmission [14] Y. Sun and I. Ahmad, “A robust and adaptive rate control algorithm for object based video coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 10, pp. 1167 1182, Oct. 2004. [15] Y. Sun and I. Ahmad, “Asynchronous rate control for multi object videos,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, No. 8, pp. 1007 1018, Aug. 2005. [16] ISO/IEC 14496 10:2003/ITU T Recommendation H.264, “Advanced video coding (AVC) for generic audiovi sual services,” 2003. [17] Z.G. Li, F. Pan, K.P. Lim, G. Feng, X. Lin, and S. Rahardaj, “Adaptive basic unit layer rate control for JVT,” Doc. JVT G012, 7th meeting, Pattaya, Thailand, Mar. 2003. [18] Z. He and S.K. Mitra, “A unified rate distortion analysis framework for transform coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1221 1236, Dec. 2001. [19] ISO/IEC 14496 2:2001, “Information technology: coding of audio visual objects: part 2: visual,” 2001. [20] L.D. Soares and F. Pereira, “Refreshment need metrics for improved shape and texture object based resilient video coding,” IEEE Transactions on Image Processing, Vol. 12, No. 3, pp. 328 340, Mar. 2003. [21] L.D. Soares and F. Pereira, “Adaptive shape and texture intra refreshment schemes for improved error resilience in object based video,” IEEE Transactions on Image Processing, Vol. 13, No. 5, pp. 662 676, May 2004. [22] S. Shirani, B. Erol, and F. Kossentini, “A concealment method for shape information in MPEG 4 coded video sequences,” IEEE Transactions on Multimedia, Vol. 2, No. 3, pp. 185 190, Sep. 2000. [23] L.D. Soares and F. Pereira, “Spatial shape error concealment for object based image and video coding,” IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 586 599, Apr. 2004. [24] G.M. Schuster, X. Li, and A.K. Katsaggelos, “Shape error concealment using hermite splines,” IEEE Transac tions on Image Processing, Vol. 13, No. 6, pp. 808 820, Jun. 2004. [25] P. Salama and C. Huang, “Error concealment for shape coding,” Proc. IEEE International Conference on Image Processing, Rochester, NY, Vol. 2, pp. 701 704, Sep. 2002. [26] L. D. Soares and F. Pereira, “Motion based shape error concealment for object based video,” Proc. IEEE International Conference on Image Processing, Singapore, Oct. 2004. [27] L.D. Soares and F. Pereira, “Combining space and time processing for shape error concealment,” Proc. Picture Coding Symposium, San Francisco, CA, Dec. 2004. [28] L.D. Soares and F. Pereira, “Spatial Texture Error Concealment for Object based Image and Video Coding,” Proc. EURASIP Conference on Signal and Image Processing, Multimedia Communications and Services, Smolenice, Slovakia, Jun. 2005. [29] ISO/IEC 14496 2:2001, “Information technology: coding of audio visual objects: part 2: visual,” 2001. [30] P. Nunes and F. Pereira “Joint rate control algorithm for low delay MPEG 4 object based video encoding,” IEEE Transactions on Circuits and Systems for Video Technology (submitted). [31] ISO/IEC 14496 5:2001, “Information technology: coding of audio visual objects: part 5: reference software,” 2001. [32] MPEG Video, “MPEG 4 video verification model 5.0,” Doc. N1469, Maceio´ MPEG meeting, Nov. 1996. [33] A. Vetro, H. Sun, and Y. Wang, “MPEG 4 rate control for multiple video objects,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 9, No. 1, pp. 186 199, Feb. 1999. [34] J. Ronda, M. Eckert, F. Jaureguizar, and N. Garcia, “Rate control and bit allocation for MPEG 4,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 9, No. 8, pp. 1243 1258, Dec. 1999. [35] H. J. Lee, T. Chiang, and Y. Q. Zhang, “Scalable rate control for MPEG 4 video,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 10, No. 6, pp. 878 894, Sep. 2000. [36] P. Nunes and F. Pereira, “Scene level rate control algorithm for MPEG 4 video encoding,” Proc. VCIP’01, San Jose, CA, Vol. 4310, pp. 194 205, Jan. 2001. [37] Y. Sun and I. Ahmad, “A robust and adaptive rate control algorithm for object based video coding,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 14, No. 10, pp. 1167 1182, Oct. 2004. [38] Y. Sun and I. Ahmad, “Asynchronous rate control for multi object videos,” IEEE Transactions on Circuits and Systems For Video Technology, Vol. 15, No. 8, pp. 1007 1018, Aug. 2005. [39] P. Nunes and F. Pereira, “Rate control for scenes with multiple arbitrarily shaped video objects,” Proc. PCS’97, Berlin, Germany, pp. 303 308, Sep. 1997. [40] G. Bjontegaard,“Calculation of average PSNR differences between RD curves,” Doc. VCEG M33, Austin, TX, Apr. 2001. [41] ISO/IEC 14496 10:2003/ITU T Recommendation H.264, “Advanced video coding (AVC) for generic audiovi sual services,” 2003.

Non normative Video Coding Tools 201 [42] T. Chiang, Y.Q. Zhang, “A new rate control scheme using quadratic rate distortion model,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, pp. 246 250, Feb. 1997. [43] Z. He, S.K. Mitra, “A unified rate distortion analysis framework for transform coding,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1221 1236, Dec. 2001. [44] ISO/IEC 14496 2, “Information technology: coding of audio visual objects: part 2: visual,” Dec. 1999. [45] R.C. Gonzalez and R.E. Woods, Digital Image Processing, 2nd Ed., Prentice Hall, 2002. [46] L.D. Soares and F. Pereira, “Motion based shape error concealment for object based video,” Proc. IEEE International Conference on Image Processing, Singapore, Oct. 2004. [47] L.D. Soares and F. Pereira, “Error resilience and concealment performance for MPEG 4 frame based video coding,” Signal Processing: Image Communication, Vol. 14, Nos. 6 8, pp. 447 472, May 1999. [48] L.D. Soares and F. Pereira, “Refreshment need metrics for improved shape and texture object based resilient video coding,” IEEE Transactions on Image Processing, Vol. 12, No. 3, pp. 328 340, Mar. 2003. [49] L.D. Soares and F. Pereira, “Temporal shape error concealment by global motion compensation with local refinement,” IEEE Transactions on Image Processing, Vol. 15, No. 6, pp. 1331 1348, Jun. 2006. [50] L.D. Soares and F. Pereira, “Adaptive shape and texture intra refreshment schemes for improved error resilience in object based video,” IEEE Transactions on Image Processing, Vol. 13, No. 5, pp. 662 676, May 2004. [51] L.D. Soares and F. Pereira, “Spatial shape error concealment for object based image and video coding,” IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 586 599, Apr. 2004. [52] L.D. Soares and F. Pereira, “Spatial scene level shape error concealment for segmented video,” Proc. Picture Coding Symposium, Beijing, China, Apr. 2006. [53] L.D. Soares and F. Pereira, “Spatio temporal scene level error concealment for shape and texture data in segmented video content,” Proc. IEEE International Conference on Image Processing, Atlanta, GA, Oct. 2006. [54] L.D. Soares and F. Pereira, “Spatial texture error concealment for object based image and video coding,” Proc. EURASIP Conference on Signal and Image Processing, Multimedia Communications and Services, Smolenice, Slovakia, Jun. 2005. [55] L.D. Soares and F. Pereira, “Combining space and time processing for shape error concealment,” Proc. Picture Coding Symposium, San Francisco, CA, Dec. 2004. [56] ISO/IEC 14496 10:2003/ITU T Recommendation H.264, “Advanced video coding (AVC) for generic audiovi sual services,” 2003. [57] “Joint model reference encoding methods and decoding concealment methods,” JVT I049d0, San Diego, CA, Sep. 2003.



6 Transform-based Multi-view Video Coding 6.1 Introduction Existing video coding standards are suitable for coding 2D videos in a rate-distortion optimized sense, where rate-distortion optimization refers to the process of jointly optimizing both the resulting image quality and the required bit rate. The basic principle of block-based video coders is to remove temporal and spatial redundancies among successive frames of video sequences. However, as the viewing experience moves from 2D viewing to more realistic 3D viewing, it will become impossible to reconstruct a realistic 3D scene by using a simple 2D video scene, or to allow a user (or a group of users) to freely and interactively navigate in a visual scene [1]. More viewpoints of a given scene would be needed in such cases, which conventional video coders are not optimized to jointly code without modifying some of the existing compression tools. As the number of representations of a certain scene from different viewpoints increases, the size of the total payload to be transmitted will increase by the same proportion where every single representation (or viewpoint) is encoded as a single 2D video. This method is highly impractical due to storage and bandwidth constraints. Therefore, other ways to jointly encode viewpoints have been developed, all of which take into account the exploitation of strong correlations that exist among different viewpoints of a certain scene. Exploitation of inter-view correlation reduces the number of bits required for coding. The correspondences existing among different viewpoints are called inter-view correspondences. A generic multi-view encoder should reduce such inter-view redundancies, as well as other temporal and intra-frame redundancies, which are already well catered for by existing 2D video compression techniques. The ISO/ICE JTC1/SC29/WG11 Moving Picture Experts Group (MPEG) has recognized the importance of multi-view video coding, and established an ad hoc group (AHG) on 3D Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6

204 Visual Media Coding and Transmission audio and visual (3DAV) in December 2001 [2]. Four main exploration experiments were conducted in the 3DAV group between 2002 and 2004: 1. An exploration experiment on omni-directional video. 2. An exploration experiment on FTV. 3. An exploration experiment on coding of stereoscopic video using the multiple auxiliary component (MAC) of MPEG-4. 4. An exploration experiment on depth and disparity coding for 3D TV and intermediate view generation (view synthesis) [3]. After a Call for Comments was issued in October 2003, a number of companies claimed the need for a standard enabling the FTV and 3D TV systems. In October 2004 MPEG called interested parties to bring evidence on MVC technologies [4,5], and a Call for Proposals on MVC was issued in July 2005 [6], following acceptance of the evidence. Responses to the Call were evaluated in January 2006 [7], and an approach based on a modified AVC scheme was selected as the basis for standardization efforts. An important property of the codec under development is that it is capable of decoding AVC bitstreams as well as multi-view coded sequences. Multi-view coding (MVC) remains a part of the standardization activities within the body of the Joint Video Team (JVT) of ISO/MPEG and ITU/VCEG. Much of the work carried out so far is intended to compensate for the illumination level differences among multi-view cameras [8]. Such differences reduce the correlation between the different views, and therefore affect compression efficiency. Further research has been carried out with the aim of improving inter-view prediction quality. Reference frame-based techniques aim to adapt already existing motion estimation and compensation algorithms to remove inter-view redundancies. Basically, the disparity among different views is treated as motion in the temporal direction, and the same methods used to model motion fields are applied to model disparity fields. Utilization of the highly-efficient prediction structure based on a hierarchical decomposition of frames (in time domain) in the spatial (viewpoint) domain brings quite a lot of prediction efficiency overall [9]. The Joint Multi-view Video Model (JMVM) takes this type of prediction structure as its base prediction. Disparity-based approaches take into account the geometrical constraints. These constraints can be measured and used to try to improve the inter-view prediction efficiency by fully or partially exploiting the scene geometry. Intermediate representations of frames are generated or synthesized within these approaches, and used as potential prediction sources for coding. In [10] a disparity-based approach is set out, which relies on predicting frames from novel images synthesized from other viewpoints using their corresponding per-pixel depth information and known camera parameters. In [11] it is proposed to refer to the positional correspondence of the multi-view camera rig when interpolating intermediate frames. Accor- dingly, the scene geometry is exploited partially. The advantage of such a method is that neither depth/disparity fields nor the extrinsic and intrinsic parameters of the multi-camera arrays need to be calculated and sent. However, especially for free-viewpoint video applications [12,13], high-quality views need to be synthesized through depth image-based rendering techniques at arbitrary virtual camera locations, which necessitates the communication of scene depth information anyway. Another field of work concentrates on achieving efficient coding of multi-view depth information, trying to preserve fine-edge details in the depth image at the same time. The effect of the reconstruction quality of depth information is shown to have a significant effect

Transform based Multi view Video Coding 205 on the quality of the synthesized free viewpoint images [14]. The authors of [15] have proposed a multi-view-plus-depth map coding system to handle both efficient compression and high-quality view generation at the same time. Similarly, in [16], the importance of considering multi- viewpoint videos jointly with their depth maps for efficient overall compression is stressed. A completely different data representation format for depth information is defined in [17], where the depth maps of a scene captured from a multiple-camera array are fused into the layers of a special representation for depth. This representation is called Layered Depth Image (LDI), and in [18,19] coding of such representations is investigated. Due to its different picture format, LDI is not easily or efficiently compressible with conventional video coders like AVC. A number of possible 3D video and free viewpoint video applications that include multiple viewpoint videos are subject to constraints. A large amount of the work performed on multi- view video coding has the aim of improving the compression efficiency. However, there are other application-specific constraints, in addition to the bandwidth/storage limitations, that require high compression efficiency. These additional constraints include coding delay (crucial for real-time applications), coding complexity, random access capability, and additional scalability in the view dimension. These should also be taken into account when designing application-specific multi-view encoders. In the context of scalable 3D video compression, many key achievements have been made within VISNET I. The purpose of the research activities carried out within VISNET II is to develop new video coding tools suitable for multi-view video data, and to achieve the integration of these tools into the state-of-the-art multi-view video coder JSVM/JMVM. In VISNET II a number of different tools have been developed to improve the rate-distortion performance and reduce the multi-view encoder complexity, and to cope with the other application-specific constraints stated above at the same time. The research work performed within VISNET II regarding transform-based multi-view coding is detailed in the rest of the chapter. Section 6.2 discusses the reduction of multi-view encoder complexity through the use of a multi-grid pyramidal approach. Section 6.3 discusses inter-view prediction using the reconstructed disparity information. Section 6.4 shows a work on multi-view coding via virtual view generation, and Section 6.5 will discuss low-delay random view access issues and propose a solution. 6.2 MVC Encoder Complexity Reduction using a Multi-grid Pyramidal Approach 6.2.1 Problem Definition and Objectives Multi-view Video Coding uses several references to perform predictive coding at the encoder. Furthermore, motion estimation is performed with respect to each reference frame using different block search sizes. This entails a very complex encoder. The goal of the contribution is to reduce the complexity of the motion estimation while preserving the rate-distortion performance. 6.2.2 Proposed Technical Solution The complexity reduction is achieved by using so-called Locally Adaptive Multi-grid Block Matching Motion Estimation [20]. The technique is supposed to generate more reliable motion fields. A coarse but robust enough estimation of the motion field is performed at the lowest

206 Visual Media Coding and Transmission resolution level and is iteratively refined at the high resolution levels. Such a process leads to a robust estimation of large-scale structures, while short-range displacements are accurately estimated on small-scale structures. This method takes into consideration the simple fact that coarser structures are sufficient in uniform regions, whereas finer structures are required in detailed areas. The method produces a precise enough estimate of the motion vectors, so that the prediction efficiency is maintained with a less complex structure than a classic full search method. On the other hand, the coding cost increases, since the amount of side information becomes larger. The multi-grid approach consists of three main steps, namely motion estimation at each level, the segmentation decision, and the down-projection operator. These are described in more detail in the following subsections. 6.2.2.1 Motion Estimation at Each Level At each level, motion estimation is performed using block matching. In order to reduce the complexity of the algorithm, an n-step search technique is used to search for a matching block, as illustrated in Figure 6.1. Initially, the first nine locations defined by the set (0, Æ2n 1) are evaluated. The best estimate is the initial point for the next search step. At the ith step, the eight locations defined by the set (0, Æ2n i) around the initial point are evaluated. Therefore, the resulting maximum displacement of the n-step search is 2n À 1. 6.2.2.2 The Segmentation Decision Rule The segmentation rule decides whether or not to split a block. This rule has a great impact on the overall performance of the algorithm. The algorithm can produce either more accurate motion vectors, which leads to the generation of a large amount of side information (i.e. many blocks are split), or poor motion vectors, which require a reduced amount of overhead information (i.e. few blocks are split). At its simplest, the segmentation rule might be defined according to the mean absolute error (MAE) and some threshold T, such that the decision not to split a block occurs as soon as T is reached: MAE nosplit > T ! split ð6:1Þ Mean square error (MSE) is another possible measure. The difficulty of this rule is to define the appropriate threshold T. Figure 6.1 n step search

Transform based Multi view Video Coding 207 ABC ab DE cd FGH Figure 6.2 Down projection initializes the motion vectors of blocks a, b, c and d. Block a chooses from the motion vectors of its direct parent block, block A, block B and block D. Block b chooses from the motion vectors of its direct parent block, block B, block C and block E. Block c chooses from the motion vectors of its direct parent block, block D, block F and block G. Block d chooses from the motion vectors of its direct parent block, block E, block G and block H 6.2.2.3 Down-projection Operator This operator is used to map the motion vectors between two grid levels, from the coarser level towards the finer ones (i.e. the motion vectors that parent blocks transmit to their child blocks). It should prevent block artifacts from propagating into the fine levels. At the same time, it should prevent incorrect motion vector estimations, as a result of local minima selection of the matching criterion. Furthermore, it should guarantee a smooth and robust motion field. Each block can be segmented into four child blocks. Each of these requires an initial motion vector, obtained by the down-projection operation. The simplest way is pass the motion vector of the parent block to the four children. A more efficient operator is to select for each child block the best motion vector from among four parent blocks. These blocks are the ones which are closest to the child (i.e. its direct parent block and three of its neighbors, depending on the position of the child). A down-projection example is given in Figure 6.2. An example of a multi-grid iteration is shown in Figure 6.3; the coarsest and finest block resolutions are 64 Â 64 and 4 Â 4, respectively. It is therein possible to verify the fact that coarser structures are sufficient in uniform regions, whereas finer structures are required in detailed areas. Figure 6.3 An example of the multi grid search, where the block size varies from 4 to 64 pixels

208 Visual Media Coding and Transmission The novelty comes from the adaptation of the multi-grid block motion estimation approach to the framework of multi-view video coding. 6.2.3 Conclusions and Further Work The intended next step is to replace the simple splitting rule with a splitting entropy criterion. This kind of splitting operation is supposed to control the segmentation such that an optimal bit allocation between the motion parameters and the displaced frame difference (DFD) is reached. This criterion compares the extra cost of sending additional motion parameters with the gain obtained on the DFD side, to decide whether or not to split a given block. In addition, the problem of setting a threshold is expected to be overcome, as has been done for the MAE segmentation rule. As an alternative to block-based motion search, a global motion model [21] (homography) for prediction from the side views will be investigated for use in MVC. This would generate less side information to be encoded, since the model applies to the whole image. On the other hand, a motion vector is assigned to each block of the image when block-based motion search is used. 6.3 Inter-view Prediction using Reconstructed Disparity Information 6.3.1 Problem Definition and Objectives In future multi-view video coding standards, the use of disparity/depth information will be essential, and consideration of coding such data is needed. Having such information enables the development of new applications containing interactivity, i.e. the user can freely navigate in a visual scene through the existence of virtual camera scenes. In this research, the effect of lossy encoded disparity maps on the inter-view prediction will be investigated. If this information is available, it can be used to assist inter-view prediction. Hence, the overall bit rate is reduced, and furthermore the complexity of the multi-view encoder is reduced as well. 6.3.2 Proposed Technical Solution The developed framework consists of four main building blocks, which are explained in the following subsections. 6.3.2.1 Estimation of the Fundamental Matrix To reduce the correspondence search to one dimension, the fundamental matrix F is required. This step has to be carried out once for a fixed camera setup. The following steps are needed to estimate F [22]: . Feature detection using Harris corner detector. . Feature matching using Normalized Cross-Correlation (NCC). . Estimating F using the 7-pt algorithm and RANSAC.

Transform based Multi view Video Coding 209 Figure 6.4 Parameterization of disparity 6.3.2.2 Estimation of the Dense Disparity Map The notation of the disparity is briefly explained below. Given the fundamental matrix F, relating point correspondences of two images (i.e. views in this case) x and x0: x0F x ¼ 0 ð6:2Þ Figure 6.4 depicts the parameterization for the disparity. The vectors T and N denote tangential and normal vectors of the epipolar line, respectively. D denotes the unitary disparity vector, which relates x with x0 by a translation dD. L describes the variable which has to be estimated below. More detail about this notation can be found in [23]. The problem of finding the correspondences in two images is formulated as a partial differential equation (PDE), which has been integrated in an expectation-maximization (EM) framework [24]. The following equation is minimized using finite differences: E½lðxފ ¼ P V ðxÞÄQðxÞT S 1QðxÞc þ CrlðxÞT TðrI1*ÞrlðxÞ ð6:3Þ x QðxÞ ¼ I1* À I2ðx þ FðlðxÞ; xÞÞ F maps l to the disparity dD, V denotes the visibility of the pixel in the other image, Ià gives the true image, and S is the covariance matrix to model normal distributed noise with zero mean. T() is tensor and is explained in detail in [23] and [24]. Figure 6.5 shows the estimated disparity image using the abovementioned algorithm for a sample image from the sequence Flamenco2. 6.3.2.3 Lossy Encoding of Disparity Map The estimated disparity map is quantized to quarter pixel. After that, the disparity map is encoded using JSVM 3.5. During the test, the luminance (Y) component of the disparity image is set to the estimated disparity values, whereas the chrominance (U and V) components are set to a constant value, and are not used. The result after encoding is shown in Figure 6.6.

210 Visual Media Coding and Transmission Figure 6.5 Original image (left) and dense disparity map (right) 6.3.2.4 Multi-view Video Coding Using the Reconstructed Disparity Map for View Prediction JSVM 3.5 is used to encode the picture. The encoder is set to support only unidirectional prediction (P prediction). To allow the insertion of external frames to be used as references for prediction, the GOP-string structure has been adapted, which tells the encoder how to encode a frame. As seen in Figure 6.7, the reference candidate is warped according to the reconstructed disparity map, before being used as reference. The frame that is currently predicted uses the warped prediction instead. 6.3.3 Performance Evaluation The current implementation supports P prediction only. Figure 6.8 shows the coding results for a single test frame. The disparity map is encoded using 2624 bits at a mean square error equal Figure 6.6 Lossy encoded disparity map

Transform based Multi view Video Coding 211 Figure 6.7 Multi view video coder that uses reconstructed disparity maps for view prediction to 0.35. Figure 6.8 shows some gain in rate-distortion performance. If the displacement search range is set to zero, the drop in rate-distortion performance is only minor. 6.3.4 Conclusions and Further Work Next steps include the utilization of B-prediction to achieve a fully-featured multi-view coding approach. The B-prediction can be achieved by using depth maps and camera information. Furthermore, the coding of the disparity/depth maps needs further investigation. Figure 6.8 Result of the proposed coding scheme. The blue curve shows the result if the displacement search range is set to zero

212 Visual Media Coding and Transmission 6.4 Multi-view Coding via Virtual View Generation (Portions reprinted, with permission, from E. Ekmekcioglu, S.T. Worrall, A.M. Kondoz, ‘‘Multi-view video coding via virtual view generation,’’ 26th Picture Coding Symposium, Portugal, November 2007. Ó2007 EURASIP.) 6.4.1 Problem Definition and Objectives In this research, a multi-view video coding method via generation of virtual image sequences is proposed. It is intended to improve the inter-view prediction process by exploiting the known scene geometry. Moreover, most future applications requiring MVC already necessitate the use of the scene geometry information [15,16], so it is beneficial to use this information during the compression stage. Pictures are synthesized through a 3D warping method to estimate certain views in a multi- view set, which are then used as inter-view references. Depth maps and associated color video sequences are used for virtual view generation. JMVM is used for coding color videos and depth maps. Results are compared against the reference H.264/AVC simulcast method, where every view is coded without using any kind of inter-view prediction, under some low-delay coding scenarios. The rate-distortion of the proposed method outperforms that of the reference method at all bit rates. 6.4.2 Proposed Technical Solution The proposed framework is composed of two main steps, namely the virtual view generation step through 3D depth-based warping technique, and the multi-view coding step using the generated virtual sequences as inter-view references. 6.4.2.1 Generation of Virtual Views through Depth Image-based 3D Warping In order to be able to remove the spatial redundancy among neighboring views in a multi-view set, virtual sequences are rendered from already-encoded frames of certain views. These views are called ‘‘base views’’. The rendered frames are then used as alternative predictions for the according frames to be predicted in certain views. These views are called ‘‘intermediate views’’ or equivalently ‘‘b views’’. The virtual views are rendered through the unstructured lumigraph rendering technique explained in [25]. In this research work, this method uses an already-encoded picture of the base view, which is projected first to a 3D world with the pinhole camera model, and then back to the image coordinates of the intermediate view, taking into account the camera parameters of both the base view and the intermediate view. The pixel in base view image coordinate, (x,y), is projected to 3D world coordinates using: ½u; v; wŠ ¼ RðcÞ Á A 1ðcÞ Á ½x; y; 1Š Á D½c; t; x; yŠ þ TðcÞ ð6:4Þ where [u, v, w] is the world coordinate. (Reproduced by permission of Ó2007 EURASIP.) Here, c defines the base views camera. R, T, A define the 3  3 rotation matrix, the 3  1 trans- lation vector, and the 3  3 intrinsic matrix of the base view camera, respectively, and D[c,t,x,y] is the distance of the corresponding pixel (x,y) from the base view camera at time t [25]. Theworld

Transform based Multi view Video Coding 213 Figure 6.9 Final rendered image (in the middle), constructed from the left and right rendered images. Reproduced by permission of Ó2007 EURASIP coordinates are mapped back to the intermediate view image coordinate system using: ½x0; y0; z0Š ¼ Aðc0Þ Á R 1ðc0Þ Á f½u; v; wŠ À Tðc0Þg ð6:5Þ where [(x0/z0), (y0/z0)] is the corresponding point on the intermediate view image coordinate system [25]. The matrices in Equations (6.4) and (6.5), i.e. R, T, and A, and the corresponding depth images of the base views, are all provided by Microsoft Research for the multi-view Breakdancer sequence [15]. The camera parameters should be supplied to the image renderer. In the experiments, the depth maps supplied by Microsoft Research are used. In the proposed method, a 3D warping procedure is carried out pixel by pixel. However, care should be taken to avoid several visual artifacts. First, some pixels in the reference picture may be mapped to the same pixel location in the target picture. In that case, a depth-sorting algorithm for the pixels falling on the same point in the target picture is applied. The pixel closest to the camera is displayed. Second, not every pixel may fall on integer pixel locations. The exact locations should be rounded to fit to the nearby integer pixel locations in the target image. This makes many small visual holes appear on the rendered image. The estimates for empty pixels are found by extrapolating the nearby filled pixels, which is a valid estimation for holes with radius smaller than 10 pixels. For every intermediate view, the two neighboring base views are warped separately into the intermediate view image coordinate system. The resulting view yielding the best objective quality measurement is chosen for the prediction. For better prediction quality and better usage of the scene geometry, the formerly occluded regions in the final prediction view are compared with the corresponding pixels in the other warped image. Figure 6.9 shows a sample final rendered image segment, which is formed from two side camera images. 6.4.2.2 MVC Prediction Method One motivation for using virtual references as prediction sources is that for future video applications, particularly for FTV, transportation of depth information will be essential [26]. So, exploiting that information to improve compression of certain views would be quite reasonable. Besides the rendered references, it is important not to remove temporal references from the prediction list, since the temporal references occupy the highest percentage of the references used for prediction in hierarchical B-frame prediction [9]. In

214 Visual Media Coding and Transmission Figure 6.10 Camera arrangement and view assignment. Reproduced by permission of Ó2007 EURASIP the tests, other means of inter-view references are removed in order to be able to see the extent to which the proposed method outperforms conventional temporal predictive coding techniques. Figure 6.10 shows the camera arrangement and view assignment for an eight-camera multi-view sequence. The view assignment is flexible. There are two reasons for such an assignment, although it may not be optimal in a sense for minimizing the overhead caused by depth map coding. One reason is that, for any intermediate view, the two closest base views are used for 3D warping, making the most likely prediction frame be rendered. The other reason is to show the effects of using prediction frames rendered using just one base view. In our case, virtual prediction frames for the coding of intermediate view 7 are rendered using only base view 6. JMVM 2.1 is used for the proposed multi-view video coding scenario. Both the color videos and the depth maps of the base views are encoded in H.264 simulcast mode (no inter-view prediction). However, the original depth maps are downscaled to their half resolution prior to encoding. The fact that depth maps don’t need to include full depth information to be useful for stereoscopic video applications [27] motivates us to use downscaled versions of depth maps containing more sparse depth information. In the experiments, use of reduced-resolution depth maps affected the reconstruction quality at the decoder negligibly, even for very low bit rates. The PSNR of decoded and upsampled depth maps changed between roughly 33 dB and 34.5 dB. Table 6.1 shows the coding conditions for base views and depth maps. Following the coding of base views with their depth maps, intermediate views are coded using the rendered virtual sequences as inter-view references. In this case, the original frames at I-frame and P-frame positions are coded using the corresponding virtual frame references. At P-frame locations, temporal referencing is still enabled. A lower quantization parameter is used for coding intermediate views. The prediction structure for intermediate view coding is illustrated schematically in Figure 6.11. One reason for such a prediction structure is that it is intended to explore the coding performance of the proposed scheme for low-delay coding scenarios. Besides, as the GOP size increases, where the coding performance of temporal prediction is maximized, the effect of the proposed method on the overall coding efficiency becomes less visible. It was observed in experiments, for GOP size of 12, that the proposed technique had no gain compared to the reference technique (H.264-based simulcast method). Table 6.1 Codec configuration. Reproduced by Permission of Ó2007 EURASIP Software JMVM 2.1 Symbol mode CABAC Loop filter On (color video), Off (depth maps) Search range 96 Prediction structure I P I P . . . (low delay, open GOP) Random access 0.08 second (25 fps video)

Transform based Multi view Video Coding 215 Figure 6.11 Prediction structure of intermediate views. Reproduced by permission of Ó2007 EURASIP 6.4.3 Performance Evaluation Figure 6.12(a) and (b) shows the performance comparison of the proposed MVC method with H.264-based simulcast coding. The coding bit rate of the depth map doesn’t exceed 20% of the coding bit rate of the associated color video. Figure 6.12(c) and (d) shows the performance comparisons between the proposed method and the reference method, where all frames in base views are intra-coded and intermediate views are predicted only from rendered virtual sequences. Figure 6.12(e) and (f) shows the results for Ballet test sequence. Avg PSNR (dB) Breakdancers Intermediate Views 1,3 and 5 Breakdancers Intermediate View 7 for I P I P … (average) for I P I P … coding coding 38 Reference Avg PSNR (dB) 38 Reference 37 Proposed 37 Proposed 36 36 300 400 500 600 700 35 300 400 500 600 35 34 34 Avg Bit Rate (kbps) 33 Avg Bit Rate (kbps) 33 32 32 200 200 (a) (b) Avg PSNR (dB) Breakdancers Intermediate Views 1,3 and 5 Avg PSNR (dB) Breakdancers Intermediate View 7 for I I I I … (average) for I I I I … coding coding 38 Reference 38 Reference Proposed Proposed 37 37 400 600 800 1000 36 300 400 500 600 700 800 36 Avg Bit Rate (kbps) 35 Avg Bit Rate (kbps) 35 34 34 33 (d) 33 32 200 32 200 (c) Ballet Intermediate Views 1,3 and 5 (average) for I I I I… Ballet Intermediate View 7 for I I I I… coding coding Reference Avg PSNR (dB) 38 Avg PSNR (dB) 38 Proposed 37 37 36 300 400 500 600 700 800 35 36 Avg Bit Rate (kbps) 34 35 (f) 33 34 32 33 Reference 31 32 Proposed 30 31 200 30 200 300 400 500 600 700 800 Avg Bit Rate (kbps) (e) Figure 6.12 Rate distortion performance of proposed and reference schemes. Reproduced by permis sion of Ó2007 EURASIP

216 Visual Media Coding and Transmission 6.4.4 Conclusions and Further Work According to Figure 6.12, the coding performance is improved in comparison to combined I and P prediction. The difference in gain between Figure 6.12(a) and (c) shows us that the proposed method has a considerable gain over intra-coded pictures, but also that the temporal references should be kept as prediction candidates to achieve optimum coding performance. Similar results are observed in Figure 6.12(b) and (d), where the performance of the proposed method is analyzed for intermediate view 7. The proposed method still outperforms the reference coding method, and the gain over intra-coding is significant. The overall decrease in average coding gains when compared to those of intermediate views 1, 3, and 5 shows us that virtual sequences, rendered using two base views, can predict the original view better than the virtual sequences rendered using only one base view. Similar results are obtained for the Ballet sequence, as can be seen in Figure 6.12(e) and (f). The subjective evaluation of the proposed method was satisfactory. Accordingly, the proposed method is suitable for use in multi-view applications under low-delay constraints. 6.5 Low-delay Random View Access in Multi-view Coding Using a Bit Rate-adaptive Downsampling Approach 6.5.1 Problem Definition and Objectives (Portions reprinted, with permission, from E. Ekmekcioglu, S.T. Worrall, A.M. Kondoz, ‘‘Low delay random view access in multi-view coding using a bit-rate adaptive downsampling approach’’, IEEE International Conference on Multimedia & Expo (ICME), 23 26 June 2008, Hannover, Germany. Ó2008 IEEE. E. Ekmekcioglu, S.T. Worrall, A.M. Kondoz, ‘‘Utilisation of downsampling for arbitrary views in mutli-view video coding’’, IET Electronic Letters, 28 February 2008, Vol. 44, Issue 5, p. 339 340. (c)2008 IET.) In this research, a new multi-view coding scheme is proposed and evaluated. The scheme offers improved low-delay view random access capability and, at the same time, comparable compression performance with respect to the reference multi-view coding scheme currently used. The proposed scheme uses the concept of multiple-resolution view coding, exploiting the tradeoff between quantization distortion and downsampling distortion at changing bit rates, which in turn provides improved coding efficiency. Bi-predictive (B) coded views, used in the conventional MVC method, are replaced with predictive-coded downscaled views, reducing the view dependency in a multi-view set and hence reducing the random view access delay, but preserving the compression performance at the same time. Results show that the proposed method reduces the view random access delay in an MVC system significantly, but has a similar objective and subjective performance to the conventional MVC method. 6.5.2 Proposed Technical Solution A different inter-view prediction structure is proposed, which aims to replace B-coded views with downsampled (using bit rate-adaptive downscaling ratios) and P-coded views. The goal is to omit B-type inter-view predictions, which inherently introduces view hierarchy to the system and increases random view access delay. The disadvantage is that B coding, which improves the coding efficiency significantly, is avoided. However, the proposed scheme preserves the coding performance by using downsampled and P-coded views, reducing the

Transform based Multi view Video Coding 217 random view access delay remarkably at the same time. A mathematical model is constructed to relate the coding performances of different coding types used within the proposed scheme to one another, which enables us to estimate the relative coding efficiencies of different inter-view prediction structures. In the following subsections, the bit rate-adaptive downsampling approach is explained and the proposed inter-view prediction is shown. 6.5.2.1 Bit Rate-adaptive Downsampling The idea behind downsampling a view prior to coding and upsampling the reconstructed samples is based on the tradeoff between two types of distortion: distortion due to quantization and distortion due to downsampling. Given a fixed bit-rate budget, increasing the down- sampling ratio means that less coarse quantization needs to be used. Thus, more information is lost through downsampling, but less is lost through coarse quantization. Finding the optimum tradeoff between the two distortion sources should lead to improved compression efficiency. To observe this, views are downsampled with different downscaling ratios prior to encoding, ranging from 0.3 to 0.9 (the same ratio for each dimension of the video, leaving the aspect ratio unchanged). These ratios are tested over a broad range of bit rates. The results indicate that the optimum tradeoff between the two distortion types varies with the target bit rate. Figure 6.13 shows the performance curves of downscaled coding, with some downscaling ratios and full resolution coding, for a particular view of the Breakdancer test sequence. The best performance characteristics at medium and low bit rates, where the quantization distortion is more effective, are achieved with 0.6 scaling ratio (mid-range), whereas at much higher bit rates, where the effect of the distortion due to downsampling becomes more destructive, larger scaling ratios (0.8 0.9) are suitable, introducing less downsampling distortion. Very low ratios, such as 0.3, are only useful at very low bit rates, where the reconstruction quality is already insufficient to be considered (less than 32 dB). Results do not change over different data sets. Figure 6.13 Coding performance of a multi view coder that uses several downscaling ratios for the second view of the Breakdancer test sequence. Reproduced by Permission of Ó2008 IET

218 Visual Media Coding and Transmission In the rest, for simplicity, two predefined downscaling ratios are used À0.6 for bit rates less than 300 kbit/s, 0.8 for bit rates over 300 kbit/s targeting VGA sequences (640 Â 480) at 25 fps. Accordingly, up to 20% saving in bit rate is achieved for individual views at certain reconstruction qualities. 6.5.2.2 Inter-view Prediction Structure The random view access corresponds to accessing any frame in a GOP of any view with minimal decoding of other views [28]. In Figure 6.14(a), the reference inter-view prediction structure of the current MVC reference is shown (for 8 views and 16 views cases) at anchor frame positions. The random view access cost, defined as the maximum number of frames that must be decoded to reach the desired view, is 8 and 16 for the 8-view and the 16-view cases, respectively. The disadvantage is that as the number of cameras increases, the cost increases at the same rate. Furthermore, in some streaming applications only relevant views may be sent to the user, to save bandwidth. With such a dependency structure, more views would have to be streamed, and hence the bit rate would increase. In Figure 6.14(b) the proposed low-delay view random access model with downsampled P coding (LDVRA þ DP) is given. The group of views (GOV) concept is used, which is suitable for free viewpoint video (separated by dashed lines), and in each GOV one view, called a base view, is coded at full spatial resolution, while Figure 6.14 Anchor frame positions: (a) reference MVC inter view prediction structure; (b) low delay view random access with downsampled P coding. Reproduced by Permission of Ó 2008 IEEE

Transform based Multi view Video Coding 219 other views, called enhancement views, are downsampled using the idea described in Section 6.5.2.1 and are P coded. None of the views are B coded, so that no extra layers of view dependency are present. Every enhancement view is dependent on its associated base view, and every base view depends on the same base view, whose anchor frames are intra-coded. 6.5.3 Performance Evaluation In this work it is assumed that the coding performances at anchor frame positions of both techniques reflect the overall multi-view coding performances of the respective techniques. The reason is that most of the coding gain in MVC, compared to simulcast, is achieved at anchor frame positions where there are no means of temporal prediction. Therefore, only the coding efficiency at anchor frame positions is evaluated. In both prediction methods (reference MVC and LDVRA þ DP) there is one intra-coded (I) view (all GOPs begin with an intra-coded frame). Other than the I view, both prediction structures contain a certain number of P views, B views (only for reference technique), and DP views (only for LDVRA þ DP coding). Another assumption is that for each view, I coding at anchor frame positions at the same time instant would generate similar bit rate for the same output quality. The same is valid for P coding. The efficiency metrics of P, B, and DP coding are defined as aP, aB, and aDP respectively. aP, is set to 1 initially. Accordingly, aB and aDP change between 0 and 1. A lower efficiency index means higher coding efficiency. The values of aB and aDP are determined experimentally, and their values for different views are found to be consistent. Therefore, at changing bit rates, different aP, aB, and aDP values are calculated by averaging their values from the results of all views. Let the total number of cameras in a multi-view system be equal to 2n. The per-view coding efficiency index can be calculated as: Reference MVC ! n þ ðn À 1ÞaB ð6:6Þ 2n ð6:7Þ  ð6:8Þ b2n=3 À 1c þ b4n=3c þ 1 aDP LDVRA þ DPðGOV ¼ 3Þ ! 2n  LDVRA þ DPðGOV ¼ 5Þ ! b2n=5 À 1c þ b8n=5c þ 1 aDP 2n Figure 6.15 shows the per-view efficiency versus PSNR graphs drawn for the 16-view Rena and 8-view Breakdancer sequences, for experimentally-determined values of aP, aB, and aDP. They are determined at different bit rates by taking the ratios of output bit rates for B and DP views with respect to the output bit rate of P-coded views. Actual coding results with JMVM are given in Figure 6.16. Common coding configurations for each view are shown in Table 6.2. LDRA curves represent the technique in which no downsampling is utilized for P-coded views. LDRA performs worse than the reference MVC method, since it does not benefit from the coding gains of B-view coding or downsampled P-view coding. The proposed LDRA coding technique with downsampled P-view coding tends to perform better than the reference coding technique at especially low bit rates. This is observed in both the estimation graphs and real coding results. At the same time, the real relative efficiencies of the proposed techniques with

220 Visual Media Coding and Transmission Figure 6.15 Estimated relative performances for Rena (16 views) and Breakdancer (8 views) test sets. Reproduced by Permission of Ó 2008 IEEE respect to the reference coding technique are reflected correctly in the estimated relative per- view efficiency graphs. In order to compare the relative efficiencies of both techniques, F1 is defined as the difference between the per-view efficiency indices of the reference and the proposed techniques. Then: ð3n À 3ÞaB À ð4n þ 3ÞaDP þ ðn þ 3Þ ! 6n F1 ¼ ð6:9Þ In order to make sure that the proposed low-delay random access coding scenario performs at least as well as the reference MVC method, we need the following condition to be satisfied: F1 ! 0 ð6:10Þ Table 6.2 Codec configuration. Reproduced by Permission of Ó 2008 IEEE Basis QP 22, 32, 37 Entropy Coding CABAC Motion search range 32 Temporal prediction structure Hierarchical B prediction Temporal GOP size 12 RD optimization Yes Figure 6.16 Real experiment results with JMVM for Rena (16 views) and Breakdancer (8 views) test sets. Reproduced by Permission of Ó 2008 IEEE

Transform based Multi view Video Coding 221 Experimental values for aP, aB, and aDP guarantee that the condition in (6.10) is satisfied for the test videos used, at most bit rates. Similarly, in order to make sure that the proposed scheme performs better with larger GOV sizes, define F2 as the difference between the per-view efficiency indices of the two proposed techniques (one with a GOV size of 3 and the other with a GOV size of 5). Then the following is obtained: 2n=3 À 1 4n=3 þ 1 aDP 2n=5 À 1 8n=5 þ 1 aDP ! 2n 2n 2n 2n F2 ¼ þ À À ð6:11Þ YF2 ¼ 2=15ð1 À aDP Þ ! 0 Since aDP is absolutely below 1, it is certain that the condition in (6.11) is satisfied. It is observed from both the estimated and the real coding results. The perceptual quality of the proposed low-delay random access scheme with downsampled P coding is compared with the reference MVC method including B-coded views, using the stimulus comparison-adjectival categorical judgment method described in recommendation ITU-R BT.500-11 [29]. The Rena, Breakdancer, and Ballet test sequences are used for evaluations. A differential mean score opinion is calculated at two different bit rates and plotted on a differential scale, where 0 corresponds to no perceptual difference between the two methods and negative values indicate that the proposed method performs better. Sixteen subjects are used in the evaluations. Figure 6.17 shows the results. Since the Rena sequence is a blurry sequence originally, the downsampling distortion is not visually sensible. Therefore, there is no visual difference between the conventional MVC and the proposed coding method for the Rena sequence. On the other hand, for the other sequences tested it can be observed that at high bit rates the perceptual qualities of both methods do not differ, indicating that the downsampling distortion (blurriness) is not a significant issue. At lower bit rates, quantization distortion (blockiness) is more visible than the downsampling distortion and hence the proposed method generates visually more satisfactory results. Figure 6.17 Subjective test results comparing the proposed method and the reference MVC method. Reproduced by Permission of Ó 2008 IEEE

222 Visual Media Coding and Transmission 6.5.4 Conclusions and Further Work It is observed that the random view access performance of multi-view coding systems can be improved significantly with respect to the conventional MVC method without any loss of coding performance and perceptual quality. The reason is that the performance of efficient B coding, present in the conventional MVC method, can be achieved by downsampled P coding. The proposed inter-view dependency structure is more suitable for fast-switching free-view systems, due to the utilization of the concept of groups of views. Furthermore, assigning larger GOV sizes can further increase the compression performance, without affecting the overall random view access delay. The proposed approach brings a slight increase in the complexity, due to the addition of up-conversion and down-conversion blocks, but this is balanced with the reduction in the processing load for the downsampled videos. One limitation of this technique is with highly-textured video sequences, where the inherent low-pass filtering effect of downsampling might significantly degrade the subjective quality. This can be overcome by transmitting extra residuals for the blocks in the vicinity of object edges to improve the visual quality, which is a next step in research. References [1] A. Smolic et al., ‘‘3D video and free viewpoint video: technologies, applications and MPEG standards,’’ IEEE International Conference on Multimedia and Expo, Jul. 2006. [2] ISO/IEC JTC1/SC29/WG11 N371, ‘‘List of ad hoc groups established at the 58th meeting in Pattaya,’’ 2001. [3] A. Smolic and D. McCutchen, ‘‘3DAV exploration of video based rendering technology in MPEG,’’ IEEE Transactions on Circuits and Systems for Video Technology, Vol. 14, No. 3, pp. 348 356, Mar. 2004. [4] ISO/IEC JTC1/SC29/WG11 N6720, ‘‘Call for evidence on multi view video coding,’’ 2004. [5] ISO/IEC JTC1/SC29/WG11 N6999, ‘‘Report of the subjective quality evaluation for multi view coding CfE,’’ 2005. [6] ISO/IEC JTC1/SC29/WG11 N7327, ‘‘Call for proposals on multi view video coding,’’ 2005. [7] ISO/IEC JTC1/SC29/WG11 N7779, ‘‘Subjective test results for the CfP on multi view video coding,’’ 2006. [8] Y. L. Lee, J. H. Hur, D. Y. Kim, Y. K. Lee, S. H. Cho, N. H. Hur, and J.W. Kim, ‘‘H.264/MPEG 4 AVC based multiview video coding (MVC),’’ ISO/IEC JTC1/SC29/WG11 M12871, Jan. 2006. [9] H. Schwarz, T. Hinz, A. Smolic, T. Oelbaum, T. Wiegand, K. Mueller, and P. Merkle, ‘‘Multi view video coding based on h.264/mpeg4 avc using hierarchical b pictures,’’ Picture Coding Symposium, 2006. [10] J. Xin, A. Vetro, E. Martinian, and A. Behrens, ‘‘View synthesis for multi view video compression,’’ Picture Coding Symposium, 2006. [11] K. Yamamoto, M. Kitahara, H. Kimata, T. Yendo, T. Fujii, M. Tanimoto et al. ‘‘SIMVC: multi view video coding using view interpolation and color correction,’’ IEEE Transactions on Circuits and Systems for Video Technology, Vol. 10, No. 17, Oct. 2007. [12] M. Tanimoto, ‘‘Overview of free viewpoint television,’’ Signal Processing: Image Communication, Vol. 21, pp. 454 461, 2006. [13] H. Kimata et al., ‘‘Multi view video coding using reference picture selection for free viewpoint video communication,’’ Picture Coding Symposium, San Francisco, USA, 2004. [14] A. Smolic et al., ‘‘Multi view video plus depth representation and coding,’’ IEEE ICIP 2007, San Antonio, TX, Sep. 2007. [15] C.L. Zitnick et al., ‘‘High quality video view interpolation using a layered representation,’’ ACM Siggraph and ACM Trans. on Graphics, Aug. 2004. [16] P. Kauff et al., ‘‘Depth map creation and image based rendering for advanced 3DTV services providing interoperability and scalability,’’ Signal Processing: Image Communication, Vol. 22, pp. 217 234, Feb. 2007. [17] J. Shade, S. Gortler, L. He, and R. Szeliski, ‘‘Layered depth images,’’ Computer Graphics Proceedings, Annual Conference Series, SIGGRAPH, Orlando, FL, Jul. 1998. [18] S. U. Yoon and Y. S. Ho, ‘‘Multiple color and depth video coding using a hierarchical representation,’’IEEE Transactions on Circuits and Systems for Video Technology, Oct. 2007.

Transform based Multi view Video Coding 223 [19] J. Duan and J. Li, ‘‘Compression of the layered depth image,’’ IEEE Transactions on Image Processing, Vol. 12, No. 3, Mar. 2003. [20] F. Dufaux and F. Moscheni, ‘‘Motion estimation techniques for digital TV: a review and a new contribution,’’ Proc. IEEE, Vol. 83, No. 6, pp. 858 876, Jun. 1995. [21] F. Dufaux and J. Konrad, ‘‘Robust, efficient and fast global motion estimation for video coding,’’ IEEE Transactions on Image Processing, Vol. 9, No. 3, pp. 497 501, Mar. 2000. [22] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. [23] L. Alvarez, R. Deriche, J. Sanchez, and J. Weickert,‘‘Dense disparity map estimation respecting image discontinuities: a PDE and scalespace based approach,’’ Tech. Rep. RR 3874, INRIA, Jan. 2000. [24] C. Strecha, R. Fransens, and L.J. Van Gool, ‘‘A probabilistic approach to large displacement optical flow and occlusion detection,’’ European Conference on Computer Vision, SMVP Workshop, Prague, Czech Republic, pp. 71 82, May 2004. [25] S. Yea et al., ‘‘Report on core experiment CE3 of multiview coding,’’ JVT T123, Klagenfurt, Austria, Jul. 2006. [26] M. Tanimoto et al., ‘‘Proposal on requirements for FTV,’’ JVT W127, San Jose, CA, Apr. 2007. [27] W.J. Tam and L. Zhang,‘‘Depth map preprocessing and minimal content for 3D TV based on DIBR,’’ JVT W095, San Jose, CA, Apr. 2007. [28] Y. Liu et al., ‘‘Low delay view random access for multi view video coding,’’ IEEE International Symposium on Circuits and Systems 2007, pp. 997 1000, May 2007. [29] ITU R, ‘‘Methodology for the subjective assessment of the quality of the television signals,’’ Recommendation BT.500 11, 2002.



7 Introduction to Multimedia Communications 7.1 Introduction The goal of wireless communication is to allow a user to access required services at any time with no regard to location or mobility. Recent developments in wireless communication, multimedia technology, and microelectronics technology have created a new paradigm in mobile communications. Third/fourth-generation wireless communication technologies pro- vide significantly higher transmission rates and service flexibility, over a wider coverage area, than is possible with second-generation wireless communication systems. High-compression, error-robust multimedia codecs have been designed to enable the support of a multimedia application over error-prone bandwidth-limited channels. The advances of VLSI and DSP technologies are preparing light-weight, low-cost, portable devices capable of transmitting and viewing multimedia streams. The above technological develop- ments have shifted the service requirements of future wireless communication systems from conventional voice telephony to business-oriented multimedia services. To successfully meet the challenges set by future audiovisual communication requirements, the International Telecommunication Union Radiocommunication Sector (ITU-R) has elabo- rated on a framework for global third-generation standards by recognizing a limited number of radio access technologies. These are Universal Mobile Telecommunications System (UMTS), Enhanced Data rates for GSM Evolution (EDGE), and CDMA2000. UMTS is based on Wideband CDMA technology and is employed in Europe and Asia using the frequency band around 2 GHz. EDGE is based on TDMA technology and uses the same air interface as the successful second-generation mobile system GSM. General Packet Radio Service (GPRS) and High-Speed Circuit-Switched Data (HSCSD) are introduced by phase 2þ of the GSM standardization process, and support enhanced services with data rates up to 144 kbps in the packet-switched and circuit-switched domains respectively. GPRS has also been accepted by the Telecommunication Industry Association (TIA) as the packet data standard for TDMA/136 systems. EDGE, which is the evolution of GPRS and HSCSD, provides third-generation services up to 500 kbps within GSM carrier spacing of Visual Media Coding and Transmission Ahmet Kondoz © 2009 John Wiley & Sons, Ltd. ISBN: 978-0-470-74057-6

226 Visual Media Coding and Transmission GSM HSCSD EDGE TDMA/136 GPRS UMTS WCDMA (FDD/TDD) 3GPP IS-95 CDMA2000 CDMA2000 1× 2G 3× MC-CDMA <10 kbps 2.5G 3GPP2 64-144 kbps 3G 384 kbps- 2 Mbps Figure 7.1 Evolution toward global third generation standards 200 kHz. CDMA2000 is based on multicarrier CDMA technology and provides the upgraded solution for existing IS-95 operators, mainly in North America. The migration paths from existing second-generation standards to third-generation standards are shown in Figure 7.1. EDGE and UMTS are the most widely accepted third-generation radio access technologies. They are being standardized by the 3rd Generation Partnership Project (3GPP). Even though EDGE and UMTS are based on two different multiple-access technologies, both systems share the same core network. The evolved GSM core network will serve for a common GSM/UMTS core network that supports GSM/GPRS/EDGE and UMTS access. In addition, wireless local area networks (WLAN) are becoming more and more popular for communication in homes, offices, and indoor public areas such as campus environments, airports, hotels, shopping centers, and so on. IEEE 802.11 provides the dominant WLAN standard. IEEE 802.11 specifications are focused on the two lowest layers, the medium access control (MAC) layer and physical layer of the ISO protocol stack. IEEE 802.11 has a number of physical layer specifications with a common MAC operation. IEEE 802.11 includes two physical layers a frequency-hopping spread-spectrum (FHSS) physical layer and a direct- sequence spread-spectrum (DSSS) physical layer and operates at 2 Mbps. The current, widely-deployed IEEE 802.11b standard provides an additional physical layer based on a high-rate direct-sequence spread-spectrum (HR/DSSS). It operates in the 2.4 GHz unlicensed band and provides bit rates up to 11 Mbps. The IEEE 802.11a standard for the 5 GHz band provides high bit rates up to 54 Mbps and uses a physical layer based on orthogonal frequency division multiplexing (OFDM). The IEEE 802.11g standard was issued to achieve such high bit rates in the 2.4 GHz band. Worldwide Interoperability for Microwave Access (WiMAX) is a telecommunications technology aimed at providing wireless data over long distances in different ways, from point-to-point links to full mobile cellular access. It is based on the IEEE 802.16 standard, which is also called WirelessMAN. The name WiMAX was created by the WiMAX Forum,

Introduction to Multimedia Communications 227 which was formed in June 2001 to promote conformance and interoperability of the standard. The forum describes WiMAX as “a standards-based technology enabling the delivery of last mile wireless broadband access as an alternative to cable and DSL”. Mobile WiMAX IEEE 802.16e provides fixed, nomadic, and mobile broadband wireless access systems with superior throughput performance. It enables non-line-of-sight reception, and can also cope with high mobility of the receiving station. The IEEE 802.16e enables nomadic capabilities for laptops and other mobile devices, allowing users to benefit from metro-area portability of an xDSL-like service. Multimedia services by definition require the transmission of multiple media streams such as video, still picture, music, voice, and text data. A combination of these media types provides a number of value-added services, including video telephony, E-commerce services, multiparty video conferencing, virtual office, and 3D video. 3D video, for example, provides more natural and immersive visual information to end users than standard 2D video. In the near future, certain 2D video application scenarios are likely be replaced by 3D video in order to achieve a more involving and immersive representation of visual information, and to provide more natural methods of communica- tion. 3D video transmission, on the other hand, requires more resources than conventional video communication applications. Different media types have different quality of service (QoS) requirements and enforce conflicting constraints on the communication networks. Still picture and text data are categorized as background services and require high data rates but have no constraints on transmission delay. Voice services, on the other hand, are characterized by low delay. However, they can be coded using fixed low-rate algorithms operating in the 5 24 kbps range. In contrast to voice and data services, low-bit rate video coding involves rates at tens to hundreds of kbps. Moreover, video applications are delay sensitive and impose tight constraints on system resources. Mobile multimedia applications play an important role in the rapid penetration of future communication services and the success of these communication systems. Even though the high transmission rates and service flexibility have made wireless multimedia communication possible over third/fourth-generation wireless communication systems, many challenges remain to be addressed in order to support efficient communications in multi-user, multi- service environments. In addition to the high initial cost associated with the deployment of third-generation systems, the move from telephony and low-bit rate data services to bandwidth- consuming third-generation services implies high system costs, as the latter consume a large portion of the available resources. Moreover, for rapid market evolvement, these wideband services should not be substantially more expensive than the services offered today. Therefore, efficient system resource (mainly the bandwidth-limited radio resource) utilization and QoS management are critical in third/ fourth-generation systems. Efficient resource management and the provision of QoS for multimedia applications are in sharp conflict with one another. Of course, it is possible to provide high-quality multimedia services with the use of a lot of radio resources and very strong channel protection. However, this is clearly inefficient in terms of system resource allocation. Moreover, the perceptual multimedia quality received by end users depends on many factors, such as source rate, channel protection, channel quality, error resilience techniques, transmission/processing power, system load, and user interference. Therefore, it is difficult to obtain an optimal source and network

228 Visual Media Coding and Transmission parameter combination for a given set of source and channel characteristics. The time-varying error characteristics of the radio access channel aggravate the problem. Section 7.2 provides a brief overview of recent developments in supporting QoS in wireless networks. Subsection 7.2.2 addresses the constraints and problems associated with wireless multimedia communications. Subsection 7.2.3 provides an overview of multimedia compression technologies for video and speech. Subsection 7.2.4 discusses multimedia transmission issues in wireless networks. Finally, Subsection 7.2.5 discusses resource management strategies in wireless multimedia communications. 7.2 State of the Art: Wireless Multimedia Communications 7.2.1 QoS in Wireless Networks In a wireless environment, providing multimedia applications that deliver text, audio, images, and video (often in real time) with a QoS guarantee is a complex problem, due to, for example, the limited bandwidth resources and time-varying transmission characteristics of wireless channels, caused by user mobility and interference. In addition, QoS provisioning has many interrelated aspects, such as resource allocation, call admission control, traffic policing, routing, and pricing. QoS in wireless networks may be supported at three levels: the connection level, the application level (or packet level), and the transaction level. Connection-level QoS is related to connection establishment and management, with parameters such as call-blocking probability, which measures service connectivity, and call-dropping probability, which measures service continuity during handoff [1]. New-call-blocking and existing-call-dropping probabilities have been considered central and critical QoS parameters in mobile networks [2,3]. Dropping an ongoing call is assumed to be more annoying than blocking a new call. Therefore, minimizing the call-dropping probability is usually a main objective in wireless system design. On the other hand, the goal of a network service provider is to maximize revenue by improving network resource utilization, which is usually associated with minimizing the call-blocking probability while keeping the call-dropping probability below a certain threshold. In order to have high capacity and an acceptable service quality, a tradeoff between channel quality, dropping (or outage), and blocking probability has to be made. An admission control algorithm is used to make this tradeoff. The purpose of the admission control is to admit or deny new users, new radio access bearers, or new radio links, for example due to a handover. The admission control should try to avoid overload situations and base its decisions on interference and resource measurements. Admitting a new call will degrade the link quality for existing calls. A handoff call is forced to be dropped when attempting a handoff to a target cell that does not have enough bandwidth to support the handoff call. Bandwidth reservation has been proposed, in order to maximize the efficient use of available bandwidth while giving higher priority to handoff calls. Some bandwidth is reserved for handoff calls and the rest is shared by both new and handoff calls. The amount of reserved bandwidth affects QoS performance as well as resource utilization. A network that provides some buffer bandwidth for expected handoff calls is likely to result in lower connection-dropping probability and hence better QoS [2]. Application-level QoS is related to the end user-perceived quality and is commonly considered in packet-switched networks. Parameters such as delay, delay jitter, packet loss

Introduction to Multimedia Communications 229 rate, and throughput are used to describe application-level QoS. These metrics are likely to be most influenced by network resources, buffer characteristics in the network entities, and the network protocols. Transaction-level QoS is expressed in terms of the probability of transaction completion and response time [1]. Future communication systems are intended to support a wide range of applications that require different levels of QoS. Compared to the single-service network, future multimedia applications require multiple-service operation in integrated network environments. The beyond-3G communication network concept involves the coexistence of all the different network technologies. Thus, heterogeneity will play a major role in future networks and operating system environments. Highly-differentiated access technologies allow a user terminal to exploit a number of platforms to access heterogeneous services. Application data transmitting over heterogeneous networks will therefore experience systems with limited and varying capacity. With increased user mobility, users will be able to seamlessly roam across multiple heterogeneous wireless networks while being connected to other users located in different wireless networks. In this case, the QoS received by an application will not only depend on the network in which the application user is located, but also on the network conditions at the other end of the connection. This increases the heterogeneity, making it more difficult to maintain an acceptable level of perceived quality, especially for video applications where the quality received is highly susceptible to channel variations. Therefore, QoS management for multimedia traffic over heterogeneous networks is of paramount importance for satisfying the end user’s quality requirements while enabling efficient resource utilization. The access links can be wired or wireless, as illustrated in Figure 7.2. Several wireless access technologies are available, such as GSM, GPRS, UMTS, WLAN, and WiMAX [4]. In Figure 7.2 Interworked heterogeneous networks

230 Visual Media Coding and Transmission multimedia transmission over these heterogeneous wireless access technologies, problems may arise due to: . Differences in available data rates, for example up to 2 Mbps in UMTS, 54 Mbps in WLAN, and 70 Mbps in WiMAX. . QoS discrepancies. . Time-varying transmission characteristics of wireless channels, due to user mobility and interference [5]. The end-to-end QoS will be complex to compute due to the differences in the amount and quality of radio network resources and the differences in QoS schemes deployed by individual wireless networks. QoS will need to be supported across the different subnetworks, due to the connections spanning such networks. It will be necessary to set up the required resources so that the expected end-to-end QoS can be achieved by negotiating suitable traffic and QoS parameters, and optimizing the radio network resources accordingly. In addition, it will also be necessary to have dynamic adaptation mechanisms that take into account the time-varying characteristics of wireless channels [6,7]. Figure 7.3 illustrates communication between two end systems communicating through possibly heterogeneous subnetworks. Typically, when QoS guarantees are provided, they refer to parameters at a particular protocol stack layer [8]. For example, a conversational connection may be guaranteed some maximum delay and packet loss rate. There exists a need to translate such QoS parameters into parameters that are meaningful to the upper layers. For example, the maximum-delay and -packet-loss-rate guarantees must eventually be interpreted at the application layer in terms of the proportion Figure 7.3 Communication between two end nodes (taken from [8])


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook