Home Explore Visual Media Coding and Transmission

Visual Media Coding and Transmission

Published by Willington Island, 2021-07-26 02:21:34

Description: Visual Media Coding and Transmission is an output of VISNET II NoE, which is an EC IST-FP6 collaborative research project by twelve esteemed institutions from across Europe in the fields of networked audiovisual systems and home platforms. The authors provide information that will be essential for the future study and development of visual media communications technologies. The book contains details of video coding principles, which lead to advanced video coding developments in the form of Scalable Coding, Distributed Video Coding, Non-Normative Video Coding Tools and Transform Based Multi-View Coding. Having detailed the latest work in Visual Media Coding, networking aspects of Video Communication is detailed. Various Wireless Channel Models are presented to form the basis for both link level quality of service (QoS) and cross network transmission of compressed visual data. Finally, Context-Based Visual Media Content Adaptation is discussed with some examples.

MEDIA DOODLE

Read the Text Version

Pages:

Context based Visual Media Content Adaptation 481 Alliance (OMA) DRM [58], and TV Anytime [59], but there are also many industry solutions like the Windows Media DRM 10 [60]. They all use rights expression languages (RELs) to deﬁne the rights that a user has over their digital work, and the restrictions that have to be applied on usage. The most relevant rights expression languages are MPEG-21 REL, based on the eXtensible rights Markup Language (XrML) [61] proposed by ContentGuard Inc. [62], and the Open Digital Rights Language (ODRL) [63] proposed by Ianella from IPR Systems [64]. XrML and ODRL syntactically are based on XML, while structurally they both conform to the axiomatic principles of rights modeling, which were ﬁrst laid down by, among others, Dr Mark Steﬁk of Xerox PARC, the designer of the Digital Property Rights Language (DPRL) [65]. 11.4.3 The New ‘‘Adaptation Authorization’’ Concept As the two aforementioned areas of research were developed separately, it seems impossible to govern content adaptations due to the lack of descriptions about permissible conversions nowadays. Only very recently did the two groups of researchers working on adaptation and DRM start to cooperate in jointly deﬁning approaches and methodologies to combine each others outcomes into a single framework. In this framework, adaptation operations are subjected to restrictions based on the content owners rights, i.e. content adaptation is governed by the content owners rights, in addition to the constraints imposed by terminals, networks, natural environments, and users. Thus, this framework brings out a new concept, adaptation authorizations, which can be seen as a new form of contextual information. Not surprisingly, the joint effort between these two research ﬁelds has also emerged within the MPEG working community. In fact, MPEG already addressed the two issues separately within the MPEG-21 standard [30,31]. As each area evolved during standardization, it became clear that some kind of integration was crucial. In multimedia networks where digital rights are governed, providers can protect the distribution and use of their content by means of standardized REL [66] and Rights Data Dictionary (RDD) [67] (Parts 5 and 6 of MPEG-21, respectively). However, as adaptation is becoming more and more important in multimedia systems, we arrive at a point where more detailed descriptions are needed about permissible conversions in order to be able to govern content adaptations. The ﬁrst amendment of MPEG-21 DIA [68] provides the description of ﬁne-grained media conversions by means of the conversion operations names and parameters, which can be used to deﬁne rights expressions to govern adaptation in an interoperable way. This amendment is mainly the result of the work done in the context of two European projects: DANAE [40], discussed in more detail in Section 11.2.2.2, and ADMITS (Adaptation in Distributed Multimedia IT Systems) [69]. In Section 11.6.7, we will go into further detail on how adaptation authorization can be achieved, but ﬁrst we will give a brief overview of how rights can be expressed by means of licenses: the fundamental units of communication in the rights domain. Figure 11.15 shows a license, which is divided into two main parts: a set of grants and the issuer of these grants. In the example illustrated in this ﬁgure, Bob is the issuer of the license (and probably the owner or distributor of the content, named ‘‘image.jpg’’), and he expresses that he allows Alice (the principal) to play his content during the month of April by means of a grant. The MPEG-21 REL data model for a rights expression includes four basic entities, and

482 Visual Media Coding and Transmission Figure 11.15 REL data model the basic relationship among these entities is deﬁned by the MPEG-21 REL assertion grant. Structurally, a grant consists of the following elements: . The principal to whom the grant is issued. . The right that the grant speciﬁes. . The resource to which the right in the grant applies. . The condition that must be met before the right can be exercised. A grant, by itself, is not a complete rights expression that can be transferred unambiguously from one party to another. A full rights expression is called a license. As mentioned above, a typical license consists of one or more grants and an issuer the party who issued the license. In order to follow this structure and guarantee interoperability, MPEG-21 DIA conversion permissions have to be integrated within the ‘‘Condition’’ ﬁeld of each grant. The details of this integration will be discussed in Section 11.6.7. To date, there has not been a real implementation of adaptation authorization made, but many of the projects currently working with MPEG-21 DIA (DAIDALOS [70], aceMEDIA [71], etc.) have earmarked this point as possible future work. Other projects like AXMEDIS [72] or the second part of Projecte Integrat [73], named ‘‘Machine’’, are also beginning to conduct some research work in this area. The most advanced publication about the subject is [74], in which a very interesting use case can be found, illustrating a complex UMA scenario that justiﬁes the need for conversion and permission descriptions, as well as giving some detailed examples of them. 11.4.4 Adaptation Decision ADEs can be considered as the intelligent part of the content adaptation systems. Their goal is to make a decision regarding the actions to be performed when contextual information is

Context based Visual Media Content Adaptation 483 available. Thus, they provide an implementation of the phase ‘‘sensing higher-level context’’, as deﬁned in Section 11.2.1.1. Within the scope of MPEG-21, an ADE realizes the basic contextual information as constraints imposed by the delivery and consumption environment. Using also descriptions of the service to be paid for or the content to be delivered, in terms of technical characteristics such as encoding format, encoding rate, spatial or temporal resolution, it implements an optimization algorithm to select the set of characteristics that satisfy the constraints. As the ultimate goal of adaptation is to improve the quality of experience of the user, this optimization algorithm usually uses a utility value to represent the users satisfaction, which drives the optimization algorithm. Accordingly, an ADE needs to receive not only the low-level contextual information, but also sets of media characteristics that provide the technical parameters with which the content is encoded (corresponding to different ﬂavors of the same content) as well as a utility value that quantiﬁes the degree of satisfaction of the user with each set of the technical parameters. The fact that the ADE receives sets of values of encoding parameters should not be seen as imposing any kind of restriction upon the type of adaptation operation that may be performed as a result of the decision-taking process. In fact, it does not require that the adaptation be performed by transcoding or trans-rating the media resource. For example, if the original DI is composed of video and audio, and among the available sets of media characteristics there is one that has only video encoding parameters and text, then this implies that the adaptation operation needs to be a voice-to-text transformation (and possibly also a rate transformation of the video). Thus, it will be up to the ADE to reason about the low-level contextual information it receives and infer higher-level context. Furthermore, in this example, if the description of the natural environ- ment indicates a rather high surrounding noise level, the ADE might also opt to select the set of parameters that do not include audio. Current implementations of ADEs do not address this kind of behavior. They only provide the functionality to compare similar sets of parameters, and select the one that maximizes the utility. Generally speaking, the adaptation decision process can be seen as a problem of resource allocation in a generalized rate-distortion or resource-distortion framework. It is mainly an optimization problem that operates in a three-dimensional space: content, resources, and utility, as represented in Figure 11.16. The content space provides the indication of the possible variations of the content that can be offered (or equivalently of the possible content adaptation operations that can be performed). The resources space represents the characteristics and limitations of the current consumption environment. It describes the consumption environment in terms of availability of resources at the terminal and in the networks, of the conditions of the natural surrounding environment, and of the user preferences or requirements. It can thus be seen as the space dictating the initial rules by which to choose among the available content variations, by imposing some kinds of constraints. It allows for elimination of variations that do not meet the described constraints. Finally, the utility space provides values that quantify the degree of satisfaction of the user with each of the available variations of the content. It can thus be used to enable a ﬁner-grain selection among the subset of variations that have initially satisﬁed the constraints imposed by the resources space. For a given ﬂavor of the content (a possible adaptation of the content), a set of resources is selected from among those available so as to minimize the distortion introduced, or in other terms maximize the utility. This distortion can be a measure of the degradation of the quality of the adapted content (correspondingly, the utility is the level of quality of the content). It can also

484 Visual Media Coding and Transmission Figure 11.16 The representation of the three different spaces for adaptation decision be another measure that reﬂects the degree of satisfaction of the user, or even any other metric, such as the cost that the user will have to pay for the adapted service (any other metric that may reﬂect some preference of the user or degree of satisfaction). In [24], the problem of deﬁning utility measures is discussed. The author argues that there is no universal solution, due to the complex nature of utility and its dependencies on a number of subjective factors, such as the nature of the content itself and the characteristics of the user (for example, user 1 may consider that both content A and content B with bit rate r1 are of very good and of satisfactory quality respectively, whereas user 2 considers content A as of medium quality only). Accordingly, it is concluded that this is still an open issue, which is currently being studied within the concept of quality of experience (QoE). An early research work addressing the context-aware adaptation of content [37] presents a framework using an info pyramid, where different variations and modalities of the content are represented at different levels of ﬁdelity. From this pyramid, a customizer selects the best pair of variation and modality so as to meet the constraints of the usage environment. The focus of the work is on adapting Web documents or applications composed of multiple media types to meet different terminals with various capabilities. The system architecture proposed in this work is presented in Figure 11.17. Although this approach is somewhat rigid, its concepts are quite useful when the objective is to adapt a complete presentation composed of different media types. It could be complemented with the approach based on the three-dimensional space. The info pyramid provides different modalities of a given content in the horizontal axis, where the most demanding modality, in terms of required resources, is placed on the left corner. Along the vertical axis, it provides variations of each of those modalities, starting with the highest available quality variation at the bottom. Figure 11.18 illustrates an example info pyramid.

Context based Visual Media Content Adaptation 485 Figure 11.17 Internet content adaptation system architecture based on the info pyramid concept [37] In more recent work [38], a three-dimensional space approach has been presented using the designation ‘‘adaptation-resource-utility’’. Figure 11.19 illustrates the concept of this three- dimensional space for content adaptation decision. For a given possible adaptation of the content, a set of resources is selected from among those available so as to minimize the distortion introduced or, in equivalent terms, to maximize the utility. In this research work, different case studies are described, for which different utilities are developed to drive the selection of the adaptation operation. The use of the resource space is essentially restricted to resources that are directly related to the technical speciﬁcities of the adaptation operation upon a given media content (for example, ‘‘network bandwidth available’’ as the resource and ‘‘frame dropping’’ as the adaptation). This work is focused on establishing relationships between the spaces from the perspective of video adaptation. Another aspect where this work diverges from that discussed in this chapter is that the selection of the adaptation is driven by the minimization of the used resources (such as bandwidth) and the maximization of the utility. In Figure 11.18 Example of an info pyramid for a video item [37]

486 Visual Media Coding and Transmission Figure 11.19 The ‘‘adaptation resource utility’’ space for content adaptation [38] this chapter, on the other hand, the adaptation decision is described to be initially driven by the constraints imposed by the context of usage, and then further reﬁned by the maximization/ minimization of the utility. Nonetheless, this model is sufﬁciently generic to allow for the description of a number of different constraints in the resource space and the establishment of different mapping relations between the spaces. The MPEG-21 DIA tools, together with MPEG-7, can be used to represent the above- mentioned three spaces (refer to Section 11.2.2 for a succinct description of these standards). As indicated above, the content space provides structural metadata about the content for each possible variation or ﬂavor. More precisely, the technical parameters with which the content is encoded in order to provide the speciﬁed variation are described. Accordingly, this space can be implemented via the MPEG-7 MDS media characteristics (see Section 11.2.2). The resources space provides information about the characteristics, capabilities, and conditions of the whole delivery and consumption environment, which are used to determine the constraints imposed on the service. The MPEG-21 DIA UED tool is thus adequate for implementing this space together also with the UCD tool, as UCD can be used to express speciﬁc limitations or optimization constraints based on the UED-speciﬁc characteristics in order to facilitate the adaptation decision. Finally, the utility space is the vehicle through which the ADE is able to formulate a decision by reasoning about the contextual information present in the content and resources spaces. Basically, it achieves this goal by assigning a utility to each set of technical parameters in order to encode the content. As such, the MPEG-21 DIA AQoS tool is adequate for employment here. In fact, as the AQoS tool provides the mechanism to describe the relations between the content space and the utility space, it also incorporates the content space by making use of the MPEG-7 MDS media characteristics referred to. Figure 11.20 illustrates the conceptual architecture of an adaptation decision framework, implementing the three-dimensional space approach through the use of the MPEG-21 description tools referred to. It also shows the high-level architecture of an ADE, which can be seen as a service by other modules of a context-aware system, as well as small examples of the metadata that it uses. This ADE is quite generic as it can accept metadata in different

Context based Visual Media Content Adaptation 487 Figure 11.20 ADE framework based on MPEG 21 formats. The module that wants to use the ADE can completely specify how the provided metadata is to be used in the decision-taking process. In some implementations, this module can be the entity or process responsible for monitoring the quality of the service being offered to the user. In other cases, this functionality can be incorporated within the ADE. The important aspect to highlight here is that this ADE can potentially be used in many different application scenarios and by different external modules, regardless of the fact that formats and rules are externally supplied or internally generated. The use of XSLT provides the ADE with great ﬂexibility, while decoupling it from other components. For example, it is possible to seamlessly use different Adaptation Decision Taking Engine (ADTE) components, which implement different search strategies, accept different UEDs, and use different forms of transforming the UEDs into UCDs. The output of the ADE is a set of ‘‘name value’’ pairs selected from among the AQoS descriptors originally provided. This output is used to conﬁgure different resources, including the encoding parameters of AEs. In the current implementation, this conﬁguration is done using Web Services and associated Simple Object Access Protocol (SOAP) messages [75]. This type of ADE was developed under the VISNET I NoE project [76]. It did not take into consideration the information concerning the protection of the content and authorization of adaptation operations. In addition, it was able to provide an adaptation decision for one medium only, not taking into account aspects of adapting a composition of multiple media components. However, given its high versatility, it forms the basis of the work conducted within the VISNET II NoE project.

488 Visual Media Coding and Transmission 11.4.5 Context-based Content Adaptation Content adaptation is the process of converting the media available from the content provider into a format which can be consumed by the user. An AE performs this operation as instructed by the ADE. Content adaptation in multimedia processing research has primarily been realized in the form of video adaptation, as compared to the speech and audio components of the multimedia content, video requires special attention for its coding, processing, and transmis- sion over access networks. Most of the video adaptation techniques that have been discussed in the literature address network constraints. Bit-rate adaptation of MPEG-4 visual-coded bitstreams by frame dropping (FD), AC, and/or DCT coefﬁcient dropping (CD), and their combinations (FD CD), have been discussed in [77]. A utility function (UF) has been used to model video entity, adaptation, resource, utility, and the relations among them. Each video clip is classiﬁed into one of several distinctive categories and then local regression is used to accurately predict the utility value. Techniques reported in [78 81] consider only frame dropping as a means of bit-rate adaptation. A more intelligent frame-skipping technique has been presented in [82]. This technique determines the best set of frames (key frames) to represent the entire sequence. The proposed technique utilizes a neural network model that is capable of predicting the indices of the most appropriate subsequent key frames. In addition, a combined spatial and temporal technique for multidimensional scalable video adaptation has also been discussed in [83]. A framework for video adaptation based on content recomposition is presented in [84]. The objective of the proposed technique is to provide effective small-size videos which emphasize the important aspects of a scene while faithfully retaining the background context. This is achieved by explicitly separating different video objects based on a generic video attention model that extracts the objects in which a user is interested. Three types of visual attention feature, namely intensity, color, and motion, have been used in the attention model. Subse- quently, these objects are integrated with the direct-resized background to optimally match the speciﬁc screen sizes under consideration. However, the aforementioned techniques do not consider user preferences when determin- ing the nature of adaptation. The technique presented in [85] extracts the highlights in sports videos according to user preferences. The system is able to extract the highlights, such as shots at goal, free kicks, and so on for soccer, and start, arrival, and turning moments for the swimming scenes. Objects and events are assigned to different classes of relevance. The user can assign a degree of preference to each class, in order to have the best quality cost tradeoff for the classes most relevant to what they are interested in, at the price of a lower quality for the least relevant ones. The adaptation module performs content-based video adaptation according to the bandwidth requirements and the weights of the classes of relevance. [86] considers user preferences together with network characteristics for the adaptation of sports videos. Events are detected by audio/video analysis, and annotated by the DSs provided by the MPEG-7 MDSs. Subsequently, user preferences for events and network characteristics are considered in the adaptation of the videos through event selection and frame dropping. An effective way of performing content adaptation is to utilize adaptation operations in the networks, due to their advantages such as transparency and better network bandwidth resource utilization. Scalable coding technologies help to simplify the function of the network element that carries out the adaptation operation. Adaptation is particularly needed when compressed media streams traverse heterogeneous networks. In such cases, a number of content-speciﬁc

Context based Visual Media Content Adaptation 489 properties of the coded multimedia information require adaptation to new conditions imposed by the different networks and/or terminals in order to retain an acceptable level of service quality. Network-based adaptation mechanisms can be employed at the edges or other strategic locations of different networks, using a ﬁxed-location content adaptation gateway, node, or proxy as in conventional networking strategies [87 90]. Alternatively, content adaptation through transcoding can be performed dynamically wherever and whenever needed using active networking technologies [91,92]. Not only do the network- and/or user terminal-based characteristics impose adaptation needs on the accessed/delivered content, but the users themselves play a major role in choosing the way the content is distributed. For instance, a user may wish to select a speciﬁc area that draws their main attention in visual content. Thus, they may want to access a part of the video scene based on their selection. In addition to this, or in a totally isolated situation, the terminal that the user is using may have a restricted display capability with lower resolution than that of the originally-encoded content. Moreover, the access network that the user is connected to may not be able to support elaborate visual information transfer due to bandwidth limitations and/or other channel-speciﬁc characteristics. All of these add up to the proﬁling of a use case for this particular user, and the different display capabilities, attention area selection preferences, access network-based features, and so on provide the necessary context elements for this use case. The content adaptation strategies and mechanisms that are being discussed within this chapter aim to implement user-centric content adaptation operations, which ultimately will provide the user with the best possible user experience of the service they have requested. This goal can only be realized if the content access/distribution is effectively decoupled from the service-related limitations, which in turn makes the service delivery transparent to the user. Under such a circumstance, the different factors, all of which can be referred to as context information, collectively affect the adaptation of the content, and can provide guidance on how to perform the best possible adaptation for each and every use case. A number of content adaptation tools will be described in subsequent sections, with a view to addressing the needs of the application scenario described in Section 11.6. The main focus is placed on the context-based methods for user-centric content adaptation with management of digital rights. While discussing the issues related to the focused objective, a region of interest (ROI) selection by the user is assumed to form a driving context element for developing an ROI- based user-centric content adaptation tool. ROI selection provides a key advantage during content adaptation through transcoding, as it identiﬁes a visually-important area or object in the digital video. The advantage is particularly signiﬁcant when high-resolution video services are distributed across a wide range of heterogeneous user terminals with diverse display capabili- ties [93]. Selecting an ROI in video content allows a content adaptation (e.g. through transcoding, etc.) gateway to accurately reformat the resolution of input video while focusing on the main region or object of visual attention, as requested by the user. In this way, the AE is able to reorganize the predeﬁned scene priorities, allowing for unequal video parameter allocation to different parts of a scene based on their perceptual qualities [94,95]. Various methods for determining an area of visual attention have been presented in the literature to date. These methods have been exploited to develop a number of algorithms for ROI selection [96 98]. However, most of these algorithms were employed to select an ROI in the pixel domain during the encoding of a video sequence [98 100]. Therefore, they are not quite adequate for network-based adaptation operations for heterogeneous video access

490 Visual Media Coding and Transmission scenarios with quick system responses. Recent research has focused on ﬁnding ROI in the coded domain in order to allow for a number of fast applications, such as transcoding systems, object detection, tracking and identiﬁcation techniques, image and video retrieval/summari- zation schemes based on MPEG-7 descriptors, event detection, AV content analysis and understanding tools, and so on [51,94,101 105]. The reorganization of the content at a gateway in the network has to be context-driven, which wil depend on either the user preferences or the network conditions. Focusing on the main region or object of main attention could be implemented by separating the source stream into substreams and varying source and channel rates for these streams to provide better error protection to the stream carrying the selected attention area. Other streams can be assigned lower source and channel coding rates. This facility could also be useful in situations where the network is experiencing congestion or a user is in an area with weaker signal reception, where adequate bandwidth is not available for higher-quality content access. Here, unequal rate allocation to different regions of the stream could provide better quality for the selected region of the video scene. The network gateway could sense network conditions and resources could be allocated on a priority basis to different regions of the video content. Optimizations in the allocation of resources for video applications over mobile networks have served for transmis- sion power control and improved visual quality [106,107]. Scalable video coding (SVC) has been identiﬁed as a feasible video adaptation technique that ﬁts within the MPEG-21 DIA architecture. A number of scalability options have been discussed in the literature, namely spatial, temporal, ﬁdelity, and interactive ROI (IROI) [108]. If the coded video is featured with one or more of the aforementioned scalability options, the adaptation operation is as simple as letting the set of coding units that deﬁne the adapted bitstream through the AE and discarding the rest. This technology has been available in video coding standards, such as MPEG-4, for many years. Nevertheless, it remains underutilized for several reasons, such as excessive demand for computational resources at the encoder, coding efﬁciency, and delay. Furthermore, user-generated content cannot be expected to be always scalable, due to the use of low-cost hardware and software by some content providers during the content production cycle. IROI adaptation is a vital ingredient in user-centric adaptation. A user may wish to view a selected ROI from the high-resolution video on their low-resolution display, rather than a low- resolution version of the entire frame. For instance, this scenario frequently occurs in security and surveillance applications [109]. Spatial cropping of each video frame in the video sequence (sequence-level cropping) is necessary to address such adaptation requests. Leaving the decoder to handle this adaptation is not an ideal choice since it will not only be a misuse of the precious network bandwidth, but will demand more computational resources at the user terminal. Furthermore, if the access network has bandwidth limitations in particular, the overall concept of decoder-driven scalability or adaptation becomes totally unfeasible. SVC extension of H.264/AVC [110] provides provisions for user IROI adaptation. This technique is formally identiﬁed as IROI scalability [111,112]. This scalability is achieved by coding non-overlapping rectangular regions (tiles) of a frame into independently decodable entities called network abstract layer (NAL) units (NALUs) using ﬂexible macroblock ordering (FMO). An AE can utilize the IROI scalability to extract a substream that provides enhanced visual quality over the ROI [109]. However, a sequence-level cropping operation can be performed at an AE only if cross-tile temporal prediction is restricted, which drastically affects the overall compression efﬁciency. A similar IROI scalability technique has also been proposed

Context based Visual Media Content Adaptation 491 by Lambert et al. [113]. In this work, FMO in H.264/AVC has been utilized to code an ROI into NALUs, and therefore this technique also suffers from the limitations of SVC extension of H.264/AVC IROI scalability, as discussed above. Consequently, transcoder-based adaptation is a necessity for serving such scenarios. This chapter discusses a platform to accomplish such adaptations on both MPEG-4- and H.264/AVC-coded video streams, and presents experimen- tal results on the effectiveness of the described adaptation tools for context-based user-centric content adaptation. The adaptation tools under view are utilized as the AE, and form an integral part of the content adaptation block shown in Figure 11.21 [114]. Here: 1. The user speciﬁes a selected ROI in a feedback message to the service provider. 2. The service provider consults an ADE. 3. The ADE determines the type of adaptation needed after processing the available context descriptors. The context descriptors describe the user-deﬁned ROI and other constraints, such as terminal capabilities, access network capabilities, usage environ- ment, DRM, and so on. 4. The relevant adaptation decision is then passed on to the AE. Figure 11.21 User centric ROI based content adaptation architecture

492 Visual Media Coding and Transmission An AE that focuses on content adaptation based on a method for optimized source and channel rate allocation is presented in Section 11.6.8.1. Then an AE that carries out sequence-level cropping-type recommendations speciﬁed in the adaptation decision message in order to provide ROI-based content adaptation is described in Section 11.6.8.2. Moreover, adaptation based on scalable video content is also reported, in Section 11.6.8.3. 11.5 Generation of Contextual Information and Proﬁling The ﬁrst part of this section presents standardized descriptions for context in a generic multimedia scenario. This is followed by the description on how to represent contextual information based on proﬁles. As will be explained, the intention of using proﬁles is to promote interoperability and facilitate the use of context in real-world applications. This section then discusses the problem of how gathering contextual information can affect a users privacy. Finally, it provides details regarding the generation and aggregation of contextual information from diverse sources. 11.5.1 Types and Representations of Contextual Information (Portion reprinted, with permission, from V. Barbosa, A. Carreras, H. Kodikara Arachchi, S. Dogan, M.T. Andrade, J. Delgado, A.M. Kondoz, ‘‘A scalable platform for context-aware and DRM-enabled adaptation of multimedia content’’, in ICT-Mobile Summit 2008 Conference Proceedings, Paul Cunningham & Miriam Cunningham (Eds), IIMC International Information Management Corporation, 2008. Ó2008 IIMC Ltd and ICT Mobile Summit.) Contextual information has a very broad deﬁnition. As seen in Section 11.2.1.1, the term context can be applied to many different aspects and characteristics of the complete delivery and consumption environment. The discussion on the context-aware content adaptation platform in this chapter is based on the assumption that a standardized representation of the contextual information is available. This aspect is considered as instrumental to enabling interoperability among systems and applications, and across services. As introduced in Section 11.4.4, MPEG-21 DIA seems to be the most complete standard, and as such the ideal choice for any system that expects wider visibility. MPEG-21 DIA deﬁnes UED, which is a full set of contextual information that can be applied to any type of multimedia system, as it assures device-independence. UED includes a description of terminal capabilities and network characteristics as well as User2 and natural environment characteristics. All of these elements can be seen in Table 11.1. User and natural environment characteristics are possibly the most relevant and innova- tive of any UED subsets. As we have seen in Sections 1.4.1 and 1.4.4, the majority of standards that have been developed to describe contextual information for content adaptation concentrate their efforts on terminal and network capabilities. To date, the speciﬁcations drawn for UMA have had several limitations, as they focus too much on network and ter- minal restrictions while ignoring the improvement of User experience [55,115]. Nowadays, researchers are starting to concentrate on ﬁlling the gap between the content and the User (and MPEG-21 DIA is a clear example); thus the ultimate driver of the adaptation 2 User with a capitalized ‘‘U’’ in MPEG 21 can be a person, a group of persons, or an organization.

Context based Visual Media Content Adaptation 493 Table 11.1 MPEG 21 DIA: UEDs User Characteristics * UserType/UserCharacteristics * UserInfo (MPEG 7, AgentType) * UsagePreferences (MPEG 7, UserPreferences) * UsageHistory (MPEG 7, UsageHistory) * AudioPresentationPreferences (volume, frequency equalizer settings, audible frequency ranges, etc.) * DisplayPresentationPreferences (color temperature, brightness, saturation, contrast, conversion 2D 3D, etc.) * ColorPreference * StereoscopicVideoConversion (2D 3D) * GraphicPresentationPreferences (geometry, texture, etc.) * ConversionPreference (qualitative and quantitative) (video audio, video text, etc.) * PresentationPriorityPreference * FocusOfAttention (ROI, MEG 7) * AuditoryImpairment (users auditory deﬁciency) * VisualImpairment * ColorVisionDeﬁciency * MobilityCharacteristics (update interval, directivity, and erraticity) * Destination Terminal Capabilities * Terminals/Terminal/TerminalCapabilities * CodecCapabilities (coding, decoding) * CodecParameter * Displays/Display/DisplayCapabilities (resolution, color capabilities, rendering format, etc.) * AudioOutputs/AudioOutput/AudioOutputCapabilities * UserInteractionInputs (mouse, micro, 1 boton, etc.) * DeviceClass * PowerCharacteristics * Storages/Storage/StorageCharacteristics * DataIOs/DataIO/DataIOCharacteristics (bus width, transfer speed, number of max of devices, etc.) * Benchmarks * CPUBenchmark * ThreeDBenchmark * IPMPTools (authentication, decryption, watermarking, etc.) Network Characteristics * Networks/Network/NetworkCharacteristics * NetworkCapability (bandwidth, sequence, errors, etc.) * NetworkCondition (available, min/max delay, BER, duration, etc.) Natural Environment Characteristics * NaturalEnvironments/NaturalEnvironment/NaturalEnvironmentCharacteristics * Location (MPEG 7, PlaceType) * Time (MPEG 7, TimeType) * AudioEnvironment (noise level, etc.) * IlluminationCharacteristics

494 Visual Media Coding and Transmission operations is no longer the terminal or networks only, but also the User, with all their surrounding environment. Accordingly, the two new drivers in focus are discussed in more detail here. We can divide the DIA User characteristics into ﬁve main blocks, as follows: . User Info: General information about the User. As can be seen in Table 11.1, this information is speciﬁed by means of MPEG-7 agent DSs. . Usage Preference and History: Usage preference includes descriptions about a Users preferences, and usage history describes the history of actions on DIs by a User. Both import the corresponding MPEG-7 DSs. . Presentation Preferences: This block includes new and important descriptors about how DIs (and their associated resources) are presented to the User. It is especially interesting as the Focus of Attention descriptor allows the expression of preferences that direct the focus of a Users attention with respect to AV and textual media. . Accessibility Characteristics: Includes detailed information about auditory or visual im- pairments of the User, which can lead to a need for speciﬁc adaptation of the content. . Location Characteristics: By means of mobility characteristics and destination, this block is especially useful for adaptive location-aware services. Natural environment characteristics focus on the physical environmental surrounding the User. They can be used as a complement to adaptive location-aware services, as they contain Location and Time descriptors (based on MPEG-7), which are referenced by both mobility characteristics and destination tools seen in User characteristics. On the other hand, they also include descriptors about the AV environments (such as noise level or illumination characteristics). These characteristics may also impact the adaptation decisions, thus contrib- uting to a ﬁner level of detail of the adaptation operations and consequently increasing the Users experience. Although the utilization of these two groups of description tools will potentially enable the delivery of innovative and more interesting results, clearly the characterization of the user terminal as well as of the network connections can be considered as indispensable for performing useful adaptation operations. Likewise, a description of the transformation capabilities offered by the available AEs is also instrumental for the effective implementation of the desired adaptation operation. Such a description can be expressed as subsets of the UED descriptions, notably those belonging to the terminal capabilities group. The MPEG-21 DIA standard speciﬁes appropriate XML schemas to represent this contex- tual information. They can be exchanged among the components of a content mediation platform as independent XML ﬁles or, when applicable, referenced inside DIDs or even directly included in the DID. As will be discussed in Section 11.6.7, a part of this contextual information can also be included in a license that governs the use and consumption of a protected DI. 11.5.2 Context Providers and Proﬁling (Portion reprinted, with permission, from M.T. Andrade, H. Kodikara Arachchi, S. Nasir, S. Dogan, H. Uzuner, A.M. Kondoz, J. Delgado, E. Rodriguez, A. Carreras, T. Masterton and R. Craddock, ‘‘Using context to assist the adaptation of protected multimedia content in virtual

Context based Visual Media Content Adaptation 495 collaboration applications’’, roc. 3rd IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom2007), New York, NY, USA, 12 15 November 2007. Ó2007 IEEE and V. Barbosa, A. Carreras, H. Kodikara Arachchi, S. Dogan, M.T. Andrade, J. Delgado, A.M. Kondoz, ‘‘A scalable platform for context-aware and DRM-enabled adaptation of multimedia content’’, in ICT-Mobile Summit 2008 Conference Proceedings, Paul Cunningham & Miriam Cunningham (Eds), IIMC International Information Management Corporation, 2008. Ó2008 IIMC Ltd and ICT Mobile Summit.) This section describes the deﬁnition of proﬁles based on the contextual information, in order to facilitate the adaptation of multimedia content in generic scenarios. Contextual proﬁles have the potential to simplify the generation and use of context, by creating restricted groups of contextual descriptions from the full set of DIA descriptors. Each group or proﬁle contains only the descriptions that are essential to each application scenario. This approach also promotes interoperability, as different CxPs are able to generate/provide the same type of contextual information as their counterparts, using the same standard representation. Proﬁles presented in this section are based on contextual information that can help to implement different kinds of adaptation operation within multimedia content systems for delivering different services or supporting different applications. As discussed in the previous section, the UED of MPEG-21 DIA is an optimal set of descriptors that can be used to describe virtually any context of usage. As shown in Table 11.1, the set of descriptors is divided into four main blocks, where each block is associated with an entity or concept within the multimedia content chain: User, Terminal, Network, and Natural Environment. This division is a good starting point for deﬁning proﬁles. Although the combined use of descriptors from different classes can offer increased functionality, the identiﬁcation of proﬁles inside each class can potentially simplify the use of these standardized descriptions by the different entities involved in the provision of context-aware multimedia services, and thus increase their rate of acceptance/penetration. One of the resulting advantages would be realized at the level of interoperability. The provision of networked multimedia services usually requires that different entities, operating in distinct domains, interact with one another. The sum of their contributions allows the building of the complete end-to-end service. In order to be able to provide this service in an adaptable manner that seamlessly reacts to different usage environments characteristics or to varying conditions of a given environment, all of the participating entities should collect and make useful contextual information available. These entities that make contextual information available can be designated as context providers. If they all use the same open format to represent those descriptions, any one entity can use descriptions provided by any other entity. Moreover, considering that CxPs have a one-to-one correspondence to service providers, license servers/authorities, network providers, content providers, or electronic equipment manufacturers, this means that each CxP will offer contextual information concerning its own sphere of action only. For example, a network provider and a manufacturer will make available contextual information related to the network dynamics and the terminal capabilities, respectively. In this way, each one needs to know about one speciﬁc proﬁle only. Accordingly, proﬁles can be deﬁned based on the four existing classes: User proﬁle, Network proﬁle, Terminal proﬁle, and Natural Environment proﬁle. Each of these proﬁles is composed of the corresponding elements of MPEG-21 DIA to assure a full compliance with the standard. Figures 11.22 11.25 show the XML representations of the four proﬁles.

496 Visual Media Coding and Transmission Figure 11.22 User proﬁle based on MPEG 21 DIA Figure 11.23 Terminal proﬁle based on MPEG 21 DIA

Context based Visual Media Content Adaptation 497 Figure 11.24 Network proﬁle based on MPEG 21 DIA 11.5.3 User Privacy The generation and use of contextual information concerning the usage environment may affect the privacy of the users and, if not properly handled, could even be intrusive and violate the users rights. This section analyzes the possible ways in which this might happen, laying down the foundations for the eventual need for authorization when generating or gathering contextual information. Figure 11.25 Natural Environment proﬁle based on MPEG 21 DIA

498 Visual Media Coding and Transmission Users might be able to choose some degree of privacy. We should consider not only personal information related to the User proﬁle, but also the possibility of protecting the information related to the Terminal, the Natural Environment, and even the Network. We can therefore think about deﬁning new proﬁles based on the level of privacy, which could be associated with the previously-deﬁned ones. This situation must be carefully considered when developing context- aware systems, which may need to exchange sensitive personal information among different subsystems. It is therefore of utmost importance to devise ways of protecting this information and thus ensuring the privacy and rights of users. As discussed in Section 11.2, this speciﬁc aspect of security has not yet been sufﬁciently addressed. As a result, addressing user privacy issues has become one of the key areas of research. 11.5.4 Generation of Contextual Information (Portion reprinted, with permission, from M.T. Andrade, H. Kodikara Arachchi, S. Nasir, S. Dogan, H. Uzuner, A.M. Kondoz, J. Delgado, E. Rodriguez, A. Carreras, T. Masterton and R. Craddock, ‘‘Using context to assist the adaptation of protected multimedia content in virtual collaboration applications’’, roc. 3rd IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom2007), New York, NY, USA, 12 15 November 2007. Ó2007 IEEE.) As described in Section 11.2.1.1, the process of gathering contextual information involves three steps. The ﬁrst and second, namely ‘‘sensing the context’’ and ‘‘sensing the context changes’’, relate to the generation and representation of basic or low-level contextual information. As referred to in Section 11.5.1, contextual information is represented as MPEG-21 DIA descriptions organized into four distinctive groups. According to the speciﬁc application scenario in use, a subset of descriptions from these four groups can be used. Nevertheless, when the same type of information is used in different applications, the form of generating that information may be shared across applications to address more generic application scenarios. It is very important to identify the standard representation of contextual information and accordingly indicate the process and/or mechanisms by which it will be generated and represented. The third step involved in the acquisition of content, i.e. ‘‘inferring high-level context’’, is discussed in the next section. Although contextual information belonging to the Resources Context category (as deﬁned in Section 11.2.1.1), such as those concerning the terminal capabilities and the network characteristics, is likely to be automatically generated through the use of software modules, information addressing the Users characteristics and their natural surrounding environment (User Context and Physical Context categories) will require dedicated hardware or possibly the manual intervention of the User. Dedicated hardware may consist of visual and audio sensors, such as video cameras and microphones. User preferences can implicitly be created based on usage history. The four categories of basic contextual information can be mapped on to the different proﬁles introduced in Section 11.5.2, which in turn match the different groups of descriptions identiﬁed in the MPEG-21 DIA speciﬁcation. The Resources type is split into the Terminal and Network proﬁles, and the Physical class is mapped on to the Natural Usage Environment proﬁle. Contextual information belonging to the fourth identiﬁed type, the Time Context, may be represented as an attribute within each of the mentioned context proﬁles, or it may also be addressed as part of the rules for reasoning about the low-level contextual information.

Context based Visual Media Content Adaptation 499 11.6 The Application Scenario for Context-based Adaptation of Governed Media Contents (Reproduced by permission of Ó2008 IIMC Ltd and ICT Mobile Summit.) This section provides a brief description of the selected application scenario based on the Virtual Collaboration System (VCS), where adaptation of content to meet constraints as well as user preferences imposed by the usage context would be required to enhance the quality of the user experience. It includes a succinct textual description of the selected scenario as well as of the different contexts or the different conditions and characteristics of the usage environment that may occur during the consumption of such an application. The scenario described in this section is meant to be used as an enabler for the deﬁnition of potential use cases for context-based adaptation issues discussed in this chapter. It thus provides the means for the identiﬁcation and selection of useful contextual information, of participating entities processing that contextual information and associated functionalities, as well as of the interactions therein. VCS is a system through which remotely-located users are able to meet in a virtual environment created by the supported AV technology. Such an environment requires the provision of a sensation of all the remotely-located users presence in the same room. In a typical virtual collaboration scenario, as shown in Figure 11.26, there are a number of ﬁxed collaboration units (e.g. communication terminals, shared desk spaces, displays, etc.) as well as a number of mobile units (e.g. laptops, PDAs, mobile phones, etc.), which are equipped with various user-interaction devices. The central unit (e.g. headquarters of a large multi-site Figure 11.26 Virtual collaboration environment

500 Visual Media Coding and Transmission company, etc.) can be equipped with large-scale virtual collaboration equipment, and serves as the main command and communications base. A remotely-located secondary ﬁxed unit serves as the local contact/collaborator, and provides the necessary local information. The mobile units, such as vehicles or patrolling personnel, are equipped with mobile devices, and are on the move to react to the requests of the headquarters and/or secondary command unit. In such a scenario, in order to allow users to communicate with each other using devices with different capabilities, such as visual display types, over different networks with various characteristics and so on, it is necessary to perform context detection, extraction, and content adaptation based on the detected context. Such a collaboration system will be of a heterogeneous nature, due to its users accessing the network with their available connectivity and their terminal devices of different levels of capability. Therefore, context-based multimedia content adaptation is needed to scale down the content, where necessary, while exchanging the media information between the large- and small-scale collaboration terminals over different access networks in this heterogeneous scenario. For instance, in this kind of a scenario, ﬁeld workers (i.e. mobile collaborators) may be using PDAs and/or some other mobile equipment, such as laptops, to achieve their tasks, whereas the users located in the headquarters and secondary command units are allowed to use their large-scale devices with more processing power and high-quality display capabilities. VCSs are envisaged to be employed in many application scenarios, such as in virtual ofﬁces for remote meetings/collaboration, rapid deployment of emergency services in case of a threat/ emergency, virtual classroom for distance learning, remote working for the production of consumer goods, media content, and so on. In light of the above discussions, the following subsection describes the speciﬁc application scenario adopted to describe the context-based content adaptation technologies (i.e. the virtual classroom, which is based on the VCS platform). 11.6.1 Virtual Classroom Application Scenario Within the context of the technologies described in this chapter, a conceptual framework for a virtual classroom application is assumed, which is based on a VCS with a feature for context extraction from the media streams and adaptation of the delivered content. We also assume the use of rights management on the content to allow controlled dissemination to a heterogeneous audience. In combination, these features are envisaged to enable academic institutions to conduct a series of collaborative lectures and classes with which remotely-located students can interact more efﬁciently. Such an infrastructure is believed to enhance all the traditional advantages of remote learning in a more efﬁcient way, such that: . Experienced teaching staff can be shared more widely. . Both staff and students can save time and travel expenses. . Scheduling constraints are eased. . Debate and discussion quality is improved. In addition, there will also be new beneﬁts: . More interaction options will be available for students.

Context based Visual Media Content Adaptation 501 . There is the possibility of having one or more document tracks one for the presentation and another for interactivity. . Small, securely contained subgroups can be convened on a single system. . The quality of the presented material can be tailored to the content at a ﬁner granularity. The success of such an application largely depends on the users ability to comprehend the materials delivered over the virtual collaboration platform. Ideally, the remote audience should have the same comfort as the local audience in terms of listening to the speaker (the lecturer or anyone from the audience) and viewing the speakers expressions and gestures, presentation materials, and the whiteboard. Even though the audience in a remote lecture theatre with full VCS functionality may be able to experience this, the challenge is to also provide it to individuals accessing the lecture through terminal devices with limited capabilities. Therefore, in order to facilitate seamless access of the heterogeneous audience with various preferences and privileges over various network infrastructures using a vast range of terminal devices with various levels of capability, context-aware content adaptation is a key technology for the virtual classroom application. In order to deploy the virtual classroom application, the lecture theatres of each of the participating universities must be equipped with the necessary virtual collaboration infra- structure, which includes: . Three sets of projectors and screens for: (a) displaying the presentation; (b) the virtual whiteboard, which can be used by the lecturer as well as the audience (local and/or remote); and (c) video feeds of the lecturer and the audience. . A set of cameras to capture the lecturer. . A microphone and a video camera ﬁtted to each seat to capture the occupant. These would ultimately be replaced by a smaller number of optical pan-tilt-zoom cameras and a steerable beam microphone. . An input device (possibly a tablet) ﬁtted to each seat so that the occupant can write/draw on the virtual whiteboard, and to alert the rest of the audience to the occupants willingness to add their view. When one institute conducts a lecture in one of their lecture theatres for their enrolled students, those from other universities who have also enrolled in the same course can attend the same lecture remotely from another lecture theatre. Unlike a conventional classroom session, these lecture series can be followed by external students, those who have been unable to attend the classroom, as well as the general public over a wired or wireless link using their home PC or a mobile terminal such as a smart phone. Enrolled remote students will be able to interact with the lecturer and audiences using the virtual collaboration platform, which facilitates interaction not only through audio and video, but also through a virtual whiteboard. Any Internet-enabled device on which the virtual collaboration platform is installed can be used as a terminal to enjoy the full functionality of the system. Devices on which the virtual collaboration platform cannot be installed but which offer video communication functionality may be able to present the document tracks as a video. The video is adapted to match individuals preferences while also considering related constraints. However, the general public will have the right to view only a low-resolution version of the video and will not have privileges to view any of the adapted versions. Neither can they interact with the classroom sessions.

502 Visual Media Coding and Transmission This raises the prospect of a whole new concept of searching the Internet for scheduled live seminars on a topic of the users choice, just as seminar notes can be found online today. Live seminars are already possible, but will only achieve mass uptake when users become accustomed to seamless audio and video access, when all the devices they use regularly can provide it (at some level of adaptation). There are a number of technical challenges associated with this scenario, which can be summarized as follows: . Integration of audio feeds of individual students and lecture theatres. . Managing and presenting interactions (audio, visual, and whiteboard) from the distributed audience. . Automatically tracking the movements of the presenter. . Customizing the presentation. . Detecting requests for interaction from the audience and adapting their audio and video content, preferably without the expense of individual microphones and cameras at each audience seat. . Acquiring low-level context and inferring higher-level concepts. Inferring the state of the user, such as location, activity, intentions, resources, and capabilities in the past, present, and future. . Adapting content (audio, video, and document tracks) for small terminals with limited capabilities, and to speciﬁc context situations. The next section will discuss the underlying adaptation technologies that enable seamless access to the classroom while managing the participating institutions rights to the contents. In line with the above objective, adaptation decision techniques based on context descriptors and their integration with content adaptation tools in order to support adaptive, context-aware, distributed applications that react to the characteristics and conditions of the usage environ- ment and provide transparent access and delivery of content in such scenarios are being developed. To achieve this objective, the aforementioned VCS needs to be equipped with a context-aware content adaptation platform. This platform can be implemented within the terminal gateways. Adaptation of the multimedia content can be achieved by using contextual information, such as terminal capabilities, network conditions, user characteristics and preferences, and environmental conditions during the adaptation decision process, as is discussed in the next section. 11.6.2 Mechanisms Using Contextual Information in a Virtual Collaboration Application The survey of the state of the art presented in Section 11.2 revealed a considerably complex scenario in context-awareness research and standardization. Subsequently, Section 11.5 presented the perspective adopted in this chapter for generating and aggregating contextual information from diverse sources and proﬁling for generic multimedia communication and/or content access/distribution scenarios. The selection of a focused application scenario, pre- sented in detail so far in Section 11.6, has made it possible to narrow down the scope of the discussions provided in this chapter. In fact, the identiﬁcation of a set of real-world situations likely to occur within this usage scenario, where performing content adaptation driven by

Context based Visual Media Content Adaptation 503 speciﬁc contextual information is an enabling factor for the identiﬁcation of a subset of contextual information providing the grounds for the adoption of a pragmatic approach towards the design of a context-aware content adaptation platform. The objective of acquiring contextual information and formatting it according to standard representations is to further allow its use within the context-aware platform in order to arrive at meaningful content adaptation operations while also complying with the necessary DRM issues. There are basically two different levels at which the contextual information is used: at the ADE level and at the AE level. The former assembles the required contextual information, including that related to DRM and adaptation authorization, interprets it, and reasons about it, whereas the latter actually uses part of the contextual information to perform the adaptation operation. For example, the ADE may receive the capabilities of the terminal and the media characteristics of the resource to be consumed. It then needs to interpret them while also managing the digital rights associated with the media content in use, so as to select the best set of media characteristics for the capabilities of the terminal. While doing this, it also consults with an adaptation authorization module to authorize the necessary format changes and/or conversions on the media stream. The AE will then receive these media characteristics and will use them to set up its operating mode accordingly. This section describes the mechanisms that use contextual information in a virtual classroom application, as discussed in Section 11.6.1. Section 11.6.3 gives an overview of a system architecture of the context-aware content adaptation platform to be deployed in the selected application scenario. The ADE, relevant adaptation authorization technologies, and AEs are discussed in detail in the following subsections. 11.6.3 Ontologies in Context-aware Content Adaptation Context-awareness in content adaptation can be deﬁned as the ability of a system to adapt content to the characteristics and constraints of the consumption environment and user preferences [1]. It thus aims to increase the system usability and enhance the quality of the user experience. The use of contextual information is instrumental in the successful imple- mentation of useful and meaningful content adaptation operations that enhance the quality of the user experience [116]. Context information is required to decide how and when to adapt content, so as to meet users expectations and satisfy usage environment constraints. An ontology is used to deﬁne the knowledge about a domain, which enables a formal description of speciﬁc situations in that domain [117]. Decision-taking operations driven by the conditions and characteristics of real-world situations can greatly beneﬁt from the use of ontologies [116]. Low-level contextual information gathered from sensors can be used to trigger the adaptation decision process. Together with the concepts and rules provided by the ontologies, the sensed data is used for reasoning and inferring higher-level contexts, thus enabling a context analysis closer to real-world situations. Accordingly, the adaptation decision has a higher chance of satisfying user expectations. Different real-world situations of multimedia content consumption are likely to have a common knowledge denominator (common concepts and rules). However, they will also have speciﬁc knowledge only relevant to the application in use. Hence, a layered ontology model is more advantageous for a content adaptation platform, since such an approach enhances the reusability and extendibility of the system. An example of a two-layer ontology model, which has been developed using OWL, is shown in Figure 11.27 [19].

504 Visual Media Coding and Transmission Figure 11.27 Context ontology overview. Reproduced by Permission of Ó2008 IIMC Ltd and ICT Mobile Summit The generalized ontology layer provides descriptions of generic concepts and rules that can be used in any virtual collaboration application scenario. This layer is based on MPEG-21 DIA [68], in particular the UED tool, and it is divided into four main proﬁles. Figure 11.28 represents the conceptualization of these proﬁles. The second layer, i.e. the domain-speciﬁc layer, provides rules dedicated to a given application. Multiple domain-speciﬁc ontologies can thus co-exist in this layer. For example, the virtual classroom-speciﬁc layer provides the means of reasoning various adaptation options in order to help the user better understand the classroom session. 11.6.4 System Architecture of a Scalable Platform for Context-aware and DRM-enabled Content Adaptation The context-aware content adaptation platform developed for a virtual collaboration applica- tion, which is conceptually illustrated in Figure 11.29, consists of the following four major modules: (1) adaptation decision engine (ADE); (2) adaptation authorizer (AA); (3) context providers (CxPs); and (4) adaptation engine stacks (AESs), comprising adaptation engines (AEs) within. These modules are independent units that interact with one another through Web Services-based interfaces. The distributed modular architecture of the adaptation platform ensures scalability. Well-deﬁned interfaces based on open standards also guarantee interoperability and the ﬂexibility to freely add, remove, and migrate modules. The use of ontologies in the ADE, while being a vehicle for interoperability, provides the platform with context-aware analysis capabilities closer to real-world situations. The AA ensures the governed use of protected content. Flexible AEs enable the execution of a variety of adaptations that can be dynamically conﬁgured and requested on the ﬂy. A modular system architecture can be considered for the realization of the advanced context- aware services building into a layered platform that embraces: (1) the system interoperability approach proposed by the new generation of systems, as discussed in Section 11.2.1.2; (2) the aspects of combining multiple explicit contexts to build elaborate implicit contexts in an interoperable way; and (3) the aspects of improving usability of applications, and generally speaking the quality of the experience of the user (which was the main goal of the early research), by selecting the most adequate type of adaptation. Figure 11.30 illustrates the high- level layered architecture of a generic context-aware platform. The external lower layer can be seen as a middleware layer that abstracts the higher layers from the actual generation of

Context based Visual Media Content Adaptation 505 Figure 11.28 The conceptualization of: (a) user; (b) terminal; (c) network; and (d) natural environment ontologies. Reproduced by Permission of Ó2008 IIMC Ltd and ICT Mobile Summit

506 Visual Media Coding and Transmission Figure 11.28 (Continued) low-level contextual information. This layer is instrumental to enabling interoperability at the system level, as: . Applications to be developed do not need to be aware of the details of the sensor devices in use, and thus are independent of the sensor technology. . Different applications may use the same sensors, and make different uses of the low-level sensed information. . Sensors can be distributed, and thus applications may proﬁt from using explicit contextual information gathered at remote points. Figure 11.29 Context aware content adaptation platform in a virtual collaboration scenario (e.g. the virtual classroom). Reproduced by Permission of Ó2008 IIMC Ltd and ICT Mobile Summit

Context based Visual Media Content Adaptation 507 Figure 11.30 High level architecture of a generic context aware platform. Reproduced by Permission of Ó 2007 IEEE Each layer of this architecture is further divided into functional modules. For example, the low-level context-sensing layer incorporates different modules acting as services, offering functionalities to collect different types of low-level contextual information. Likewise, the context-reasoning layer provides various modules that reason about different sets of low- level contextual information. The existence of particular modules is dependent on the types of adaptation offered by the application layer, which in turn is dictated by the application scenario. Accordingly, the generic context-aware platform may present different function- alities according to the application scenario in use. It is still a generic architecture in the sense that it can seamlessly incorporate different functionalities as needed by different applica- tions, while a common functionality can be re-used between those applications. The speciﬁc application scenario considered in this section involves the gathering of the following set of low-level contextual information: . Characteristics of the terminal, conditions of the network, user characteristics and interactions. . Sensed low-level visual and auditory information related to both the user and their surrounding environment. This information can be used to reason and conclude on the emotional or physical state of the user, or to identify indoor/outdoor situations. . Security and DRM information (eventually conveyed in licenses). Accordingly, the functional blocks that form the VISNET II context-aware adaptation architecture for the virtual classroom application are shown in Figure 11.31. In this architec- ture, the decision mechanisms decide on the appropriate adaptation operations by gathering and inferring the sensed context from the sensor layer, and through consulting with the DRM and protection tools. In turn, they pass their decision to the AEs, where a speciﬁc adaptation algorithm is executed on an input media stream in response to the ADE decision. The following three subsections provide further details on the functional blocks of this architecture. 11.6.5 Context Providers Contextual information can be any kind of information that characterizes or provides additional information regarding any feature or condition of the delivery and consumption environment.

508 Visual Media Coding and Transmission Figure 11.31 Functional blocks of the VISNET II context aware content adaptation architecture. Reproduced by Permission of Ó 2007 IEEE All of the participating entities collect and make available useful contextual information. These entities can be designated as ‘‘context providers’’. The described diversity of information can be grouped into four main context classes according to the feature or entity to which it refers: Resource, User, Physical, and Time. Entities, either software or hardware, that are able to generate and provide this explicit contextual information are designated as CxPs. The low-level contextual information generated by these entities, once acquired and represented according to a standard format, will be used to infer higher-level concepts, and thus assist the adaptation decision operation. The use of standards is instrumental in enabling interoperability among systems and applications, and across services. The standard considered in this chapter is the MPEG-21 DIA speciﬁcation. It speciﬁes appropriate XML schemas to represent the low-level contextual information. Based on four main types of descriptor provided in MPEG-21 UED, i.e. User, Terminal, Network, and Natural Environment, four context proﬁles have been created, as illustrated in Figure 11.32. With these proﬁles, each CxP needs only to know and implement its own sphere of action, resulting in a level of interoperability enhancement.

Context based Visual Media Content Adaptation 509 Figure 11.32 Virtual collaboration context proﬁles. Reproduced by Permission of Ó 2007 IEEE, Ó2008 IIMC Ltd and ICT Mobile Summit CxPs are responsible for formatting the acquired context into the identiﬁed UED, where appropriate, using the context proﬁles shown in Figure 11.32. They work in a ‘‘pull’’ model during the service initialization, responding to requests from the ADE, and subsequently in a ‘‘push’’ model, notifying the ADE when new context is available. The CxPs can be various in a complex application scenario, and almost all of the participating entities collect and make

510 Visual Media Coding and Transmission available useful contextual information. A few examples of such CxPs are network operators (through the network equipment), content providers (through databases, media repositories, streaming servers, encoders, etc.), equipment manufacturers (through terminal devices, sensors such as cameras, microphones, etc.), and users (via the terminal device being used or via databases holding user proﬁles). The CxPs that are required for the virtual classroom application consist of logical and physical sensors. The former are software applications running in the terminal, at the network edge equipment, and in databases (DBs), holding descriptions relative to the content and reasoning rules. The latter are external physical sensors, namely the overall camera and microphone. These CxPs pass their information to the ADE lower layer through: . Terminal device drivers, to acquire the capabilities and conditions of the user terminal and also audio and video information captured by the built-in camera and microphone. . Network agents, to describe the characteristics and the conditions of networks. . Content service agents in the form of a DB access module, to acquire content-related metadata and reasoning rules. . Audio and video sensors. The content adaptation platform discussed in this chapter exposes an application programming interface (API) based on MPEG-21 distinguishing the different proﬁles, which are expected to be used by these CxPs accordingly. 11.6.6 Adaptation Decision Engine The ADE can be considered the intelligent part of the content adaptation platform in the virtual collaboration application. Its goal is to make a decision regarding the actions to perform when contextual information is available, with the goal of maximizing the quality experienced by the user. From the description of the selected application scenario in this section, namely the virtual classroom, some of the adaptation requirements and contextual descriptions that are needed for the adaptation can be identiﬁed; Table 11.2 presents the resulting list of adaptation requirements, together with the relevant context descriptions in line with the MPEG-21 standard descriptors that can be used to satisfy the requirements of the identiﬁed application scenario. The ADE designed for use in the virtual classroom application is illustrated in Figure 11.33. The approach is to have a central coordinator, named ContextServiceManager, which interacts with dedicated modules that sense low-level context generated by terminals, networks, electronic equipment, and other required metadata, notably content-related (media character- istics and MPEG-21 DIA AQoS), and rules for reasoning speciﬁc to the application under consideration. The information generated by CxPs, once acquired and represented according to a standard format, is used to infer higher-level concepts, and thus assist the adaptation decision operation. The use of the AQoS tool of the MPEG-21 DIA speciﬁcation provides the vehicle for implementing the three-space approach, as described in Section 11.4.4. Whenever rules are available, the Reasoner is invoked by the ContextServiceManager and interacts with the DecisionTaking module to select the most appropriate adaptation and the

Table 11.2 Adaptation possibilities for different scenarios in a virtual classroom session Scenario for Nature/Origin Group/Proﬁle Descriptor Adaptation usage context Scaling selected objects according to priority Various priorities Preferences of User dia:FocusOfAttention Summarizing the session for displaying the user dia:PresentationPriorityPreference Prioritizing audio content by different objects scaling video content A user wants to Preferences of User dia:PresentationPriorityPreference Separating background from watch highlights the user foreground, and prioritizing the Delay-sensitive Preferences of User dia:PresentationPriorityPreference foreground content transmission the user dia:FocusOfAttention Presenting the user-authorized (chat, discussion, dia:ConversionPreference segments of the content only debate, etc.) Downscaling to lower resolution User authorization Characteristics of User dia:UserInfo the user who may Terminal (continued ) Inadequate display be authorized or dia:TerminalCapability:DisplaysType: size not to consume Resolution dia:TerminalCapability: speciﬁc content DisplaysType:Screensize Static constraints of the physical environment. This characteristic can be either inferred from the user request or included in a terminal UED generated by some software module residing at the terminal

Table 11.2 (Continued) Scenario for Nature/Origin Group/Proﬁle Descriptor Adaptation usage context Receiving device Static constraints User dia:FocusOfAttention Cropping a selected region does not support of the terminal Terminal dia:TerminalCapability Transmoding documents to a documents video sequence Remaining terminal Static constraints Terminal dia:PowerCharacteristics battery power is of the terminal Lowering spatial/temporal resolution not enough for the Network dia:NetworkCapability:maxCapacity and/or ﬁdelity of the video to full session Static or dynamic dia:NetworkCondition: minimize the utilization of the Bandwidth scarcity conditions of AvailableBandwidth processor constraint the physical Bit-rate transcoding, discarding environment higher signal-to-noise ratio (SNR) scalability layers, spatial scalability User is at a low Static or dynamic User dia:FocusOfAttention layers, and temporal scalability signal reception area conditions of the Network layers physical environment dia:NetworkCapability Prioritizing bit rates for important Lighting conditions Dynamic conditions Natural dia:NetworkCondition regions of the frame of the natural Environment Improving error-resilience and/or environment dia:IlluminationCharacteristics using stronger error-protection surrounding the user. Requires the Increasing or decreasing the availability of sensors brightness of the presented material according to the illumination

Present background Dynamic conditions Natural dia:AudioEnvironment Prioritizing a selected area of visual noise level of the natural Environment content in a scene (e.g. focus on lips environment MPEG-7:MediaFormatType: of a news reader, etc.), audio-to-text Loss of quality at surrounding the Content QualityMeasure transmoding, audio level/quality one collaboration user. Requires the improvement, etc. terminal availability of sensors Dynamic characteristics Separating background from of the content caused by foreground, and prioritizing the some dynamic foreground constraint of the usage environment. This Spatio-temporal downscaling information will need of the content to be provided by speciﬁc quality sensors

514 Visual Media Coding and Transmission Figure 11.33 Modular architecture of the ADE. Reproduced by Permission of Ó 2007 IEEE, Ó2008 IIMC Ltd and ICT Mobile Summit corresponding service parameters, maximizing the user QoE. The ContextServiceManager also has an interface with the DRM tools subsystem, in order to request information concerning the authorization of the adaptations. The big challenge in the design and development of the ADE relates to the Reasoner. This module uses the sensed context to infer the state of the user, including the type of user, the location or activity in which they are engaged, the degree of satisfaction being experienced, and the environment and network conditions. This can be done through the use of an ontology- based model, which comprises a two-layer ontology approach using OWL. The basic layer provides descriptions of generic concepts and rules that can be used for any generic application scenario, while the second layer provides speciﬁc rules for the virtual classroom application scenario. This ontology-virtual-classroom-speciﬁc layer provides the means for the ADE to reason on how the different possible adaptations will help the user better understand the classroom sessions. This is the big challenge: how to obtain descriptions and sets of relation- ships in the form of rules that represent as accurately as possible the real-world situations in virtual classroom applications. Accordingly, this allows the ADE to base its decision not only on the restrictions of the consumption environment, such as terminal screen size, but also on a user-satisfaction model consistent with the learning objectives of the virtual classroom application. A two-layer strategy has recently been proposed in [19]. The rule-based inference engine considered in this chapter makes use of the OWL Reasoner supplied with the Jena2 platform [118]. 11.6.7 Adaptation Authorization In the virtual classroom application scenario, intellectual property and digital rights are managed during adaptation. When dealing with protected digital content, licenses are issued to enable control of the access and usage of it. Here, two different types of license can be used. The ﬁrst type is a restrictive license that limits the use of digital content, for example establishing the number of times that the content can be rendered or the interval of time in which the content can be played. The second type is an attribution and non-commercial license.

Context based Visual Media Content Adaptation 515 Figure 11.34 Authorization proﬁle. Reproduced by Permission of Ó2008 IIMC Ltd and ICT Mobile Summit In general terms, the main role of an AA in a governed system is to allow (or disallow) adaptation operations based on whether they violate any condition expressed in the licenses. An innovative way of implementing the AA is to consider the AA as a new CxP which converts licenses into adaptation constraints. Complementing those presented in Section 11.6.5, an additional context proﬁle, namely Authorization Proﬁle, is introduced for this contextual information, which comes from the AA, as shown in Figure 11.34. The AA looks into the DRM repository (Figure 11.29) to ﬁnd all the licenses associated with a certain resource and user, and passes relevant adaptation constraints to the ADE so that it can take an appropriate adaptation decision. If MPEG-21 technologies are considered to express digital objects, adaptation information, and so on, then the MPEG-21 REL for expressing licenses can be chosen. The standard elements deﬁned in this REL can be used in the scenario under consideration to restrict the usage of multimedia content, for example for limiting the number of times that a video can be played. However, currently a new proﬁle for the MPEG-21 REL is under development to support the different types of Creative Commons (CC) [119] licenses. This proﬁle is based on a contribution made in the 76th MPEG meeting [120] to facilitate interoperability with CC licenses. The Open Release Content Proﬁle [121] includes new rights and conditions, such as the governedAdapt, embed or governedAggregate rights, and the copyrightNotice or non- CommercialUse conditions. Licenses using this proﬁle can express different types of CC license, which include attribution, non-commercial, no derivatives, share alike, and so on. Adaptation operations should only be performed if they do not violate any condition expressed in the licenses. The ﬁrst amendment of MPEG-21 DIA, named ‘‘Conversions and Permissions’’ [68], can be divided into two main parts: the ﬁrst one speciﬁes the description formats for multimedia conversion capabilities, offering description tools for the conversions (adaptations) that a terminal is capable of doing; and the second part speciﬁes description formats for permissions and conditions for multimedia conversions that are useful for determining which changes (adaptations) are permitted on particular content and under what kind of conditions. The focus here is on the second part, and thus a method for ﬁlling the gap between DIA and REL/RDD by embedding adaptation descriptions into rights expressions will be presented. Table 11.3 presents the schema of a license that allows the deﬁnition of permissible changes and associated constraints. It provides the mechanisms to specify which changes are allowed

516 Visual Media Coding and Transmission Table 11.3 Example of a license schema including conditions of permissible changes <r:license> <r:inventory> <!- - ... - -> </r:inventory> <r:grant> <- - Jordi may play the video... - -> <- - Principal: Jordi - -> <r:keyHolder licensePartIdRef \"Jordi\"/> <- - Right: play - -> <mx:play/> <- - Resource: video - -> <mx:diReference licensePartIdRef \"video\"/> <- - ...under these conditions - -> <r:allConditions> <dia:permittedDiaChanges> <dia:ConversionDescription xsi:type \"dia:ConversionUriType\"> <dia:ConversionActUri uri ..> <- - kind of adaptation: Change bitrate,Change resolution.. - -> </dia:ConversionDescription> <- - further ConversionDescription would go here - -> </dia:permittedDiaChanges> <- - these constraints apply whether or not the image is adapted - -> <dia:changeConstraint> <dia:constraint> <dia:AdaptationUnitConstraints> <- -limits- -> <dia:LimitConstraint> <dia:Argument xsi:type \"dia:SemanticalRefType\" semantics \"..\"/> <- -atribute to evaluate: bitrate, resolution.. - -> <dia:Argument xsi:type \"dia:ConstantDataType\"> <dia:Constant xsi:type \"dia:IntegerType\"> <dia:Value/> <- - limit value - -> </dia:Constant> </dia:Argument> <dia:Operationoperator \"..\"/> <- -type of limit: max, min,.. - -> </dia:LimitConstraint> <- -further LimitConstraints would go here - -> </dia:AdaptationUnitConstraints/> </dia:constraint> </dia:changeConstraint> </r:allConditions> </r:grant> <r:issuer> <- - Anna offers the right - -> <r:keyHolder licensePartIdRef \"Anna\"/> </r:issuer> </r:license>

Context based Visual Media Content Adaptation 517 (permittedDiaChanges ConversionDescription) and the conditions under which those changes can be performed (changeConstraint). Furthermore, Table 11.4 illustrates an example of how to express in an MPEG-21-compliant license that a source (‘‘video’’) can be played as long as the ‘‘resolution’’ is under some speciﬁc limits and the ‘‘bit rate’’ stays between two other speciﬁc values. When dealing with protected/governed content, a content provision system with capabilities to adapt the content to a users context characteristics would need to check this license and use the conditions referred to therein as additional constraints during the adaptation decision- taking process. This kind of license can be very useful in assisting content creators, owners, and distributors to keep some control of the quality of their products. It provides them with the means to set up conditions under which their contents are consumed. This can also contribute to augmenting user satisfaction, as the content presented to them will satisfy the quality conditions intended by its creator. In line with the above licensing examples for generic scenarios, a more speciﬁc license for use in the selected application scenario can also be derived. In this particular scenario, a teacher may want students to download the lectures at a good resolution exceeding a given minimum, as otherwise they will miss some important details of the presentation, video feed, and so on. Table 11.5 illustrates the resulting license that the teacher should issue associated with the lectures. 11.6.8 Adaptation Engines Stack The AESs considered in a context-aware content adaptation platform are capable of performing multiple adaptations, as illustrated in Figure 11.35. An AES encloses a number of AEs into a single entity. All the AEs in an AES reside in a single hardware platform, sharing all the resources. The advantage of such an approach is that it is possible to cascade multiple AEs optimally to minimize computational complexity. For example, if both cropping and scaling operations need to be performed on a give non-scalable video stream, those operations can be performed together. The service initialization agent is responsible for initializing each component in the AES. After initializing the AES, the registering agent communicates with the ADE to register its services, capabilities, and required parameters. It is also responsible for renewing the registered information in case of any change in its service parameters. The adaptation decision interpreter processes the adaptation decision message from the ADE requesting the adaptation service. Based on this information, it also decides the appropriate AE to be invoked, and its conﬁgurations. The progress of the adaptation operation is monitored by the AE monitoring service, and if necessary it informs the progress back to the ADE. This subsection presents two AEs for use in the virtual classroom application, with a focus on resource adaptation. The two AEs under consideration are: 1. Optimized Source and Channel Rate Allocation (OSCRA) 2. Cropping and Scaling of H.264/AVC Encoded Video (CS/H.264/AVC). As well as the above adaptation techniques, the possibilities of using the scalability extension of H.264/AVC as a means of achieving quality, temporal, and spatial adaptations at a reduced computational complexity are also highlighted in this part.

518 Visual Media Coding and Transmission Table 11.4 License expressing conditions upon the values of video encoding parameters <r:license > xmlns:xsi \"http://www.w3.org/2001/XMLSchema-instance\" xmlns:dia \"urn:mpeg:mpeg21:2003:01-DIA-NS\" xmlns:r \"urn:mpeg:mpeg21:2003:01-REL-R-NS\" xmlns:sx \"urn:mpeg:mpeg21:2003:01-REL-SX-NS\" xmlns:mx \"urn:mpeg:mpeg21:2003:01-REL-MX-NS\" xsi:schemaLocation \"urn:mpeg:mpeg21:2003:01-DIA-NS ConversionDescription.xsd\" <r:inventory > <!– –... – –> </r:inventory> <r:grant> <- - the User has the right of playing the source \"video\"- -> <r:keyHolder licensePartIdRef \"User\"/> <mx:play/> <mx:diReference licensePartIdRef \"video\"/> <- - under these conditions - -> <r:allConditions> <dia:permittedDiaChanges> <- -Adaptation of the Resolution - -> <dia:ConversionDescription xsi:type \"dia:ConversionUriType\"> <dia:ConversionActUri uri \"urn:mpeg:mpeg21:2003:01- RDD-NS:CropRectangularBitmapImage\"/> </dia:ConversionDescription> <- - Adaptation of the BitRate - -> <dia:ConversionDescription xsi:type \"dia:ConversionUriType\"> <sx:rightUri deﬁnition \"change bitrate\"/> </dia:ConversionDescription> <– further ConversionDescription would go here –> </dia:permittedDiaChanges> <– these constraints apply whether or not the image is adapted –> <dia:changeConstraint> <dia:constraint> <dia:AdaptationUnitConstraints> <- -maximum limits for the resolution- -> <- - width must be less than 352 - -> <dia:LimitConstraint> <dia:Argument xsi:type \"dia:SemanticalRefType\" semantics \"urn:mpeg:mpeg21:2003:01-DIA-MediaInformationCS- NS:17\"/>

Context based Visual Media Content Adaptation 519 Table 11.4 (Continued) <- - 17 refers to the width - -> <dia:Argument xsi:type \"dia:ConstantDataType\"> <dia:Constant xsi:type \"dia:IntegerType\"> <dia:Value>352</dia:Value> </dia:Constant> </dia:Argument> <dia:Operation operator \"urn:mpeg:mpeg21:2003:01- DIA-StackFunctionOperatorCS-NS:12\"/> <- - 12 refers to the operator \"<\" - -> </dia:LimitConstraint> <- - height must be less than 240 - -> <dia:LimitConstraint> <dia:Argument xsi:type \"dia:SemanticalRefType\" semantics \"urn:mpeg:mpeg21:2003:01-DIA-MediaInformationCS- NS:18\"/> <- - 18 refers to the height - -> <dia:Argument xsi:type \"dia:ConstantDataType\"> <dia:Constant xsi:type \"dia:IntegerType\"> <dia:Value>240</dia:Value> </dia:Constant> </dia:Argument> <dia:Operation operator \"urn:mpeg:mpeg21:2003:01- DIA-StackFunctionOperatorCS-NS:12\"/> <– 12 refers to the operator \"<\" –> </dia:LimitConstraint> <- - Bitrate limits - -> <- - maximum limit - -> <dia:LimitConstraint> <dia:Argument xsi:type \"dia:SemanticalRefType\" semantics \"urn:mpeg:mpeg21:2003:01-DIA-MediaInformationCS- NS:11\"/> <- - 11 refers to nominal bitrate - -> <dia:Argument xsi:type \"dia:ConstantDataType\"> <dia:Constant xsi:type \"dia:IntegerType\"> <dia:Value>5000</dia:Value> </dia:Constant> </dia:Argument> <dia:Operation operator \"urn:mpeg:mpeg21:2003:01- DIA-StackFunctionOperatorCS-NS:12\"/> <- - 12 refers to the operator \"<\" - -> </dia:LimitConstraint> <- - minimum limit - -> <dia:LimitConstraint> <dia:Argument xsi:type \"dia:SemanticalRefType\" (continued )

520 Visual Media Coding and Transmission Table 11.4 (Continued) semantics \"urn:mpeg:mpeg21:2003:01-DIA-MediaInformationCS- NS:7\"/> <- - 10 refers to nominal bitrate - -> <dia:Argument xsi:type \"dia:ConstantDataType\"> <dia:Constant xsi:type \"dia:IntegerType\"> <dia:Value>1000</dia:Value> </dia:Constant> </dia:Argument> <dia:Operation operator \"urn:mpeg:mpeg21:2003:01- DIA-StackFunctionOperatorCS-NS:13\"/> <- - 13 refers to the operator \">\" - -> </dia:LimitConstraint> </dia:AdaptationUnitConstraints/> </dia:constraint> </dia:changeConstraint> </r:allConditions> </r:grant> <r:issuer> <r:keyHolder licensePartIdRef \"Distributor\"/> </r:issuer> </r:license> As described above, the selected application scenario, namely the virtual classroom, involves numerous users with various preferences, as well as terminal- and network-speciﬁc constraints. In such an application scenario, one of the demanding user-centric adaptation cases is the delivery of user-centric services to different collaborators, such as the cropped view of a particular attention area (i.e. ROI) selected by a user. On the other hand, an optimal source and channel rate allocation technique is required in such a scenario in order to adapt for allocating a higher level of protection to segments of the video content that are estimated to be more prone to corruption during transmission, by accurately modeling the distortion based on the given channel conditions. These adaptation methods are designed to respond to the variations in a number of context parameters, as explained in Table 11.6. The function of the OSCRA-based AE is to adapt the level of error-resilience of the coded video sequence based on the prevailing error rates or packet drop rates experienced during transmission of media resources. In the virtual classroom application, this can be utilized to optimize the user satisfaction even under harsh network conditions. The AE improves the rate- distortion characteristics of the decoded video using differentially-prioritized segments of a video frame based on a metric that quantiﬁes the relative importance of those different segments. In a generic adaptation scenario, the importance can be automatically weighted towards areas such as moving objects. In an IROI adaptation scenario, this relative importance measure is calculated based on a users feedback on their preferences for a particular attention area of the video frame. The request for adaptation can be generated for a number of reasons in the virtual classroom application scenario. For instance, a user may wish to focus on the lecturer alone, due to the

Context based Visual Media Content Adaptation 521 Table 11.5 License expressing conditions upon the video resolution for virtual classroom session. Reproduced by Permission of Ó2007 IEEE <r:license> <r:grant> <r:keyHolder licensePartIdRef \"student\"/> <mx:play/> <- - the student can play the lecture- -> <mx:diReference licensePartIdRef \"lecture\"/> <r:allConditions> <- - under these conditions - -> <dia:permittedDiaChanges> <- - Adaptation of the Resolution - -> <dia:ConversionDescription xsi:type \"dia:ConversionUriType\"> <dia:ConversionActUri uri \"urn:mpeg:mpeg21:2003:01-RDD- NS:CropRectangularBitmapImage\"/> </dia:ConversionDescription> </dia:permittedDiaChanges> <- - constraints apply whether there is adaptation or not - -> <dia:changeConstraint> <dia:constraint> <dia:AdaptationUnitConstraints> <dia:LimitConstraint> <- - min lim horizontal resolution - -> <dia:Argument xsi:type \"dia:SemanticalRefType\" semantics \"urn:mpeg:mpeg21: 2003:01-DIA-MediaInformationCS-NS:17\"/> <dia:Argument xsi:type \"dia:ConstantDataType\"> <dia:Constant xsi:type \"dia:IntegerType\"> <dia:Value>min value</dia:Value> </dia:Constant> </dia:Argument> <dia:Operation operator \"urn:mpeg:mpeg21:2003:01-DIA- StackFunctionOperatorCS-NS: 13\"/> <- - 13 refers to operator \">\" - -> </dia:LimitConstraint> </dia:AdaptationUnitConstraints/> </dia:constraint> </dia:changeConstraint> </r:allConditions> </r:grant> <r:issuer> <r:keyHolder licensePartIdRef \"teacher\"/> </r:issuer> </r:license>

522 Visual Media Coding and Transmission Figure 11.35 Organization of an AES. Reproduced by Permission of Ó2008 IIMC Ltd and ICT Mobile Summit small display size of their PDA and/or mobile device (e.g. a phone). The AE can then reserve the maximum amount of resources for the lecturers region, as this can be considered as the attention area for the user. In the worst-case scenario, the AE may also consider totally ignoring the background or other speakers/regions in the scene based on the ADEs decision. The AE can thus reallocate source channel rates and the error-protection resources for the visually salient Table 11.6 Contextual information handled by the AEs and the expected reaction of each AE to the variations in these contexts Context Expected reaction to changes in the context IROI OSCRA CS/H.264/AVC Network capacity and condition Prioritizing ROI by allocating ROI cropping more resources Display window size One or more of the following One or more of the and resolution in pixels actions: following actions: . Reducing resources for less . ROI cropping . Resolution scaling important regions and reallocating . SNR scaling them to the ROI . Changing the priority of more One or more of the important syntax elements following actions: Prioritizing ROI by allocating more . ROI cropping resources . Resolution scaling

Context based Visual Media Content Adaptation 523 regions accordingly. The reallocation of resources is based on a method that separates a video sequence into substreams, each of which carries an individual region, encodes them at varying rates according to their priority degrees, and then transmits them based on their relative importance over multiple radio bearers with different error-protection levels. Similarly, this method could also be used to apply adaptation to secure the important syntax elements of a video stream, so as to transmit sensitive data on secure channels. The CS/H.264/AVC-based AE serves user requests to view an ROI of their choice on a small display, such as a mobile phone, PDA, and so on. Once the ROI is deﬁned by the user, the ADE has to determine whether this operation can be handled on the terminal itself. Network capacity and network condition must also be considered in this decision. Even when the network does not pose a bottleneck, the terminal and its decoder may not be capable of decoding the original high-resolution video stream. For example, a decoder in a PDA may not have the necessary computational resources to decode a high-deﬁnition television (HDTV) quality video due to processor and memory capacity restrictions, even when it is connected to a network without bandwidth limitations, such as a wireless local area network (WLAN). Apart from cropping the ROI, the AE often needs to resize the cropped video to match the terminal displays resolution and ﬁt it into the viewing window. Frequently, a user selects an arbitrary ROI aspect ratio that may not be identical to the display window aspect ratio. Thus, the AE also needs to be capable of making up the gap caused by the aspect ratio mismatch under the guidance of the ADE. Each of the above AEs is discussed in detail in the following subsections. 11.6.8.1 Optimized Source and Channel Rate Allocation-based Content Adaptation The provision of multimedia over mobile networks is challenging due to the hostile and highly variable nature of the radio channels in a scenario where remote students may want to join in various virtual classroom sessions. Carefully designed error-resilience techniques and channel protection mechanisms enable the adaptation of the video data so that it is more resilient to channel degradations. However, as these (i.e. the error-resilience techniques and channel protection mechanisms) have been separately optimized, they are still insufﬁciently effective for application over mobile channels on their own. Joint source channel coding approaches [122] have been proven to provide optimal performances for video applications over practical systems. These techniques can be divided into two areas: channel-optimized source coding and source-optimized channel coding. In the ﬁrst approach, the channel coding scheme is ﬁxed, and the source coding is designed to optimize the codec performance. In the case of source-optimized channel coding, the optimal channel coding is derived for a ﬁxed source coding method. Application of source-optimized channel coding for video transmission in an error-prone propagation environment is considered in [123,124]. Adaptive source-channel code optimization has been proven to provide better perceptual video quality [125]. Source coding is dependent on the speciﬁc codec characteristics. Channel coding is performed at the physical link layer to overcome the effects of propagation errors. Both source and channel coding contribute to the overall channel coding rate, which is the characteristic of the network. Keeping the overall channel bit rate as constant, the source and channel rates can be modiﬁed. This can be done by prioritizing different parts of the video bitstream, by sending them using multiple bearers with different characteristics, such as different channel coding, modulation, and so on. In order to implement this method, it is necessary to separate

524 Visual Media Coding and Transmission the encoded bitstream optimally into a number of substreams during the media adaptation process. Rate Allocation-based Adaptation This subsection addresses issues in designing an optimal joint source channel bit rate allocation for adaptation of video transmission over mobile networks in the virtual classroom application scenario under consideration. The aforementioned scheme combines bit-level unequal error protection (UEP) and joint source channel coding to obtain optimal video quality over a wide range of channel conditions. The encoded data is separated into a number of substreams based on the relative importance of data in different video packets, which is calculated by the estimated perceived importance of bits at the AE. These streams are then mapped on different radio bearers, which use different channel coding schemes depending on their relative importance. Realization of the scheme is presented in Figure 11.36. Rate Allocation Scheme The transmission channel is characterized by the probability of channel bit errors and the channel bandwidth, Rch, expressed in terms of bits per second. Assume that the underlying communication system allows a maximum of N communication subchannels or radio bearers, as in the case in a Universal Mobile Telecommunications System (UMTS) network, for a given video service, such as in a virtual classroom session. The encoded bitstream is separated into N substreams. Rn denotes the channel bit rate on the nth subchannel, and xn is the channel coding rate on the nth subchannel. The optimum possible source rate, R, is a function of channel bit rate, Rn, and channel coding rate, xn: R ¼ R1 þ . . . þ Rn þ . . . þ RN ð11:1Þ Rch ! R1=c1 þ . . . þ Rn=cn þ . . . þ RN=cN Adaptation engine Source–channel rate adaptation algorithm Stream 1 Stream n Stream N ... ... Mapping on to radio Bearer 1 Channel state bearers Bearer 2 predictor Bearer 3 ... ... Transmit bit energy allocation & channel multiplexing Feedback channel Physical channel Physical channel mapping Figure 11.36 Realization of the rate allocation based adaptation scheme. Reproduced by Permission of Ó 2008 IEEE

Context based Visual Media Content Adaptation 525 The expected distortion due to the corruption of the nth substream is E(Dn). Thus, the total sequence distortion E(D) becomes: XN ð11:2Þ EðDÞ ¼ EðDnÞ n¼1 (Reproduced by Permission of Ó2008 IEEE.) The goal set here is to ﬁnd the optimal substream separation and mapping of substream data on to multiple radio bearers, in order to maximize the received video quality for video transmission over a bandwidth-limited error-prone channel. The optimization problem can formally be written as: XN ð11:3Þ Minimise EðDÞ ¼ EðDnÞ n¼1 (Reproduced by Permission of Ó2008 IEEE.) Subject to: R1=c1 þ . . . þ Rn=cn þ . . . þ RN =cN Rch ð11:4Þ Let the input video sequence have L video frames. Each video frame is separated into M number of video packets. M is a variable. Each video packet is divided into K partitions. In the case of MPEG-4 video, which supports two partitions, K is equal to two. The expected distortion due to the corruption of the ﬁrst partition of the mth video packet of the lth frame is am,l. Distortion resulting from the corruption of the second partition is bm,l, and lm,l is the expected distortion due to the corruption of the kth partition. Thus, the total expected distortion E(D) is: XN XL XM ðam;l þ bm;l þ . . . þ lm;lÞ ð11:5Þ EðDÞ ¼ EðDnÞ ¼ n¼1 l¼1 m¼1 (Reproduced by Permission of Ó2008 IEEE.) Now the optimization problem set in Equation (11.3) can be visualized as the optimal allocation of bits within each partition into N substreams, subject to the constraint set in Equation (11.4). Data separation into N substreams is conducted based on the importance level of information in each partition for the overall quality of the received video. For N substreams, there are N importance levels, labeled as In(n2{1, . . ., N}). In / Wm;l Á Am;lðA 2 fa; b; . . . ; lg; m 2 f1; . . . ; Mg; l 2 f1; . . . ; LgÞ ð11:6Þ Wm,l speciﬁes a weighting factor. This factor can be used to prioritize the ROIs from other regions. If all parts of the information in a video frame are equally interesting then Wm,l becomes one. The source channel rate allocation algorithm is shown in Figure 11.37. The algorithm operates at video frame level with the estimation of the source rate, R, for a given channel bandwidth with maximum available channel protection.

526 Visual Media Coding and Transmission Figure 11.37 The source channel rate allocation algorithm

Context based Visual Media Content Adaptation 527 After encoding the video frame with the estimated source rate, the expected distortion is calculated at video packet level. Based on the calculated distortion values, the data partition of each video packet is assigned an importance level, which is calculated according to Equa- tion (11.6), and data is separated into N number of substreams accordingly. Instantaneous channel quality for each selected subchannel is predicted from the channel quality measurement conducted at the network or at the terminal. The source rates for subchannels are calculated, and the channel bandwidth requirement set out in Equation (11.4) is checked. If the bandwidth requirement is not satisﬁed, the data on highly-protected subchannels is reduced, and the importance levels are recalculated. After the channel bandwidth requirement steps, the total expected frame distortion is calculated for the particular subchannel conﬁguration. This total expected frame distortion is compared to the value obtained from the previous iteration to ﬁnd the local minimum distortion value. If the minimum distortion is obtained, the process is terminated. Otherwise, the source rate is incremented by one step, and the process is repeated. Note that the source-rate allocation algorithm starts with an estimated source rate for a maximum channel protection level. This provides the minimum source rate for a given channel bandwidth. For the second and following iterations, the effective channel bit-error ratio, m, is used in estimating the expected video packet distortion. m is computed as: m ¼ ðh1 Á R1 þ . . . þ hn Á Rn þ . . . þ hN Á RNÞ=R ð11:7Þ where hn denotes the channel bit error ratio on the nth subchannel. After ﬁnding the optimal source channel rate allocations, the data on each substream is reformatted to achieve the stream synchronization at the receiver, and transmitted using the selected radio channels. Modeling of Distortions The source channel rate allocation algorithm for adapting the video streams described in the previous subsection relies on an accurate distortion model at video packet level. This subsection describes the distortion modeling algorithm for estimating distortions due to the corruption of data in each partition of a video packet. The video packet format is assumed to be of MPEG-4 Visual simple proﬁle with data partitioning enabled. Video performance can be shown to be dependent on a combination of quantization distortion, E(DQ,pv), and channel distortion. Channel distortion can be further subdivided into two parts: concealment distortion and distortion caused from error propagation over predictive frames. Concealment distortion depends on the concealment techniques applied at the decoder. The scheme under consideration applies a temporal concealment technique. That is to say, if an error is detected in a decoded video packet, it discards that packet and replaces the discarded data with the concealed data from the corresponding macro blocks (MBs) of the previous frame. The distortion caused by such a concealment process is called temporal concealment distortion, E(Dt_con,pv). Frame-to-frame error propagation through motion prediction and temporal concealment is called temporal domain error propagation, ftp. The distortion model adopted in this section has similarities with the method proposed in [126]. However, the distortion induced due to error propagation is calculated differently, even though the same assumption is made, namely the uniform distribution of the video reconstruction error. The model also uses adaptive intra refresh (AIR) techniques instead of intra frame refreshment,

528 Visual Media Coding and Transmission which is used in the model presented in [126]. The modiﬁcations made enhance the accuracy of the distortion calculation. Taking the video packet as the base unit, the expected frame quality can be written as: XIj ð11:8Þ EðQfj Þ ¼ 10 Á logðg= EðDip;vj ÞÞ i¼0 where EðQjf Þ is the expected quality, EðDip;vj Þ is the expected distortion of the video packet, and Ij is the total number of video packets. (Reproduced by Permission of Ó2008 IEEE.) Superscript i and j represent the ith video packet of the jth video frame. g is a constant deﬁned by the dimensions of the frame. For instance, for a common intermediate format (CIF)-resolution video g ¼ 2552 Â 352 Â 288. EðDip;vj Þ can be written as: EðDip;vj Þ ¼ EðDiQ;j;pvÞ þ rdi;j;pvEðDit;jcon;pvÞ þ ftip;j ð11:9Þ (Reproduced by Permission of Ó2008 IEEE.) In this equation, rid;j;pv denotes the probability of receiving an erroneous video packet. Calculation of each term shown in Equation (11.9) depends on the formatting of the video coding, error-resilience and concealment techniques, and the coding standard. The probability calculation for MPEG-4 (simple proﬁle with data partitioning)-encoded video is described below, so as to exemplify the process. Say the probability of receiving a video object plane (VOP) header with errors is riV;jOP, and the probability of receiving the video packet header and the motion information with errors is riM;j. In addition, the probability of ﬁnding an error in the discrete cosine transform (DCT) part is xi,j. Then: rid;j;pv ¼ ð1 À rVi;jOPÞ Á ð1 À rMi;j Þ Á ci;j ð11:10Þ (Reproduced by Permission of Ó2008 IEEE.) For a given probability of channel bit-error rate, rb, it can be shown that: riV;jOP ¼ XV 1rb ¼ ð1 À ð1 À rbÞV Þ ð11:11Þ ð1 À rbÞv v¼1 where V represents the VOP header size. (Reproduced by Permission of Ó2008 IEEE.) Similarly: rMi;j ¼ 1 À ð1 À rbÞZM ð11:12Þ ci;j ¼ XZDCT ZDCT Z rbz ¼ 1 À ð1 À rbÞZDCT ð11:13Þ i ð1 À rbÞZDCT z¼1 where ZDCT and ZM denote the lengths of the DCT and motion vector data, respectively. (Equations (11.12) and (11.13) Reproduced by Permission of Ó2008 IEEE.)

Context based Visual Media Content Adaptation 529 The expected distortions of MBs for MPEG-4-encoded video are calculated in the same way as speciﬁed in [126]. The quantization distortion is computed by comparing the reconstructed MBs and the original MBs at the encoder. Concealment distortions are also computed in a similar manner. The transmitted video data belonging to each MB is corrupted using a noise generator located at the encoder [124]. Corrupted data is replaced by the concealed data, and the data belonging to the original and concealed MBs are compared. It is assumed that the neighboring video packets and reference frames are correctly received during the calculation. The temporal error propagation due to MB mismatch between adjacent video frames is quantiﬁed by the term ftip;j in Equation (11.9), which is computed as: ftip;j ¼ ð1 À rui;;jpvÞ Á Pj 1 ½Xðrid;j;pv 1 Á EðDkt ;ic;oj n;p1vÞ þ ð1 À riu;;jpv 1 Þ Á Pj 2 ð11:14Þ TP TP k2W where W denotes the sets of coded blocks in a frame. (Reproduced by Permission of Ó2008 IEEE.) The summation in Equation (11.14) represents the error propagation through MBs. PjTP 1 quantiﬁes the fraction of distortion of the reference video packet (j 1th frame), which should be considered in the propagation loss calculation. PTj P 1 is computed by: PjTP 1 ¼ ð1 À ð1 À rbÞFj 1 Þ ð11:15Þ where Fj 1 is the size of the j 1th frame. (Reproduced by Permission of Ó2008 IEEE.) Application of the Adaptation Scheme The adaptation scheme uses the predicted channel quality information to calculate the expected distortion. Different channel coding schemes combined with spreading gain provide a number of different radio bearer conﬁgurations that offer ﬂexibility in the degree of protection. For example, UMTS employs four channel coding schemes. The available channel coding methods and code rates for dedicated channels are 1/2 rate convolutional code, 1/3 rate convolutional code, 1/3 rate turbo code, and no coding. A video frame is encoded at the selected source rate and separated into different substreams. The higher-priority data is sent over the highly-protected channels, while low protection is used to transmit low-priority streams. This arrangement adapts available network resources according to the perceived importance of the selected objects from the video data. The algorithm performs a number of iterations to obtain the optimal operating point, as shown in Figure 11.37. Re-encoding the frame with adjusted source rate can however delay the transmission process. Therefore, for a real-time video application, it is suggested that the encoding rate adjustments be performed only for the ﬁrst two frames of the sequence. The following video frames should be encoded at the source rate obtained from the ﬁrst two frames. As the encoding is performed only once for the following frames, the expected distortion calculation, and therefore the whole algorithm process, is simpliﬁed. ROI coding and UEP schemes if used together can signiﬁcantly improve the perceived video quality at the user end [127]. The importance levels associated with the segmented objects can be applied to a region within a video frame of more interest to a user in the application scenario under consideration, so as to transmit it over one of the most secure radio bearers available with maximum error protection.

530 Visual Media Coding and Transmission Experimentation Setup Video sequences are encoded according to the MPEG-4 visual simple proﬁle [128] format. This includes the error-resilience tools, such as video packetization, data partitioning, and reversible variable-length coding. The ﬁrst video frame is intra coded, while others use inter (i.e. predictive) coding. A Test Model 5 (TM5) rate control algorithm is used to achieve a smoother output bit rate, while an AIR algorithm [128,129] is used to stop temporal error propagation. The two test sequences shown in Figure 11.38, namely Singer and Kettle, are used as the source signals in the experiments. The Singer sequence is used as the background, and the Kettle sequence is segmented and used in the foreground of the output sequence. These CIF (352 Â 288 pixels) sequences are coded at 30 fps. The encoded sequences are transmitted over a simulated UMTS channel [124]. The simulator consists of the UMTS Terrestrial Radio Access Network (UTRAN) data ﬂow model and Wideband Code Division Multiple Access (WCDMA) physical layer for the forward link. The WCDMA physical layer model is generic and enables easy conﬁguration of the UTRAN link-level parameters, such as channel structures, channel coding/decoding, spreading/de- spreading, modulation, transmission modeling, propagation environments, and their corre- sponding data rates according to the 3rd Generation Partnership Project (3GPP) speciﬁcations. The transmitted signal is subjected to a multipath fast-fading environment. The multipath- induced inter-symbol interference is implicit in the chip-level simulator. By adjusting the variance of the noise source, the bit error and block error characteristics can be determined for a range of SNRs and for different physical layer conﬁgurations. A detailed explanation of the link-level simulator used can be found in [124]. The simulation setup for transmission considers a vehicular A propagation condition and downlink transmission. The mobile speed is set to 50 kmph. The experimentally-evaluated average channel block error rates (BLER) over the vehicular A environment are listed for different channel protection schemes using convolutional coding (CC) and bit energy-to-noise ratios (Eb/No) in Table 11.7. The experiment carried out in this section assumes a perfect Figure 11.38 Input sequences used in experiments: (a) Singer sequence; (b) Kettle sequence. Reproduced by Permission of Ó 2007 IEEE

Pages:

Willington Island

Visual Media Coding and Transmission

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Visual Media Coding and Transmission

Read the Text Version

Willington Island

TOP SEARCH

RELATED PUBLICATIONS