|Dr. Amela Sadagic Ben Teitelbaum Dr. Jason Leigh|
|Internet2 Electronic Visualization Lab.|
|Advanced Network and Services Univ. of Illinois at Chicago|
|Prof. Magda El Zarki Haining Liu|
|Information and Computer Science|
|University of California, Irvine|
During the last few years, we have experienced a paramount proliferation of the Internet that penetrated all aspects of everyday life. Starting off as a purely academic research network, the Internet is now extensively used for educational, entertainment and as a very promising and dynamic marketplace, and is envisioned to evolve into a vehicle of true collaboration and a multi-purpose working environment. Although based on a best effort service model, the simplicity of its packet-switched nature and the flexibility of its underlying packet forwarding regime, IP, accommodates millions of users while offering acceptable performance. In the meantime, new exciting applications and networked services have emerged, posing more stringent demands for the network. In order to offer a better than best-effort Internet, new service models that offer certain performance guarantees to applications have been proposed. While several of these proposals are in place and many QoS-enabled networks are operating, there is still a lack of comprehension about the precise requirements new applications have in order to function with high or accepted levels of quality. Furthermore, what is required is an understanding of how network-level QoS reflects on the actual application utility and usability. This document tries to fill this gap, by presenting an extensive survey on applications' QoS needs. It identifies applications that cannot be accommodated by today's modest best-effort Internet service model and reviews the nature of these applications as far as their behaviour with respect to the network is concerned. It presents guidelines and recommendations on what levels of network performance parameters are expected so that applications operate with high quality, or, within ranges of acceptable quality. In tandem with this, this document highlights the central role of the application and application developers in achieving the expected performance from a network service. It argues that the network itself cannot guarantee good performance unless assisted by well-designed applications that can tailor their behaviour, by employing suitable adaptation mechanisms, to whatever network conditions or service model is present. This document also reviews the tools and experimentation procedures that have been recently proposed to quantify how different levels of resource guarantees map on application level quality. This will allow network engineers, application developers and other interested parties to design, deploy and parameterise networks and applications that offer increased user utility and achieve efficient utilisation of network resources.
At its present form, this documents is primarily focused on audio and video applications. It presents a detailed analysis of the end-to-end performance requirements of applications like audio-video conferencing, voice over IP and streaming of high quality audio and video, and overviews what adaptation choices are available to these applications so that they can operate within a wider range of network conditions.
. A type of per-hop behaviour for Diffserv aggregates of flows. . . . . . A layer-3 approach to provide QoS to aggregates of flows. . . A type of per-hop behaviour for Differentiated Services flows. . . . . An approach of IP QoS that introduces services to provide finegrained assurances to individual flows. . . A multicast enabled overlay network over IP. . An open video streaming format developed jointly by Microsoft, Progressive Networks, Inc., Intel Corp., AdobeSystems Inc. and Vivo Software Inc. . A technique used in video encoding to reduce temporal redundancy in video and achieve higher compression rates. . . An objective speech quality assessment method developed at the Institute for Telecommunications Sciences. . A method of subjective quality evaluation of encoded multimedia content based on the collection and statistical manipulation of several quality ratings obtained by human subjects after viewing the corresponding material in a controlled environment. . . . . A speech quality metric developed at British Telecommunications. . A model of speech quality jointly dedveloped by KPN Research and BT. . . . . A method of objective quality assessment of speech signals developed at KPN Research that is also an ITU-T Recommendation (P.861). . . A signalling protocol that is used by the Integrated Services QoS model to establish quality assured connections. . . A simple objective metric to measure the quality of a signal. . . .
Until today, the Internet is predominantly used by conventional TCP oriented services and applications, like web, ftp, email, etc., enriched with static media types (images, animations, etc.). For the last few years, the Internet is also utilised to transport modest-quality audio-visual content in the form of streaming. It is also being used as a low-cost interactive communication medium, to facilitate video and voice interactive communication. However, people are start to realise the potential of using the plethora of already existing applications used in different disciplines and contexts over the Internet. We witness emerging of new generation of applications emerges that can revolutionise the way people have been conducting research, working together and communicating up to date. We call this new breed of applications advanced Internet applications. Advanced Internet applications can offer new opportunities and ways for communication and collaboration, leverage teaching and learning and significantly improve the way research groups are brought together and share scientific data and ideas. The use of advanced applications will facilitate new frontier applications that explore complex research problems, enable seamless collaboration and experimentation on large scale, access and examine distributed data sets and bring research teams closer together in a virtual research space. Advanced applications can also attributed with involving a rich set of interactive media, more 'natural' and intuitive user interfaces, new collaboration technologies using high quality sensory data and offering interactive access and real-time manipulation to large amounts of distributed data repositories. To mention only a snapshot, application areas can be:
If to be transmitted over the Internet, data and media flows of advanced applications pose a high degree of requirements to all the components and devices on the end-to-end path. These are requirements for real-time operating system support, new distributed computing strategies and resources, databases, improved display and hardware capabilities, development of efficient middleware, and, of course, certain prerequisites to the underlying network infrastructure. Large-scale scientific exploration and data mining require the exchange of large volumes of data (in the order of Tera- and Peta-bytes) between remote sites. High quality data visualisation applications, videoconferencing and demand huge amounts of bandwidth with probably tight timing requirements. On the other hand, there are applications that are highly sensitive to any loss of data. In order to function with acceptable quality, such applications require exceptionally high bandwidth, and also specific and/or bounded network treatment with respect to other network performance parameters (delay, jitter, loss, etc.). In other words, they require bounded worst-case performance, something what is generically called "Quality of Service", that a best-effort Internet at its current status cannot always guarantee, as it does not support any means of traffic differentiation.
Quality of service is a very popular yet overloaded term that is very often looked from different perspectives by the networking and application developers communities. In networking literature QoS is quantified and measured by network-centric terms, such as throughput, end-to-end delay, bounds on delay and delay variation (jitter) or packet loss percentage and loss pattern. As a result, from a network engineering points of view, the design goal is to guarantee QoS by negotiating and assuring certain bounds of these metrics while at the same time trying to maximise network utilisation (which is usually translated to maximising revenue). In contrast, the view of QoS that application developers and application users have, is more subjective: that of maximising the utility of the application. The term utility is an umbrella term: it embraces perceived quality, that is, how pleasant or unpleasant is the presentation quality to the user of the application (i.e., visual quality of a displayed video sequence). Additionally, it may reflect the application's ability to perform its task (for example, in IP telephony if good conversation is achieved) or generate user interest (which in turn, may produce revenue - an important incentive). In certain occasions, QoS terms have been differently interpreted by different communities. In networking, the term delay expresses the amount of time it takes for a data unit to propagate through the different paths of the network. For the application developer, i.e., a video codec designer, it is the time that is required for data to be encoded/decoded. It is very often the case that the two communities disregard the importance of this disparity in perspective. For example, until recently the image processing community considered that the underlying transmission infrastructure is providing a reliable transport medium, a circuit-switched equivalent, where the only delay was the propagation time and the losses were rare and corrected by the physical or data-link layer. Thus, they strived to maximise the quality of the encoded material by optimally selecting appropriate encoder/decoder parameters. In an non deterministic environment like the Internet, these assumptions do not hold. For example, packet loss may dramatically degrade the quality of the encoded stream and the perceptual distortion caused is usually far more significant to that introduced by encoding artifacts. It is imperative that these misconceptions are alleviated and that a mutual understanding of what quality stands for different communities is determined.
There is a broad belief that advanced applications cannot be entirely accommodated by today's Internet, and that is necessary to have a service model that offers QoS guarantees to flows that need them. There is another camp that claims that QoS needs of applications can be sufficiently serviced by an over-provisioned best-effort network combined with application intelligence to adapt to the changing availability of network resources and to tolerate loss and jitter. Both of the approaches have their respective merits and disadvantages. It is probably true that an efficient solution lies somewhere in-between the two ends of the spectrum and favours some form of traffic differentiation.
It is apparent that without any form of traffic classification and prioritisation the network congestion will become a problem affecting QoS-sensitive flows and deteriorating the quality of the corresponding applications. However, the selection of a suitable network model is a complicated function of several factors, like the criticality of the applications, the complexity and scalability of the solution, as well as the economic model or the market needs. A very important factor is also the kind of applications that are designed and expected to run over a network. Since networks are ultimately used by users running applications, it is imperative that the designers of networks and Internet service providers consider the effect of those applications when operating over the network and also the effect of the network's capabilities or service model on the usability and quality of applications. In this respect the research, design, development, upgrade or configuration of networks have to be driven with the target applications' needs and requirements in mind. The reverse also holds. Applications need also to consider the capabilities and limitations of the networks that are used to transmit their data. Applications that are unresponsive to the way they transmit their streams can cause network congestion or even congestion collapse , reduce the network's utilisation and inescapably suffer the consequences of their own behaviour.
Understanding the performance needs of advanced applications is essential as it can provide both the network and applications R&D communities with a better understanding of how network services can be tailored to suit demands of advanced applications, and how advanced applications can exploit existing or new networks in a beneficial manner. Furthermore, it can allow applications themselves to deploy built-in mechanisms that allow them to function with acceptable quality even on a network that, at times, displays characteristics that are far from ideal. In order to do so it is necessary that the whole range of operational behaviours of applications is carefully explored and translated into proper adaptation mechanisms or policies. Such attributes are particularly important both for the application itself, as it will allow the application to operate on a wide range of networking environments, thus increasing its acceptance or marketability. Additionally, the health and well-being of the underlying network regime will be also preserved.
This report is a "working document"; in this respect is should not be considered as "complete" and ëxhaustive", but it will be continuously updated. Its purpose is twofold: firstly to investigate the QoS needs of Internet applications from the underlying network and the ranges of values of network performance metrics within which, advanced applications operate with high or acceptable quality. This is how the application ëxpects" or "requires" to be honoured by the network. Secondly, it is particularly important to acquire a good understanding on what set of behaviours an application can develop in order to (i) make the most out of the underlying network that transports its data flows, and, (ii) in turn, "honour" and "protect" the network from undesired circumstances. Both these issues are central to the success of advanced Internet applications and indicative of the need for closer cooperation between the äpplication" and the "network", a dialog that needs to be further promoted. A good end-to-end application performance should eventually be a shared task between the 'network' and the 'application' to find the best balance between network engineering, application design and economic incentives.
Chapter 2 presents a multidimensional taxonomy of Internet applications and investigates how this taxonomy poses or influences the performance characteristics and requirements of respective applications. We present a high-level review of different classes of applications. Chapter 3 examines the issue of application quality and presents a detailed review of end-to-end performance requirements for two classes of applications: interactive IP audio (VoIP) and requirements of Internet video streaming and conferencing. In the last part of the chapter, we present an overview of adaptation techniques and methods that audio and video applications may utilise to 'share the burden' of Qos with the network. Chapter 4 discusses recent advances in the research and development of tools and methods to measure the application quality. Again, this chapter focuses on quality assessment methods for audio and video. Finally, chapter 5 concludes this report.
In this chapter we present a multi-dimensional taxonomy of advanced applications. Advanced applications acquire a set of characteristics and features that do not lie on the same contextual space, and it is therefore not feasible to define a taxonomy on a single dimension. For that, applications can be categorised, and identified to belong to one or more categories. This division can be based on the task they serve (task characteristics), the type of media they involve, the situation of operation (e.g., geographical sparsity of users) and the behavioural characteristics of the users (e.g., user expectations, user skills, etc.). The utility of an application, defined as its ability to successfully complete its task or as the quality perceived by the end-user, is admittedly a function of all the above factors and it is in many cases, hard to define. In order to gain a better understanding of how the characteristics of an application define or dictate its quality requirements, we attempt to classify, where applicable, applications by examining the properties of the above mentioned planes (task characteristics, type of media, user behaviour and situations of usage). As we stressed before our taxonomy is not uni-dimensional, but it spans several planes.
In this section, we present a first taxonomy or, more precisely, a grouping of advanced applications by considering common inner characteristics of applications and usage scenarios from a number of different viewpoints. For each class of applications, we try to devise generic, high-level guidelines for the specification of quality requirements considering the fact that an application's behaviour is influenced by multiple factors.
Applications can be categorised by considering the task they try to achieve or the kind of activities that take place. At a high level, application tasks can be classified into Telepresence and Teledata, and into Foreground and Background tasks, a division derived from Buxton .
188.8.131.52 Telepresence vs. Teledata. The distinction of these two classes lies on whether applications are aimed to support communication, enable awareness between users and facilitate immersiveness in virtual environments (telepresence), such as videoconferencing and virtual meetings, or the application carries useful data to the user (teledata), such as video or music streaming. In general, telepresence tasks can be identified as human-to-human tasks, while teledata as human-to-machine interaction. In certain applications, both telepresence and teledata tasks may coexist. Activities that involve interaction between users will have different requirements than human-to-machine tasks, as the nature of interaction involves various user behaviours that affect quality differently.
More precisely, the definition of (tele)presence term may have different interpretations whether it is defined as Presence in Virtual Reality or in videoconferencing:
Presence in VR is usually considered to be the sense of "being in" or "being part of" mediated virtual environment, the one that is different from the physical environment in which the observer is currently (note: the existence of other people or their graphical representations is not relevant here). The term telepresence introduces the notion of being in a remote place; being that all environments in VR are virtual the factor of remoteness is not applicable to them, and therefore the use of prefix "tele" is not appropriate. On the other hand, one could say that "telepresence" may signify the sense of being in environments that are shared by users who are at different geographic locations. They are now able to experience this shared virtual environment as the place where they all meet together and interact with each other. If used in this way the correct term suggested in VR literature is "co-presence" or ßocial presence". In case of videoconferencing systems, the term telepresence can and does indicate the sense of being in a remote place. There is also another component of its definition - telepresence includes and assumes the existence of other people and interactions among them, something that was not the part of the definition of presence in VR. Both definitions are correct in their respective areas; nevertheless, since substantial part of our document treats videoconferencing systems, to avoid misunderstanding we will be using the definition of telepresence term as it is used in these systems.
184.108.40.206 Foreground vs. Background tasks.
The classification of an application task as foreground or background has major implications on how users perceive their quality. According to Buxton , a foreground task gets the full attention of the user, where a background task does not. Background tasks take place in the 'periphery', usually introduced to promote or enable awareness. Examples of foreground applications are those user interaction, monitoring and/or responding to a number of on-going activities, where in background tasks, the role of the user is that of a passive observer. It is clear that foreground applications will have significantly higher quality requirements than background tasks, and background tasks can be accommodated with a modest set of resources that secures a low-level of quality. Although background tasks or data are not tightly related to QoS requirements, it is still imperative that they exist in the context of advanced applications since their absence would influence the way foreground tasks or data are perceived. For example, the lack of background noise in environments, where it would be naturally expected to be present (street noise while you're wondering in a virtual city, or people noise in a virtual museum) will influence immersiveness, as it brings to the user the feeling of a sterile and unnatural environment. This apparently affects the application quality, in this case the feeling of presence in a real environment.
The above two divisions are orthogonal classifications and can divide applications in four main types: Foreground teledata, background teledata, foreground telepresence and background telepresence. Based on this classification of Buxton  we try to identify applications of these categories, and later, to outline generic quality requirements:
Usually, as a rule of thumb, this kind of applications will require high audio-visual quality, because, in such cases, it's not only human factors that assess the quality, but the criticality of the application as well. Also, as discussed above, the type of application will dictate whether certain data flows (e.g., data that control remote devices) have to be treated preferentially.
In telepresence applications, the auditory channel is quite important and in many cases the most vital means of communication. The existence of an auxiliary video channel increases the users' perception of the task. However, very low video frame rates ( < 5Hz)  can cause a mismatch between the auditory and visual cues leading to complete loss of lip synchronisation, with annoying perceptual results. For the visual feed to be influential to the application task and act complementary, video frame rate needs to be over 15-16 Hz (or frames per sec) [55,15].
We observe that background data do not possess particularly stringent network QoS requirements, as they are usually low-bandwidth flows that serve auxiliary purposes to the application. If data transmission prioritisation is utilised, they can be low-priority flows, or may be dropped as they have less importance for the quality of the application (in comparison with foreground data). For these reasons, we subsequently ignore background tasks or data.
220.127.116.11 Interactive vs. non-interactive tasks. In interactive tasks, actions are followed by appropriate responses. Interactivity may arise between persons (interpersonal), between a human and a machine (e.g., remote instrument control), and between machines (machine-to-machine - e.g., data transactions). The degree to which the task is interactive is particularly important as it may determine the levels of tolerance to delay, jitter, etc. Interactive applications will usually pose more stringent requirements than non-interactive ones, because of the promptness of response that is required.
Depending on the the number of application users, interactive applications can be further divided to group-to-group and individual-to-individual interactions. Naturally, 'group-to-group' interactions are far more complicated as they often involve large numbers of participants and teams working together (i.e., multiple sites and multiple participants per site).
The major difference between telepresence and teledata in terms of network QoS can be summarised as follows: the main aim of a telepresence application is to enable an environment for coherent remote communication and collaboration and also to create the feeling that the participant/subject is in a remote environment rather than his/her actual physical place. As such, it needs to preserve the main aspects of communication, that is, good interactivity and for the sensory cues to successfully serve the application task. This translates into the need for short latencies that can keep tasks (full-duplex conversation, joint operation of instruments, exploration of data, etc.) in-synchronisation. As interactivity poses low latency constraints, jitter needs to be kept controlled as well. For non-critical applications, the fidelity of the involving flows can be traded off, as long as the minimum requirements to achieve its task are met. Furthermore, packet loss can have a significant quality degradation cost, especially since error protection and retransmission techniques are perhaps time-expensive to be used. In the case of interactive teledata applications (remote surgery, remote control or telescopes/microscopes, etc.), interactivity requirements can be similar to or even tighter than those of interactive telepresence.
For non-interactive teledata applications, the constraints on low latency and jitter can be more relaxed. One-way delays can be in the order of seconds without compromising the application quality or causing user distraction/annoyance and receiver de-jittering buffers can allow for comparatively high jitter values. On the other hand, there is an expectation of high quality media, mainly due to the nature of the application (i.e., music, entertainment video) thus the ability of these applications to adapt their transmission rates without sacrificing expected quality is more restricted. Relaxed latency requirements also means that sophisticated error correction and retransmission can be used to reclaim data corrupted due to loss.
18.104.22.168 Machine-to-machine tasks. The above mentioned tasks are related to interactions between humans, or between humans and machines. There are also applications that do not involve any human intervention or interactivity as part of their operation. Such tasks are specific to machine to machine applications. Typical scenarios engage the transmission of data among various computers, manipulation and processing of data, creation and transmission of transactional data, distributed computing, exchange of control data, etc. Not all of these computer-to-computer tasks will require high levels of network service. With this type of applications, quality requirements are not dictated by human factors, but rather on the exact properties of the application, like their delay sensitivity, requirement for timely delivery (e.g., time- or safety-critical applications), volume of data to be exchanged between remotely located nodes. End-to-end delay and delay variation are the most crucial performance parameters for those applications that transfer control data. Throughput is important for applications requiring bulky data transfers.
Figure 2.1 graphically outlines task-based categorisation of applications.
Specific user characteristics influence the quality requirements of application to an extended degree. Some of these factors are:
We can therefore comprehend that the behavioural patterns and expectations of the users of the applications can provide a solid ground base for the extraction of generic requirements from an application. While user-related behaviour is highly subjective, we can usually recognise the profile of a 'typical' user and design the application service in accordance to preference, requirements and expectations of that 'typical' user.
In this paragraph we examine application properties attributed to the nature and transport requisites of the participating media flows.
22.214.171.124 Elastic vs. inelastic. Elastic applications can tolerate significantly high variations in throughput and delay, without considerably affecting their quality; as network performance degrades, application utility degrades gracefully. These are traditional data transfer applications, like file transfer and e-mail or some http traffic. While long delays and throughput fluctuations may degrade performance, the actual outcome of the data transfer is not affected by unfavourable network conditions. However, certain constraints may arise when these services are considered in the context of advanced applications. Elastic traffic can be further broken up according to their delay and throughput requirements:
Inelastic applications (also called real-time applications) are comparatively intolerant to delay, delay variance, throughput variance and errors, because they usually support some kind of QoS sensitive media like, voice or remote control commands. If certain QoS provisions are not in place, the quality may get unacceptable and the application derives no utility. However, depending on the application task and the media types it involves, the application can successfully operate within a range of QoS values. For example, audio and video streaming applications being very loosely interactive applications, are not extremely sensitive to delay and jitter.
126.96.36.199 Tolerant vs. intolerant. Some inelastic applications can tolerate certain levels of QoS degradation (tolerant applications) and can operate within a range of QoS provisions with acceptable or satisfactory quality. A video application can tolerate certain packet loss without the impairments becoming considerably annoying to the user. Consequently, tolerant applications can be:
On the other hand, there are applications that if their QoS demands are not met, fail to accomplish their task sufficiently. These applications are called intolerant. An example of such application would be the remote control of mission critical equipment, such as control of a robot arm or surgery instruments. Some applications may be able to adapt their rate to instantaneous changes in throughout (rate-adaptive), while others may be totally non-adaptive.
We start our study by observing the behaviour and needs of 'elementary' applications. Some of these applications are not applications per-se, but rather constitute building blocks of applications. However, from the network point of view, they form autonomous modules, have their own transport mechanisms, and as such, pose specific demands to the underlying transport medium. Obviously, it is not feasible to draw specific conclusions on the behaviour and quality characteristics of these elementary applications, because the context of the application within which they are integrated creates new inter-dependencies that cannot be described in such a simplistic way (synchronisation with other media flows, user-specific importance, etc.). Nevertheless, it is worthwhile to look at the characteristics of these applications as a baseline approach to our analysis. We attempt to analyse these demands in the following paragraphs.
188.8.131.52 Conversational audio.
Voice communication is still the dominant type of remote human communication. It can be characterised as a foreground, interactive, telepresence application. IP telephony has recently become a competitive alternative to mainly due to its simplicity and economic viability and all added value gained from a computer-aided telephony service. VoIP is probably the first 'advanced' application that has significant entry levels in today's Internet. In the industry, is becoming a commonplace for voice communication through corporate intranets, and as backbone capacities grow and service differentiation will be available, telecommunication carriers most likely will rely on the Internet to provide telephone service to geographic locations that today are high-tariff areas. Since interactivity is the main requirement, low end-to-end delay and jitter are very important parameters in maintaining conversational quality, and low packet loss is needed to sustain the audio signal's quality. Resilience to packet losses is also dependant on the specific audio codec used (see section 3.2). The business case for VoIP can be made much more compelling by advancing it to a better experience than the (e.g. better fidelity, better integration with messaging and presence, etc.)
For multi-way voice communication over the Internet, the 1 and the MBone tools2 have been successful enablers of one-to-many and many-to-many communication for large groups.
Parenthetically, multicast transmission can be used to transport data between multiple sites efficiently and has been proved to be the only scalable solution for multi-scale, multi-user or large audiences applications. Such applications may be the broadcast of popular live or recorded events to, broadcast style Internet TV, group communication, distribution of data, etc. Despite its efficiency, IP multicast still retains the problem of insufficient, wide-scale adoption from network operators. The important factor that impedes the wide deployment of multicast services is the fact that it is still questionable how it will interact with traditional IP unicast (specifically issues about multicast congestion control), fairness issues between unicast and multicast trtaffic/flows, support of service differentiation in IP multicast.
184.108.40.206 High quality audio orchestration. Audio orchestration poses stringent timing requirements from the participating audio flows. Such applications involve distributed, geographically dispersed, sources of high quality, multi-channel sound that needs to be orchestrated with tight timings (e.g., tele-concert) in order to maintain synchronisation of those different sources of audio, and has completely different requirements to the audio streaming case we examined above. Due to the stringent requirements, end-to-end delay and jitter are crucial factors. High quality expectations mean that multi-channel audio streams may have to be transmitted uncompressed, to preserve the original quality, which increases the demand from the network service in terms of sustainable bit-rate requirement. Furthermore, human factors indicate that users are far less tolerant to quality degradations of entertainment sound or music. Thus, besides sustaining the required network bit-rate for the audio stream (e.g., ³ 128Kbps for stereo MP3, or > 1.5Mbps for six-channel AC-3 Dolby sound), packet loss should remain very low if we want to preserve 'audible' quality.
220.127.116.11 Professional quality audio streaming. High quality means high-sampling (16- or 24-bit samples), multi-channel audio (up to 10 channels or more) with CD-equivalent or better quality (e.g., 96KHz sampling rate). For reasons of maintaining the high quality of the original signal (for example in remote music recordings or music distribution scenarios), the streams might need to be transmitted uncompressed or losslessly compressed. However, audio streaming may be a completely different 'animal' in terms of application requirements. Since interactivity is not a constraint3 (for example, a user is willing to wait for a while, even in the order of seconds, before the music starts playing), the application might be able to build-up de-jittering buffers and thus tolerate certain delay and jitter. Furthermore, less sensitivity to delay means that certain error correction algorithms can be employed to increase the robustness of the stream to packet loss. Robustness to loss can also be enhanced by means of retransmission. If such techniques are not employed, then packet loss should be kept at really low levels. This leaves one major requirements for high-quality streaming applications, which is a requirement for a sustainable bit-rate. However, the adaptivity opportunities within these applications are possible. As a matter of fact, a number of adaptation choices can be taken to restrict the transmission rate of the audio stream, in order to adjust to the available (nominal) network bit-rate - layering, dropping of transmitted channels, transcoding, etc. (see section 3.5.1), but this entails some degradation in quality( see section 3.5.1.
18.104.22.168 High quality audiovisual conferencing. Until recently, videoconferencing required expensive equipment, specialised room setup and complicated conference control. PC-based conferencing equipment is now an affordable commodity, and H.323 based IP conferencing can be easily set-up and maintained. Currently, Internet videoconferencing is restricted to low/modest bit-rates (300-500Kbps), thus prohibiting a high quality communication experience4. If the appropriate network resources are in place, high quality videoconferencing can provide an exciting means of collaboration.
The term "videoconferencing" is loosely used in this section to include all foreground, interactive, tele-presence applications that involve high quality audio and video to enable communication, education, distance learning, entertainment, tele-medicine and collaboration between remotely located individuals. Depending on the exact nature of the conferencing application and what type of media and media encodings are present, the quality requirements may vary accordingly:
22.214.171.124 Videoconferencing technologies
The are several different modes of videoconferencing spanning all ranges of quality: from quarter screen images with low-to-modest quality (less than VCR's) to broadcast-TV quality and higher . It can be used with desktop or room-based environments and it may involve tight (scheduled and controlled join) or loose (join-at-will) conference control. It should be noted that this type of applications are not immersive. The user experiences either a head-and-shoulders view of the current speaker (determined by some means of floor control) or a "Hollywood Squares" grid of all participants. Some of the most popular videoconferencing technologies include:
When using videoconferencing technologies, there are several other issues to be considered:
Transmission of stored or live video material is already very popular on today's Internet. However, it is confined to modest quality, low resolutions and restricted frame rates, due to the huge bandwidth requirements that higher quality video acquires. These services are foreground, teledata applications, and as such, will require significantly higher bandwidth, but have more relaxed requirements in terms of latency. The exact nature of the application and the geographical distribution of the users pose extra requirements, as discussed in section 2.1.2 and in the text below.
126.96.36.199 Video broadcast, streaming, video on demand.
Video streaming falls into two categories: real-time dissemination of live events (broadcasting, or webcasting), and streaming of on-demand material. The former application scenario typically involves the transmission of events, such as, news, sports, remote experimental observation (e.g., eclipse), rocket launches, etc. to significant numbers of viewers. Due to the viewer group sizes involved, such one-to-many applications are in practise best serviced in a scalable fashion by multicast networks9. The later is typified by client applications gaining access (one-to-one) to pre-recorded, stored video material from a remote server, for entertainment, training, education, etc. The common attribute of these services is that they do not, with the exception of start/play/pause actions, involve any high-level interaction or interpersonal communication. This means more relaxed demands for latency; jitter of tens of ms can be alleviated with appropriate buffering algorithms and an initial delay to build de-jittering buffers. As mentioned earlier in the task-based application taxonomy (section 2.1.1), users of teledata applications require high-quality, thus, a certain level of sustained bandwidth has to be available to the applications. Furthermore, even though low loss rates are important to keep the quality distortions low, application tolerance of latency means that loss protection techniques, like , or re-transmission (), can be used to facilitate sustainable high quality even in the presence of higher network packet loss.
188.8.131.52 High Definition TV. is an application that might be able to stretch high-speed networks to the limits. HDTV  can provide high-resolution (16x9 aspect ratio, 1920x1080 (1080i), at 60Hz interlaced), high-quality moving images at qualities comparable or better than any contemporary digital equivalent (like DVD). This, combined with high quality surround sound, is by far surpassing today's TV experience. Depending on the compression algorithm used, HDTV signals can be sent at rates well over 200 Mbps.
HDTV comes in a variety of quality for different target audiences:
As seen from the above, HDTV is a major potential consumer of network bandwidth. From some preliminary experiments10 that involved the transmission of 40Mbps MPEG-2 and 200Mbps HDCAM/SDTV packet HDTV over an OC-12 Internet2 backbone network (Abilene), it was reported that the 40Mbps stream with RAID-style FEC, had minimal latency and could withstand 5-10% loss (depending on the decoder). For the 200Mbps stream, buffering and retransmission was used for loss resilience, that resulted in a 4sec (start-up) delay and significant resilience to loss (10-15%). Giving the huge demands of HDTV for bandwidth, multicast transmission seems the only scalable solution if HDTV is to be used for large-scale broadcasting of audiovisual data.
Remote collaboration using traditional forms of media, like audio and video, offers a useful means of communicating. However, advances in 3D graphics rendering techniques and increasing support of more powerful underlying hardware gives rise for far more exciting and diverse ways of supporting collaboration, scientific exploration, instrument operation and data visualisation by supporting continuous media flows, data transfers and data manipulation tools within virtual environment interfaces. Other specialised equipment, such as haptic devices or head-mounted displays are used to enable manipulation of instruments or navigation within virtual worlds. These collaborative distributed immersive environments pose new requirements to the network in terms of data transfer bit-rates, short one-way delay for real-time operation and reliable transfer of data.
The system developed by the National Tele-immersion Initiative uses 3D real-time acquisition to capture accurate dynamic models of humans. The system combines these models with static 3D background that is previously acquired (3D model of an office, for example), and it also adds synthetic 3D graphics objects that may not exist at all. These objects may be used as a basis for a collaborative design process that users immersed in the system can work on. The system can be also described as a mix of virtual reality and 3D videoconferencing. Very important part of the system, which also consumes great deal of processing power, is deriving 3D information out of a set of 2D images, a task that computer vision part of our system deals with. The advantage with this approach is that all humans are represented very accurately there is no need to model and simulate them, something that we still do not know how to do accurately. There is also no need to have models of humans prepared in advance their 3D representations will be obtained in real-time fashion as they enter the space that is "covered" by 2D cameras.
The downside of this approach is that 3D acquisition processing still takes a lot of computational power. As a result we still do not have real-time frame rate as well as the resolution that one would like to have in systems that support human-to-human interactions. As processors and cameras become much faster, this will be less of a problem. If we think about applications where this system would be superior those would be the applications that require very accurate representation of particular human(s), his/her appearance and the way he/she moves and gestures. The typical examples are tele-diagnosis and tele-medicine. As for the projection system, NTII solution uses surfaces that are parts of regular working environment - walls in front and around the office desk, which removes the need for specially designated rooms and expensive projection constructions. The entire system exhibits large demands in terms of required bandwidth to send constant streams of 3D data that represent remote participants, as well as minimal delay and constant jitter. As such this system represents a classical example of highly demanding and QoS hungry application in Internet2 environment.
The tele-immersive system that uses CAVE environments, on the other hand, use avatars (3D graphical models and approximations of human bodies) to represent participants in tele-immersive session. In order to achieve good results in mimicking humans, the system has to invest a lot of processing power to simulate human body, its appearance, moves and gestures. This is extremely tough task if one would want to be as accurate as possible. So far there are no algorithms that do that to the degree that is completely satisfying for all intricacies of human-to-human interactions. Nevertheless, this may not be necessary in some applications, and so certain approximations may be good enough. Further on, all models of the humans have to be prepared and available in advance before the session starts. These models may also be very much different from the actual person that will be using the system. Still, this effect may be desirable in some applications - the exact "replica" of some human may not be the goal of that particular application at all. As for the projection system the CAVE installation uses specially designed multiple canvases onto which the imagery is projected. The entire system is still too expensive to be considered for massive deployment. It is important to note that this system exhibits far less demands in terms of the bandwidth required between the remote users (requirements for minimal delay and constant jitter still exist)."
184.108.40.206 Tele-immersive data exploration. Tele-immersive data exploration combines data queries from distributed databases, use of real-time or near real-time tools to facilitate data exploration and visualisation of data using immersive environments. It is basically a combination of data mining techniques together with collaborative virtual reality. Such applications will allow users to explore large and complicated data sets and interact with visualised versions of the data in an immersive environment. Furthermore, it will allow collaboration between remotely located users during the data exploration and visualisation process and real-time data manipulation and data processing.
This kind of applications poses several requirements to the network and, as they involve a combination of real-time control and sensory data, they have stringent interactivity requirements.
Grid computing involves a plethora of applications, devices, distributed computing contexts, experimentation tools, data manipulation software, large-scale database access and data exploration, services (archiving, security, caching, etc.), VR tools, transmission of continuous media flows and use of heterogeneous network and transmission platforms. Grid traffic includes bulk data transfer, high priority access to remote databases, grid control traffic, exchange of audio and video for communication (videoconferencing) and visualisation purposes, interactive data visualisation. For this, Grid systems are both foreground teledata and telepresence applications with different levels of interactivity. All these diverse traffic types have their own distinct QoS demands13, and unless some data segregation mechanisms are used, they cannot achieve proper operation.
Grid computing projects introduce a next step to IT computation. One of the most attractive challenges, as identified by the Grid Physics Network project GridPhyN , is the realisation of the Virtual Data concept, which advocates the creation of a virtual data space consisted of large distributed datasets. Together with the development of global resource management techniques and policies, and security constraints, constitute computational environments referred as Petascale Virtual Data Grids (PVDG).
220.127.116.11 Megaconference. The Megaconference  project involves a series of high-quality videoconferencing meetings based on H.323. It is a "permanent, continuous multipoint H.323 video conference" hosted at Ohio State University and featuring several s scattered around the world. Megaconference events consist of multiple sites collectively engaging in a live demonstration of the capabilities of H.323. Each participating institution/organization had an opportunity to address the conference participants, speak about their deployment of H.323 and showcase H.323 applications at their site. The event in 2001 was the largest H.323 multipoint conference conducted to date, and was simultaneously broadcast on the Internet in MPEG1, Real and QT4 formats. The Megaconferences continue to approximately double in size every year. Megaconference I had about 50 sites, Megaconference II about 100 sites, and Megaconference III (2001) around 200 sites and 25 MCUs15. They are the largest Internet video conferencing events and continuously push the state of the art in video conferencing technologies and networking to new advances.
18.104.22.168 ViDeNet. ViDeNet  was created by ViDe to be a testbed and model network in which to develop and promote highly scalable and robust networked video technologies, and to create a seamless global environment for teleconferencing and collaboration. From a technical perspective, ViDeNet is a mesh of interconnected H.323 zones. Each zone represents a collection of users at each site that are administered by the site itself. ViDeNet enables end-users registered with each zone to transparently call each other, thus facilitating seamless use.
22.214.171.124 ViDe LSVNP. The ViDe Large Scale Video Network Prototype (LSVNP)  is a distributed H.323 video conferencing testbed, funded by the Southeastern Universities Research Association and BBN, the research arm of GTE. Its goal is to explore issues critical to the deployment of seamless networked video, and accelerate the deployment of H.323 through resolution of large-scale deployment issues. BBN is collaborating with ViDe to utilize the LSVNP to conduct analysis of video traffic patterns. The LSVNP testbed is the first large-scale distributed video conferencing network. A number of projects are currently being supported with gatekeeping and multipoint services. The projects include applications in marine sciences, veterinary medicine, speech pathology and audiology, training for teachers, architecture, higher education outreach, technical assistance for people with disabilities(deafness), emergency telemedicine, and earthquake research.
126.96.36.199 VRVS. One MBone implementation example is the Virtual Rooms Videoconferencing System (VRVS)  from the California Institute of Technology and CERN, the European particle physics laboratory. With the objective of supporting collaborations within the global high-energy physics community, VRVS has deployed reflectors that allow participation by non multicast-enabled sites. In addition, the VRVS team has developed gateways that allow participation from non-MBone tools such as H.323, QuickTime, and MPEG-2. Most of the software used by the MBone-based environments is freely available (e.g., ), and can be used with low-cost conferencing equipment (desktop cameras, microphones, etc.).
More information on the above can be found in the 'Videoconferencing Cookbook' from the Video Development Initiative .
188.8.131.52 Music video recording via Internet2 This is an Internet2 Initiative16. The goal was multi-location music video recording session using real-time streaming video over Internet2 networks. The participants included: NYU, USC, U Alabama-Birmingham, U Miami and U Georgia School of Music. A summary of the technologies and properties of the project:
184.108.40.206 The World's First Remote Barbershop Quartet Internet2 Initiative (November 1, 2000) goal was to orchestrate a multi-location barbershop quartet over Internet2 networks. Pieces played were the "Beer Barrel Polka," Ïn The Good Old Summertime," "The Internet2 song". Some of the setup details and lessons from the experiment include:
220.127.116.11 QoS Enabled Audio Teleportation (Chris Chafe, CCRMA, November, 2000). The goal of this project was to stream professional-quality audio to remote destinations using established internet pathways. The arrangement setup involved the conference site in Dallas connected to CCRMA (Stanford) for the SuperComputing 2000 conference. A summary of the project's features is:
18.104.22.168 Tele-immersion - Office of the Future, CAVEs, collaboration within a virtual environment for the manipulation and visualisation of large amounts of data
22.214.171.124 Remote control of large telescopes enables the real-time control of such devices using reliable embedded execution from remote locations, allows the view of high resolution image data and utilises H.323 based videoconferencing enabling human supervision of the data acquisition process.
The Southern Astrophysical Research (SOAR) Telescope project18 is a 4.2-meter aperture telescope funded by a partnership between the US National Optical Astronomy Observatories (NOAO), the Country of Brazil, Michigan State University, and the University of North Carolina at Chapel Hill. The telescope is being designed to support science in both high quality imaging and spectroscopy in the optical and near infrared wavelengths.
126.96.36.199 Remote access of powerful microscopes control data involve the adjustment of focus of the microscope, where high quality images of the microscope are also transferred over the high-speed network. The possibility of integrating the application into a tele-immersive environment is attractive as it will give a feeling of presence to the remote user. One such project is the Microscope And Graphic Imaging (MAGIC) Center19. The Center provides both local and remote access to optical and electron microscopes to students and faculty on the Hayward campus and other educational institutions including local community colleges. The mission of the Center is to expand the use of microscope imaging and analysis in science education and research. MAGIC is developing a model for remote access to scientific instruments20. This provides a way to share a variety of valuable resources with a world-wide audience. By pooling these resources and providing a common network and user interfaces to these resources, science researchers and educators will have capabilities that no one institution could afford. Model software for interactive remote-shared access to an unmodified Philips XL 40 scanning electron microscope (SEM) located within the MAGIC facilities at CSU Hayward is being developed. A wide range of network technologies is being used including modem, ISDN, Ethernet, T1, and ATM to control the SEM, and a wide range of image transmission technologies, including closed circuit TV, compressed video over ATM, and digital imaging.
188.8.131.52 nanoManipulator The nanoManipulator  is a virtual-reality interface to scanned-probe microscopes. Using haptic devices scientists are enabled to virtually view a surface on a nanometer scale and control ongoing experiments. What makes such an application more interesting is the use of different computers to handle different modules of the system, like, the graphics, the haptics, the microscope communicating over a high-speed Internet (this application is called tele-nanoManipulator. The distributed users can use tele-immersion and audio and video links to facilitate collaboration.
184.108.40.206 The GridPhyN project. The Grid Physics Network GridPhyN project  is primarily focused on achieving IT advances on creating petascale virtual data grids (PSDGs) software and technologies to enable distributed collaborative exploration and experimental analysis of data, and package them in a multi-purpose, domain-independent Virtual Data Toolkit, and use this toolkit to prototype PVDGs. The aim is to provide support to four frontier physics experiments that explore the fundamentals of nature and the universe. These experiments are:
The above experiments offer great challenges for data-intensive applications in terms of timeframe, data volumes and data types, computational and transfer requirements.
220.127.116.11 The DataGrid project. The DataGrid project  (DataGrid is a European funded GRID project), concentrated on several computational intensive scientific projects:
In this chapter we discuss the properties of advanced applications in terms of their end-to-end requirements so that they can operate with high (or acceptable) quality. We also provide a review of research and studies of the effects of network performance parameters, such as, delay, loss, jitter, etc., on the qualitative behaviour of Internet applications. Because advanced applications are structured as an ensemble of high quality data and media flows, their quality and usability can be assessed by (i) the quality of their individual flows and (ii) the degree to which requirements posed by interactions among the individual flows are satisfied. We follow this approach of studying QoS requirements of advanced applications, that is, by examining the QoS properties of the individual, elementary flows and the requirements that arise from their relationship within the application scenario.
The transition to network QoS allows the definition of quality metrics that are based on a variety of parameters. However, such QoS models are engineered using network-centric quality parameters (available bandwidth, delay, jitter). Application developers and users, on the other hand, require quality models that are geared towards their needs, and expressed by different performance characteristics, such as response time, predictability, consistent perceptual quality. These are metrics that define what is called application quality. The term application quality is too vague to be deterministically defined. The reason is that the factors that influence quality are very 'fuzzy'. Such factors include the user's expectations and experience, the task of the application, or whether the application delivers the expected levels of performance. Furthermore, some other factors, like charging for the use of the underlying network resources or the service also influence application quality.
Another, usually underestimated and overlooked aspect of quality is the social behaviour of individual applications in an environment like the Internet. Applications that aggressively try to acquire as much network resources as possible in a network that does not impose any direct penalties for doing so, may seem to increase their own quality in comparison to applications that are precautious (like those based on TCP transmission). While on a first look this does not seem to directly influence the quality or acceptability of an application, it may have undesirable implications to the health and well-being of the network. Unresponsive and excessive use of resources from applications can lead to severe congestion conditions that will inevitably affect the irresponsible applications as well. So, the ability of an application to adapt to the state of the network may be considered an indirect aspect of its quality. While network operators and designers devise and deploy mechanisms in the network to help prevent such behaviour (direct - pricing, changes in queuing principles (RED), or indirect - ECN), the cooperation of the applications is also required. Applications that can deal with changing network conditions and adapt (or adjust) their resource demands in response to indications of network performance variations are called network-aware. An application can be network-aware only if it is able to change its resource demands by appropriately changing its behaviour to work in different modes (adaptive application, see section 2.1.3).
We therefore recognise some important features of application behaviour that are tightly coupled to the perception or their quality:
The most important metrics that can characterise the performance of an IP network and are the most significant factors that influence the end-to-end quality of an application are:
The network QoS metrics mentioned above do not necessarily coincide with the application's perception of QoS parameters. The application performance from the user's perspective, is not concerned with the details of how a network service is implemented and how it performs. Application performance deterioration should be expressed in terms that focus on user-perceivable effects, rather than its origin within the application end-to-end path. For example, the end-user of an application sees only one latency, and cannot distinguish the cause of it, whether it comes from the network or from the processing done on end-system. Thus, from a user perspective, network QoS performance is hidden in end-to-end application-level performance.
Application-level performance metrics include:
It is very important to emphasize here that the above application QoS metrics are not only affected by the network-centric metrics. Several other factors in the end-to-end application path may result in undesired changes of performance parameters, like the operating system's inability to support the application, erroneous application and protocol stacks implementations, the usage environment (e.g., faulty equipment), etc. In most of these cases, when people investigate how application quality or utility is affected, it is the application-level and not the network-level performance parameters that they are able to measure, understand their influence, and subsequently recommend on desired values or properties. This should be highlighted to avoid erroneous conclusions and misconceptions from network designers and researchers. As an example, it is known through user trials and experiments, that for videoconferencing applications, interactivity becomes problematic when one-way delay is higher than 300-400 ms. This does not mean that the requirement of the application from the underlying network is for a transmission delay below 300-400 ms. Other delays introduced at the end-to-end path of application data units (hardware capture devices, encoding and buffering delays, scheduling delays from non real-time operating system) means that the requirement for the network latency is even lower. In fact, it was demonstrated in  that in some cases during the transmission of high quality HTDV, the limitations of the end-system (PCI cards, network interface cards)21 rather than the network itself was the main limiting factor (note however that the set of experiments described in  were conducted in conditions of a lightly-loaded network).
In the investigation of application QoS requirements that follows in this chapter, recommended values for certain performance parameters obtained though usability tests or other methods refer to the end-to-end application path and this should be taken into consideration when we desire to draw appropriate network QoS values.
Voice over IP (VoIP) is characterised by stringent requirements regarding end-to-end delay, jitter and packet loss. While VoIP has a requirement for a 'modest' yet sustained bandwidth (64Kbps and below are enough to accommodate a voice stream; use of more sophisticated encoding algorithms can reduce this down to a few Kbps), bandwidth only is not sufficient from a user QoS point of view. In this section, we discuss work that quantifies the effects of delay, delay variation and packet loss on interactive IP voice.
Voice quality is very subjective, and can be expressed primarily with respect to the individual user. At a high level, perceived IP voice quality can be affected by mainly three factors:
While the above factors originate from different sources, the individual contribution to disruption they cause cannot be distinguished by the human user of the application. It is difficult to categorise the effects each of the above individually has to the quality; only their joint effect can be ultimately measured. In the following, we study what qualitative effects the end-to-end values of delay, jitter and information loss have on IP voice, and suggest on approximate, desired values for these metrics.
|One-way delay||Effect in perceived quality|
|< 100-150 ms||Delay not detectable|
|150-250 ms||Still acceptable quality, but a slight delay or hesitation is noticeable|
|Over 250-300 ms||Unacceptable delay, normal conversation impossible|
|Delay variation (jitter)||Effect in perceived quality|
|< 40 ms||Jitter not detectable|
|40-75 ms||Good quality, but occasional delay or jumble noticeable|
|Over 75 ms||Unacceptable; too much jumble in the conversation|
Delay is introduced at all stages of a VoIP system: the sound-capture device, the encoding and packetisation modules, network transmission delay, receiver buffering, decoding and playout of the signal. However, a user does not care where latency has been introduced; users requirement is to have low latency in order to experience a truly interactive feel to inter-personal communication. Delay mostly affects conversational quality rather than received voice fidelity. There is a range of opinions on suitable ranges of one-way delay. According to ITU , the users cannot notice any delay below 100-150 ms. Delay between 150 and 300 ms is perceived as a slight hesitation in the response of their conversational partner. The delay above 300 ms, is obvious to the users; the conversation may be almost impossible, as each speaker backs further and further off to prevent interruptions. Talker overlap (the problem of one caller stepping over the other talker's speech) becomes significant if the one-way delay becomes greater than 250ms. ITU-TG.114  recommends 150 ms as the maximum desired one-way latency to achieve high-quality voice. Vegesna  suggests a target of 100ms of end-to-end, one-way delay in order to maintain the interactive nature of communication, while, Kumar  argues that, in order to maintain full-duplex voice conversation, a one-way delay of less than 300ms is desirable. Table 3.1 stands as a rule of thumb for the desired ranges of one-way delay.
Jitter results as a variation in inter-packet arrival times. A jitter buffer can be used to alleviate the jitter effect; incoming packets are buffered and then read out at a nominal rate. Jitter buffers sizes usually adapt to instantaneous network jitter conditions. However, packets arriving very late are either discarded and considered lost (thus causing conversational gaps) or they obstruct the proper reconstruction of voice packets (generating a confusing conversation in which the talking parties may jumble together). Table 3.2 gives an indication of the perceptual effect of jitter.
The study of the impact that the loss of data has to the perceived quality is rather difficult, as it relies on a number of factors: the audio codec used, the existence of error protection or correction (e.g., FEC), and the pattern of packet loss itself. For example, packet loss events where the packets are lost individually is better than if the loss events occur in bursts with higher lengths. This is because packet repair techniques may be employed to recover from isolated lost packets, but it may be unable to recover from a lengthy series of consecutive packet loss. Furthermore, the location of loss within the bitstream has severe effects to the perceived quality. The loss at an unvoiced segment has little impact on perceived quality, while, this is not the case if a voiced segment is affected . The effect of packet loss also depends on the packet size. When small packets (20ms worth of audio) are used, the impact of a lost packet might easily be alleviated by simple error concealment techniques at the receiver. However, this is more difficult to achieve when larger size packets are transmitted (80ms/160ms)22. Table 3.3 shows the effect of packet loss on voice for some of the most commonly used voice codecs. As the table shows, in most cases acceptable (toll) voice quality can be achieved if one-way delay is kept below 150 ms and packet loss below 2%. Resilience to packet loss can increase with the use of error correction and concealment techniques (FEC, interleaving, retransmission, etc., see  for a survey), but as a side effect, these techniques increase end-to-end delay.
Figure 3.1 shows the distortion (R values), of different speech codecs for transmission under various one-way delay and packet loss ratio conditions. An R-value is the value calculated by the E-model, an objective speech quality assessment model (see section 18.104.22.168).
|G.711 w/o PLC||150||1||3.55|
|G.711 w PLC||150||1||4.31|
|G.711 w/o PLC||150||2||3.05|
|G.711 w PLC||150||2||4.26|
|G.711 w PLC||400||0||3.60|
Apart from the above factors further impairments can be caused by the specific selection of audio codec. These impairments are due to the distortion introduced by the codec and due to the interaction of network effects and the codec. Table 3.4 summarises the most commonly used speech codecs with three major factors that impact speech coding: bit rate, quality and complexity.
|Codec||Description||Rate (Kbps)||Quality (MOS)||Complexity|
|ITU-T G.711||Pulse Code Modulation for voice frequencies (PCM), 4KHz bandwidth||64||> 4||-|
|ITU-T G.722||SB-ADPCM: Sub-Band Adaptive Differential Pulse Code Modulation; 16 kHz sampling frequency||48-64||3.8||low|
|ITU-T G.723.1||Dual rate speech codec for multimedia applications (MP-MLQ/ACELP)||6.4/5.3||3.9||high|
|ITU-T G.726||Adaptive Differential Pulse Code Modulation (ADPCM)||32||3.8||low|
|ITU-T G.728||4 KHz bandwidth; Low Delay CELP (LD-CELP) G.727H: Variable-Rate LD-CELP||16||3.6||low|
|ITU-T G.729(A/B)||Conjugate Structure Algebraic CELP (CS-ACELP) G.729A: Reduced complexity algorithm; G.729B: Discontinuous Transmission (DTX)||8||3.9||medium|
|ETSI GSM 06.10 (GSM FR)||Full Rate (FR) speech codec (RPE-LTP: Regular Pulse Excitation - Long Term Prediction)||13||3.5||low|
|ETSI GSM 06.20 (GSM HR)||Half Rate (HR) speech codec (VSELP: Vector Sum Excited Linear Prediction)||5.6||3.5||high|
|ETSI GSM 06.60 (GSM EFR)||Enhanced Full Rate (EFR) speech codec (ACELP: Algebraic CELP)||12.2||> 4||high|
|ETSI GSM 06.70 (GSM AMR)||ETSI Adaptive Multi-Rate (AMR) speech codec||4.8-12.2||> 4||high|
|Nokia AMR-WB||Nokia proposal for a wideband Adaptive Multi-Rate (AMR) codec||12.6-23.85||> 4||very high|
As a general remark, in terms of network QoS parameters, bandwidth is not the main issue for voice traffic. The requirements of this class of applications is that the other network parameters, like end-to-end delay, jitter and loss, be kept within a region of well-defined boundaries.
Experiences from the tremendous acceptance of mobile or cellular telephony23 has proved that human users are prepared to tolerate transient 'seemingly bad' voice quality when communicating, given that the appropriate incentives are in place (e.g., it's the only mean of achieving voice communication, or it is considerably more economic, etc.). Nevertheless, it needs to be stressed that the context of use (criticality of application, quality expectations) is still the most influential factor.
The list below provides links and references to useful additional information that are not directly cited in the text.
The term 'audio' refers here to any kind of sound signal that is not explicitly 'voiced', such as music of all sorts, sounds, etc. We therefore, distinguish 'voice' and 'audio' transmission, because there are several important distinctions between them. In general, audio signals require higher bandwidth24. Thus, a higher transmission bit-rate is needed in principle. Depending on the application task (whether it is interactive or not), delay constraints can be either tight or relaxed. Subsequently, delay, jitter and packet loss compensation techniques can be used, and we shall see when and how this can be achieved. However, as a rule of thumb, we can say that (in contrast to 'voice' signals), humans will always expect high-fidelity sound, and will not tolerate quality degradations. This seems to be an outcome of users' expectations. For music to be acceptable, it has to be at least of FM-quality, and preferably, CD-quality and higher, as this is what users are used to.
The transmission of video offers the opportunity for far more advanced forms of communication, collaboration and entertainment, but it also poses new and considerably higher demands for network resources when compared to other media. As an indication of the diversity of bandwidth requirements popular video formats.
|Video format||Typical bandwidth requirement|
|, Interim format||360 Mbits/sec|
|, SMPTE||270 Mbits / sec|
|Compressed MPEG2 4:2:2||25 - 60 Mbits/sec|
|broadcast quality HDTV (MPEG-2)||19.4 Mbits/sec|
|MPEG-4||5Kbits/sec - 4 Mbits/sec|
|H.323 (H.263)||28Kbits/sec - 1 Mbits/sec|
The network QoS requirements of a video flow are primarily determined by its application context. Depending on whether the application is telepresence or teledata and the degree of interactivity it involves, different handling from the network is expected, as discussed in section 2. In this respect, we identify two major modalities of networked video applications:
Interactive video is used in telepresence (e.g., videoconferencing and virtual collaborative environments), distance learning, medical applications (remote surgery), scientific applications (immersive data exploration, remote control of scientific instruments). For these applications, short latency is an essential requirement if acceptable interactivity is to maintained. Depending on the usage scenario (technology and codec used) and user expectations, throughput requirements range from a few hundreds of Kbps (Nx128Kbps-H.261, H.263) for videoconferencing with 384-800Kbps per flow currently used in typical H.323 conferences, 2-10 Mbps for MPEG-2 based videoconferencing, and up to 19.2 Mbps-1.5 Gbps for quality telepresence. Figure 3.2 presents typical throughput requirements for most common interactive video services (shown in green colour to indicate their requirements for short latencies).
Two-way interactivity (human-to-human or human-to-machine) of these applications means that latency requirements are quite stringent. Some general rules of thumb for the requirements of interactive video apply:
22.214.171.124 H.323 videoconferencing. In section 3.2.1 we discussed delay requirements for two-way speech, depicted in Table 3.1. In principle, we would expect that these requirements apply to interactive video (videoconferencing) as well. However, a major obstacle in achieving this is the encoding delay introduced by compression. It is reported in  that in current H.323 codecs the encoding/decoding delay is approximately 240ms. In addition, each in an H.323 configuration may generate additional delay of 120-200ms.
Figures for acceptable data loss rates depend on the application technology. H.323 video is very sensitive to both jitter and packet loss. While some IP audio/video applications can withstand significantly high loss rates (even up to 10-15%) by employing redundancy techniques, to the best of our knowledge H.323 implementations do not currently use these techniques.
Audio-video synchronisation poses an extra requirement for timely delivery of all involved data flows; we examine this issue in more deetails in section 126.96.36.199 below.
Video streaming refers to the real-time transport of live or stored video, such as news retrieval, live video transmission or multimedia information broadcasting (web-casting), Pay-TV, video on demand, network-based studio production and 25. In video streaming applications interactivity is not the main feature, with the exception of VCR-like control functions (pause, stop, etc.). User expectations and application usage scenarios vary widely in terms of desired quality and so do throughput requirements (Figure 3.2):
The main requirement of users of non-interactive streaming video is for sustained high visual quality. Given that an initial delay can be used to built a receiver playout buffer that can accommodate quite high jitter values, this is translated in bounded data loss rates; as shown in Figure 3.3, the user expectation of data loss is that it should be lower that 2-3%. Today's video streaming is confined to modest quality video with restricted resolution (image size). In such scenarios the distortions introduced by encoding with comparatively low bit-rates and packet loss are usually tolerates by users. However, advanced video services over IP use much bigger resolutions where encoding distortions or lost information are easily spotted. Therefore, they need to maintain sustained high bandwidth and very low packet loss in order to fulfill expectations of the users that have experience with similar, non-IP, high-quality video services (digital broadcast TV, satellite, DVD and VCR). As the data loss tolerance is much lower in such services, packet loss needs to be kept to a minimum. Delay requirements are relaxed; there is usually minimal interactivity involved, so the application can afford a startup delay in the order of seconds (up to 10sec for video streaming, or even higher, e.g., for VoD). Such an initial delay can be used to build up receiver buffers that ameliorate the application's resistance to any variation in packet inter-arrival times and also to transient congestion. Transmission delay though needs to be kept controlled in order to constrain jitter, but most likely, larger delays can be tolerated in case of streamed video than in the interactive case. Obviously, the user would expect relatively fast reaction in response to control actions (start, stop, pause, etc.), but such responses can be in the order of one second rather than in the order of tens of ms. Such relaxed requirements also apply to delay variation where higher values of jitter, even up to 500ms, are considered acceptable . Efficient transmission scheduling algorithms can be employed to maximise network utilisation and application quality. Furthermore, more advanced packet loss protection mechanisms can be used to reduce the effects of information loss that results from transmission errors and packet loss.
For both of these types of video transmission, the bandwidth requirements are determined by the type of application, the capacity or nature of the network (Virtual Private Networks (VPN) or enterprize networks, public Internet, DSL, cable, wireless networks or telephone lines). In turn, demand of bandwidth can spam the whole range from very low bit-rates (a few Kbps) to extremely high transmission rates (e.g., uncompressed HDTV) as shown above. The video quality at the receiving end is a complex function of several factors: the channel rate availability (it determines the encoding quality), the frame rate, the resolution of the video image, etc. The end-to-end latency and delay variation influence the application's interactivity and the timely delivery of video data packets for decoding, and packet loss significantly deteriorates the visual quality. The two forms of video tasks mentioned above have different requirements on delay and data loss, and we address these issues in the following section.
The task of specifying the effects of network QoS parameters to the end video quality is a difficult but very interesting undertaking. Bandwidth variability, increased delay, jitter and packet loss deteriorate the perceived quality or fidelity of the received video content. As a consequence, they affect the task of the application. In streaming applications, for example, high levels of distortion and quality fluctuations are very noticeable to the human observer and they may cause misunderstanding of a video clip or loss of content's intelligibility. In collaborative telepresence applications they result in loss of inter-stream synchronisation (e.g., between audio and video), difficulties or even break of communication and coordination. The task of remote control operations can be severely defected as the user is responding to video imagery that is out-of-date due to high latencies or jitter. In the following section, we document the effects that the end-to-end QoS parameters may have on interactive or streaming video. Although we categorise these effects for each of these parameters individually, it should be noted that they do not affect quality in an independent manner - they rather act in combination or cumulatively, and ultimately, only this joint effect is detected by the application user. However, studying the effects of network parameters in an isolated fashion is a more tractable approach. The challenging issue for applications is to investigate what adaptation decisions and tradeoffs can be employed to reduce this cumulative effect on quality. As an example, in a lossy environment part of the bit-rate available to the video flow may by used to transmit redundant information to recover from lost packets.
High quality digital video are bandwidth thirsty flows. Consequently, the throughput that a video stream receives predominantly determines its visual quality. We show the relationship between quality and throughput by examining Figure 3.4 which shows the evolution of perceived video quality as a function of the encoding bit-rate. The quality scores were obtained using an implementation of the ITS objective video quality metric  (see section 188.8.131.52). Objective video quality assessment metrics are computational models that can measure the quality of a video sequence in a manner that produces results similar to those obtained by human observers (for more details, refer to section 4.1.2). The quality scores are in the 1 to 5 range. Three 150-frames long CIF-size (352x288) video sequences were encoded using several bit-rates, ranging from 256Kbps to 3Mbps. The video quality is a convex increasing function of the encoding bit-rate. There is an initial range of bandwidth values where the video quality is increasing substantially. After that, the quality line saturates; any extra bit-rate offered to the encoder gives only marginal quality gains and is probably an unwise utilisation of bandwidth. Knowledge of the point where the quality graph starts to flatten would be extremely important in determining the throughput requirements for controlled high quality video. However, this is not straightforward as several other factors influence the shape and the slope of the line, and the complexity of the visual content and the video codec used are the most important ones. The visual complexity represents the amount of spatial detail and motion in the video sequence. This is also shown in Figure 3.4 where the quality is plotted for three video sequences: the video sequences akiyo and news have relatively low motion, where the sequence rugby has much higher spatial activity and significant motion. The rugby sequence lies below the other two which suggests that more perceived distortion is present at equivalent bit-rates. This shows that the throughput requirement for a video application is highly dynamic, and thus the bandwidth offered by the network needs to by re-adjusted during the transmission according the the content's variability. The effect of the particular choice of video codec on quality is also shown in Figure 3.4, two sets of different shapes of the quality plots correspond to two video encoders used (H.263-left and MPEG-1-right).
The effect of packet loss on compressed video streams is particularly disruptive because the distortions it causes on the image quality are typically more annoying to the human viewer than other types of impairments (e.g., encoding artifacts). The reason why loss is so detrimental to video flows is because image compression achieves reduction to the number of bits required to transmit the stream by removing redundancy inherent in video data. Removing most of the redundancies by means of compression implies that any loss of compressed data cannot be recovered. The effects of packet loss depend on a number of factors, and the most important are: (i) the compression technique used, and the selection of the encoding parameters. i.e., the compression ratio, (ii) the data loss rate, (iii) the pattern of loss, and (iv) the data packet size. In order to reduce spatio-temporal redundancies and achieve better compression ratios most of the contemporary video codecs use quantisation and inter-frame compression by exploiting motion compensation and estimation. Inter-frame compression means that certain parts of the compressed bit-stream, like those that belong to reference frames (I- and P-frames in MPEG1,2 or Intra-pictures in H.263), are more 'important' than those that belong to predicted frames (B-frames in MPEG) as they are used as predictors for the decoding of predicted frames. A packet loss in a predicted or difference frame will only affect the particular frame but a packet loss in a reference frame propagates to all the dependent predicted frames, causing more severe distortion that withstands for the duration of dependent frames. This phenomenon is called propagation of errors. The pattern of loss may also have different impact.The impact of a lost packet is higher if the larger the size of that packet is larger. Thus smaller packet, if lost, cause less distortion but, on the other hand, small packet sizes means that more packets have to be transmitted which increases the header and packetisation overhead and decreases the throughput.
As far as packet loss is concerned, the nature of the video application, whether we talk about interactive or one-way video streaming, does not seem to have any particular significance in terms of the quality affect on the visual content. Nevertheless, the users are still more tolerable of data loss in face-to-face video interaction (videoconferencing). Different types of video application tasks have completely different opportunities of recovering from losses. Non-interactive applications do have a benefit - they can afford longer latencies and consequently they may use more sophisticated error recovery techniques. As already mentioned, the use of source or retransmission introduces extra latencies that delay-sensitive flows cannot always withstand while one-way streaming applications can employ this technique as latency is not a crucial factor.
It is useful to have a measure of how IP packet loss affects video quality, but it is very hard to determine the degree of quality damage. The kinds of data packets that are lost is also crucial factor as some may carry more important information (syntax, motion vectors, blocks that carry more visual information, etc.) than the others. Boyce et. al  define a frame error state measure, which is determined by whether or not a lost packet affects the frame. Although this measure does not give much clue about the perceived quality effect, it shows how packet loss influences the frame integrity. They found that even small results in a higher frame error rate. For example, using real transmissions of MPEG-1 video, they observed that a 3% packet loss results in a frame error of 30%.
Particularly interesting is the interaction between the throughput and packet loss. As shown in section 184.108.40.206, in the absence of packet loss, the quality of an encoded video stream is an increasing function of available bit-rate. However, as reported in , when packet loss is present, quality increases up to a certain bit-rate, but then starts to smoothly decrease (shown in Figure 3.5). This can be explained: at lower bit-rates, the visual distortion caused by encoding artifacts is dominant, thus quality increases with bit-rate. The perceived quality peaks and then starts to drop because the higher the average bit-rate the greater the number of packets being lost which, ,in turn, causes larger visual distortions. So, loss conditions, there is an optimal average bit-rate, which is somehow independent of the and directly dependent on the sequence (content) type . This result shows that, in the presence of packet loss, increasing the bit-rate does indeed deteriorate the quality. This clearly shows the benefits of designing network-friendly applications that adapt themselves and responde to congestion signals.
Several techniques can be used to alleviate the effect of packet loss. These include and retransmission at the sender, and error concealment techniques at the receiver/decoder . Some other researchers have proposed the use of more robust coding techniques for error-prone environments such as the best-effort Internet . In order to avoid the persistence of error blocks throughout several successive frames, such techniques include the use of intra-frame encoding only, or they use frequent intra-updates of the frame blocks (macroblocks) called intra-refresh, . However, nothing comes cheap; these techniques inevitably require more bandwidth for equivalent encoding quality.
Figure 3.3 graphically depicts the performance targets for audio and video applications based on user expectations of QoS performance as far as delay and are concerned. The graph is created based on indications and recommendations from the Study Group 12 of ITU-T  and the literature study on the perceived qualitative quality of generic application from the TF-NGN working group within the GÈANT project26. While the suggested user tolerance to data loss is below 1% for high-quality audio-video streaming and below 2-3% for two-way interactive audiovisual services27, the graph shows that the tolerance of audio and video flows can be increased by the use of , e.g., . Such models of user performance expectations that are based on the user's end-to-end quality perception, "...provide with some upper and lower boundaries for applications to be perceived acceptable to the user and show how the underlying impairments of information loss and delay can be grouped appropriately..." .
If we want to achieve a natural, 'in-sync' impression in multimedia presentations, it is essential to preserve the temporal relationship between several continuous media flows. Media flows that are 'out-of-sync' are often perceived as being artificial, awkward and sometimes annoying . While many systems allow the multiplexing of the various streams to avoid this phenomenon (e.g., MPEG systems), this is either not always possible, due to the nature of the participating flows, or it is not desirable as different media streams are handled by different modules in the system. Avoiding application level multiplexing of the media streams can sometimes provide with more chances of applying different transmission, adaptation and error protection strategies to the individual flows. In such cases synchronisation is an additional requirement.
We observe two modes of synchronisation:
Let us examine inter-stream synchronisation, emphasizing on lip-synchronisation issues. We try to map lip-synchronisation requirements to network QoS metrics by discussing and analysing human perception results28 reported by Steinmetz in  and shown in Figure 3.6. The grgaphs in the figure depict the level of detection of synchronisation errors for different skew values. From the graph it can be seen that most humans do not detect any lag, when the absolute skew value is 80ms (-80ms -audio behind video, +80ms -audio ahead of video). In the areas where absolute skew spans over 160ms, error in synchronisation was reported by almost all subjects; this caused the subjects to be distracted by the out-of-sync effect rather than attracted by the viewing content itself. In the 'transient' areas between 80-160ms (absolute skew), detecting of synchronisation errors depended on how close the speaker was - the closer the speaker, the easier those errors were detected. In these transient areas synchronisation errors were more easily detected when audio was ahead of video (higher slopes of the detection error in the [+80, +160]ms skew region, smaller slopes in the [-80, -160]ms area). This means that the situation with video being ahead of audio could be tolerated better than vice versa. Minor experiments with other languages (Spanish, Italian, French, Swedish) showed similar behaviour. Results were also obtained for different video content (violinist in a concert as well as a choir) and showed that they did not pose any extra requirements than those exposed by the speaker experiment.
Based on the results of perception of synchronisation errors shown in Figure 3.6, Figure 3.7 depicts how different levels of skew were qualified in terms of synchronisation (acceptable, indifferent, or annoying) in case of the 'shoulder view' (for a complete view of the result graphs refer to ). The upper (envelope) curve shows the portion of subjects that detected any loss of synchronisation. The authors conclude that absolute skews between -80ms and +80ms are acceptable by most of the human viewers, and thus 80ms lag between audio and video is an upper threshold for maintaining a synchonised audio-video presentation. Similar thresholds are also reported in .
Iai, et. al  report the effects of video and audio delays on quality through a series of subjective tests on conversational video. These results are reproduced in Figure 3.8 - it shows the hatched region where quality is acceptable and the regions where the inequality in speech and video delays is detectable. When speech delay is shorter that video delay, then the detection threshold appears to be 120ms. When speech delay is longer than that of video, then the detection threshold is about 250ms. The detection level for the delay difference was defined as the value at which the detection rate was 0.5. Note this is the reason why they reported somehow higher values than what discussed above. But the fact that inequalities in speech and video delays are more easily detected when the speech precedes the video was in agreement with Steinmetz's results .
As a general conclusion, a skew of 80ms between the audio and video flows in a multimedia presentation is the upper threshold to preserve synchronisation without noticeable effects. There are two sources of network distortions that may result in synchronisation errors: packet loss and, as long as an acceptable end-to-end delay is achieved, jitter. Lost packets result in a presentation gap as the corresponding audio or video data units are not available in time. When a video frame is lost, then stream lag will occur. Delay variation between the two flows also results in a lag between the presentation of audio and video segments that should normally be displayed simultaneously. This problem can be partially tackled by introducing a short initial delay to allow for a de-jittering buffer to be built-up. Different jitter for the two flows means that respective data units arrive at different times even though they were generated at the same time instance. If the corresponding audio and video segments are appropriately time-stamped upon generation or transmission, then at the receiver side, they can be scheduled for display later in the future so to allow for both segments to be received, assuming that their respective jitter is below a threshold. In interactive video, there is a limitation in the amount of initial buffering.
[Study of synchronisation effects with other media types (e.g., remote instrument control) should be documented as well]
This section discusses issues that promote the idea of a cooperation between the application and the network in order to provide the necessary network stability and application quality. It highlights the fact that applications designers should build applications that can be aware of the underlying network conditions and react to them (adaptation). It also promotes the necessity of congestion adaptation in conjunction with any optional offered QoS from the network. We examine the question: What choices can the applications take in order to respond to varying network conditions? We briefly talk about TCP-friendly adaptation for continuous media flows and why it is important, and what adaptation mechanisms are available for internet audio and video. Also, what are the proper design and run-time choices for video adaptation, and how these are influenced by the type of the video task, and the specific application requirements.
Recent attempts to implement network QoS within the Internet2 have proven the difficulties of employing QoS models that offer hard QoS guarantees. Such techniques require considerable investment in designing, testing and planning while at the same time, it is no clear whether the right demand incentives or market needs are so strong yet. Development of such services, like the QBone Premium Service [34,83] still lacks support from router vendors and, more importantly, it requires dramatic network upgrades with considerable operational and economic costs. In the light of such scalability problems, IP-layer differentiation could focus on "non-elevated" services that can be deployed incrementally and cheaply over the existing Internet. Such services may be those that attempt to divide today's monolithic best-effort service class into multiple "different but equal" service classes that trade off delay or loss. Such an approach is the Alternative Best Effort (ABE)  or the Best Effort Differentiated Service (BEDS) . While these services will obviously require changes to router support, they do not exhibit the complexity of the afore-mentioned QoS approaches (policing, reservation signaling, admission control, etc.) and can be used to offer bounded delay to delay-sensitive applications.
Network QoS is often mistakenly regarded as a panacea in the quest of achieving high-quality networked applications. Much of this perception is originating from the assumption that certain applications have very strict requirements and unless the network grant them with the required resources, they cannot run. However, there is evidence that applications are capable of operating within a range of resource that are available to them. An example is audio and video transmission that make the majority of the so called "real-time" traffic on the Internet today. The öld view" of QoS was that an application asks what it needs from the network and the network will set-up a reservation and offer the requested guarantees. It is argued that this method of QoS is far too complex and does not scale; furthermore, the new reality of network communications makes end-to-end guarantees hard to achieve. As various technologies evolve, networking conditions become more and more heterogeneous: T1 and E1 lines, ATM circuits, 10BaseT up to Gigabit Ethernet LANs, multicast, not to mention the ever-growing wireless networks that, by their nature have very variable capacities as well as the myriad of different hardware configurations. If applications are to survive in this extremely heterogeneous environment, they have to gracefully adapt to multiple environments through means of adaptation. Applications can follow adaptation techniques over several planes to tackle changing network conditions:
Methods of encoding and transmitting video:
This section presents a brief review of the most common adaptation techniques for audio and video streams.
Adapting the transmission rate of a media flow is a technique an application may use to react to changing availability of bandwidth or to regulate the number of bits it injects to the network. There are a number of different ways to achieve rate adaptation. Some of these can only be applied when media are pre-encoded and stored for later on-demand transmission, and others for both stored and live transmissions.
The typical way to change the transmission rate for audio streams is to switch to another codec. This is due to the fact that typical audio codecs produce a stream. Layered codecs, especially for encoding music, have also been proposed.
There are numerous opportunities of adapting the output bit-rate fort video:
There are many issues involved in layered encoding: (i) what partition method to use, (ii) what is the efficiency cost, (iii) how many layers to use, or/and, (iv) how to distribute the bandwidth among the layers. Usually there is a performance penalty (in terms of PSNR) involved in scalable codecs in comparison to one-layer codecs but the benefits can out-weight this. For example, in multicast congestion control, easy rate-adaptation and unequal error-protection can be more effective.
Dealing with variable end-to-end delays can be easily tackled by the use of a playout buffer. The type of application is the crucial factor here. Interactive applications can afford a limited playout buffer depth. Other applications (live or on-demand streaming) can tolerate a playout buffer of several seconds worth of depth. Such technique can be adopted for both voice and video transmission. The idea of removing jitter is to, instead of playing-out media data as soon as they arrive, to delay their presentation by placing them into a receiver playout buffer for a certain amount of time, called the playout delay. The playout delay can be:
We should also note that many other variations of adaptive playout algorithms have been proposed in the literature.
Packet loss in unavoidable in IP networks. Utilising techniques that increase the tolerance of media flows to loss and ameliorate the effect of packet loss on perceived quality is crucial for the application if we would like to achieve sustained quality. A significant number of approaches towards adapting to different rates of packet loss and loss pattern at the application layer have been suggested. These methods can operate complementary of each other at different parts of the application's end-to-end path, utilise different techniques, and fulfil different objectives. It can be concluded that adaptation to packet loss can be achieved either using proactive or indirect solutions that aim at protecting the flow from potential packet loss or by using reactive or direct techniques that do not protect the bitstream from packet loss per se but rather try to reduce the factors that cause packet loss or repair the part of the data that is damaged. Let us examine methods of achieving pro-active and re-active adaptation to packet loss and analyse how these techniques are used on audio and video media flows (or both). (Note that another categorisation of packet loss resilience techniques can be performed that is based on whether the technique is used at the sender site of the application or at the receiver.)
Most recent video compression standards provide error resilient modes that are supported by a special bitstream syntax and some coding tools. H.263+ has several error resilience modes at source coding level, including Slice-Structured Mode, Independent Segment Decoding Mode, and Reference Picture Selection (RPS) . MPEG-4 provides several error resilience tools like re-synchronization markers, data partitioning, and reversible variable length codes (RVLC) . A report of H.263+ and MPEG-4 error resilience performance can be found in [139,115,33]. All these tools, except the RPS mode in H.263+ are designed for intra-frame error control. They are most suitable for controlling errors that result from loss of small random data blocks within a video frame. On the other hand, they are not very effective in controlling errors that result from a packet loss in video streaming.
The advantages of media-independent FEC is that it does not rely on the content of the packets, is quite simple to implement and not very computationally expensive. The drawbacks are increased decoder complexity and increased encoding delay (that can possibly affect interactivity). Furthermore, they increase the bandwidth requirement of the output bitstream.
In case of video, media dependent FEC is often designed for a compressed bitstream and it is performed before packetisation. H.263+ specifies FEC mode for its bitstream .
Unequal FEC can also be used to assign unequal amounts of FEC to the various segments of video data. The technique is called Priority Encoding Transmission (PET) . PET requires the priorities of video segments to be specified.
FEC is not always an ideal solution for video streaming. Recall that FEC increases the output bit-rate and as packet loss is highly correlated with congestion, it can be a counter-productive idea. Of course, increasing the bit-rate can be avoided if the nominal bit-rate is partitioned between the compressed video data and FEC, but this comes at the expense of lower video quality. As packet loss conditions cannot be forecasted, it is difficult to know a-priori what is the best (optimal) partition.
In case of audio, error concealment techniques can be divided into three categories:
In this chapter, we discuss recent work focused on the development of tools and methods to measure the application quality. Application quality is difficult to be defined in a straightforward and generic manner. It is a multi-parameter property linked with the nature of the application and the context of its use. For many cases, application quality is synonymous to the human user degree of satisfaction or pleasure; in others, to the degree which the application is capable of allowing its user to successfully complete a task.
Application quality has several different interpretations depending on a number of parameters described in chapter 2, and thus quality measures need to be adaptive and flexible enough to reflect these parameters. In general, two different kinds of quality metrics are recognised and appreciated. Firstly, those that can be used off-line to gain an in-depth understanding of the specific application behaviour and requirements in terms of quality. The quality might mean user-perception, success in achieving and completing the applications task and goals, or clues on how an application needs to be efficiently designed and engineered. In this category, several methods, techniques and disciplines and combinations of them may be used:
A second kind of quality metrics are those that can be used to monitor the quality of an application during execution time. Such metrics are equally important, as they can provide invaluable feedback on the application's conditions when used in real-time. They can be also used to dynamically monitor the underlying transport mechanisms and adapt, reschedule, re-assign and re-allocate resources and spot problematic modules on the application end-to-end path. Ideally, such metrics, given a snapshot of the network condition and the chosen QoS configuration (QoS service class, SLA agreement or other), should be able to evaluate the application performance in a concise and coherent way.
Developed quality tools and methods that belong to the above classes are not destined to work exclusively, but in a rather complementary way. The background or off-line tools are believed to provide with more accurate results because they may use multi-disciplinary techniques and exhaustive analysis of the results. The on-line quality tools on the other hand, are precious management and quality monitoring mechanisms for real-time supervision of a service. Since time constraints prohibit the use of computationally expensive processing and analysis, they may sometimes fail to provide very accurate results. Nevertheless,this becomes a source for another important feature of the 'background' tools; they can be used for alignment and calibration of the the on-line models during their design and development process.
In the following sections we review, where applicable, the models that belong to both these categories, analysing their main features and discussing the relative merits and weaknesses for a number of applications and application components (e.g., video). If feasible, and if such are not presently available, we try to present generic requirements and desirable features for the development of new measures.
As video becomes a fundamental part of advanced networking applications, being able to measure its quality has significant importance at all stages: starting from the development of new video codecs and ending to the monitoring the quality of the transmission system. Indeed, in order to appreciate and compare the performance of all involved components and devices, methods to assess and quantify the quality are proposed. In general, two classes of methods are available to measure video quality: subjective tests, where human subjects are asked to assess or rank the viewed material, and objective models, which are computational models that measure the quality by comparing the original and distorted video sequences. Subjective tests may produce the most accurate ratings, but, they require costly and complex setup and viewing conditions and thus, they are inflexible to use. Objective quality metrics, on the other hand, are based purely on mathematical methods, from quite simplistic yet inaccurate models, like PSNR, to sophisticated ones that exploit models of human visual perception and produce far more reliable results. While objective models of quality appear to be very promising, both methods are considered useful in the process of measuring the quality of video applications (and, in general, multimedia applications). Once successful, standard objective models of video quality are developed, subjective tests can function complementary to the evaluation purposes or they can be used to validate new objective models.
The next sections provides a review of work on subjective and objective video quality assessment, presenting their respective advantages and weak points.
Subjective quality tests aim to capture the user's perception and understanding of quality. As we pointed out earlier, the user's perception of quality in not uni-dimensional and it depends on many factors. Besides the quality of the viewed material per se, user perception is also content specific, i.e., whether the video material is interesting and intriguing or it not. It has also been recognized that what determines quality also depends on the purpose of the interaction and the level of user's engagement. The extent to which QoS is perceived as degraded depends upon the real-world task that the user is performing. Furthermore, depending on the application, the quality of the background sound is also highly important. For example, it is shown that subjective quality ratings of the same video sequence are usually higher when accompanied by good quality sound, as this may lower the viewers' ability to detect impairments . In case of some applications, such as multimedia conferencing, users typically require higher audio quality relative to video quality. Perceived quality also depends on other factors like viewing distance, display size and resolution, as well as lighting conditions [2,71,75]. It is also worthwhile mentioning the distinction between image quality measured by mathematical procedures or computational models (i.e., the degree of distortion or difference between the original and reconstructed images) and the observed quality or image fidelity. It appears that images with higher contrast or slightly more colourful and saturated images appeal more to human viewers, even though, according to a strict mathematical interpretation of distortion (e.g., ), they appear distorted in comparison to the originals [31,35].
Subjective quality assessment of video still remains the most reliable means of quantifying user perception. It is also the most efficient method to test the performance of components, like video codecs, human vision models and objective quality assessment metrics (section 4.1.2). This procedure, called , involves formal subjective tests where users are asked to rate the quality using a 5-point scale, as shown in Figure 1, with quality ratings ranging from bad to excellent.
ITU-R recommendation BT.500-10  formalises this procedure by suggesting several experimental conditions, like, viewing distance and viewing conditions (room lighting, display features, etc.), selection of subjects and test material, assessment and data analysis methods. Depending on what contextual factors that influence user's perception need to be derived, three testing procedures are most commonly used: Double Stimulus Continuous Quality Scale, Double Stimulus Impairment Scale and Single Stimulus Continuous Quality Evaluation.
220.127.116.11 Double Stimulus Continuous Quality Scale (DSCQS).
In this method of subjective quality assessment, viewers are shown multiple sequence pairs, which consist of the original, or ``reference'' and the reconstructed, or ``test'' sequence. The sequences are relatively short in duration (8-10 seconds). The reference and test sequences are shown to the user twice in alternating fashion, the order chosen randomly. Subjects do not know in advance which is the reference sequence and which is the test sequence. They rate the material on a scale ranging from ``bad'' to ``excellent'' (Figure 4.1), and the rating has an equivalent numerical scale from 0 to 100. The difference of the two ratings is taken for further analysis. This difference removes some rating uncertainties caused by the material content and viewers' experience. The DSCQS method is preferred when the quality of the reference and test sequences is similar, otherwise subjects can easily spot the small differences in the quality of the two sequences.
18.104.22.168 Double Stimulus Impairment Scale (DSIS).
In contrast to DSCQS, in this method the reference sequence is always presented before the test sequence, and there is no need for the pair to be shown twice. Rating of impairments is again on a 5-point scale, ranging from ``very annoying'' to ``imperceptible'' (Figure 1). This method is more useful for evaluating clearly visible impairments, such as noticeable artefacts caused by encoding or transmission.
The limitation of using rather short test sequences becomes a problem when we are interested in the evaluation of digital video systems over longer timescales. These systems generate substantial quality variations that may not be uniformly distributed over the the time. Both the DSCQS and DSIS methods were not designed for the quality evaluation of video transmission over packet networks, like the Internet, because of its non-deterministic behaviour and the bursty nature of encoded video. This means that, from the user's point of view, perceived quality can vary significantly over time. The first problem is that if double stimulus (showing both the reference and test video sequences) is used for longer sequences, the time between comparable moments will be too lengthy to be rated accurately. Furthermore, it is known that humans' memory is increased for more recent stimuli, when the duration of the stimulus is increased from 10 to 30 seconds . In other words, for longer sequences (over 10sec or so), the most recent parts of the sequence have a relatively greater contribution to the overall quality impression. This phenomenon, called the recency effect, is long known in psychology literature , and it is difficult to quantify it in subjective tests. Pearson  has discussed several higher-order effects that influence users' quality ratings when assessing video sequences of extended duration. What is needed here is a method able to dynamically capture user's opinion as the underlying network conditions or visual content complexity change.
22.214.171.124 Single Stimulus Continuous Quality Evaluation (SSCQE).
In order to capture temporal variations of quality, the viewers are shown a longer program (typically of 20-30 minutes duration). The reference is not presented, and viewers assess the instantaneously perceived quality in the same fashion, by continuously adjusting a side slider on the DSCQS scale (from ``bad'' to ``excellent''). The slider can be implemented either as a hardware device  or as software. Instantaneous quality scores are obtained by periodically sampling the slider value usually every 1-2 seconds. In this way, differences between alternative transmission configurations can be analysed in a more informative manner. The drawback of this method is that the accuracy of user rating can be compromised by the cognitive load imposed by the task of moving the slider. As program content tends to have a significant impact on the SSCQE ratings and it becomes more difficult to compare scores from different test sequences. When a model is required to link instantaneously perceived quality to an overall quality score calculated for the whole sequence, then the non-linear influence of good or bad parts within the sequence can be expressed by pooling methods like Minkowski power weighing . Despite its attractive nature, the SSCQE method also exhibits several drawbacks, the most apparent of all is the impact of the 'recency or memory effect' s scale judgements. Therefore the momentary changes in quality are quite difficult to be tracked, leading to problematic stability and reliability of the derived results.
We stressed earlier that although subjective procedures for measuring video quality still constitute the most reliable method for gaining insight knowledge about the performance of digital video transmission systems, the complicated and costly setup of subjective tests makes this method particularly unattractive for automating the assessment procedure. The involvement of human subjects in this process makes this approach unusable when the quality monitoring systems have to be embedded into practical processing systems. For this reason, quality metrics able to produce objectively obtained ratings present an attractive alternative.
Objective quality metrics have been the subject of research for several years. The first models were designed to work on analogue video transmission. However, the recent advent of digital manipulation and transmission of video means that video material is effected in a completely different way, leading to different types of impairments. This required a new approach in the development of quality metrics that considers the impact of encoding and transmission in a digital system.
The simplest form of measuring the quality is by calculating the distortion at the pixel level. The peak signal-to-noise (PSNR) measures the mean squared error (MSE) between the reference and test sequences has been extensively used by the image processing community. Due to its simplicity it is still being used. Despite it is a straightforward metric to calculate, it cannot describe distortions perceived by a complex and multi-dimensional system like the human visual system, thus, it fails to give good predictions in many cases. Recent research in image processing has focused on developing metrics that use models of the human visual system, while others, exploit properties of the compression mechanism, and assess the effect that known encoding artefacts have on perceived quality. In the following sections, we therefore first briefly discuss the various types of distortions of compressed digital video transmission and then we present a review of recent work on objective video quality metrics.
End-to-end transmission of video material is subject to two sources of distortions. At first, the original content sequence is encoded (real-time or off-line) to reduce its otherwise prohibitive bandwidth requirement. As a result a first level of distortions caused by the lossy nature of encoding is introduced. The compressed bit-stream is then transmitted in packets over the network. There delay, delay-variation and packet loss make some information unavailable to the decoder and further impairments occur. In the following paragraphs we describe the most common types of distortions introduced during the encoding and the transmission of digital video.
126.96.36.199 Encoding Artefacts Most of the popular encoders rely on , block-based of blocks of pixels and then quantisation of the resulting transform coefficients. Quantisation is the main source of encoding distortions, although other encoding parameters also influence the perceived fidelity of the video (like frame dropping). The main types of artefacts in a compressed video sequence  are the following:
Some of the above effects are unique to block-based coding, while others are prevalent in other compression algorithms. In wavelet codecs, for example, there are no block-related artefacts, as the transform is applied on the entire image, however blurring may become more noticeable.
188.8.131.52 Transmission Artefacts An important source of impairments is the transmission of the compressed video bitstream over the packet network. The bitstream is fragmented to a series of packets, which are then sent out to the destination. Two different types of impairments are attributed to the network behaviour: (i) packet loss, and (ii) end-to-end delay. When packets are lost they are unavailable to the decoder, while excessively delayed packets (high jitter) are worthless to the application. Therefore, both types have the same impact: data unavailability. The impact of such losses depends on the nature of the video encoder and the level of redundancy present in the compressed bitstream (for example, intra-coded bitstreams are more resilient to loss). For MC/DCT codecs, like MPEG, to interdependencies of syntax information, can cause an undesired effect, where the loss of a macroblock may corrupt subsequent macroblocks, until the decoder can re-synchronise. This is viewed as error blocks within the image; it bears no resemblance to the current scene and it usually contrasts greatly with adjacent blocks. Obviously, this has major impact on perceived quality, and it is usually greater than the effects of coding artefacts. Another problem arises when blocks in subsequent frames are predicted from a corrupted macroblock - they will be damaged as well and this will cause a temporal propagation of loss until the next intra-coded macroblock is available.
Quality metrics that rely on models of the human vision system are potentially the most accurate. However, another category of quality metrics that also utilise vision models but rely on knowledge about the impact of specific compression artefacts and transmission errors, is proved to produce almost equally reliable results. Due to the absence of the complex vision models, these metrics allow for more computationally-efficient implementations. However, their drawback is that they work well only in particular application areas or types of distortion. Objective quality models of this category feature the extraction of perceptually relevant, quality-related attributes of the signals (features); they measure the distance of the features in the test and reference image, and relate these distances to estimates of perceived quality. The following list represents the most influential work on this class of quality metrics (part of the discussion here is taken from ):
A complete review of work on objective video quality assessment can be found in  and the references therein.
The research efforts to design objective video quality models generated an ongoing standardisation process that involves the specification of the essential features and requirements of objective quality assessment models, like the target video applications and the desired performance characteristics. Candidate models can then be assessed based on the above specifications with the purpose of producing industry standards. We describe this standardisation work in the following.
These proponents were tested over a wide range of video sequences and reference sequences. A large number of subjective tests were organised by independent labs, strictly adhering to the specifications of the ITU-R BT.500-8  procedure for the DSCQS method of subjective evaluation (section 184.108.40.206). The analysis of the results showed that the performance of most of the proponents was statistically equivalent. Based on the analysis of results, VQEG concluded that currently it could not propose any model for inclusion in ITU Recommendations, and that further validation is required. However, the effort produced significant insights in the process of designing efficient objective video quality metrics and understanding the limitations of the current models. Details on the VQEG work and the evaluated proponents can be found in its 2000 final report . Some of the VQEG future objectives have been presented in .
At the current phase of VQEG work (phase II), the group defined more precise areas of interest to obtain more accurate results than those accomplished in phase I. These are:
In the previous sections we described procedures and methods to measure the quality of digital video. While these methods are very helpful for the evaluation of digital video components (video codecs, monitoring transmission quality, understanding psychophysical aspects of quality), there are still several limitations, especially if one focuses on Internet-based video services. Firstly, although subjective and objective methods can quite accurately measure the impairments of digital video, most of them work on short video sequences (typically, approximately 10 seconds duration). It is clear that 10-sec video sequences is not long enough to experience all kinds of impairments that would occur in a real Internet video application. This problem has been partially tackled by employing continuous assessment techniques ([51,32]) or using temporal pooling methods such as Minkowski summation on objectively acquired quality scores . Since objective video models that continuously assess quality have to be based on the corresponding subjective methods (e.g., SSCQE) for tuning, they inherit the same problems encountered with the subjective methods, as mentioned in section 220.127.116.11.
Secondly, the objective models described earlier cannot easily appreciate the impairments caused when digital videdo material is transmitted over an IP network. This happens because the original requirements were for the quality assessment of different transmission systems (like broadcast/cable TV transmission) and not the Internet. These transmission systems could provide an assured transmission channel, with bounded delay and different types of transmission errors, in contrast to what is experienced by transmitting video over the Internet. These models are designed to successfully track distortions of the visual content and not for other types of quality degradations, such those caused by large delays or extreme jitter. An objective metric of video quality needs to also develop methods that account for the joint effect of visual (encoding distortions, packet loss) and non-visual types of distortions (delay and delay variation in the presentation of the video), and be able to measure the quality effects of all Internet's idiosyncracies.
Thirdly, the quality judgements produced by these models are made solely for the video part.One may question such practise, since the video material (whether interactive or one-way) is rarely used on its own without audio. It is obvious, that if judged jointly for the purpose of quality assessment, the inherent relations between the various media types involved in the applications scenario (end especially between audio and video) may substantially alter the quality scores. Certain distortions may become more or less important, or even new requirements arise (lip-sync). This requirement for synchronisation of the audio and video compartments of a multimedia presentation are discussed in .
The quality of voice transmission over traditional networks, like the , was measured quite accurately with simple objective metrics and are specified in Recommendation G.712  (one such over-used metric is the ). However, the evolution of voice applications over packet-switched networks involves different kinds of technologies, such as low bit-rate codecs and different transmission characteristics. This renders traditional metrics inadequate for predicting a person's perception of speech clarity. Such metrics cannot account for a person's ability to adapt to missing time or frequency energy caused by encoding and it cannot indicate poor quality of the signal output even if its characteristics are not actually perceivable. For example, modern speech codecs utilise non-waveform encoding33 and as a natural consequence the input signal is distorted considerably, resulting in very poor SNR. However, the nature of codecs is such that the output characteristics of the voice would not cause significant discomfort to the listener i.e., the encoded voice still sounds good). The metrics are thus required to calculate the signal clarity in the same context that users do so that they can accurately represent and measure voice quality and clarity the way humans listeners may perceive it. This section presents an overview of the most prominent methods of assessing speech quality. For a good introduction to voice quality metrics please refer to .
As already mentioned in section 18.104.22.168, this method uses a large numbers of human listeners to produce subjective quality scores with statistical confidence (it considers the mean of the obtained quality opinions). The method is applied to both one-way and two-way (conversational) listening scenarios, and it is applied under a controlled testing environment (method of scoring, properties of voice samples used in tests, etc.). The overall quality scores lie on a 1-5 scale34, as shown in Figure 4.1. ITU Recommendation P.800  describes the techniques for performing , while ITU Recommendation P.830  describes the methods for subjective evaluation of speech codecs. However, quality assessment reveals several shortcomings:
These drawbacks suggests that objective, computational models that can automatically and repeatedly estimate the on-going quality of speech are required to quantify the subjective clarity and quality of networked voice applications.
This section presents a brief description of several well-known models of predicting the quality of voice signals. We detail their merits and drawbacks and their suitability in the quest of specifying the network needs of IP voice. Note that these models can measure distortions caused by the process of encoding as well as the distortions introduced due to the transmission of the speech over a transport network.
The speech quality method called  was a result of work at KPN Research at Netherlands and was subsequently approved by ITU-T Study Group 12 as Recommendation P.861 . The method works on a pair of input and output (distorted) voice signals. The comparison is performed on time segments (frames) in the frequency domain. It performs analysis of the spectral power densities of the input and output time frequencies and applying comparisons based on factors of the human hearing process, such as sensitivity to loudness or frequency. The result is a score that measures the perceptual distance of the input and output signals, and its practical values range from 0 (representing a perfect, undistorted signal) to 15-20 (high distortions). PSQM can accurately predict the speech clarity of voice signals that have been impacted by any of the following processes :
PSQM cannot however predict reliable results when the signal is impacted by the factors (among others) that depend on the transmission of the speech over a network, like:
The PSQM algorithm converts the actual physical signal to its mathematical representation based on the physiological properties of human perception. This is performed by three operations: (i) a Fast Fourier Transform (FFT) is performed on the input and output time domain signals transforming them into the frequency domain. (ii) A frequency warping and filtering where the frequency scale is warped to take into account human frequency sensitivities. Specific critical bands are used (PSQM defines 56 critical bands). (iii) An intensity warping (compression), where the intensity scale is warped to loudness scale to represent human loudness sensitivities35. This process creates a mathematical representation of the acoustic signal based on the physiology of human hearing. In the next step, called cognitive modelling, the preprocessed input and output signals are compared to produce the PSQM score by evaluating the audible errors in the output signal. PSQM will disregard inaudible differences between input and output. As mentioned, the cognitive modelling will produce a PSQM score that ranges from 0 (perfect quality) up to 15 or higher (very bad).
While PSQM gained widespread acceptance, it was recognised that a major drawback was its inability to reliably report the impact of distortions generated from the network transport of speech, namely, those caused by packet loss and time clipping. In these cases, PSQM would output better quality than what human listeners would normally report. PSQM+ is a slight deviation of ITU P.861 PSQM and improves its performance and reliability under conditions of packet loss and time clipping. The revised model was subsequently published by ITU . To recall, PSQM measures the added or subtracted disturbance at each frame (time segment). As additive distortion (added energy) has larger impact than subtractive distortion (lost energy), the added disturbance is scaled up, thus producing higher PSQM score (worst quality), while subtractive disturbance is scaled down, giving a lower score (higher quality). As packet loss and time clipping within a frame cause loss of energy, PSQM produced far too low scores than it should. PSQM+ reverts this by applying another scaling factor that scales up the disturbance. So, for added energy or low distortions due to the codec, the PSQM and PSQM+ produce almost the same results. For large distortions due to lost energy (e.g., packet loss) PSQM+ will produce higher scores, correlating better with subjective results.
The technique  was developed at the Institute for Telecommunications Sciences and ir represents an alternative to PSQM when we want to measure the perceptual distance between two voice signals. The algorithm works on (perceptually modified) input and output signals, and can be summarised in the following steps:
The [16,98] is a collaborative effort by KPN Research and British Telecommunications and leverages the best of features from and . It's intended to accurately measure the distortions of waveform or non-waveform codecs, transcoding, transmission errors, packet loss and time-clipping, etc. However, it has unknown accuracy for parameters and situations like, delay, background noise, multiple talkers, codecs at bit-rates below 4Kbps, artificial speech or music as input. PESQ performs two kinds of processes:
The E-model [125,124,126], a recent ITU standard, is a
computational model that predicts the subjective quality of networked speech based on the
transmission characteristics. It combines individual impairments that result from both the signal's
properties and the characteristics of the transport medium into a single R rating, that ranges
from 0 to 100. This rating is then translated into the correlated MOS value. The relationship
between R-ratings, speech quality and MOS values is shown in Figure 4.2. The
R-rating is a linear combination of the perceived impairments converted to appropriate
psycho-acoustic scales. The R value is given by the following formula:
Ro represents the basic signal to noise ratio, and Is accounts for loud connection and quantisation. Both these terms have to do with the distortions of the voice signal not owed to the network transmission itself. Id and Ie are the factors that reflect the distortions caused by the network transmission; Id captures the effect of delay while Ie the effect of information loss (that is loss due to very low bit-rate encoding and packet loss). The value A represents an ädvantage factor" that reflects the users tolerance, or, willingness to accept certain degree of quality degradation. So, from the network point of view, the Id and Ie factors are more interesting.
The Id factor models the impact in conversational quality of end-to-end (mouth-to-ear, 'm2e') delay. It is further broken down into three terms: Id = Idte(m2e, EL2) + Idle(m2e, EL1) +Idd(m2e), where Idte(m2e, EL2) and Idle(m2e, EL1) reflect the impact due to talker and listener echo, respectively. EL1 and EL2 represent the echo losses (in dB) at the points of reflection (these values depend on echo cancellation and are infinite if perfect echo cancellation is present. The term Idd(m2e) captures the degree of interactivity loss, when end-to-end delay is large.
The Ie is a measure of the distortion caused by the "loss of information or user data", which can be caused due to low bit-rate encoding or due to packet loss in the network or playback buffers. A good overview of the relation between the Ie R-factor and packet loss can be found in .
This report examines the qualitative and quantitative properties of quality of service requirements exhibited by advanced IP-based applications. We presented a taxonomy of applications based on several attributes: the task they have to perform, the degree of interactivity and the kind of interactions (human-to-human, human-to-machine, machine-to-machine). Furthermore, we discussed the characteristics of the application users, and the ways these factors directly or indirectly influence the perception of application quality. Another taxonomy was presented that examined applications' properties attributed to the nature and transport requisites of the participating media flows and their inherent capability to tolerate the impact of network transmission and adapt to changing conditions of the network.
We argued that the QoS needs of advanced applications can be studied from the viewpoint of the quality characteristics and requirements of their individual elementary data and media flows. We examined the quality requirements and the impact of the network service to these media flows, focusing at this phase on audio and video media flows, and the interactions that may occur in applications that use these media (i.e., synchronisation of audio and video). The importance of application adaptation was identified as a crucial parameter for the application's quality and the well-being of the underlying network and this report provided with a review of common adaptation policies and decisions that audio-visual application may utilise to tackle varying levels of QoS.
Moreover, a comprehensive review of procedures and tools that can be used to assess and quantitatively measure the perceived quality of audio and video was presented. There are two major ways of preforming quality assessment: subjective, by means of human , and objective, by utilising computational models that achieve high correlations with subjective scores. The respective merits and disadvantages of these methods were portrayed together with suggestions on what appropriate enhancements on top of these are necessary to also account for the impact of Internet transmission.
The majority of this phase of the document deals with the investigation of the quality requirements and idiosyncracies of audio-visual applications. Based on the type of application and the degree of interaction involved (whether human-to-human or human-to-machine), the degree of interactivity of the application mainly dictates the respective requirements in terms of end-to-end service. Users of significantly interactive applications, like VoIP or videoconferencing, can only tolerate low, bounded end-to-end delay and jitter. For non-interactive scenarios, like music or video streaming, such requirements are relaxed and higher values of delay and delay variation can be comfortably accommodated. We can also conclude that for these applications, loss of information, usually caused by packet loss in the network that is above a certain threshold, does seem to cause more devastating quality deterioration, in comparison to delay and jitter, as its effects are directly noticed and perceived by the human perception system. Applications that do not have strict timing requirements may benefit from jitter compensation and error correction techniques that are designed to provide higher immunity to the transmitted stream from the delay and the packet loss. We reported a plethora of proposed methods that can be used from an application, either in isolation or in a cooperative mode, to alleviate the effects of the network transmission.
We argue that the idea of a network that can always guarantee the desired or expected performance requested by an application is far from becoming reality in the foreseeable future for a number of reasons. Most probably, the network will be able to provide some form of service discrimination for the different flows or some qualitative QoS guarantees. Within an ever-increasing heterogeneous network, that can provide minimal guarantees, the burden of QoS must also be placed on the application. The end application provides an ideal decision point where the most appropriate adaptation techniques can be selected and activated so that the application bears higher or better utility. We believe that under unfavourable network conditions, applications should have the right incentives to perform 'adaptation'. We examined what these incentives are: non-adaptive application streams increase network congestion and are unfair to the rest of the network flows. This eventually leads to worsening quality and may even cause network congestion collapse. In an ever-evolving heterogeneous Internet, applications that do not adapt are doomed to become 'extinct'. We presented a brief overview of the most common adaptation techniques that have been proposed in the literature. The aim of this document is to strongly encourage the mentality of 'adaptation' to application designers and developers in order to build applications and services that make the most out of what the network can offer instead of entirely relying on it.
This report is a 'working document' and should not be considered complete. It is expected to be continuously filled and updated with new and more focused analyses of the QoS needs of existing and upcoming advanced applications. It is hoped that the outcome of this document will provide with a comprehensive understanding of what is äpplication QoS" and how it can be tackled using network- and application-level techniques in a concise and cooperative way.
The help and support from a number of people that made significant contributions to this document by providing feedback, comments and discussions, have to be acknowledged. The members of the advisory committee (Amela Sadagic and Ben Teitelbaum (Advanced Network & Services), Jason Leigh (Univ. of Illinois), Magda El Zarki and Haining Liu (Univ. of California, Irvine) for their continuous feedback and fruitful comments. Victor Reijs (Heanet) provided with very good suggestions and comments for improvements. Terry Gray and David Richardson (Washington Univ.) for HDTV issues. Stephen Wolf (ITS) for invaluable feedback on video quality. Hope Shu and Roch Guerin (UPenn) for comments on the structure and content of the document. Theodore Pagtzis (UCL) reviewed an early version of the document. Many thanks to Amela Sadagic and Ben Teitelbaum who have been a continuous source of help, comments and support.
3Some interaction may still remain, e.g., for play, pause, stop, or other audio player-control actions
4Currently, due to limited bandwidth and unpredictable behaviour of the Internet, typical videoconferencing suffers from relatively low video frame-rates, lack of lip-synchronisation (time-lag between the audio and video components), modest quality of audio (e.g., mono) and low video resolution (usually QCIF or CIF) and fidelity
5The platform chosen is Motorola iPAQ handhelds, running Linux and using the rat and vic multimedia software .
6Although MPEG-1 is mostly used for video streaming.
7There are no MPEG-2 decoders for videoconferencing, but only for streaming.
8Camera quality is very important for MPEG-2.
9However, there are still a lot of problems with native network multicast (e.g. pricing, security, scaling to large numbers of groups)
10Reported at the 2000 Internet2 QoS meeting at Houston, see http://www.internet2.edu/qos/houston2000/proceedings/Gray/20000209-QoS2000-Gray.pdf
13For example, bulk data transfer will require big ftp pipes that will compete with other delay and jitter sensitive traffic, like control traffic or continuous media flows.
17See also http://www.internet2.edu/presentations/20010308-I2MM-VidConfDevel&Deploy-Dixon.ppt
21In these experiments, non-specialised hardware was used to show the feasibility of displaying HDTV on commodity desktop PCs
22On the other hand, small packet sizes increase the packet header overhead (RTP,UDP/TCP, IP) incurred. The selection of the audio packet size is a trade-off between the specific audio codec, the impact of a lost packet, and the packer header overhead.
23The encoding quality of the voice signal is usually modest, and can be low at times, depending on location and signal strength. Furthermore, fluctuations in quality, temporary loss of service, and complete loss of service are also quite often.
24Recall, that speech signals can be restricted in a 4KHz bandwidth, with almost no loss
25We here ignore the case of applications that download the video material prior to its display (non-streaming video).
27As mentioned earlier in section 2.1.1, this is due to the higher tolerance of the human eye and ear to data loss.
28Obtained by users viewing a talking person video clip. Three views of the person were used: head view (speaker very close), shoulder view and body view (speaker relatively away) .
29A talkspurt is defined as the interval with voice activity between two silent periods.
32In full reference (FR), both the reference and test sequences are required for the model to operate. In the reduced reference method (RR), a restricted amount of data from the reference is required to obtain the objective quality, where in the no reference method (NR), no reference data is required. The reduced reference and no reference methods decrease the transmission burden when on-line quality monitoring is required.
33Such codecs do not try to recreate the original waveform, but the perceptual properties of the speech signal by attempting to model the vocal tract (like estimation of the pitch)
34To recall, a quality rating of '5' corresponds to excellent quality, '4' to good, '3' to fair, '2' to poor and '1' to bad quality
35Depending on the loudness of the signal, voice distortions can be perceived
differently by human listeners.