搜档网
当前位置:搜档网 › SENDER-BASED RATE-DISTORTION OPTIMIZED STREAMING OF 3-D WAVELET VIDEO WITH LOW LATENCY

SENDER-BASED RATE-DISTORTION OPTIMIZED STREAMING OF 3-D WAVELET VIDEO WITH LOW LATENCY

SENDER-BASED RATE-DISTORTION OPTIMIZED STREAMING OF 3-D WAVELET VIDEO WITH LOW LATENCY
SENDER-BASED RATE-DISTORTION OPTIMIZED STREAMING OF 3-D WAVELET VIDEO WITH LOW LATENCY

SENDER-BASED RATE-DISTORTION OPTIMIZED STREAMING

OF3-D WA VELET VIDEO WITH LOW LATENCY

Chuo-Ling Chang,Sangeun Han and Bernd Girod

Information Systems Laboratory,Department of Electrical Engineering,Stanford University

{chuoling,sehan,bgirod}@https://www.sodocs.net/doc/bc9864390.html,

ABSTRACT

We propose a sender-based rate-distortion optimized framework to stream scalable bitstreams of3-D wavelet video stored at the sender to a remote receiver.Based on the requests and feedback from the receiver,the source rate-distortion pro?les,the desired playout latency and transmission rate,and the network charac-teristics,the sender optimizes the responses sent to the receiver throughout the video playout session in order to minimize the dis-tortion in the reconstructed frames.Rate-distortion optimized re-sponse is formulated as a convex optimization problem,and an of?ine-computation approach is proposed to further reduce the complexity at the sender.Experimental results show that the pro-posed sender-based approach outperform the receiver-based ap-proach,and the of?ine-computation approach very closely approx-imates the fully-optimized approach.

1.INTRODUCTION

Many attempts have been made to incorporate motion compensa-tion into the3-D wavelet video coding framework[1][2].Earlier works are somewhat unsatisfactory in terms of the rate-distortion coding performance because the motion vector?eld is severely re-stricted and the temporal transform is usually limited to the two-tap Haar wavelet.Recently,motion-compensated lifting has been pro-posed[3]-[5],which successfully incorporates unrestricted motion compensation into3-D wavelet coding and achieves compression ef?ciency approaching the state-of-the-art predictive video coding schemes.However,despite the increasing interest in3-D wavelet video,ef?cient streaming of such data sets that exploits the rate-distortion performance as well as the inherent support for scalabil-ity is seldom addressed.

In[6],Chou and Miao proposed a framework for rate-distortion optimized packet scheduling of video and audio data. In general,it can be applied to streaming of the3-D wavelet video data set.However,in their framework,the data set has to be as-sembled into packets before the optimization for packet schedul-ing takes place.Therefore,the packet content cannot dynamically adapt to the network characteristics as well as the state of the data previously transmitted to the receiver.Additionally,the optimiza-tion process is formulated as a combinatorial problem which re-quires high complexity to solve.

We have previously proposed a receiver-based rate-distortion optimized framework for streaming of3-D wavelet video with low latency[7].The source rate-distortion pro?les,the desired playout latency and transmission rate,and the network characteristics are 1This work was supported by Grant No.ECS-0225315of the National Science Foundation,Philips Corporation,and the Max Planck Institute.all taken into account to optimize the streaming strategy.Addi-tionally,due to the?ne scalability property of embedded wavelet coef?cient coding,the optimization process is approximated as a convex optimization problem,which can be ef?ciently solved by standard optimization techniques.

In the receiver-based framework,it is the responsibility of the receiver to ef?ciently select the data to be retrieved from the sender.Therefore,the rate-distortion pro?le of the coded data has to be available at the receiver ahead of the video playout session. In addition,the losses and excess delay in the backward channel experienced by the requests issued from the receiver could block the transmission.In this paper,we describe a sender-based frame-work where the sender selects the data to be transmitted.To reduce the computational complexity at the sender,an of?ine-computation approach is proposed.

The remainder of the paper is organized as follows:In Sec-tion2,we brie?y describe the structure of the3-D wavelet video coding scheme adopted in this work.In Section3,we present the proposed sender-based framework for streaming the3-D wavelet video over networks.The of?ine-computation approach is dis-cussed in Section4.Finally,experimental results are presented in Section5.

2.MOTION-COMPENSATED3-D WA VELET

VIDEO CODING

In this work,a3-D wavelet coder using motion-compensated lift-ing is adopted to encode the video sequence[3]-[5].A multi-level temporal discrete wavelet transform(DWT)implemented using motion-compensated lifting is?rst applied across the video frames to decompose them into temporal subbands,followed by a multi-level2-D spatial DWT decomposing the temporal subbands into wavelet coef?cients.The SPIHT(Set Partitioning in Hierarchical Trees)[8]algorithm is?nally applied to encode the wavelet co-ef?cients of each subband into a scalable bitstream.The SPIHT algorithm provides a scalable representation so that different re-construction qualities of the video frames can be obtained by trun-cating the coded bitstreams at different lengths.

To decode a particular video frame,only a few subbands rel-evant to synthesizing the frame need to be reconstructed.The truncated bitstreams of these subbands available at the decoder are decoded into reconstructed wavelet coef?cients by the inverse SPIHT algorithm.Then the inverse2-D spatial DWT is applied to reconstruct the temporal subbands.Finally,the inverse motion-compensated lifting procedure is performed to carry out the inverse temporal DWT in order to reconstruct the video frame.

3.SENDER-BASED RATE-DISTORTION OPTIMIZED

STREAMING

We consider the scenario where the video sequence consisting of N frames,F n,n=0,···,N?1,is encoded as N scalable bit-streams stored at the sender.Each bitstream corresponds to a tem-poral subband S n.In this paper,we do not consider streaming of the motion vectors and assume that they are transmitted to the re-ceiver before the playout session.The viewer at a remote receiver starts the playout session of K frames at time instant t st.One frame is displayed every time duration T d.Note that the frames can be displayed at a different frame rate as encoded,or even in the reverse direction to provide VCR-like functionalities.

A sequence of assigning instants,{a k},is de?ned as a k= t st+kT d where k=0,···,K?1.Given the desired play-out latency,D max,from t st to the time data required for the?rst frame to be displayed are being decoded,a sequence of decoding instants,{d k},is de?ned as d k=a k+D max.At each a k the receiver assigns the index of the frame to be reconstructed at d k, denoted by f(k),and transmits it to the sender in request req k. For instance,if the receiver decides to start the playout session with F8,then f(0)=8is assigned at a0and F8is going to be de-coded at a later instant d0.Along with the assigned frame index, a rich acknowledgement that indicates the indices of the packets currently at the receiver is also contained in req k[9].

Denote the backward trip time req k experienced in the back-ward channel from the receiver to the sender as BT T k.Note that BT T k=∞if req k is lost.We de?ne a sequence of re-sponding instants,{r i},where r0=a0+BT T0and r i=

min{a M

r i +BT T M

r i

,r i?1+M r T d}for i≥1.The factor

M r∈Z+determines the frequency of sending responses relative to the frame rate.At a responding instant r i,we consider a win-dow of frames to be displayed and hence a set of subbands related to reconstructing these frames,denoted as{S n:n∈Z i}.At r i, the sender initiates response res i which contains the data of these associated subbands.As an example,r i can be associated with the most recently assigned frame and the next several frames that either have not yet been assigned or the corresponding requests in-dicating the assigned frame indices have not arrived at the sender, in order to send the data in advance.The frame indices of the not-yet-assigned future frames,as well as those contained in the late or lost requests,can be interpolated or extrapolated from the known assigned frame indices at the sender.

Note that res i is issued when either req M

r i arrives at the

sender or M r T d after the previous response,whichever comes ear-lier.This push-and-pull method ensures that the responses are sent timely and regularly even when the requests experience excess de-lay or losses.

Each response is optimized such that the resulting distortion in the reconstructed video sequence after receiving the response is minimized.It takes T p processing time for the sender to gener-ate res i,and then res i is fragmented into N p responding packets issued at r i+T p through the forward channel of the network.

Both the backward and the forward channel are modelled as an independent time-invariant packet erasure channel with random delays.Each packet is delayed or lost independently from other packets.The cumulative distribution function of the forward trip time(FTT)is denoted as F F T T(τ)=P r{F T T≤τ}.Addi-tionally,we denote the target forward payload transmission rate by C f,measured in bit-per-second.As a result,the payload length

in each response is determined by L= M r T d C f

8 in bytes.

3.1.Packet Buffer and Decoding Buffer

The content of a responding packet can be described by two N-

vectors,s and e.The n-th element of s and e,s n and e n,denote

the preceding and ending position(in bytes)of the bitstream for

subband S n contained in the packet respectively.In other words,

the bitstream segment for S n starts at byte s n+1and ends at byte

e n o

f the bitstream.

Between two adjacent decoding instants,the receiver continu-ously receives packets and keeps them in the packet buffer which

holds up to N pb most recently received packets.In addition,the

decoding buffer stores the bitstreams for the N db most recently re-

ferred subbands.The decoding buffer state can also be described

with an N-vector,b,where the n-th element,b n,denotes the

length(in bytes)of the bitstream for S n in the decoding buffer.

At d k,we repeatedly update the decoding buffer by the packets in the packet buffer in the same order as they were issued from the

sender(according to the packet sequence number).A bitstream

segment in the packet can update the decoding buffer only if it

satis?es all of the following three conditions.First,the segment

belongs to one of the N db most recently referred subbands(in-

cluding those for reconstructing F k).Second,all the bits in the

bitstream preceding the segment are already in the decoding buffer,

i.e.,s n≤b n.Third,the segment extends the bitstream already in

the decoding buffer,i.e.,e n>b n.After d k,the packets are still

kept in the packet buffer since they may be needed in later decod-

ing instants,for instance,for those bitstream segments contained

in the packets that later satisfy the above conditions.

3.2.Rate-Distortion Optimized Response

We de?ne the rate for coding S n as the length of the bitstream

decoded to reconstruct S n.The distortion is de?ned as the mean

squared reconstruction error of the reconstructed S n in the wavelet

coef?cient domain.During SPIHT-encoding for S n,rate and dis-

tortion are recorded whenever there is a change in the distortion

due to the rate increment,resulting in a sequence of(R j n n,D j n n)

pairs with j n∈{0,···,j max

n

}indexing the recording order.The ?rst R0n bytes of a bitstream contain essential header information

for the bitstream to be decodable.Therefore,we assume that these

initial bytes are transmitted separately before the playout session.

Therefore,the viewer can always reconstruct a subband at the low-

est quality when no subsequent bitstream segments are received.

The recorded(R j n n,D j n n)pairs are usually densely located and they approximately form a decreasing convex function that

can be well?tted with a weighted sum of V terms of exponen-

tial functions,resulting in the continuous distortion-rate function

D n(R n):

D n(R n)=

V

v=1

c n,v·exp(?λn,v·R n),c n,v≥0(1)

D n(R n)is a convex function having analytically derivable gradi-

ent and Hessian,which greatly facilitate the optimization process.

Furthermore,only2V parameters instead of a long list of R-D

pairs for each subband now need to be stored at the sender.

To optimize res i,we assume it is given that the?rst?b n bytes of the bitstream for S n are already at the receiver by the time the

responding packets update the decoding buffer.In general,the de-

coding buffer state in the future is a variable.Prediction of the state

will be discussed in Section3.3.At r i,the task of rate-distortion

optimization is to determine the length of the additional bitstream

segment to respond for each S n,i.e.,R n??b n,n∈Z i,such that the resulting distortion in the reconstructed video sequence is minimized.

To associate the distortion in the reconstructed frames to that in the reconstructed wavelet coef?cients,we make several sim-pli?cations and assumptions.We assume that the distortion in a reconstructed temporal subband is proportional to that in its spa-tially decomposed wavelet coef?cients.In addition,by neglecting the effects from motion compensation,the inverse lifting proce-dure applied to synthesize a frame from the temporal subbands is identical to ordinary temporal wavelet synthesis.Therefore,a re-constructed frame can be modelled as a linear combination of the reconstructed subbands.We further assume the reconstruction er-ror in one subband is uncorrelated to that in any other subband,and hence the distortions from different subbands are additive in the reconstructed frame.Based on these assumptions,rate-distortion optimized response at r i that minimizes the expected total distor-tion is equivalent to:

minimize K?1

k=0

F F T T(d k?r i?T p)

n∈Z i

w n f(k)2D n(R n)(2a)

subject to

n∈Z i

(R n??b n)≤L(2b)

?b n ≤R n≤R j max

n

n

,n∈Z i(2c)

where w n f(k)denotes the wavelet synthesis?lter coef?cient asso-ciated to an input S n and an output F f(k),F F T T(d k?r i?T p) denotes the probability that res i arrives at the receiver by d k.The distortion in F f(k0is weighted by this probability since res i con-tributes to the distortion reduction in F f(k)only if it arrives by d k. Note that only a few k’s in(2a)having w n f(k)=0for n∈Z i need to be actually considered.The formulation in(2)consists of a con-vex cost function with linear constraints,thus formulates a convex optimization problem which be solved ef?ciently using standard optimization techniques.

3.3.Decoding Buffer State Prediction

From the rich acknowledgement[9]included in each request that indicates the packets available at the receiver,the sender can reli-ably reconstruct the state of the decoding buffer up to the time the last request received was issued.

To respond the bitstream segment of S n,the sender needs to optimize res i according to the predicted decoding buffer state at a future decoding instant d k(i,n),i.e.,use this prediction as the?b n in(2).d k(i,n)is chosen as the?rst among the decoding instants at which S n needs to be decoded and by which the responding packets are more probable to arrive the receiver than not,i.e.,

k(i,n)=min{k:w n f(k)=0,F F T T(d k?r i?T p)>0.5}(3) Assume there are N u packets that have been previously re-sponded but their arrival at the receiver has not yet been con?rmed by the rich acknowledgments until the last request received re-quest req l.The sender therefore needs to predict the arrival of these packets at the receiver in the time window between a l and d k(i,n).For each of these packets,the probability of arrival can be evaluated using the conditional probability distribution of the forward trip time[7].There are2N u possible arrival combinations of these packets,and each results in a possible decoding buffer state.Since packet delay and losses are assumed to be independent across packets,the probability of each possible decoding buffer state is simply the product of that of the outcome of each packet.

Ideally,optimization should be applied for each possible state at d k(i,n)and the response that results in the minimum expected distortion calculated over all possible arrival combinations is cho-sen.However,observing that the probability distribution of the possible states is highly skewed,the unlikely states can therefore be neglected.For simplicity,we directly choose the most proba-ble decoding buffer state at d k(i,n)as the prediction[7].Conse-quently,a packet is predicted to arrive by d k(i,n)if and only if the probability of arrival is larger than0.5.

Furthermore,we make a simpli?cation that the bitstream seg-ment for S n does not contribute to the distortion reduction in F f(k) if?b n does not match the actual state at a decoding instant d k. Therefore,D n(R n)in(2a)is further weighted by the probability that the prediction?b n actually occurs,at each d k.

4.OFFLINE-COMPUTATION APPROACH

To further reduce the computation required at the sender,we pro-pose an of?ine-computation approach.A set of transmission rates that covers the range of possible target payload transmission rate is de?ned and denoted as{C m:C1

n

and stored at the sender.

During the playout session,instead of solving the actual op-timization problem in(2)online,the variables R n,n∈Z i are determined by the algorithm described in Table1.In words,we

assign R(m)

n

to the subband S n being considered in the current re-sponse and keep increasing the value of m until the target payload length L is?lled up.If the frames are displayed at the same frame rate as encoded(either forward or reverse playback),the alloca-

tion R(m)

n

,n∈Z i minimizes the objective function in(2a)for the incurring amount of data to be transmitted without the latency constraint,since F F T T is substituted with1.If C1,···,C M are densely spaced,the amount of data to be transmitted at a certain C m in each response closely matches the target response length. Therefore,the of?ine-computation approach closely approximates the solution in(2)when the probability that res i arrives the re-ceiver by the decoding instants that use any of the S n,n∈Z i,is close to1.

Table1.Of?ine-Computation Approach

R n←?b n,n∈Z i;L ←0

for m←1,···,M

for z←1,···,|Z i|

n←z-th element in Z i

if R m n>R n

if R m n?R n

L ←L +R m n?R n;R n←R m n

else

R n←R n+L?L ;exit

5.EXPERIMENTAL RESULTS Experimental results are shown for the video sequence Foreman and Mobile,both consisting of288frames in QCIF format.The frames are displayed in the same order and frame rate as en-coded.Both the backward and the forward channel are modelled as an independent time-invariant packet erasure channel with ran-dom delays.The probability density function of the packet delay, given the packet is not lost,is shifted-gamma distributed with shift κ=10ms,meanμ=40ms,and varianceσ2=300ms2, and the packet loss rate is set to2%in both channels.The4-level Haar wavelet transform is adopted for the temporal DWT, resulting in N=288temporal subbands.We choose the pa-rameters N db=16,N pb=200,D max=150,T d=33ms, M r=2,T p=50ms,N p=4,V=5,and only the lumi-nance component is considered throughout the experiments.At a responding instant,we consider the most recently assigned frame and the next10frames that have not yet been assigned but can be inferred(extrapolated)from the previously received requests, and hence the related subbands(Section3).We choose the tar-get forward payload transmission rate(excluding rate for motion vectors)from64up to448kbps.The spacing between C m’s for of?ine-computation is16kbps.

The performance of the sender-based optimized,sender-based of?ine-computation,receiver-based optimized approach[7],and an upper-bound are compared in terms of the actual forward trans-mission rate and the video reconstruction quality in PSNR in Fig.

1.For the two sender-based approaches,the actual forward trans-mission rate is the target payload rate plus the rate of the header indicating the allocation of each subband in the payload,which is typically encoded at a rate less than1%of the payload rate.For the receiver-based approach,this header information is included in the request sent from the receiver in the backward channel and is not considered in Fig.1.The upper-bound is achieved using the optimal source rate allocation for the target payload transmission rate as discussed in Section4.Since the forward channel is an erasure channel the source rate is scaled by1.

For both video sequences,the sender-based approaches out-perform the receiver-based approach,and the of?ine-computation approach closely approximates the optimized approach.In the sender-based approaches,constantly informing the sender about the receiver-buffer state using the rich acknowledgements as well as the push-and-pull method that ensures the responses are sent timely and regularly(Section3)largely mitigate the effects of packet losses and excess delay in the backward channel.There-fore,the inef?ciency in the sender-based approaches compared to the upper-bound mainly results from dealing with the random losses and delay in the forward channel when performing rate-distortion optimized response at the sender.However,in the receiver-based approach,the receiver has to cope with the random-ness in both channels and hence the inferior performance com-pared to the sender-based approaches.

6.CONCLUSIONS

We propose a sender-based rate-distortion optimized framework to stream3-D wavelet-coded video.The source rate-distortion pro-?les,the user request,target playout latency and transmission rate, and network characteristics,are all taken into account to minimize the distortion in the reconstructed frames.An of?ine-computation is also proposed to reduce the computational complexity.Ex-

Fig.1.(top)Foreman(bottom)Mobile periments show that the proposed sender-based approach outper-form the previously proposed receiver-based approach,and the of?ine-computation approach very closely approximates the fully-optimized approach.

7.REFERENCES

[1] D.Taubman and A.Zakhor,“Multi-rate3-D subband coding of

video,”IEEE Trans.Image Processing,vol.3,no.5,pp.572–588, Sept.1994.

[2]J.-R.Ohm,“Three-dimensional subband coding with motion com-

pensation,”IEEE Trans.Image Processing,vol.3,no.5,pp.559–571, Sept.1994.

[3] A.Secker and D.Taubman,“Motion-compensated highly scalable

video compression using an adaptive3D wavelet transform based on lifting,”in Proc.IEEE Int.Conf.on Image Processing2001,Thessa-loniki,Greece,Oct.2001,vol.2,pp.1029–1032.

[4] B.Pesquet-Popescu and V.Bottreau,“Three dimensional lifting

schemes for motion compensated video compression,”in Proc.IEEE Int.Conf.on Acoustics,Speech,and Signal Processing2001,Salt Lake City,UT,USA,May2001,vol.3,pp.1793–1796.

[5]L.Luo,J.Li,S.Li,et al.,“Motion compensated lifting wavelet and its

application in video coding,”in Proc.IEEE Int.Conf.on Multimedia and Expo2001,Tokyo,Japan,Aug.2001,pp.481–484.

[6]P.A.Chou and Z.Miao,“Rate-distortion optimized streaming of pack-

etized media,”(submitted)IEEE Trans.Multimedia.

[7] C.-L.Chang,S.Han,and B.Girod,“Rate-distortion optimized stream-

ing for3-D wavelet video,”in(to appear)Intl.Conf.on Image Pro-cessing2004,Singapore,Oct.2004.

[8] A.Said and W.A.Pearlman,“A new fast and ef?cient image codec

based on Set Partitioning in Hierarchical Trees,”IEEE Trans.Circuits Syst.Video Technol.,vol.6,pp.243–250,June1996.

[9]J.Chakareski and B.Girod,“Rate-distortion optimized video stream-

ing with rich acknowledgments,”in Proc.SPIE Visual Communica-tions and Image Processing2004,San Jose,CA,USA,Jan.2004,pp.

661–668.

相关主题