搜档网
当前位置:搜档网 › Faster heuristic search algorithms for planning with uncertainty and full feedback

Faster heuristic search algorithms for planning with uncertainty and full feedback

Faster Heuristic Search Algorithms for Planning

with Uncertainty and Full Feedback

Blai Bonet

Computer Science Department University of California,Los Angeles Los Angeles,CA 90024,USA bonet@https://www.sodocs.net/doc/12654599.html,

H′e ctor Geffner

Departamento de Tecnolog′?a ICREA –Universitat Pompeu Fabra

Barcelona 08003,Espa?n a

hector.geffner@tecn.upf.es

Abstract

Recent algorithms like RTDP and LAO *combine the strength of Heuristic Search (HS)and Dynamic Programming (DP)methods by exploiting knowl-edge of the initial state and an admissible heuris-tic function for producing optimal policies without evaluating the entire space.In this paper,we in-troduce and analyze three new HS/DP algorithms.A ?rst general algorithm schema that is a simple loop in which ‘inconsistent’reachable states (i.e.,with residuals greater than a given )are found and updated until no such states are found,and serves to make explicit the basic idea underlying HS/DP algorithms,leaving other commitments aside.A second algorithm,that builds on the ?rst and adds a labeling mechanism for detecting solved states based on Tarjan’s strongly-connected components procedure,which is very competitive with existing approaches.And a third algorithm,that approx-imates the latter by enforcing the consistency of the value function over the ‘likely’reachable states only,and leads to great time and memory savings,with no much apparent loss in quality,when transi-tions have probabilities that differ greatly in value.

1Introduction

Heuristic search algorithms have been successfully used for computing optimal solutions to large deterministic problems (e.g.,[Korf,1997]).In the presence of non-determinism and feedback,however,solutions are not action sequences but policies,and while such policies can be characterized and computed by dynamic programming (DP)methods [Bellman,1957;Howard,1960],DP methods take all states into account and thus cannot scale up to large problems [Boutilier et al.,1999].Recent algorithms like RTDP [Barto et al.,1995]and LAO *[Hansen and Zilberstein,2001],combine the strength of Heuristic Search and Dynamic Programming methods by

exploiting knowledge of the initial state

and an admissi-ble heuristic function (lower bound)for computing opti-mal policies without having to evaluate the entire space.In this paper we aim to contribute to the theory and practice of Heuristic Search/Dynamic Programming methods by for-mulating and analyzing three new HS/DP algorithms.The ?rst algorithm,called FIND -and-REVISE is a general schema

that comprises a loop in which states reachable from and the greedy policy that have Bellman residuals greater than a given (we call them ‘inconsistent’states),are found and up-dated until no one is left.FIND -and-REVISE makes explicit the basic idea underlying HS/DP algorithms,including RTDP and LAO *.We prove the convergence,complexity,and op-timality of FIND -and-REVISE ,and introduce the second al-gorithm,HDP ,that builds on it,and adds a labeling mech-anism for detecting solved states based on Tarjan’s ef?cient strongly-connected components procedure [Tarjan,1972].A state is solved when it reaches only ‘consistent’states,and solved states are skipped in all future searches.HDP termi-nates when the initial state is solved.HDP inherits the con-vergence and optimality properties of the general FIND -and-REVISE schema and is strongly competitive with existing ap-proaches.The third algorithm,HDP ,is like HDP ,except that while HDP computes a value function by enforcing its consistency over all reachable states (i.e.,reachable from and the greedy policy),HDP enforces consistency over the ‘likely’reachable states only.We show that this approxima-tion,suitably formalized,can lead to great savings in time and memory,with no much apparent loss in quality,when transi-tions have probabilities that differ greatly in value.

Our motivation is twofold:to gain a better understanding of HS/DP methods for planning with uncertainty,and to de-velop more effective HS/DP algorithms for both optimal and approximate planning.

2Preliminaries

2.1

Model

We model non-deterministic planning problems with full feedback with state models that differ from those used in the classical setting in two ways:?rst,state transitions become probabilistic ;second,states are fully observable .The result-ing models are known as Markov Decision Processes (MDP s)and more speci?cally as Stochastic Shortest-Path Problems [Bertsekas,1995],and they are given by:1M1.a discrete and ?nite state space ,M2.an initial state ,M3.a set of goal states,M4.actions applicable in each state ,M5.transition probabilities for ,,

TECHNICAL REPORT R-317

August 2003

In Proceedings of the Eighteenth International Joint Conference on Artificial

Intelligence , Acapulco, Mexico. Morgan Kaufmann, 1233-1238, 2003.

M6.positive action costs,and

M7.fully observable states.

Due to the presence of full feedback(M7)and the standard Markovian assumptions,the solution of an MDP takes the form of a function mapping states into actions. Such a function is called a policy.A policy assigns a prob-ability to every state trajectory starting in state which is given by the product of the transition probabili-ties where.If we further assume that actions in goal states have no costs and produce no changes (i.e.,and if),the expected cost associated with a policy starting in state is given by the weighted average of the probability of such trajectories times their cost.An optimal solution is a policy that has a minimum expected cost for all possible initial states.An optimal solution is guaranteed to exist provided the following assumption holds[Bertsekas,1995]:

M8.the goal is reachable from every state with non-zero probability.

Since the initial state of the system is?xed,there is in principle no need to compute a complete policy but a partial policy prescribing the action to take in the states that can be reached following the policy from.Traditional dynamic programming methods like value or policy iteration compute complete policies,while recent heuristic search DP methods like RTDP and LAO*compute partial policies.They achieve this by means of suitable heuristic functions that provide admissible estimates(lower bounds)of the expected cost to reach the goal from any state.

2.2Dynamic Programming

Any heuristic or value function de?nes a greedy policy: def(1)

where the expected cost from the resulting states is as-sumed to be given by.We call the greedy action in for the value function.If we denote the optimal(ex-pected)cost from a state to the goal by,it is well known that the greedy policy is optimal when is the op-timal cost function,i.e..

While due to the possible presence of ties in(1),the greedy policy is not unique,we will assume throughout the paper that these ties are broken systematically using an static order-ing on actions.As a result,every value function de?nes a unique greedy policy,and the optimal cost function de?nes a unique optimal policy.We de?ne the relevant states as the states that are reachable from using this opti-mal policy;they constitute a minimal set of states over which the optimal value function needs to be de?ned.2

Value iteration(VI)is a standard dynamic programming method for solving MDP s and is based on computing the op-timal cost function and plugging it into the greedy policy (1).This optimal cost function is the only solution to the?xed

FIND a state in the greedy graph with REVISE at

until no such state is found

return

2.a component is solved iff is consistent and all com-

ponents,,are solved.

The problem of labeling states in the cyclic graph can thus be mapped into the problem of labeling the components in the acyclic graph,which can be done in bottom up fashion.

From Fig.1is easy to visualize the component graph asso-ciated to the greedy graph.Thus,if is the only inconsistent state,for example,we can label the components and as solved,while leaving and unsolved.

The code that simultaneously checks in depth-?rst fashion the consistency of the states and the possibility of labeling them is shown in Alg.2.We call the resulting algorithm, HDP.HDP inherits its convergence and optimality properties from the FIND-and-REVISE schema and the correctness of the labeling mechanism.

We do not have space to explain HDP code in detail,yet it should be clear to those familiar with Tarjan’s algorithm;in particular,the use of the state visit number,S.IDX,and the ‘low-link’number,S.LOW,for detecting when a new compo-nent has been found.The?ag and the(normal)propa-gation of the visit numbers prevent a component from being labeled as solved when it is inconsistent or can reach an in-consistent component.

Theorem4(Correctness)The value function computed by HDP for a planning model M1-M8,given an initial admissible and monotonic value function,is-consistent.

5Experimental Results

We now evaluate the performance of HDP in comparison with other recent Heuristic Search/DP algorithms such as the sec-ond code for LAO*in[Hansen and Zilberstein,2001],that we call Improved LAO*(ILAO*),and Labeled RTDP(LRTDP),a recent improvement of RTDP that accelerates its convergence [Bonet and Geffner,2003].We use parallel Value Iteration as the baseline.We’ve implemented all these algorithms in C++ and the experiments have been run on a Sun Fire–280R with 750MHz and1Gb of RAM.

The domain that we use for the experiments is the racetrack from[Barto et al.,1995].The states are tuples

that represent the position and speed of the car in the dimensions.The actions are pairs of instan-taneous accelerations where.Uncer-tainty in this domain comes from assuming that the road is ‘slippery’and as a result,the car may fail to accelerate or desaccelerate.More precisely,an action has its intended effect with probability,while with prob-ability the action effects correspond to those of the action .Also,when the car hits a wall,its velocity is set to zero and its position is left intact(this is different than in [Barto et al.,1995]where for some reason the car is moved to the start position).

We consider the track large-b from[Barto et al.,1995], h-track from[Hansen and Zilberstein,2001],3and?ve other tracks(squares and rings of different size).Informa-tion about these instances can be found in the?rst three rows HDP

begin

//perform DFS

DFS

[reset IDX to for visited states]

[clean and]

//base case

if SOLVED GOAL then

//check residual

if RESIDUAL then

//mark state as active

PUSH

PUSH

IDX LOW

//recursive call

for SUCCESSORS do

DFS

LOW LOW LOW

LOW LOW IDX //update if necessary

if then

//try to label

else if IDX LOW then

POP

SOLVED

return

end

algorithm h-track square-2ring-2ring-4

5359738395033068353991

41.8950411.1304116.0983327.78572

%relevant17.120.2511.109.98

36.010.014.024.0

time for13.42272.853 4.55885.766 VI12.62081.270 5.28790.895 ILAO13.0620.942 6.095145.047

LRTDP7.6180.363 1.30137.992

HDP7.5470.275 1.10130.511

VI14.27589.248 5.704100.130 ILAO26.05768.0289.801174.089 LRTDP12.9877.736 2.480152.200

HDP26.66320.952 3.53698.033 VI21.02189.152 5.809101.616 ILAO46.754187.4639.894310.876

LRTDP15.27645.514 3.682261.286

HDP42.60179.171 6.173157.646 Table1:Problem data and convergence time in seconds for the different algorithms with different heuristics.Results for and probability.Faster times are shown in bold.

these differences in the future.

6Approximation

HDP,like FIND-and-REVISE,computes a value function by enforcing its consistency over the states reachable from and the greedy policy.The?nal variation we consider, that we call HDP,works in the same way,yet it enforces the consistency of the value function only over the states that are reachable from and the greedy policy with some minimum likelihood.

For ef?ciency,we formalize this notion of likelihood,using a non-negative integer scale,where refers to a normal out-come,refers to a somewhat surprising outcome,to a still more surprising outcome,and so on.We call these measures plausibilities,although it should be kept in mind,that refers to the most plausible outcomes,thus‘plausibility greater than ’,means‘a plausibility measure smaller than or equal to.’We obtain the transition plausibilities from the corresponding transition probabilities by the following dis-cretization:

def(5)

with when.Plausibilities are thus‘normalized’:the most plausible next states have always plausibility.These transition plausibilities are then com-bined by the rules of the calculus[Spohn,1988]which is a calculus isomorphic to the probability calculus(e.g.[Gold-szmidt and Pearl,1996]).The plausibility of a state trajectory given the initial state,is given by the sum of the transition plausibilities in the trajectory,and the plausibility of reach-ing a state,is given by the plausibility of the most plausible trajectory reaching the state.

The HDP algorithm,for a non-negative integer,com-putes a value function by enforcing its(-)consistency over the states reachable from with plausibility greater than or equal to.HDP produces approximate policies fast by pruning certain paths in the search.The simplest case results from,as the code for HDP corresponds exactly to the code for HDP,except that the possible successors of a state in the greedy graph are replaced by the plausible successors.

HDP computes lower bounds that tend to be quite tight over the states that can be reached with plausibility no smaller than.At run time,however,executions may contain‘sur-prising’outcomes,taking the system‘out’of this envelope, into states where the quality of the value function and its cor-responding policy,are poor.To deal with those situations, we de?ne a version of HDP,called HDP,that inter-leaves planning and execution as follows.HDP plans from by means of the HDP algorithm,then exe-cutes this policy until a state trajectory with plausibility mea-sure greater than or equal to,and leading to a(non-goal) state is obtained.At that point,the algorithm replans from with HDP,and the same execution and replanning cycle is followed until reaching the goal.Clearly,for suf?ciently large,HDP reduces to HDP,and for large,HDP reduces to HDP.

Table2shows the average cost for HDP for

(i.e.,most plausible transitions considered only),and several values for(replanning thresholds).Each entry in the ta-ble correspond to an average over100independent execu-tions.We also include the average cost for the greedy policy with respect to as a bottom-line reference for the?g-ures.Memory in the table refers to the number of evaluated states.As these results show,there is a smooth tradeoff be-tween quality(average cost to the goal)and time(spent in initial planning and posterior replannings)as the parameter vary.We also see that in this class of problems the heuristic delivers a very good greedy policy.Thus,further research is necessary to assess the goodness of HDP and the heuristic.

7Related Work

We have built on[Barto et al.,1995]and[Bertsekas,1995], and more recently on[Hansen and Zilberstein,2001]and [Bonet and Geffner,2003].The crucial difference between

square-2ring-4 algorithm quality time memory quality time memory

7.5473583511.130 4.1454335827.785

HDP42.9500.02181922.000 1.75824195 HDP44.0000.02265023.300 1.82123857 HDP46.8000.01755623.600 1.58021823 HDP46.8000.01456425.500 1.51821823 greedy47.150N/A10425.390N/A241 Table2:Results of HDP for and greedy policy with respect to for and.Each value is the average over100executions.N/A in time for the greedy policy means“Not Applicable”since there is no planning.

FIND-and-REVISE and general asynchronous value iteration is the focus of the former on the states that are reachable from the initial state and the greedy policy.In RTDP,the FIND procedure is not systematic and is carried out by a stochas-tic simulation that may take time greater than when the inconsistent states are reachable with low probability(this explains why RTDP?nal convergence is slow;see[Bonet and Geffner,2003]).LAO*,on the other hand,keeps track of a subset of states,which initially contains only,and over which it incrementally maintains an optimal policy through a somewhat expensive REVISE(full DP)procedure.This is then relaxed in the second algorithm in[Hansen and Zilber-stein,2001],called Improved LAO*here.The use of an ex-plicit envelope that is gradually expanded is present also in [Dean et al.,1993]and[Tash and Russell,1994].Interest-ingly,these envelopes are expanded by including the most ‘likely’reachable states not yet in the envelope.The algo-rithm HDP exploits a similar idea but formulates it in a dif-ferent form and has a crisp termination condition.

8Summary

We have introduced and analyzed three HS/DP algorithms that exploit knowledge of the initial state and an admissible heuristic function for solving planning problems with uncer-tainty and feedback:FIND-and-REVISE,HDP,and HDP. FIND-and-REVISE makes explicit the basic idea underlying HS/DP algorithms:inconsistent states are found and updated, until no one is left.We have proved its convergence,complex-ity,and optimality.HDP adds a labeling mechanism based on Tarjan’s SCC algorithm and is strongly competitive with current algorithms.Finally,HDP and HDP offer great time and memory savings,with no much apparent loss in quality,in problems where transitions have probabilities that differ greatly in value,by focusing the updates on the states that are more likely to be reached. Acknowledgements:We thank Eric Hansen and Shlomo Zil-berstein for making the code for LAO*available to us.Blai Bonet is supported by grants from NSF,ONR,AFOSR,DoD MURI program,and by a USB/CONICIT fellowship from Venezuela.H′e ctor Geffner is supported by grant TIC2002-04470-C03-02from MCyT,Spain.

References

[Barto et al.,1995]A.Barto,S.Bradtke,and S.Singh.

Learning to act using real-time dynamic programming.Ar-ti?cial Intelligence,72:81–138,1995.[Bellman,1957]R.Bellman.Dynamic Programming.

Princeton University Press,1957.

[Bertsekas,1995]D.Bertsekas.Dynamic Programming and Optimal Control,(2Vols).Athena Scienti?c,1995.

[Bonet and Geffner,2003]B.Bonet and https://www.sodocs.net/doc/12654599.html,beled RTDP:Improving the convergence of real-time dynamic programming.In Proc.ICAPS-03.To appear,2003.

[Boutilier et al.,1999]C.Boutilier,T.Dean,and S.Hanks.

Decision-theoretic planning:Structural assumptions and computational leverage.Journal of Arti?cial Intelligence Research,11:1–94,1999.

[Dean et al.,1993]T.Dean,L.Kaelbling,J.Kirman,and

A.Nicholson.Planning with deadlines in stochastic do-

mains.In R.Fikes and W.Lehnert,editors,Proc.11th National Conf.on Arti?cial Intelligence,pages574–579, Washington,DC,1993.AAAI Press/MIT Press. [Goldszmidt and Pearl,1996]M.Goldszmidt and J.Pearl.

Qualitative probabilities for default reasoning,belief revi-sion,and causal modeling.Arti?cial Intelligence,84:57–112,1996.

[Hansen and Zilberstein,2001]E.Hansen and https://www.sodocs.net/doc/12654599.html,O*:A heuristic search algorithm that?nds solu-tions with loops.Arti?cial Intelligence,129:35–62,2001. [Howard,1960]R.Howard.Dynamic Programming and Markov Processes.MIT Press,Cambridge,MA,1960. [Korf,1997]R.Korf.Finding optimal solutions to rubik’s cube using patterns databases.In B.Kuipers and B.Web-ber,editors,Proc.14th National Conf.on Arti?cial Intelli-gence,pages700–705,Providence,RI,1997.AAAI Press /MIT Press.

[Puterman,1994]M.Puterman.Markov Decision Processes –Discrete Stochastic Dynamic Programming.John Wiley and Sons,Inc.,1994.

[Spohn,1988]W.Spohn.A general non-probabilistic theory of inductive reasoning.In Proc.4th Conf.on Uncertainty in Arti?cial Intelligence,pages149–158,New York,NY, 1988.Elsevier Science Publishing Company,Inc. [Tarjan,1972]R.E.Tarjan.Depth?rst search and linear graph algorithms.SIAM Journal on Computing,1(2):146–160,1972.

[Tash and Russell,1994]J.Tash and S.Russell.Control strategies for a stochastic planner.In B.Hayes-Roth and R.Korf,editors,Proc.12th National Conf.on Arti?cial Intelligence,pages1079–1085,Seattle,W A,1994.AAAI Press/MIT Press.

相关主题