Microscopic simulation based study of pedestrian safety applications at signalized urban crossings in a connected-automated vehicle environment and reinforcement learning based optimization of vehicle decisions
Microscopic simulation based study of pedestrian safety applications at signalized urban crossings in a connected-automated vehicle environment and reinforcement learning based optimization of vehicle decisions
- Research Article
2
- 10.22119/ijte.2016.16180
- Apr 1, 2016
Pedestrians are among the most vulnerable road users. Speed of vehicles is considered as one of the major causes of danger for pedestrians crossing the street. Therefore, it is of utmost importance to devise suitable solutions to reduce speed of vehicles. One of these solutions is installation of Pedestrian Refuge Islands (PRI) in very wide midblocks. With regard to fluctuations in pedestrian and vehicle traffic volume in traffic hours, there are different variations in collisions between vehicle and pedestrian. In this article the effect of constructed PRI in Tehran on speed of vehicles and consequently their effects on probability fluctuations of fatal accidents are determined. Speed of vehicles in two phases of before and after arriving to the PRI is assessed. Additionally, speed of vehicles in non-observed volumes of vehicle and pedestrian are calculated using Aimsun.v6 simulation software. Paired T-test is applied to compare average speed of vehicles before and after the PRI. The results revealed that except for traffic volumes of 3000-4000 veh/h and 400-600 ped/h in other volumes reduction of average speed of vehicles as a result of PRI is significant. Also, the results show that in all volumes, these equipment reduce the probability of fatal accidents to under 10%. According to the results, it is recommended that PRI should be installed in mid blocks where traffic volume of vehicles in each lane is less than 750 veh/h. Language: en
- Research Article
- 10.11113/mjce.v12.80
- Jan 1, 2000
Observation at midblock signalled pedestrian crossings in Malaysia showed that violation of the red signal aspect by both the motorists and pedestrians is close to 70 %. Motorists may travel on the red signal aspect when there is no pedestrian on the crossing. Pedestrians are also observed to cross in vehicle gaps at the crossings during the right of way to vehicles. Such behaviour is dangerous especially at a system where the traffic movements rely on the signal aspects shown to them. Four alternative algorithms were designed in order to enhance the operation and safety of users at the midblock signalled pedestrian crossings in Malaysia. The objectives are to increase pedestrian safety and drivers compliance with the signal. The developed strategies required detectors to register the vehicle and pedestrian demands. Algorithms were designed to compute the appropriate time to start and terminate the vehicle and pedestrian precedence. The applied computation aimed to limit vehicle delay and to ensure that vehicle saturation flow does not exceed 85 % of the road capacity. Apart from the vehicle demands, the developed strategies also attempt to emphasis on the requirement and limits of pedestrian behaviour. Four measures of performance are compared and evaluated. These measures are the percentage of vehicles crossing on the red man, pedestrian mean delay, mean cycle times and vehicle mean delay. The measures of performance of all algorithms are evaluated using a fully calibrated and validated simulation program. All of the new strategies increase the pedestrian compliance and balance. This indicates that the present system does not respond satisfactorily to the gaps in vehicle flow and may contribute to the users disrespect of the system.
- Research Article
6
- 10.3303/cet1972016
- Jan 31, 2019
- Chemical engineering transactions
As the population and vehicle ownership increase, emission of pollutants is also increasing. The percentage of GHG emission by transportation sector is about 21 % in 2015 (OECD), and this may be caused by frequent stop-and-go phenomenon or delay time of vehicles in signalized intersection. Generally, these could be minimised by driving in constant speed or decreasing the delay times with an efficient traffic signal control. On the other hand, researches try to decrease vehicles’ delay time and to exclude the unnecessary stop-and-go phenomenon in an urban signalized intersection with an advent of V2X (Vehicle-to-Everything) technology development. Especially, in traditional pre-timed traffic signal control situation, even the autonomous vehicles would be impossible to exhibit their own maximum performance. Thus, the development of the traffic signal control system could have effects not only on the traffic flow but also on environmental aspects, which optimizes the signalized traffic flow based on the real-time vehicle information. In this research, on the premise of V2X environment, changes in traffic flow and the emission are analysed based on microscopic traffic information. In specific, the reinforcement learning model is constructed based on Deep Learning which learns the real-time traffic information and displays the optimal traffic signal. The performance of the system was analysed through microscopic traffic simulator - Vissim. The proposed system is expected to contribute on analysing the traffic flow and the environmental effects. Also, it is expected to contribute on constructing the green smart cities with an advent of autonomous vehicle operation in future V2X environment.
- Research Article
8
- 10.4399/97888548415121
- Jul 1, 2011
- Advances in transportation studies
The objective of this research is to create an overall picture of pedestrians' road safety in European countries, to identify factors associated with increased pedestrian fatality risk, and to propose countermeasures for the improvement of pedestrians road safety in Europe. In particular, data on pedestrians' fatalities for the period 1997-2006 from 19 EU countries, extracted from the EU CARE database, are associated with basic road safety factors like pedestrian's age and gender (with particular focus on children and the elderly), lighting conditions, area type (inside / outside urban area) as well as seasonality. Both overall trends and countries comparisons are examined, whereas pedestrians' accident risk rates (fatalities per population) are also estimated. The results suggest that, although pedestrian fatalities in Europe present a decreasing trend, pedestrian fatality rates are still increased in Southern European countries, as well as in the new Member States. Moreover, pedestrian fatalities are increased inside urban areas, at night-time and during the winter, whereas children and the elderly remain the most vulnerable groups. These results are further discussed in relation to pedestrians risk exposure. Finally, a review of various road safety measures from the international experience is carried out and specific areas of action are proposed in the light of these results.
- Research Article
2
- 10.14569/specialissue.2018.090155
- Jan 1, 2018
- International Journal of Advanced Computer Science and Applications
University lecture hall is the most crowded place in the university that concedes by pedestrian which mostly are the students. The university students have their own daily schedules that require them to move from one place to another in a shortest time. However, the unbalance and scattered important places (lecture and tutorial halls, general lab, students’ center and etc.) had caused the unbalance used of university lecture hall’s exit and the population density in the lecture hall. Hence, during panic situation, the evacuation process will lead towards the high physical contact between the pedestrian due to the heavy usage of exits and caused the crowd bottleneck. This research is to study and simulate the pedestrian movement in the university lecture hall for determining the most used exit and the reasons for the mass usage happened. This simulation had used the cellular automata approach as the discrete model for the microscopic movement of the pedestrian. At the end of this research, the university will be proposed with some solutions to overcome this situation. The building design and construction planning was pointed for future enhancement towards the sustainable and prudent learning space for the university’s students.
- Supplementary Content
1
- 10.4225/03/58b64f258382d
- Mar 5, 2019
Pedestrian safety at rail level crossings (RLXs) is a concern for government, the transport industry and the community. Collisions between trains and pedestrians are traditionally viewed as the result of errors or violations committed by pedestrians. However, as RLXs are complex sociotechnical systems, collisions are better understood as emergent properties of interactions amongst and between human and technical components within the system. Cognitive work analysis (CWA) is a powerful analytical framework that offers a sophisticated understanding of the functioning of RLXs as relevant to pedestrian safety, through identifying the constraints on pedestrian behaviour in this context. It does not, however, provide its users with guidance about how the findings of the analysis can be used to improve sociotechnical system functioning. Accordingly, the aim of this research was the development and evaluation of a CWA-based approach to support the design of complex sociotechnical systems, and the application of this approach to provide recommendations for RLX design to improve pedestrian safety. A secondary aim of the research was to investigate pedestrian behaviour within the RLX system using CWA. Drawing on systems theory, and more specifically, sociotechnical systems theory, the outcome of this research is a CWA design toolkit (the CWA-DT). The development and refinement of the CWA-DT is illustrated through a proof of concept application in the domain of public transport ticketing. Positive evaluation results were obtained and necessary refinements to the toolkit were implemented in the second version. A full evaluation of the toolkit was then undertaken within the complex, safety critical domain of RLXs. This evaluation found that while the CWA-DT could be considered a useful method, its application did not lead to the creation of design concepts that fully aligned with sociotechnical systems theory. The application did, however, lead to designs that were rated by human factors experts as more effective than the existing system design. These findings suggest that the sociotechnical systems theory approach may not provide an appropriate design philosophy in a public safety context. However, merging ideas and concepts from sociotechnical systems theory with existing paradigms such as safety management can lead to innovation and has the potential to improve safety performance. Awards: Winner of the Mollie Holman Doctoral Medal for Excellence, Monash Injury Research Institute, 2015.
- Supplementary Content
- 10.17638/03090640
- Jun 16, 2020
- University of Liverpool
Video surveillance system is one of the most essential topics in the computer vision field. As the rapid and continuous increasement of using video surveillance cameras to obtain portrait information in scenes, it becomes a very important system for security and criminal investigations. Video surveillance system includes many key technologies, including the object recognition, the object localization, the object re-identification, object tracking, and by which the system can be used to identify or suspect the movements of the objects and persons. In recent years, person re-identification and visual object tracking have become hot research directions in the computer vision field. The re-identification system aims to recognize and identify the target of the required attributes, and the tracking system aims at following and predicting the movement of the target after the identification process. Researchers have used deep learning and computer vision technologies to significantly improve the performance of person re-identification. However, the study of person re-identification is still challenging due to complex application environments such as lightning variations, complex background transformations, low-resolution images, occlusions, and a similar dressing of different pedestrians. The challenge of this task also comes from unavailable bounding boxes for pedestrians, and the need to search for the person over the whole gallery images. To address these critical issues in modern person identification applications, we propose an algorithm that can accurately localize persons by learning to minimize intra-person feature variations. We build our model upon the state-of-the-art object detection framework, i.e., faster R-CNN, so that high-quality region proposals for pedestrians can be produced in an online manner. In addition, to relieve the negative effects caused by varying visual appearances of the same individual, we introduce a novel center loss that can increase the intra-class compactness of feature representations. The engaged center loss encourages persons with the same identity to have similar feature characteristics. Besides the localization of a single person, we explore a more general visual object tracking problem. The main task of the visual object tracking is to predict the location and size of the tracking target accurately and reliably in subsequent image sequences when the target is given at the beginning of the sequence. A visual object tracking algorithm with high accuracy, good stability, and fast inference speed is necessary. In this thesis, we study the updating problem for two kinds of tracking algorithms among the mainstream tracking approaches, and improve the robustness and accuracy. Firstly, we extend the siamese tracker with a model updating mechanism to improve their tracking robustness. A siamese tracker uses a deep convolutional neural network to obtain features and compares the new frame features with the target features in the first frame. The candidate region with the highest similarity score is considered as the tracking result. However, these kinds of trackers are not robust against large target variation due to the no-update matching strategy during the whole tracking process. To combat this defect, we propose an ensemble siamese tracker, where the final similarity score is also affected by the similarity with tracking results in recent frames instead of solely considering the first frame. Tracking results in recent frames are used to adjust the model for a continuous target change. Meanwhile, we combine adaptive candidate sampling strategy and large displacement optical flow method to improve its performance further. Secondly, we investigate the classic correlation filter based tracking algorithm and propose to provide a better model selection strategy by reinforcement learning. Correlation filter has been proven to be a useful tool for a number of approaches in visual tracking, particularly for seeking a good balance between tracking accuracy and speed. However, correlation filter based models are susceptible to wrong updates stemming from inaccurate tracking results. To date, little effort has been devoted to handling the correlation filter update problem. In our approach, we update and maintain multiple correlation filter models in parallel, and we use deep reinforcement learning for the selection of an optimal correlation filter model among them. To facilitate the decision process efficiently, we propose a decision-net to deal with target appearance modeling, which is trained through hundreds of challenging videos using proximal policy optimization and a lightweight learning network. An exhaustive evaluation of the proposed approach on the OTB100 and OTB2013 benchmarks show the effectiveness of our approach.
- Supplementary Content
3
- 10.2312/14456
- Oct 23, 2014
- Dialnet (Universidad de la Rioja)
El objetivo de la tesis consiste en la utilizacion de Aprendizaje por refuerzo (Reinforcement Learning) para generar simulaciones plausibles de peatones en diferentes entornos. Metodologia Se ha desarrollado un marco de trabajo multi-agente donde cada agente virtual que aprende un comportamiento de navegacion por interaccion con el mundo virtual en el que se encuentra junto con el resto de agentes. El mundo virtual es simulado con un motor fisico (ODE) que esta calibrado con parametros de peatones humanos extraidos de la bibliografia de la materia. El marco de trabajo es flexible y permite utilizar diferentes algoritmos de aprendizaje (en concreto Q-Learning y Sarsa(lambda) en combinacion con diferentes tecnicas de generalizacion del espacio de estados (en concreto cuantizacion Vectorial y tile coding). Como herramientas de analisis de los comportamientos aprendidos se utilizan diagramas fundamentales (relacion velocidad/densidad), mapas de densidad, cronogramas y rendimientos (en terminos del porcentaje de agentes que consiguen llegar al objetivo). Conclusiones: Tras una bateria de experimentos en diferentes escenarios (un total de 6 escenarios distintos) y los correspondientes analisis de resultados, las conclusiones son las siguientes: - Se han conseguido comportamientos plausibles de peatones -Los comportamientos son robustos al escalado y presentan capacidades de abstraccion (comportamientos a niveles tactico y de planificacion) -Los comportamientos aprendidos son capaces de generar comportamientos colectivos emergentes -La comparacion con otro modelo de peatones estandar (Modelo de Helbing) y los analisis realizados a nivel de diagramas fundamentales, indican que la dinamica aprendida es coherente y similar a una dinamica de peatones.
- Supplementary Content
2
- 10.1184/r1/9913883.v1
- Oct 2, 2019
- Figshare
Buildings account for a significant portion of the total energy consumption of many countries. Energy efficiency is one of the primary objectives of today’s building projects. Whole building energy model (BEM), a physics-based modeling method for building thermal and energy behaviors, is widely used by building designers to predict and improve the energy performance of building design. BEM also has potential for developing supervisory control strategies for heating, ventilation and airconditioning (HVAC) systems. The BEM-derived control strategies may significantly improve HVAC energy efficiency compared to the commonly-used rule-based control strategies. However, it is challenging to use BEM for HVAC control. This is because, firstly, BEM is a high-order model so classical model-based control methods cannot be directly applied. Heuristic search algorithms, such as genetic algorithm, are usually used for BEM-based control optimization. Secondly, BEM is computationally-intensive compared to other black-box or grey-box models, which limits its application for large-scale control optimization problems. Model-free reinforcement learning (RL) is an alternative method to use BEM for HVAC control. Model-free RL is a “trial-and-error” learning method that is applicable for any complex systems. As a result, BEM can be used as a simulator to train an RL agent offline to learn an energy efficient supervisory control strategy. However, reinforcement learning for HVAC control has notbeen adequately studied. Most existing studies are based on over-simplified HVAC systems and a limited number of experiment scenarios. This study develops a BEM-assisted reinforcement learning framework for HVAC supervisorycontrol for energy efficiency. The control framework uses a design-stage BEM to “learn” a control strategy via model-free RL. The RL agent is a neural network model which performs as a function approximator. Through computer simulations, the control framework is evaluated in differentscenarios covering four typical commercial HVAC systems, four climates, and two building thermal mass levels. The RL-trained control strategies are also evaluated for “versatility”, i.e., the tolerance for the variations of HVAC operational conditions. Multiple “perturbed” simulators are created forthis purpose, with varying weather conditions, occupancy and plug-load schedules, and indoor air temperature setpoint schedules. The control framework has achieved better-then-baseline control performance in a variable-airvolume(VAV) system (a common type of air-based secondary HVAC system) for both cooling and heating under different climates and building thermal mass levels. Compared to a baseline rule-based control strategy, the RL-trained strategies can achieve obvious energy-savings and less “setpoint notmet time” (i.e., the cumulative time that indoor air temperature setpoints are not met). Also, the RL-trained strategies can tolerate the variations in weather conditions and occupancy/plug-load schedules. However, the RL-trained control strategies have worse-than-baseline energy performance if indoor air temperature setpoint schedules are significantly changed. The control framework has also achieved reduced heating demand and improved-or-similar thermal comfort (compared to a baseline rule-based control) for a slow-response radiant heating system in all the experiment scenarios. The RL-trained strategies have also achieved improved control performance in different perturbed simulators. However, the reward function must include a specially designed heuristic to deal with the slow thermal response and the imperfect energy metric of this system. The heuristic encourages low supply water temperature setpoint values and reward increasing trends of the predicted mean vote (PMV) if it is below the setpoint. This indicates that the reward function design is crucial for the control performance of this control framework. Control performance may be poor if the reward function is over-complicated, as shown in the experiments related to a multi-chiller chilled water system. The reward function for this system consistsof three complicated penalty functions corresponding to three operational constraints, including the chiller cycling time, the chiller partial-load-ratio, and the system supply water temperature. The RLtrained control strategies have violated some operational constraints significantly, and only achieved a limited amount of energy savings. This thesis also studied the effects of the neural network model (the RL agent function approximator) complexity on the control and convergence performance of the control framework. It isfound that a complex neural network model does not necessarily lead to better control performancecompared to a simple neural network model. A complex neural network model may make the reinforcementlearning hard to converge. Thus, “deep” reinforcement learning is not always a suitable choice, even though it is a popular concept in recent literature. As a general guideline, this study recommends using a narrow and shallow non-linear neural network model for the control framework.In future work, the control framework should be evaluated in more scenarios, such as more types of HVAC systems and more climate zones. It is also necessary to conduct a more comprehensive versatility analysis for a trained RL control policy. Future work should also develop an adaptive RL control method that could self-adapt to the changing characteristics of an HVAC system. Last but not least, theoretical investigations are needed to guide the future development of the control framework.
- Research Article
2
- 10.5075/epfl-thesis-7613
- Jan 1, 2017
- Infoscience (Ecole Polytechnique Fédérale de Lausanne)
Data-driven fundamental models for pedestrian movements
- Supplementary Content
- 10.26083/tuprints-00017536
- Mar 30, 2021
- TUbilio (Technical University of Darmstadt)
Nowadays, industrial processes are vastly automated by means of robotic manipulators. In some cases, robots occupy a large fraction of the production line, performing a rich range of tasks. In contrast to their tireless ability to repeatedly perform the same tasks with millimetric precision, current robotics exhibits low adaptability to new scenarios. This lack of adaptability in many cases hinders a closer human-robot interaction; furthermore, when one needs to apply some change to the production line, the robots need to be reconfigured by highly-qualified figures. Machine learning and, more particularly, reinforcement learning hold the promise to provide automated systems that can adapt to new situations and learn new tasks. Despite the overwhelming progress in recent years in the field, the vast majority of reinforcement learning is not directly applicable to real robotics. State-of-the-art reinforcement learning algorithms require intensive interaction with the environment and are unsafe in the early stage of learning when the policy perform poorly and potentially harms the systems. For these reasons, the application of reinforcement learning has been successful mainly on simulated tasks such as computer- and board-games, where it is possible to collect a vast amount of samples in parallel, and there is no possibility to damage any real system. To mitigate these issues, researchers proposed first to employ imitation learning to obtain a reasonable policy, and subsequently to refine it via reinforcement learning. In this thesis, we focus on two main issues that prevent the mentioned pipe-line from working efficiently: (i) robotic movements are represented with a high number of parameters, which prevent both safe and efficient exploration; (ii) the policy improvement is usually on-policy, which also causes inefficient and unsafe updates. In Chapter 3 we propose an efficient method to perform dimensionality reduction of learned robotic movements, exploiting redundancies in the movement spaces (which occur more commonly in manipulation tasks) rather than redundancies in the robot kinematics. The dimensionality reduction allows the projection to latent spaces, representing with high probability movements close to the demonstrated ones. To make reinforcement learning safer and more efficient, we define the off-policy update in the movement’s latent space in Chapter 4. In Chapter 5, we propose a novel off-policy gradient estimation, which makes use of a particular non-parametric technique named Nadaraya-Watson kernel regression. Building on a solid theoretical framework, we derive statistical guarantees. We believe that providing strong guarantees is at the core of a safe machine learning. In this spirit, we further expand and analyze the statistical guarantees on Nadaraya-Watson kernel regression in Chapter 6. Usually, to avoid challenging exploration in reinforcement learning applied to robotics, one must define highly engineered reward-function. This limitation hinders the possibility of allowing non-expert users to define new tasks. Exploration remains an open issue in high-dimensional and sparse reward. To mitigate this issue, we propose a far-sighted exploration bonus built on information-theoretic principles in Chapter 7. To test our algorithms, we provided a full analysis both on simulated environment, and in some cases on real world robotic tasks. The analysis supports our statement, showing that our proposed techniques can safely learn in the presence of a limited set of demonstration and robotic interactions.
- Supplementary Content
6
- 10.1184/r1/11882796.v1
- Feb 24, 2020
- Figshare
Reinforcement learning (RL) focuses on an essential aspect of intelligent behavior – how an agent can learn to make good decisions given experience and rewards in a stochastic<br>world. Yet popular RL algorithms that have enabled exciting successes in domains with good simulators (Go, Atari, etc) still often fail to learn in other domains because they rely on<br>simple heuristics for exploration. This provides additional empirical justification for essential questions around RL, specifically around algorithms that learn in a provably efficient manner through strategic exploration in any considered domain. This thesis provides new algorithms<br>and theory that enable good performance with respect to existing theoretical frameworks for evaluating RL algorithms (specifically, probably approximately correct) and introduces new stronger evaluation criteria, that may be particularly of interest as RL is applied to more real world problems.<br>For the first line of work on probably approximately correct (PAC) RL algorithms, we introduce a series of algorithms for episodic tabular domains with substantially better PAC<br>sample complexity bounds that culminate in a new algorithm with close to minimax optimal PAC and regret bounds. Look up tables are required by most sample efficient and computationally tractable algorithms, but cannot represent many practical domains. We therefore also present a new RL algorithm that can learn a good policy in environments with high dimensional observations and hidden deterministic states; unlike predecessors, this algorithm provably<br>explores not only in a statistically but also computationally efficient manner assuming access to function classes with efficient optimization oracles. To make progress it is critical to have the right measures of success. While empirical<br>demonstrations are quite clear, we find that for theoretical properties, two of the most commonly used learning frameworks, PAC guarantees and regret guarantees, each allow undesirable algorithm behavior (e.g. ignoring new observations that could improve the policy). We present<br>a new stronger learning framework called Uniform-PAC that unifies the existing frameworks and prevents undesirable algorithm properties. One caveat of all existing learning frameworks is that for any particular episode, we do not<br>know how well the algorithm will perform. To address this, we introduce the IPOC framework that requires algorithms to provide a certificate before each episode bounding how suboptimal the current policy can be. Such certifications may be of substantial interest in high stakes scenarios when an organization may wish to track or even pause an online RL system should the potential expected performance bound drop below a required expected outcome.
- Supplementary Content
- 10.6092/polito/porto/2604770
- Jan 1, 2015
- Politecnico di Torino
This thesis focuses on vehicular networks and outdoor pedestrian localization. In particular, it targets secure positioning in vehicular networks and pedestrian localization for safety services in outdoor environments. The former research topic must cope with three major challenges, concerning users' privacy, computational costs of security and the system trust on user correctness. This thesis addresses those issues by proposing a new lightweight privacy-preserving framework for continuous tracking of vehicles. The proposed solution is evaluated in both dense and sparse vehicular settings through simulation and experiments in real-world testbeds. In addition, this thesis explores the benefit given by the use of low frequency bands for the transmission of control messages in vehicular networks. The latter topic is motivated by a significant number of traffic accidents with pedestrians distracted by their smartphones. This thesis proposes two different localization solutions specifically for pedestrian safety: a GPS-based approach and a shoe-mounted inertial sensor method. The GPS-based solution is more suitable for rural and suburban areas while it is not applicable in dense urban environments, due to large positioning errors. Instead the inertial sensor approach overcomes the limitations of previous technique in urban environments. Indeed, by exploiting accelerometer data, this architecture is able to precisely detect the transitions from safe to potentially unsafe walking locations without the need of any absolute positioning systems
- Supplementary Content
5
- 10.1184/r1/8397962.v1
- Jul 2, 2019
- Figshare
Different from classic Supervised Learning, Reinforcement Learning (RL), is fundamentally interactive : an autonomous agent must learn howto behave in an unknown, uncertain,<br>and possibly hostile environment, by actively interacting with the environment to collect useful feedback to improve its sequential decision making ability. The RL agent will also<br>intervene in the environment: the agent makes decisions which in turn affects further evolution of the environment.<br>Because of its generality– most machine learning problems can be viewed as special cases– RL is hard. As there is no direct supervision, one central challenge in RL is how to<br>explore an unknown environment and collect useful feedback efficiently. In recent RL success stories (e.g., super-human performance on video games [Mnih et al., 2015]), we notice<br>that most of them rely on random exploration strategies, such as -greedy. Similarly, policy gradient method such as REINFORCE [Williams, 1992], perform exploration by injecting randomness into action space and hope the randomness can lead to a good sequence of actions that achieves high total reward. The theoretical RL literature has developed more sophisticated algorithms for efficient exploration (e.g., [Azar et al., 2017]), however, the sample<br>complexity of these near-optimal algorithms has to scale exponentially with respect to key parameters of underlying systems such as dimensions of state and action space. Such exponential dependence prohibits a direct application of these theoretically elegant RL algorithms to large-scale applications. In summary, without any further assumptions, RL is hard, both in practice and in theory. In this thesis, we attempt to gain purchase on the RL problem by introducing additional assumptions and sources of information.The first contribution of this thesis comes from improving RL sample complexity via imitation learning. Via leveraging expert’s demonstrations, imitation learning significantly simplifies the tasks of exploration. We consider two settings in this thesis: interactive imitation learning setting where an expert is available to query during training time, and the setting of imitation learning from observation alone, where we only have a set of demonstrations that consist of observations of the expert’s states (no expert actions are recorded). We study in both theory and in practice how one can imitate<br>experts to reduce sample complexity compared to a pure RL approach. The second contribution comes from model-free Reinforcement Learning. Specifically, we study policy<br>evaluation by building a general reduction from policy evaluation to no-regret online learning which is an active research area that has well-established theoretical foundation. Such a reduction creates a new family of algorithms for provably correct policy evaluation under<br>very weak assumptions on the generating process. We then provide a through theoretical study and empirical study of two model-free exploration strategies: exploration in action<br>space and exploration in parameter space. The third contribution of this work comes from model-based Reinforcement Learning. We provide the first exponential sample complexity separation between model-based RL and general model-free RL approaches. We then provide PAC model-based RL algorithm that can achieve sample efficiency simultaneously for many interesting MDPs such as tabular MDPs, Factored MDPs, Lipschitz continuous<br>MDPs, low rank MDPs, and Linear Quadratic Control. We also provide a more practical model-based RL framework, called Dual Policy Iteration (DPI), via integrating optimal<br>control, model learning, and imitation learning together. Furthermore, we show a general convergence analysis that extends the existing approximate policy iteration theories to DPI. DPI generalizes and provides the first theoretical foundation for recent successful practical RL algorithms such as ExIt and AlphaGo Zero [Anthony et al., 2017, Silver et al., 2017], and provides a theoretical sound and practically efficient way of unifying model-based and<br>model-free RL approaches. <br>
- Supplementary Content
1
- 10.4225/03/5893fe61e42bf
- Feb 3, 2017
- Figshare
Swarm robotics is the study of how large number of relatively simple physically embodied agents can be designed such that a desired collective behavior emerges from the local interactions among agents and between the agents and the environment. It is a novel approach to the coordination of large numbers of robots. It is inspired from the observation of social insects such as ants, termites, wasps and bees which stand as fascinating examples of how a large number of simple individuals can interact to create collectively intelligent systems. Social insects are known to coordinate their actions to accomplish tasks that are beyond the capabilities of a single individual: termites build large and complex mounds, army ants organize impressive foraging raids and ants can collectively carry large preys. Such coordination capabilities are still beyond the reach of current multirobot systems. In this research, most recent developments of swarm robot systems are addressed by classifying the primary research axes in terms of principal topic areas that have generated significant levels of research. Specific research scopes in each of primary research axes are also classified and the key open issues in this research scopes are identified. With the goal of bringing some objective grounding to the important areas of the mentioned research scopes, this dissertation presents an empirical analysis on multi-agent foraging task using reinforcement learning algorithms. Foraging task is one of the mostly used test applications in multi-agent systems. The aim of agents in foraging task is to find the pucks (prey) and bring them to the home (nest) location in a environment while satisfying the given constraints. This task is also known as prey retrieval task. Reinforcement learning is used to tackle the modeled multi-agent foraging task. Reinforcement learning has been extensively used in many applications such as industrial control, time sequence prediction, robot soccer competition and more. In this thesis, a multi-agent foraging task is modeled using Webots (1996) simulation software and the reinforcement learning algorithms and policies are tested. One of the challenges that arise in reinforcement learning is the exploration and exploitation dilemma. A novel learning policy (FIFO-list learning policy) is proposed and compared against the available learning policies reported in the literature to tackle the exploration and exploitation dilemma. An improved reinforcement learning algorithm (Cautious-Q learning algorithm) is also proposed and the performance is compared with the available learning algorithms. The proposed learning algorithm is a combined strategy of on-policy learning and off-policy learning. The improved learning algorithm and the learning policy is implemented on real environment with Khepera 2 mobile robots and results are presented.