Wednesday , 22 May 2013

France: Postdoc position : Fault tolerant distributed branch-and-bound on the grid

We are looking for Postdoc candidates on *Fault tolerant distributed branch-and-bound on the grid*.
Details : http://www.lifl.fr/~derbel/
Location : INRIA Lille Nord Europe
Team: Dolphin http://dolphin.lille.inria.fr
Starts : Autumn 2010
Supervisors/Contacts : bilel.derbel@lifl.fr, nouredine.melab@lifl.fr



Full description

Scientific context:
Many real-life problems coming from various application domains are reputed to be hard combinatorial optimisation problems (COPs), e.g., in high-energy physics, space and earth observation, life sciences and health, nanotechnology, etc. To efficiently solve such problems, one needs more and more ressources that are uselly aggregated into large scale computational grids. Having a huge amount of computational ressources at hand, it is however not straightforward to solve hard COPs in an efficient way. In particular, in a fully distributed environment (e.g., peer-to-peer networks, grids, cluster of clusters, desktops, etc.) communication latency, component heterogenity, ressource volatility, to mention a few, make it a challenging task to solve large COP instances. Designing high quality, scalable and fault-tolerant algorithms that can be effectively deployed to solve COPs, is in fact a hot research topics.

Generally speaking, the ‘quality’ of a parallel/distributed algorithm for solving COPs can be measured using several inter-dependent metrics. For instance, one may consider the time needed to solve a given COP, the load, or the congestion created at the network. The scalability of an algorithm can then be measered by its ability to be executed on an increasing number of processors without performance degradation. Within the context of large scale grids, the ability of a distributed algorithm to handle many simultaneous faults is also a fundamental issue. Failures can be of different types and have many origins, e.g., network crash, nodes volatility, dynamicity, etc.  Fault-tolerance mechanisms are thus mandatory,  but  they often imply additional communications and control overheads leading to efficiency loss if not handled very carefully.

Goal:
Having these issues in mind, we focus on designing exact methods to solve large COPs instances by taking benefit from the large scale property of computational grids while overcoming the limitations that prevent the underlying distributed algorithms to scale, to be time efficient and fault-resilient. More specifically, our team is designing distributed/parallel branch-and-bound (B&B) algorithms that were proved extremly efficient [1]. Very recently, we have investigated a fully distributed B&B framework. The preliminary results show that our framework is efficient in a static environment [2], i.e., when the computational nodes are fully available and do not vanish. Our distributed approach is based on organising the computational nodes according a logical peer-to-peer overlay [3] allowing them to balance B&B work units efficiently.

The purpose of this project is to extend our approach to a dynamic environment where nodes can fail, join or leave while the distributed B&B is in progress. This implies the design of efficicent and fully distributed faul-tolerant and/or self-stabilizing distributed algorithms. In addition, we want our solutions to be effective. In other words, we aim at designing fault-tolerant mechanisms that can be deployed in a grid environment without performance degradation. One hot issue is to design and maintain a faul-tolerant overlay connecting the computational nodes and allowing them to balance the load efficiently.

To validate our findings, both a formal/theoretical and an experimental study should be conducted. In fact, in the context of exact B&B algorithms, we must guaranty that the computed solution of a given COP is the optimal one. Thus, the proposed fault-tolerant mechanisms must be proved to be correct in a formal and rigorous way. Furthermore, since our goal is to tackle previously unsolved large instance COPs, the designed algorithms should be tested in a grid environment to experimentally validate their effectiveness.  More specifically,  extensive experiments should be conducted on top of the french national GRID’5000 plateform [4].

Required skills:  Distributed algorithms, fault-tolerance, dynamic peer-to-peer overlay, grid
computing

Bilbiographical References:
1. M. Mezmaz, N. Melab and E-G. Talbi. A Grid-enabled Branch and Bound Algorithm for Solving Challenging Combinatorial Optimization Problems. In Proc. of 21th IEEE Intl. Parallel and Distributed Processing Symp., Long Beach, California, March 26th-30th, 2007.
2. M. Dajmaï, B. Derbel, and N. Melab. Distributed Branch-and-Bound Algorithm: A Pure Peer-to-Peer Approach. Grid’5000 spring school. April 2010.
3. A. Rowstron and P. Druschel. Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems. In Lecture Notes in Computer Science, pages 329–350, 2001.
4. Grid’5000 : https://www.grid5000.fr

To apply: Interested candidates are invited to send a detailed cv to : bilel.derbel@lifl.fr ; nouredine.melab@lifl.fr

Comments are closed.