Grupo de Tratamiento de Imágenes

 

 

News and Events 

"From traditional multi-stage machine learning to end-to-end deep learning for computer vision applications" 

Ana Maqueda

E.T.S. Ing. Telecomunicación, Universidad Politécnica de Madrid, Sept 2018, "Cum Laude".

Ph.D. thesis Directors: Narciso  García Santos y Carlos Roberto del Blanco Adán.

The renaissance of Deep Neural Networks in the era of big data, along with the use of high performance hardware that reduces computational time, have changed the paradigm of machine learning, specially in the field of computer vision. Whereas systems based on traditional machine learning rely on multiple stages and hand-crafted features to get the insight of the problem, Convolutional Neural Networks automatically learn the features that maximize the learning accuracy directly from raw images in an end-to-end manner. The purpose of this dissertation is to show the gap between traditional multi-stage learning systems and end-to-end deep learning systems, addressing different applications for a qualitative comparison.

First, an expert-knowledge recognition system has been developed to deal with dynamic hand gestures. The key aspects of this system are hand-crafted image and video descriptors, and also the pipeline of the whole system. These descriptors have been designed to face difficulties of vision-based approaches such as illumination changes, intra-class and inter-class variances, and multiple scales. The design of the multiple stages of the system solve intermediate steps that are necessary to successfully apply the previous descriptors. Since the proposed hand-gesture recognition system has been designed for a human-computer interface, it comprises detection and tracking stages to localize the object of interest, and a recognition stage to categorize the performed gesture.

Second, DL approaches have been proposed for different computer vision applications. Research efforts have focused on building these types of end-to-end systems to face the weaknesses present in traditional learning. Unlike previous approach, they do not need multiple stages to perform the target task, nor feature engineering. Their architecture designs rely on the task to be solved, its complexity, and the available amount of data. These guidelines have been applied to common vision-based applications such vehicle detection, and hand-gesture recognition, but also to more challenging situations, such as robotics applications.