This paper presents a novel approach for performing intuitive gesture based interaction using depth data acquired by Kinect. The main challenge to enable immersive gestural interaction is dynamic gesture recognition. This problem can be formulated as a combination of two tasks; gesture recognition and gesture pose estimation. Incorporation of fast and robust pose estimation method would lessen the burden to a great extent. In this paper we propose a direct method for real-time hand pose estimation. Based on the range images, a new version of optical flow constraint equation is derived, which can be utilized to directly estimate 3D hand motion without any need of imposing other constraints. Extensive experiments illustrate that the proposed approach performs properly in real-time with high accuracy. As a proof of concept, we demonstrate the system performance in 3D object manipulation On two different setups; desktop computing, and mobile platform. This reveals the system capability to accommodate different interaction procedures. In addition, a user study is conducted to evaluate learnability, user experience and interaction quality in 3D gestural interaction in comparison to 2D touchscreen interaction.
A direct method for recovering three-dimensional (3D) head motion parameters from a sequence of range images acquired by Kinect sensors is presented. Based on the range images, a new version of the optical flow constraint equation is derived, which can be used to directly estimate 3D motion parameters without any need of imposing other constraints. Since all calculations with the new constraint equation are based on the range images, Z(x, y, t), the existing techniques and experiences developed and accumulated on the topic of motion from optical flow can be directly applied simply by treating the range images as normal intensity images I(x, y, t). In this reported work, it is demonstrated how to employ the new optical flow constraint equation to recover the 3D motion of a moving head from the sequences of range images, and furthermore, how to use an old trick to handle the case when the optical flow is large. It is shown, in the end, that the performance of the proposed approach is comparable with that of some of the state-of-the-art approaches that use range data to recover 3D motion parameters.
Number of mobile devices such as mobile phones or PDAs has been dramatically increased over the recent years. New mobile devices are equipped with integrated cameras and large displays which make the interaction with device easier and more efficient. Although most of the previous works on interaction between humans and mobile devices are based on 2D touch-screen displays, camera-based interaction opens a new way to manipulate in 3D space behind the device in the camera's field of view. This paper suggests the use of particular patterns from local orientation of the image called Rotational Symmetries to detect and localize human gesture. Relative rotation and translation of human gesture between consecutive frames are estimated by means of extracting stable features. Consequently, this information can be used to facilitate the 3D manipulation of virtual objects in various applications in mobile devices.
Head pose estimation plays an essential role for bridging the information gap between humans and computers. Conventional head pose estimation methods are mostly done in images captured by cameras. However accurate and robust pose estimation is often problematic. In this paper we present an algorithm for recovering the six degrees of freedom (DOF) of motion of a head from a sequence of range images taken by the Microsoft Kinectfor Xbox 360. The proposed algorithm utilizes a least-squares minimization of the difference between themeasured rate of change of depth at a point and the rate predicted by the depth rate constraint equation. We segment the human head from its surroundings and background, and then we estimate the head motion. Our system has the capability to recover the six DOF of the head motion of multiple people in one image. Theproposed system is evaluated in our lab and presents superior results.
Currently, the most common way to control an electric wheelchair is to use joystick. However, there are some individuals unable to operate joystick-driven electric wheelchairs due to sever physical disabilities, like quadriplegia patients. This paper proposes a novel head pose estimation method to assist such patients. Head motion parameters are employed to control and drive an electric wheelchair. We introduce a direct method for estimating user head motion, based on a sequence of range images captured by Kinect. In this work, we derive new version of the optical flow constraint equation for range images. We show how the new equation can be used to estimate head motion directly. Experimental results reveal that the proposed system works with high accuracy in real-time. We also show simulation results for navigating the electric wheelchair by recovering user head motion.
This paper presents a novel approach for performing intuitive 3D gesture-based interaction using depth data acquired by Kinect. Unlike current depth-based systems that focus only on classical gesture recognition problem, we also consider 3D gesture pose estimation for creating immersive gestural interaction. In this paper, we formulate gesture-based interaction system as a combination of two separate problems, gesture recognition and gesture pose estimation. We focus on the second problem and propose a direct method for recovering hand motion parameters. Based on the range images, a new version of optical flow constraint equation is derived, which can be utilized to directly estimate 3D hand motion without any need of imposing other constraints. Our experiments illustrate that the proposed approach performs properly in real-time with high accuracy. As a proof of concept, we demonstrate the system performance in 3D object manipulation. This application is intended to explore the system capabilities in real-time biomedical applications. Eventually, system usability test is conducted to evaluate the learnability, user experience and interaction quality in 3D interaction in comparison to 2D touch-screen interaction.
This paper proposes a simple and yet effective technique for shape-based scene analysis, in which detection and/or tracking of specific objects or structures in the image is desirable. The idea is based on using predefined binary templates of the structures to be located in the image. The template is matched to contours in a given edge image to locate the designated entity. These templates are allowed to deform in order to deal with variations in the structure's shape and size. Deformation is achieved by dividing the template into segments. The dynamic programming search algorithm is used to accomplish the matching process, achieving very robust results in cluttered and noisy scenes in the applications presented.
A simple and cost-effective wearable gaze tracking system is designed to observe the readingpattern of patients with reading disorders in order to facilitate for the work of ophthalmologists andthe multidisciplinary treating teams in making reliable diagnosis. The system constitutes of twominiaturized cameras mounted on a headset; one for eye tracking and one for the scene. The eyetracking information is combined with information extracted from the picture of the forwardlooking camera to online identify the gaze point. When reading a text the gaze point moves and areading pattern is created.
To guarantee the quality of service (QoS) of a wireless network, a new packet scheduling algorithm using cross-layer design technique is proposed in this article. First, the demand of packet scheduling for multimedia transmission in wireless networks and the deficiency of the existing packet scheduling algorithms are analyzed. Then the model of the QoS-guaranteed packet scheduling (QPS) algorithm of high speed downlink packet access (HSDPA) and the cost function of packet transmission are designed. The calculation method of packet delay time for wireless channels is expounded in detail, and complete steps to realize the QPS algorithm are also given. The simulation results show that the QPS algorithm that provides the scheduling sequence of packets with calculated values can effectively improve the performance of delay and throughput.
In high speed networks, data streams show rapid, bursting and continuous characteristics, which makes real-time clustering of data streams be a difficulty. An improved SS tree structure is designed in this paper to keep the summarized information of data streams. Then, a high speed data streams clustering algorithm based on improved SS tree is proposed. In order to process the bursting streams in time, caching and piggyback mechanisms are used. The chaining buffers in the improved SS tree are used to temporarily store the data stream objects which cannot be processed immediately, and then the contents in buffers will be piggybacked together with the following data. To meet high arrival of data streams, two-phase clustering framework is adopted. Pre-aggregation phase produces local micro-clusters. After that, local micro-clusters take part in the global clustering phase based on the improved SS tree. Experimental results show that the proposed algorithm has better clustering accuracy in high-speed networks. The improved SS tree can effectively cluster high speed data streams and has a good applicability.
Low visibility on expressways caused by heavy fog and haze is a main reason for traffic accidents. Real-time estimation of atmospheric visibility is an effective way to reduce traffic accident rates. With the development of computer technology, estimating atmospheric visibility via computer vision becomes a research focus. However, the estimation accuracy should be enhanced since fog and haze are complex and time-varying. In this paper, a total bounded variation (TBV) approach to estimate low visibility (less than 300 m) is introduced. Surveillance images of fog and haze are processed as blurred images (pseudo-blurred images), while the surveillance images at selected road points on sunny days are handled as clear images, when considering fog and haze as noise superimposed on the clear images. By combining image spectrum and TBV, the features of foggy and hazy images can be extracted. The extraction results are compared with features of images on sunny days. Firstly, the low visibility surveillance images can be filtered out according to spectrum features of foggy and hazy images. For foggy and hazy images with visibility less than 300 m, the high-frequency coefficient ratio of Fourier (discrete cosine) transform is less than 20%, while the low-frequency coefficient ratio is between 100% and 120%. Secondly, the relationship between TBV and real visibility is established based on machine learning and piecewise stationary time series analysis. The established piecewise function can be used for visibility estimation. Finally, the visibility estimation approach proposed is validated based on real surveillance video data. The validation results are compared with the results of image contrast model. Besides, the big video data are collected from the Tongqi expressway, Jiangsu, China. A total of 1,782,000 frames were used and the relative errors of the approach proposed are less than 10%.
In this work, we are presenting a technique that allows for accurate estimation of frequencies in higher dimensions than the original image content. This technique uses asymmetrical Principal Component Analysis together with Discrete Wavelet Transform (aPCA-DWT). For example, high quality content can be generated from low quality cameras since the necessary frequencies can be estimated through reliable methods. Within our research, we build models for interpreting facial images where super-resolution versions of human faces can be created. We have worked on several different experiments, extracting the frequency content in order to create models with aPCA-DWT. The results are presented along with experiments of deblurring and zooming beyond the original image resolution. For example, when an image is enlarged 16 times in decoding, the proposed technique outperforms interpolation with more than 7 dB on average.
This paper proposes an intuitive wireless sensor/actuator based communication network for human animal interaction for a digital zoo. In order to enhance effective observation and control over wild life, we have built a wireless sensor network. 25 video transmitting nodes are installed for animal behavior observation and experimental vibrotactile collars have been designed for effective control in an animal park.
The goal of our research is two-folded. Firstly, to provide an interaction between digital users and animals, and monitor the animal behavior for safety purposes. Secondly, we investigate how animals can be controlled or trained based on vibrotactile stimuli instead of electric stimuli.
We have designed a multimedia sensor network for human animal machine interaction. We have evaluated the effect of human animal machine state communication model in field experiments.
A novel metric is proposed in the present report for the evaluation of the goodness-of-fit criterion between the distribution functions of two samples. We extend the usage of the proposed criterion for the case of the generalized Zipf distribution. Detailed mathematical analysis of the proposed metric, which is embodied in a hypothesis testing, is also provided.
This paper provides a transparent and speculative algorithm for content based web page prefetching. The algorithm relies on a profile based on the Internet browsing habits of the user. It aims at reducing the perceived latency when the user requests a document by clicking on a hyperlink. The proposed user profile relies on the frequency of occurrence for selected elements forming the web pages visited by the user. These frequencies are employed in a mechanism for the prediction of the user’s future actions. For the anticipation of an adjacent action, the anchored text around each of the outbound links is used and weights are assigned to these links. Some of the linked documents are then prefetched and stored in a local cache according to the assigned weights. The proposed algorithm was tested against three different prefetching algorithms and yield improved cache–hit rates given a moderate bandwidth overhead. Furthermore, the precision of accurately inferring the user’s preference is evaluated through the recall–precision curves. Statistical evaluation testifies that the achieved recall–precision performance improvement is significant.
The present report provides a novel transparent and speculative algorithm for content based web page prefetching. The proposed algorithm relies on a user profile that is dynamically generated when the user is browsing the Internet and is updated over time. The objective is to reduce the user perceived latency by anticipating future actions. In doing so the adaboost algorithm is used in order to automatically annotate the outbound links of a page to a predefined set of “labels”. Afterwards, the links that correspond to labels relevant to the user’s preferences are pre-fetched in an effort to reduce the perceived latency when the user is surfing the Internet. A comparison between the proposed algorithm against two other pre-fetching algorithms yield improved cache-hit rates given a moderate bandwidth overhead.
We present a robust vision-based technology for hand and finger detection and tracking that can be used in many CHI scenarios. The method can be used in real-life setups and does not assume any predefined conditions. Moreover, it does not require any additional expensive hardware. It fits well into user's environment without major changes and hence can be used in ambient intelligence paradigm. Another contribution is the interaction using glass which is a natural, yet challenging environment to interact with. We introduce the concept of ``invisible information layer" embedded into normal window glass that is used as an interaction medium thereafter.
We propose a simple and yet effective technique for shape-based ear localization. The idea is based on using a predefined binary ear template that is matched to ear contours in a given edge image. To cope with changes in ear shapes and sizes, the template is allowed to deform. Deformation is achieved by dividing the template into segments. The dynamic programming search algorithm is used to accomplish the matching process, achieving very robust localization results in various cluttered and noisy setups.
We introduce an idea for connecting timekeeping devices through the Internet, aiming at assigning people their individual personal time to loosen the strict rule of time synchronization that, in many cases, causes problems in access of available resources. Information about these resources, users, and their plans are utilized to accomplish the task. Time scheduling to assign users their individual time and readjustment of their timekeeping devices is done implicitly so that they do not feel any abnormal changes during their day. This will lead to a nonlinear relationship between real (absolute) time and personal time. We explain the concept, give examples, and suggest a framework for the system.
Video conferencing is a very effective tool to use for e-learning. Most of the available video conferencing systems suffer a main drawback represented by the lack of eye contact between participants. In this paper we present a new scheme for building eye contact in e-learning sessions. The scheme assumes a video conferencing session with “one teacher many students” arrangement. In our system, eye contact is achieved without the need for any gaze estimation technique. Instead, we “generate the gaze” by allowing the user communicate his visual attention to the system through head-eye coordination. To enable real time and precise headeye coordination, a head motion tracking technique is required. Unlike traditional head tracking systems, our procedure suggests mounting the camera on the user’s head rather than in front of it. This configuration achieves much better resolution and thus leads to better tracking results. Promising results obtained from both demo and real time experiments demonstrate the effectiveness and efficiency of the proposed scheme. Although this paper concentrates on elearning, the proposed concept can be easily extended to the world of interaction with social robotics, in which introducing eye contact between humans and robots would be of great advantage.
The aim of this work is to introduce a prototype for monitoring tremor diseases using computer vision techniques. While vision has been previously used for this purpose, the system we are introducing differs intrinsically from other traditional systems. The essential difference is characterized by the placement of the camera on the user’s body rather than in front of it, and thus reversing the whole process of motion estimation. This is called active motion tracking. Active vision is simpler in setup and achieves more accurate results compared to traditional arrangements, which we refer to as “passive” here. One main advantage of active tracking is its ability to detect even tiny motions using its simple setup, and that makes it very suitable for monitoring tremor disorders.
Most of the electric wheelchairs available in the market are joystick-driven and therefore assume that the user is able to use his hand motion to steer the wheelchair. This does not apply to many users that are only capable of moving the head like quadriplegia patients. This paper presents a vision-based head motion tracking system to enable such patients of controlling the wheelchair. The novel approach that we suggest is to use active vision rather than passive to achieve head motion tracking. In active vision-based tracking, the camera is placed on the user’s head rather than in front of it. This makes tracking easier, more accurate and enhances the resolution. This is demonstrated theoretically and experimentally. The proposed tracking scheme is then used successfully to control our electric wheelchair to navigate in a real world environment.
In this paper we suggest a promising solution to come over the problems of delivering e-learning to areas with lack or deficiencies in infrastructure for Internet and mobile communication. We present a simple, reasonably priced and efficient communication platform for providing e-learning. This platform is based on wireless ad-hoc networks. We also present a preemptive routing protocol suitable for real-time video communication over wireless ad-hoc networks. Our results show that this routing protocol can significantly improve the quality of the received video. This makes our suggested system not only good to overcome the infrastructure barrier but even capable of delivering a high quality e-learning material.
In this paper we investigate important issue for real-time video over wireless ad-hoc networks on different layers. Many error control methods for this approach use multiple streams and multipath routing. Thus the new proactive, link-state routing protocol have been developed, where the protocol finds the available route in the network and also it will not cause any interruption in the video traffic between the source and the destination. The open source MPEG-4 is also implemented to get the efficient video quality for the picture.
Connectivity in ad-hoc networks is a fundamental, but to a large extend still unsolved problem. In this paper we consider the connectivity problem when a number of nodes are uniformly distributed within a unit square. We limit our problem to the one-hop and two-hop connectivity. For the one-hop connectivity we find the exact analytically solution. For the two-hop connectivity we find the lower and upper bound for connectivity.
Mobile TV is a new interesting area in the telecommunication industry. The technology for sending live video to mobile clients is characterized by relatively low CPU processing power, low network resources, and low display resolution. In this paper we discuss a solution to all of these problems by using application layer multicasting. This can significantly reduce the needed bitrate and required computing resources for each client. At the same time the received video quality is increased. Several different methods for splitting the video into substreams are discussed. Simulations for the local wireless ad-hoc network are performed. A system for application layer multicasting using layered H.264 is also presented.
Sending video over wireless sensor networks is a challenging task. The encoding and transmission of video is very resource hungry and the sensor nodes have very limited resources in terms of communication bandwidth,memory, computation and typically 5-10 times. In this paper we will present a practical implementation of a Wyner-Ziv video codec where the reversed asymmetry in complexity between encoder and decoder can be achieved. We will also present our sensor network platform used in this demonstration known as Fleck TM-3 as well as two different co-processor daughterboards for image processing. The different daughterboards are then compared in terms of speed and energy consumption.
Wyner-Ziv video coding can provide low complexity encoding and high complexity decoding and is therefore a promising approach for video coding in wireless sensor networks. We will demonstrate our practical implementation of a wyner-ziv video codec. The hardware platform used in our camera sensor network is the Fleck camera developed by CSIRO ASL in Brisbane, Australia.
In this paper we present an approach to provide efficient low-complexity video encoding for wireless sensor networks. We present an method based on removing the most time-consuming task, that is motion estimation, from the encoder. Instead the decoder will perform motion prediction based on the available decoded frame and send the predicted motion vectors to the encoder. We present results based on a modified H.264 implementation. Our results shows that this approach can provide rather good coding efficiency even for relatively high network delays.
In this paper we present our approach to use a combination of radio frequency identification (RFID) and a wireless camera sensor network to identify and track animals at a zoo. We have developed and installed 25 cameras covering the whole zoo. The cameras are totally autonomous and they are configuring themselves in a wireless ad-hoc network. At strategic locations RFID readers are deployed to identify animals in close proximity. The camera network deployed in the zoo is continuous tracking animals in its field of view. By using data fusion from the camera system and the RFID readers we can get semi-continuous tracking of individual animals. The camera network has been running in the zoo for more than one year and about 5 000 hours of video has been captured and recorded. This will give us a very large dataset for offline development and testing of computer vision algorithms for animal detection and tracking.
In this chapter we will describe our work to set up a large scale wireless visual sensor network in a Swedish zoo. It is located close to the Arctic Circle making the environment very hard for this type of deployment. The goal is to make the zoo digitally enhanced, leading to a more attractive and interactive zoo. To reach this goal the sensed data will be processed and semantic information will be used to support interaction design, which is a key component to provide a new type of experience for the visitors. In this chapter we will describe our research work related to the various aspects of a digital zoo
In spite of the progress made in tele conferencing over the last decades, however, it is still far from a resolved issue. In this work, we present an intuitive video teleconferencing system, namely - Embodied Tele-Presence System (ETS) which is based on embodied interaction concept. This work proposes the results of a user study considering the hypothesis: “ Embodied interaction based video conferencing system performs better than the standard video conferencing system in representing nonverbal behaviors, thus creating a ‘feeling of presence’ of a remote person among his/her local collaborators”. Our ETS integrates standard audio-video conferencing with mechanical embodiment of head gestures of a remote person (as nonverbal behavior) to enhance the level of interaction. To highlight the technical challenges and design principles behind such tele-presence systems, we have also performed a system evaluation which shows the accuracy and efficiency of our ETS design. The paper further provides an overview of our case study and an analysis of our user evaluation. The user study shows that the proposed embodied interaction approach in video teleconferencing increases ‘in-meeting interaction’ and enhance a ‘feeling of presence’ among remote participant and his collaborators.
Socially interactive systems are embodied agents that engage in social interactions with humans. From a design perspective, these systems are built by considering a biologically inspired design (Bio-inspired) that can mimic and simulate human-like communication cues and gestures. The design of a bio-inspired system usually consists of (i) studying biological characteristics, (ii) designing a similar biological robot, and (iii) motion planning, that can mimic the biological counterpart. In this article, we present a design, development, control-strategy and verification of our socially interactive bio-inspired robot, namely - Telepresence Mechatronic Robot (TEBoT). The key contribution of our work is an embodiment of a real human-neck movements by, i) designing a mechatronic platform based on the dynamics of a real human neck and ii) capturing the real head movements through our novel single-camera based vision algorithm. Our socially interactive bio-inspired system is based on an intuitive integration-design strategy that combines computer vision based geometric head pose estimation algorithm, model based design (MBD) approach and real-time motion planning techniques. We have conducted an extensive testing to demonstrate effectiveness and robustness of our proposed system.
In this work we present an interactive video conferencing system specifically designed for enhancing the experience of video teleconferencing for a pilot user. We have used an Embodied Telepresence System (ETS) which was previously designed to enhance the experience of video teleconferencing for the collaborators. In this work we have deployed an ETS in a novel scenario to improve the experience of pilot user during distance communication. The ETS is used to adjust the view of the pilot user at the distance location (e.g. distance located conference/meeting). The velocity profile control for the ETS is developed which is implicitly controlled by the head of the pilot user. The experiment was conducted to test whether the view adjustment capability of an ETS increases the collaboration experience of video conferencing for the pilot user or not. The user study was conducted in which participants (pilot users) performed interaction using ETS and with traditional computer based video conferencing tool. Overall, the user study suggests the effectiveness of our approach and hence results in enhancing the experience of video conferencing for the pilot user.
In this paper we propose a simple and novel method for head pose estimation using 3D geometric modeling. Our algorithm initially employs Haar-like features to detect face and facial features area (more precisely eyes). For robust tracking of these regions; it also uses Tracking- Learning- Detection(TLD) frame work in a given video sequence. Based on two human eye-areas, we model a pivot point using distance measure devised by anthropometric statistic and MPEG-4 coding scheme. This simple geometrical approach relies on human facial feature structure on the camera-view plane to estimate yaw, pitch and roll of the human head. The accuracy and effectiveness of our proposed method is reported on live video sequence considering head mounted inertial measurement unit (IMU).
Great progress in face recognition technology has been made recently. Such advances will provide us the possibility to build a new generation of search engine: Face Google, searching from person photos. It is very challenging to find a person from a very large or extremely large database which might hold face images of millions or hundred millions of people. The indexing technology used in most commercial search engines like Google, is very efficient for text-based search, unfortunately, it is no longer useful for image search. A solution is to use partial information (signature) about all the face images for search. The retrieval speed is approximately proportional to the size of a signature image. In this paper we will study a totally new way to compress the signature images based on the observation that the face signature images and the query images are highly correlated if they are from the same individual. The face signature image can be greatly compressed (one or two orders of magnitude improvement) by use of knowledge of the query images. We can expect the new compression algorithm to speed up face search 10 to 100 times. The challenge is that query images are not available when we compress their signature image. Our approach is to transfer the face search problem into the so-called âWyner-Ziv Codingâ problem, which could give the same compression efficiency even if the query images are not available until we decompress signature images. A practical compression scheme based on LDPC codes is developed to compress face signature images.
Huge efforts have been devoted to face recognition technology and remarkable results, noticed. Such advances will provide us the possibility to build a new generation of search engine: persons photo fetching. It is a real computing challenge to find a person from a very large or extremely large database which might hold face images of millions or hundred millions of people. A candidate solution is to use partial information (signature) about all the face images for search, making the retrieval speed approximately proportional to the size of a signature image. In this paper we will investigate a totally new way to compress the signature images based on the observation that the face signature images and the query images are highly correlated if they are from the same individual. The face signature image can be greatly compressed (one or two orders of magnitude improvement) by use of knowledge of the query images. We can expect the new compression algorithm to speed up face search 10 to 100 times. The challenge is that query images are not available when we compress their signature image. Our approach is to transfer the face search problem into the so-called âWyner-Ziv Codingâ problem, which could give the same compression efficiency even if the query images are not available until we decompress signature images. A practical compression scheme based on LDPC codes is developed to compress and retrieve face signature images.
Matching a query (reference) image to an image extracted from a database containing (possibly) transformed image copies is an important retrieval task. In this paper we present a general method based on matching densities of the corresponding image feature vectors by using the Bregman distances. We consider statistical estimators for some quEDratic entropy-type characteristics. In particular, the quEDratic Bregman distances can be evaluated in image matching problems whenever images are modeled by random feature vectors in large image databases. Moreover, this method can be used for average case analysis for optimization of joining large databases. © 2011 IEEE.