Our research results align with previous studies showing hand-gesture recognition (HGR) performance is significantly dependent on the viewpoint. This leads to methods and results that often do not generalize well to the human-robot interaction (HRI) scenarios where viewpoints vary significantly. This work proposes two methods for fusing complementary multi-view information for HGR. We evaluate the methods using a multi-view hand pose dataset HanCo and compare them to two standard methods relying on either a single viewpoint or fully calibrated stereo-vision. We show that in HRI settings multiple complementary viewpoints are necessary, and information fusion should be performed at the extracted features stage, as suggested in our proposed network architecture. Additionally, we show that in some scenarios, camera calibration can be avoided, leading to simplified acquisition protocols.