Arya Farkhondeh

I am a PhD student at EPFL (École polytechnique fédérale de Lausanne) & Idiap Research Institute in Switzerland, where I work on computer vision, deep learning, and human-centric scene understanding under the supervision of Jean-Marc Odobez. Previously, I was a Research Intern at the Computer Vision Center (CVC) in Barcelona, where I worked with Sergio Escalera. I received my M.Sc. degree in Data Science from Sapienza University of Rome, Italy.

Email  /  Google Scholar  /  Linkedin  /  Github  /  Huggingface 🤗  /  Twitter

profile photo
Research

I am interested in Computer Vision, Multimodal Learning, Unsupervised Learning, and Human-Centric Scene Understanding.

A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction
Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, Jean-Marc Odobez
Accepted at the Conference on Neural Information Processing Systems (NeurIPS), 2024
arXiv / code / video

We introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. It comprises: (i) a temporal, transformer-based architecture that, in addition to image tokens, handles person-specific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, that unifies annotation types across multiple gaze following and social gaze datasets. We show that our model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.

ChildPlay-Hand: A Dataset of Hand Manipulations in the Wild
Arya Farkhondeh, Samy Tafasca, Jean-Marc Odobez
European Conference on Computer Vision (ECCV) Workshops, 2024
arXiv / code / video

We propose ChildPlay-Hand, a novel dataset that includes person and object bounding boxes, as well as manipulation actions. ChildPlay-Hand is unique in: (1) providing per-hand annotations; (2) featuring videos in uncontrolled settings with natural interactions; (3) including gaze labels from the ChildPlay-Gaze dataset for joint modeling of manipulations and gaze. The manipulation actions cover the main stages of an HOI cycle. We introduce two tasks using these annotations: object in hand (OiH) and manipulation stages (ManiS). We benchmark various spatio-temporal and segmentation networks on these tasks. Our findings suggest that ChildPlay-Hand is a challenging new benchmark for modeling HOI in the wild.

Exploring the Zero-Shot Capabilities of Vision-Language Models for Improving Gaze Following
Anshul Gupta, Pierre Vuillecard, Arya Farkhondeh, Jean-Marc Odobez
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2024
Best Paper Award
arXiv / code / video

We investigate the zero-shot capabilities of Vision-Language Models (VLMs) for extracting a wide array of contextual cues to improve gaze following performance. We first evaluate various VLMs, prompting strategies, and in-context learning (ICL) techniques for zero-shot cue recognition performance. Our analysis indicates that BLIP-2 is the overall top performing VLM and that ICL can improve performance. For gaze following, incorporating the extracted cues results in better generalization performance, especially when considering a larger set of cues.

CCDb-HG: Novel Annotations and Gaze-Aware Representations for Head Gesture Recognition
Pierre Vuillecard, Arya Farkhondeh, Michael Villamizar, Jean-Marc Odobez
IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2024
pdf / code / video

CCDb-HG is a novel and large dataset for head gesture recognition, featuring diverse head gesture categories. Furthermore, we explore the inclusion of gaze as an auxiliary cue, along with geometric and temporal data augmentation, to enhance generalization. Additionally, we evaluate various model architectures to establish baseline performance on CCDb-HG.

Towards Self-Supervised Gaze Estimation
Arya Farkhondeh, Cristina Palmero, Simone Scardapane, Sergio Escalera
British Machine Vision Conference (BMVC), 2022
arXiv / code / website

SwAT is a self-supervised approach for gaze estimation that outperforms previous state-of-the-art methods and supervised baselines on several benchmarks.

Temporal Cues from Socially Unacceptable Trajectories for Anomaly Detection
Neelu Madan, Arya Farkhondeh, Kamal Nasrollahi, Sergio Escalera, Thomas B Moeslund
IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021
pdf / video

A new approach for detecting anomalies in surveillance videos that uses long-range dependencies from trajectories and basic motion information.


This website's source code