MMI Workshop Series

Open Problems

Research Questions and Future Directions
From the MMI Community

Purpose

This page is a growing collection of short research contributions from researchers participating in or connected to the MMI community. Each entry highlights an important open problem, challenge, or future research direction in Multimedia Intelligence, Multimodal AI, Information Retrieval, Trustworthy AI, Human-Centered AI, Intelligent Systems, or related areas.

Each entry captures what a researcher believes is worth working on next: a problem they keep encountering, a gap they wish more students would explore, or a direction that remains underinvestigated. The goal is to give students a map into the research landscape, written by researchers who are actively shaping it.

Unlike surveys or tutorials, these contributions focus on unanswered questions and emerging opportunities. They are intended to help students identify impactful research directions, understand why they matter, and discover where they can contribute.

Contributions

Filter: All Multimodal AI Trustworthy AI Information Retrieval
Search and filtering by research area will be available as the archive grows.
The entry below is a sample to illustrate the format. Real contributions will appear here.
Mohammad Dindoost
Cross-Modal Alignment in Long-Form Video Understanding
NJIT
Multimodal AI Multimedia Retrieval Information Retrieval
May 2026
The Open Problem

Long-form video understanding requires maintaining semantic coherence across thousands of frames while simultaneously integrating speech transcripts, visual events, and temporal context. Current vision-language models excel at short clips but struggle to reason about relationships between events separated by minutes or hours within a single recording.

The open problem is: How can a multimodal system maintain cross-modal alignment and temporal coherence across an hour-length video without losing contextual meaning or introducing factual inconsistencies?

Why It Matters

Video has become the dominant form of information in education, medicine, science communication, and professional training. A system that can answer complex temporal queries over long recordings — connecting what was said at minute 12 to what appeared visually at minute 47 — would fundamentally transform how knowledge is accessed and retrieved from video archives.

This problem sits at the intersection of multimedia retrieval, multimodal reasoning, and memory-efficient AI, making it one of the most practically important open challenges in the field.

Key Challenges
  • Token budget constraints: transformers cannot process hour-length video frame-by-frame within current context limits
  • Temporal grounding: locating when events occur, not merely whether they occur
  • Cross-modal consistency: detecting and resolving contradictions between spoken content and visual evidence
  • Evaluation: existing benchmarks test short-context understanding; genuine long-range benchmarks are sparse and expensive to annotate
  • Compression artifacts: aggressive temporal subsampling destroys the fine-grained cues needed for alignment
Suggested Starting Points for Students
  • Segment a 1-hour lecture video using a sliding window and measure the degradation in retrieval accuracy as window overlap decreases
  • Compare CLIP-based visual retrieval against a transcript-only BM25 baseline on temporal event queries — the gap reveals how much visual context helps
  • Implement a hierarchical summarization pipeline: first summarize each 5-minute segment, then reason over segment summaries
  • Study cross-modal grounding datasets such as HowTo100M and ActivityNet Captions as entry points into the evaluation landscape
Recommended Reading
  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model (Zhang et al., 2023)
  • InternVid: A Large-scale Video-Text Dataset (Wang et al., 2023)
  • LLaVA-NeXT: Improved Reasoning, OCR, and World Knowledge (Liu et al., 2024)
  • Temporal Grounding in Video Retrieval: A Survey (ACM Computing Surveys, 2024)
  • HowTo100M: Learning a Text-Video Embedding (Miech et al., 2019)

How to Contribute

Submit an Open Problem

If there's a problem you'd like to add to this collection, we'd love to include it. Send your contribution using the checklist below as a guide. One email with everything included is ideal.

A complete submission includes:

  • A concise title
  • Research area tags (3–7 tags — these will power search and filtering, e.g. Multimodal AI, Trustworthy AI, Information Retrieval)
  • A clear statement of the open problem (2–3 paragraphs)
  • Why it matters — motivation and potential impact
  • Key technical and conceptual challenges
  • Suggested starting points for students new to the area
  • 3–5 recommended readings (papers, surveys, or tutorials)
  • A headshot photo, if you are not already featured on the MMI website
  • Your Google Scholar profile link, and optionally your personal webpage or LinkedIn
  • Optional: a short video talk or recorded presentation (10 minutes or less)
The Open Problems Challenge

Consider yourself challenged. Once your entry is live, pass it on: introduce us to 3 researchers you think should contribute next. Name them, connect us, and help the collection grow one problem at a time.

MMI Series  ·  Department of Computer Science  ·  NJIT