Open Problems

Purpose

This page is a growing collection of short research contributions from researchers participating in or connected to the MMI community. Each entry highlights an important open problem, challenge, or future research direction in Multimedia Intelligence, Multimodal AI, Information Retrieval, Trustworthy AI, Human-Centered AI, Intelligent Systems, or related areas.

Each entry captures what a researcher believes is worth working on next: a problem they keep encountering, a gap they wish more students would explore, or a direction that remains underinvestigated. The goal is to give students a map into the research landscape, written by researchers who are actively shaping it.

Unlike surveys or tutorials, these contributions focus on unanswered questions and emerging opportunities. They are intended to help students identify impactful research directions, understand why they matter, and discover where they can contribute.

Contributions

Filter: All Multimodal AI Trustworthy AI Information Retrieval

Search and filtering by research area will be available as the archive grows.

The Open Problem

Long-form video understanding requires maintaining semantic coherence across thousands of frames while simultaneously integrating speech transcripts, visual events, and temporal context. Current vision-language models excel at short clips but struggle to reason about relationships between events separated by minutes or hours within a single recording.

The open problem is: How can a multimodal system maintain cross-modal alignment and temporal coherence across an hour-length video without losing contextual meaning or introducing factual inconsistencies?

Why It Matters

Video has become the dominant form of information in education, medicine, science communication, and professional training. A system that can answer complex temporal queries over long recordings — connecting what was said at minute 12 to what appeared visually at minute 47 — would fundamentally transform how knowledge is accessed and retrieved from video archives.

This problem sits at the intersection of multimedia retrieval, multimodal reasoning, and memory-efficient AI, making it one of the most practically important open challenges in the field.

Key Challenges

Token budget constraints: transformers cannot process hour-length video frame-by-frame within current context limits
Temporal grounding: locating when events occur, not merely whether they occur
Cross-modal consistency: detecting and resolving contradictions between spoken content and visual evidence
Evaluation: existing benchmarks test short-context understanding; genuine long-range benchmarks are sparse and expensive to annotate
Compression artifacts: aggressive temporal subsampling destroys the fine-grained cues needed for alignment

Suggested Starting Points for Students

Segment a 1-hour lecture video using a sliding window and measure the degradation in retrieval accuracy as window overlap decreases
Compare CLIP-based visual retrieval against a transcript-only BM25 baseline on temporal event queries — the gap reveals how much visual context helps
Implement a hierarchical summarization pipeline: first summarize each 5-minute segment, then reason over segment summaries
Study cross-modal grounding datasets such as HowTo100M and ActivityNet Captions as entry points into the evaluation landscape

Submit an Open Problem

If there's a problem you'd like to add to this collection, we'd love to include it. Send your contribution using the checklist below as a guide. One email with everything included is ideal.

A complete submission includes:

A concise title
Research area tags (3–7 tags — these will power search and filtering, e.g. Multimodal AI, Trustworthy AI, Information Retrieval)
A clear statement of the open problem (2–3 paragraphs)
Why it matters — motivation and potential impact
Key technical and conceptual challenges
Suggested starting points for students new to the area
3–5 recommended readings (papers, surveys, or tutorials)
A headshot photo, if you are not already featured on the MMI website
Your Google Scholar profile link, and optionally your personal webpage or LinkedIn
Optional: a short video talk or recorded presentation (10 minutes or less)

The Open Problems Challenge

Consider yourself challenged. Once your entry is live, pass it on: introduce us to 3 researchers you think should contribute next. Name them, connect us, and help the collection grow one problem at a time.

Purpose

Contributions

How to Contribute

Submit an Open Problem