highlighted predominant access mechanism as manual tagging via metadata
noted emerging automatic approaches are based on shot boundary detection, feature extraction and keyframe identification, followed by feature searching with keyframe browsing
noted there is no test collection of video
provided an overview of what 12 groups did on 11 hours of video in shot boundary detection and searching tasks
Last year was TV101, this year is TV201
New this year (1)
More participants and data:
17 participating teams (up from 12),
73 hours (up from 11)
Shot boundary determination (SBD)
new measures
3-week test window
New semantic feature extraction task
features defined jointly by the participants
task is to identify shots with those features
Several groups donated extracted features
identified features from test videos early
shared their output (in MPEG-7 defined by IBM) in time for others to use as part of their search systems
New this year (2)
25 topics for the search task,
developed by NIST
4 weeks between release and submission
text, video, image and/or audio
Average precision added as measure – new emphasis on ranking
A common set of shot definitions
donated by CLIPS-IMAG, formatted by DCU
common units of retrieval for feature and search tasks
allowed pooling for assessment
New this year (3)
Searching was:
Interactive: full human access and iterations, or
Manual: a human with no knowledge of the test data gets one shot to formulate the topic as a search query
No fully automatic topic-to-query translation
Elapsed search time was added as a measure of effort for interactive search, groups gathered data on searcher characteristics
advertising, educational, industrial, amateur films 1930-1970
produced by corporations, non-profit organisations, trade groups, etc.
Noisy, strange color, but real archive data
73.3 hours partitioned as follows:
2. Shot Boundary Detection task
Not a new problem, but a challenge because of gradual transitions and false positives caused by photo flashes, rapid camera or object movement
4 hours, 51 minutes of documentary and educational material
Manually created ground truth of 2,090 transitions (thanks Jonathan) with 70% hard cuts, 25% dissolves, rest are fades to black and back, etc.
Up to 10 submissions per group, measured using precision and recall, with a bit of flexibility for matching gradual transitions
2001: Recall and precision for cuts
2002: Recall and precision for cuts
2001: Gradual Transitions
2002: Gradual Transitions
2001: Frame-recall & -precision for GTs
2002: Frame-recall & -precision for GTs
So, who did what ? The approaches….
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
3. Feature Extraction
FE is
interesting itself but when it serves to help video navigation and search then its importance increases
Objective was to
begin work on benchmarking FE
allow exchange of feature detection output among participants
Task is as follows:
given small standard dataset (5.02 hours, 1,848 shots) with common shot bounds,
locate up to 1,000 shots for each of 10 binary features
Feature frequency varied from “rare” to “everywhere”
The Features
1. Outdoors
2. Indoors
3. Face - 1+ human face with nose, mouth, 2 eyes
4. People - 2+ humans, each at least partially visible
5. Cityscape - city/urban/suburban setting
6. Landscape - natural inland setting with no human development such as ploughing or crops
7. Text Overlay - large enough to be read
8. Speech - human voice uttering words
9. Instrumental Sound - 1+ musical instruments
10. Monologue - 1 person, partially visible, speaking for a long time without interruption
True shots contributed uniquely by each run
Small values imply lots of overlap between runs
Likely due to relative size of result set (1,000 shots) and total test set (1,848 shots)
AvgP by feature (runs at median or above)
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
4. The Search Task
Task is similar to text analogue …
topics are formatted descriptions of an information need
task is to return up to 100 shots that meet the need
Test data: 40.12 hours (14,524 common shots)
Features and/or ASR donated by CLIPS, DCU, IBM, Mediamill and MSRA
NIST assessors
judged top 50 shots from each submitted result set
subsequent full judgements showed only minor variations in performance
Used trec_eval to calculate measures
Search Topics
Topics (25) multimedia, created by NIST
22 had video examples (avg 2.7 each), 8 had image (avg 1.9 each)
Requested shots with specific/generic:
People: George Washington; football players
Things: Golden Gate Bridge; sailboats
Locations: ---; overhead views of cities
Activities : ---; rocket taking off
Combinations of the above:
People spending leisure time at the beach
Locomotive approaching the viewer
Microscopic views of living cells
Search Types: Interactive and Manual
Manual runs: Top 10 (of 27)
Interactive runs top 10 (of 13)
Mean AvgP vs mean elapsed time
Search: Unique relevant shots from each run
Distribution of relevant shots Top vs bottom of halves of result sets
Max/median AvgP by topic - interactive
Max/median AvgP by topic - manual
Relevant shots by file id (topics 75-87)
Relevant shots by file id (topics 88-99)
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
Groups doing the “Full Monty”
5. Conclusions
This track has grown significantly
… data, groups, tasks, measures, complexity
Donated features enabled many sites to take part and greatly enriched the progress .. this cannot be overstated … very collegiate and beneficial all-round
Common shot definition
implications for measurement need closer look
seems it was successful
The search task is becoming increasingly interactive, and we could do with guidance here
Evaluation framework has settled down – should be repeated on new data with only minor adjustments
Need more data (especially for feature extraction), more topics – looking at 120 hours of news video from 1998
Need to encourage progress on manual/automatic processing – how? focus evaluation more?
Probably ready to become one-day pre-TREC workshop with report-out/poster at TREC