Alan F. Smeaton Dublin City University
tarix 01.08.2018 ölçüsü 445 b. #65337
Alan F. Smeaton & Paul Over NIST
1. Introduction and Context Last year’s talk… gave an intro to video coding & compression; highlighted predominant access mechanism as manual tagging via metadata noted emerging automatic approaches are based on shot boundary detection, feature extraction and keyframe identification, followed by feature searching with keyframe browsing noted there is no test collection of video provided an overview of what 12 groups did on 11 hours of video in shot boundary detection and searching tasks Last year was TV101, this year is TV201
New this year (1) More participants and data: 17 participating teams (up from 12), 73 hours (up from 11) Shot boundary determination (SBD) new measures 3-week test window New semantic feature extraction task features defined jointly by the participants task is to identify shots with those features identified features from test videos early shared their output (in MPEG-7 defined by IBM) in time for others to use as part of their search systems
New this year (2) 25 topics for the search task, developed by NIST 4 weeks between release and submission text, video, image and/or audio Average precision added as measure – new emphasis on ranking A common set of shot definitions donated by CLIPS-IMAG, formatted by DCU common units of retrieval for feature and search tasks allowed pooling for assessment
New this year (3) Searching was: Interactive: full human access and iterations, or Manual: a human with no knowledge of the test data gets one shot to formulate the topic as a search query No fully automatic topic-to-query translation Elapsed search time was added as a measure of effort for interactive search, groups gathered data on searcher characteristics
The 17 groups and the tasks they completed
Video Data Difficult to get video data for use in TREC because © Used mainly Internet Archive advertising, educational, industrial, amateur films 1930-1970 produced by corporations, non-profit organisations, trade groups, etc. Noisy, strange color, but real archive data 73.3 hours partitioned as follows:
2. Shot Boundary Detection task Not a new problem, but a challenge because of gradual transitions and false positives caused by photo flashes , rapid camera or object movement 4 hours, 51 minutes of documentary and educational material Manually created ground truth of 2,090 transitions (thanks Jonathan) with 70% hard cuts, 25% dissolves, rest are fades to black and back, etc. Up to 10 submissions per group, measured using precision and recall, with a bit of flexibility for matching gradual transitions
2001: Recall and precision for cuts
2002: Recall and precision for cuts
2001: Gradual Transitions
2002: Gradual Transitions
2002: Frame-recall & -precision for GTs
So, who did what ? The approaches….
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
Shot Boundary Detection:
3. Feature Extraction FE is interesting itself but when it serves to help video navigation and search then its importance increases Objective was to begin work on benchmarking FE allow exchange of feature detection output among participants Task is as follows: given small standard dataset (5.02 hours, 1,848 shots) with common shot bounds, locate up to 1,000 shots for each of 10 binary features Feature frequency varied from “rare” to “everywhere”
The Features 1. Outdoors 2. Indoors 3. Face - 1+ human face with nose, mouth, 2 eyes 4. People - 2+ humans, each at least partially visible 5. Cityscape - city/urban/suburban setting 6. Landscape - natural inland setting with no human development such as ploughing or crops 7. Text Overlay - large enough to be read 8. Speech - human voice uttering words 10. Monologue - 1 person, partially visible, speaking for a long time without interruption
True shots contributed uniquely by each run Small values imply lots of overlap between runs Likely due to relative size of result set (1,000 shots) and total test set (1,848 shots)
AvgP by feature (runs at median or above)
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
Groups and Features
4. The Search Task Task is similar to text analogue … topics are formatted descriptions of an information need task is to return up to 100 shots that meet the need Test data: 40.12 hours (14,524 common shots) Features and/or ASR donated by CLIPS, DCU, IBM, Mediamill and MSRA NIST assessors judged top 50 shots from each submitted result set subsequent full judgements showed only minor variations in performance Used trec_eval to calculate measures
Search Topics Topics (25) multimedia, created by NIST 22 had video examples (avg 2.7 each), 8 had image (avg 1.9 each) Requested shots with specific/generic: People: George Washington ; football players Things: Golden Gate Bridge; sailboats Locations: ---; overhead views of cities Activities : ---; rocket taking off Combinations of the above: People spending leisure time at the beach Locomotive approaching the viewer Microscopic views of living cells
Search Types: Interactive and Manual
Manual runs: Top 10 (of 27)
Interactive runs top 10 (of 13)
Mean AvgP vs mean elapsed time
Search: Unique relevant shots from each run
Distribution of relevant shots Top vs bottom of halves of result sets
Max/median AvgP by topic - interactive
Relevant shots by file id (topics 75-87)
Relevant shots by file id (topics 88-99)
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
The Groups and Searching
Groups doing the “Full Monty”
This track has grown significantly … data, groups, tasks, measures, complexity Donated features enabled many sites to take part and greatly enriched the progress .. this cannot be overstated … very collegiate and beneficial all-round Common shot definition implications for measurement need closer look seems it was successful The search task is becoming increasingly interactive, and we could do with guidance here Evaluation framework has settled down – should be repeated on new data with only minor adjustments Need more data (especially for feature extraction), more topics – looking at 120 hours of news video from 1998 Need to encourage progress on manual/automatic processing – how? focus evaluation more? Probably ready to become one-day pre-TREC workshop with report-out/poster at TREC
Dostları ilə paylaş: