Alan F. Smeaton Dublin City University

Yüklə 445 b.

tarix	01.08.2018
ölçüsü	445 b.
	#65337

Alan F. Smeaton
Dublin City University
&
Paul Over
NIST

1. Introduction and Context

Last year’s talk…

gave an intro to video coding & compression;
highlighted predominant access mechanism as manual tagging via metadata
noted emerging automatic approaches are based on shot boundary detection, feature extraction and keyframe identification, followed by feature searching with keyframe browsing
noted there is no test collection of video
provided an overview of what 12 groups did on 11 hours of video in shot boundary detection and searching tasks

Last year was TV101, this year is TV201

New this year (1)

More participants and data:

17 participating teams (up from 12),
73 hours (up from 11)

Shot boundary determination (SBD)

new measures
3-week test window

New semantic feature extraction task

features defined jointly by the participants
task is to identify shots with those features

Several groups donated extracted features

identified features from test videos early
shared their output (in MPEG-7 defined by IBM) in time for others to use as part of their search systems

New this year (2)

25 topics for the search task,

developed by NIST
4 weeks between release and submission
text, video, image and/or audio

Average precision added as measure – new emphasis on ranking
A common set of shot definitions

donated by CLIPS-IMAG, formatted by DCU
common units of retrieval for feature and search tasks
allowed pooling for assessment

New this year (3)

Searching was:

Interactive: full human access and iterations, or
Manual: a human with no knowledge of the test data gets one shot to formulate the topic as a search query
No fully automatic topic-to-query translation

Elapsed search time was added as a measure of effort for interactive search, groups gathered data on searcher characteristics

The 17 groups and the tasks they completed

Video Data

Difficult to get video data for use in TREC because ©
Used mainly Internet Archive

advertising, educational, industrial, amateur films 1930-1970
produced by corporations, non-profit organisations, trade groups, etc.
Noisy, strange color, but real archive data
73.3 hours partitioned as follows:

2. Shot Boundary Detection task

Not a new problem, but a challenge because of gradual transitions and false positives caused by photo flashes, rapid camera or object movement
4 hours, 51 minutes of documentary and educational material
Manually created ground truth of 2,090 transitions (thanks Jonathan) with 70% hard cuts, 25% dissolves, rest are fades to black and back, etc.
Up to 10 submissions per group, measured using precision and recall, with a bit of flexibility for matching gradual transitions

2001: Recall and precision for cuts

2002: Recall and precision for cuts

2001: Gradual Transitions

2002: Gradual Transitions

2001: Frame-recall & -precision for GTs

2002: Frame-recall & -precision for GTs

So, who did what ? The approaches….

Shot Boundary Detection:

3. Feature Extraction

FE is

interesting itself but when it serves to help video navigation and search then its importance increases

Objective was to

begin work on benchmarking FE
allow exchange of feature detection output among participants

Task is as follows:

given small standard dataset (5.02 hours, 1,848 shots) with common shot bounds,
locate up to 1,000 shots for each of 10 binary features
Feature frequency varied from “rare” to “everywhere”

The Features

1. Outdoors
2. Indoors
3. Face - 1+ human face with nose, mouth, 2 eyes
4. People - 2+ humans, each at least partially visible
5. Cityscape - city/urban/suburban setting
6. Landscape - natural inland setting with no human development such as ploughing or crops
7. Text Overlay - large enough to be read
8. Speech - human voice uttering words
9. Instrumental Sound - 1+ musical instruments
10. Monologue - 1 person, partially visible, speaking for a long time without interruption

True shots contributed uniquely by each run

Small values imply lots of overlap between runs
Likely due to relative size of result set (1,000 shots) and total test set (1,848 shots)

AvgP by feature (runs at median or above)

Groups and Features

4. The Search Task

Task is similar to text analogue …

topics are formatted descriptions of an information need
task is to return up to 100 shots that meet the need

Test data: 40.12 hours (14,524 common shots)
Features and/or ASR donated by CLIPS, DCU, IBM, Mediamill and MSRA
NIST assessors

judged top 50 shots from each submitted result set
subsequent full judgements showed only minor variations in performance

Used trec_eval to calculate measures

Search Topics

Topics (25) multimedia, created by NIST
22 had video examples (avg 2.7 each), 8 had image (avg 1.9 each)
Requested shots with specific/generic:

People: George Washington; football players
Things: Golden Gate Bridge; sailboats
Locations: ---; overhead views of cities
Activities : ---; rocket taking off
Combinations of the above:

People spending leisure time at the beach
Locomotive approaching the viewer
Microscopic views of living cells

Search Types: Interactive and Manual

Manual runs: Top 10 (of 27)

Interactive runs top 10 (of 13)

Mean AvgP vs mean elapsed time

Search: Unique relevant shots from each run

Distribution of relevant shots Top vs bottom of halves of result sets

Max/median AvgP by topic - interactive

Max/median AvgP by topic - manual

Relevant shots by file id (topics 75-87)

Relevant shots by file id (topics 88-99)

The Groups and Searching

Groups doing the “Full Monty”

5. Conclusions

This track has grown significantly

… data, groups, tasks, measures, complexity

Donated features enabled many sites to take part and greatly enriched the progress .. this cannot be overstated … very collegiate and beneficial all-round
Common shot definition

implications for measurement need closer look
seems it was successful

The search task is becoming increasingly interactive, and we could do with guidance here
Evaluation framework has settled down – should be repeated on new data with only minor adjustments
Need more data (especially for feature extraction), more topics – looking at 120 hours of news video from 1998
Need to encourage progress on manual/automatic processing – how? focus evaluation more?
Probably ready to become one-day pre-TREC workshop with report-out/poster at TREC

Yüklə 445 b.

Dostları ilə paylaş:

Alan F. Smeaton Dublin City University

Alan F. Smeaton

Dublin City University

&

Paul Over

NIST

1. Introduction and Context

Last year’s talk…

Last year was TV101, this year is TV201

New this year (1)

More participants and data:

Shot boundary determination (SBD)

New semantic feature extraction task

Several groups donated extracted features

New this year (2)

25 topics for the search task,

Average precision added as measure – new emphasis on ranking

A common set of shot definitions

New this year (3)

Searching was:

Elapsed search time was added as a measure of effort for interactive search, groups gathered data on searcher characteristics

The 17 groups and the tasks they completed

Video Data

Difficult to get video data for use in TREC because ©

Used mainly Internet Archive

2. Shot Boundary Detection task

Not a new problem, but a challenge because of gradual transitions and false positives caused by photo flashes, rapid camera or object movement

4 hours, 51 minutes of documentary and educational material

Manually created ground truth of 2,090 transitions (thanks Jonathan) with 70% hard cuts, 25% dissolves, rest are fades to black and back, etc.

Up to 10 submissions per group, measured using precision and recall, with a bit of flexibility for matching gradual transitions

2001: Recall and precision for cuts

2002: Recall and precision for cuts

2001: Gradual Transitions

2002: Gradual Transitions

2001: Frame-recall & -precision for GTs

2002: Frame-recall & -precision for GTs

So, who did what ? The approaches….

Shot Boundary Detection:

Shot Boundary Detection:

Shot Boundary Detection:

Shot Boundary Detection:

Shot Boundary Detection:

Shot Boundary Detection:

Shot Boundary Detection:

Shot Boundary Detection:

3. Feature Extraction

FE is

Objective was to

Task is as follows:

The Features

1. Outdoors

2. Indoors

3. Face - 1+ human face with nose, mouth, 2 eyes

4. People - 2+ humans, each at least partially visible

5. Cityscape - city/urban/suburban setting

6. Landscape - natural inland setting with no human development such as ploughing or crops

7. Text Overlay - large enough to be read

8. Speech - human voice uttering words

9. Instrumental Sound - 1+ musical instruments

10. Monologue - 1 person, partially visible, speaking for a long time without interruption

True shots contributed uniquely by each run

Small values imply lots of overlap between runs

Likely due to relative size of result set (1,000 shots) and total test set (1,848 shots)

AvgP by feature (runs at median or above)

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

Groups and Features

4. The Search Task

Task is similar to text analogue …

Test data: 40.12 hours (14,524 common shots)

Features and/or ASR donated by CLIPS, DCU, IBM, Mediamill and MSRA