Alan F. Smeaton Dublin City University



Yüklə 445 b.
tarix01.08.2018
ölçüsü445 b.
#65337


  • Alan F. Smeaton

  • Dublin City University

  • &

  • Paul Over

  • NIST


1. Introduction and Context

  • Last year’s talk…

    • gave an intro to video coding & compression;
    • highlighted predominant access mechanism as manual tagging via metadata
    • noted emerging automatic approaches are based on shot boundary detection, feature extraction and keyframe identification, followed by feature searching with keyframe browsing
    • noted there is no test collection of video
    • provided an overview of what 12 groups did on 11 hours of video in shot boundary detection and searching tasks
  • Last year was TV101, this year is TV201



New this year (1)

  • More participants and data:

    • 17 participating teams (up from 12),
    • 73 hours (up from 11)
  • Shot boundary determination (SBD)

    • new measures
    • 3-week test window
  • New semantic feature extraction task

    • features defined jointly by the participants
    • task is to identify shots with those features
  • Several groups donated extracted features

    • identified features from test videos early
    • shared their output (in MPEG-7 defined by IBM) in time for others to use as part of their search systems


New this year (2)

  • 25 topics for the search task,

    • developed by NIST
    • 4 weeks between release and submission
    • text, video, image and/or audio
  • Average precision added as measure – new emphasis on ranking

  • A common set of shot definitions

    • donated by CLIPS-IMAG, formatted by DCU
    • common units of retrieval for feature and search tasks
    • allowed pooling for assessment


New this year (3)

  • Searching was:

    • Interactive: full human access and iterations, or
    • Manual: a human with no knowledge of the test data gets one shot to formulate the topic as a search query
    • No fully automatic topic-to-query translation
  • Elapsed search time was added as a measure of effort for interactive search, groups gathered data on searcher characteristics



The 17 groups and the tasks they completed



Video Data

  • Difficult to get video data for use in TREC because ©

  • Used mainly Internet Archive

    • advertising, educational, industrial, amateur films 1930-1970
    • produced by corporations, non-profit organisations, trade groups, etc.
    • Noisy, strange color, but real archive data
    • 73.3 hours partitioned as follows:


2. Shot Boundary Detection task

  • Not a new problem, but a challenge because of gradual transitions and false positives caused by photo flashes, rapid camera or object movement

  • 4 hours, 51 minutes of documentary and educational material

  • Manually created ground truth of 2,090 transitions (thanks Jonathan) with 70% hard cuts, 25% dissolves, rest are fades to black and back, etc.

  • Up to 10 submissions per group, measured using precision and recall, with a bit of flexibility for matching gradual transitions



2001: Recall and precision for cuts



2002: Recall and precision for cuts



2001: Gradual Transitions



2002: Gradual Transitions



2001: Frame-recall & -precision for GTs



2002: Frame-recall & -precision for GTs



So, who did what ? The approaches….



Shot Boundary Detection:



Shot Boundary Detection:



Shot Boundary Detection:



Shot Boundary Detection:



Shot Boundary Detection:



Shot Boundary Detection:



Shot Boundary Detection:



Shot Boundary Detection:



3. Feature Extraction

  • FE is

    • interesting itself but when it serves to help video navigation and search then its importance increases
  • Objective was to

    • begin work on benchmarking FE
    • allow exchange of feature detection output among participants
  • Task is as follows:

    • given small standard dataset (5.02 hours, 1,848 shots) with common shot bounds,
    • locate up to 1,000 shots for each of 10 binary features
    • Feature frequency varied from “rare” to “everywhere”


The Features

  • 1. Outdoors

  • 2. Indoors

  • 3. Face - 1+ human face with nose, mouth, 2 eyes

  • 4. People - 2+ humans, each at least partially visible

  • 5. Cityscape - city/urban/suburban setting

  • 6. Landscape - natural inland setting with no human development such as ploughing or crops

  • 7. Text Overlay - large enough to be read

  • 8. Speech - human voice uttering words

  • 9. Instrumental Sound - 1+ musical instruments

  • 10. Monologue - 1 person, partially visible, speaking for a long time without interruption



True shots contributed uniquely by each run

  • Small values imply lots of overlap between runs

  • Likely due to relative size of result set (1,000 shots) and total test set (1,848 shots)



AvgP by feature (runs at median or above)



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



Groups and Features



4. The Search Task

  • Task is similar to text analogue …

    • topics are formatted descriptions of an information need
    • task is to return up to 100 shots that meet the need
  • Test data: 40.12 hours (14,524 common shots)

  • Features and/or ASR donated by CLIPS, DCU, IBM, Mediamill and MSRA

  • NIST assessors

    • judged top 50 shots from each submitted result set
    • subsequent full judgements showed only minor variations in performance
  • Used trec_eval to calculate measures



Search Topics

  • Topics (25) multimedia, created by NIST

  • 22 had video examples (avg 2.7 each), 8 had image (avg 1.9 each)

  • Requested shots with specific/generic:

    • People: George Washington; football players
    • Things: Golden Gate Bridge; sailboats
    • Locations: ---; overhead views of cities
    • Activities : ---; rocket taking off
    • Combinations of the above:
      • People spending leisure time at the beach
      • Locomotive approaching the viewer
      • Microscopic views of living cells


Search Types: Interactive and Manual



Manual runs: Top 10 (of 27)



Interactive runs top 10 (of 13)



Mean AvgP vs mean elapsed time



Search: Unique relevant shots from each run



Distribution of relevant shots Top vs bottom of halves of result sets



Max/median AvgP by topic - interactive



Max/median AvgP by topic - manual



Relevant shots by file id (topics 75-87)



Relevant shots by file id (topics 88-99)



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



The Groups and Searching



Groups doing the “Full Monty”



5. Conclusions

  • This track has grown significantly

    • … data, groups, tasks, measures, complexity
  • Donated features enabled many sites to take part and greatly enriched the progress .. this cannot be overstated … very collegiate and beneficial all-round

  • Common shot definition

    • implications for measurement need closer look
    • seems it was successful
  • The search task is becoming increasingly interactive, and we could do with guidance here

  • Evaluation framework has settled down – should be repeated on new data with only minor adjustments

  • Need more data (especially for feature extraction), more topics – looking at 120 hours of news video from 1998

  • Need to encourage progress on manual/automatic processing – how? focus evaluation more?

  • Probably ready to become one-day pre-TREC workshop with report-out/poster at TREC





Yüklə 445 b.

Dostları ilə paylaş:




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin