(Für diese Definition ist die deutsche Übersetzung noch nicht abgeschlossen)
Feature extraction techniques in which additional (invariant) information is calculated from certain image regions or patches or at certain points, which are visually more relevant in the process of comparison or matching.
Feature extraction techniques in which information from multiple local image patches can be combined into a joint descriptor by using an approach called “bag of features” (from its origin in text document matching), “bag of visual features”, or “bag of visual words”.
Notes – technical background
These notes provide more information about the technical subject matter that is classified in this place:
1. The image regions referred to in this place are called “salient regions”, and the points are called “keypoints”, “interest points” or “salient points”. The information assigned to these regions or points is referred to as a local descriptor due to the inherent aspect of locality in the image analysis.
A local descriptor aims to be invariant to transformations of the depicted image object (e.g., invariant to affine transforms, object deformations, or changes in image capturing conditions such as contrast or scene illumination, etc.).
A local descriptor may capture image characteristics across different scales for reliably detecting objects at different sizes, distances, or resolutions. Typical descriptors of this kind include:
At a salient point, the pixels in its immediate neighbourhood have visual characteristics, which are different from those of the vast majority of the other pixels. The visual appearance of patches around a salient point is, therefore, somewhat unique; this uniqueness increases the chance of finding a similar patch in other images showing the same object.
Generally, salient points can be expected to be located at boundaries of objects and at other image regions having a strong contrast.
2. A “bag of visual words” is a histogram, which indicates the frequencies of patches with particular visual properties; these visual properties are expressed by a codebook, which is commonly obtained by clustering a collection of typical feature descriptors (e.g. SIFT features) in the feature space; each bin of the histogram corresponds to one specific cluster in the codebook.
The process of generating a bag of features typically involves:
A training phase comprising:
And an operating phase comprising:
Examples
Defining key-patches for different objects classes from a training set, computing features from them and using a set of support vector machine (SVM) classifiers to detect those objects in new images.
Colour feature extraction | G06V 10/56 |
Image preprocessing for image or video recognition or understanding involving the determination of a region or volume of interest [ROI, VOI] | G06V 10/25 |
Global feature extraction, global invariant features (e.g. GIST) | G06V 10/42 |
Local feature extraction; Extracting of specific shape primitives, e.g. corners, intersections; Computing saliency maps with interactions such as reinforcement or inhibition | G06V 10/44 |
Local feature extraction, descriptors computed by performing operations within image blocks (e.g. HOG, LBP) | G06V 10/50 |
Organisation of the matching process; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries | G06V 10/75 |
Obtaining sets of training patterns, e.g. bagging | G06V 10/774 |
Extracting salient feature points for character recognition | G06V 30/18 |
Image retrieval systems using metadata | G06F 16/583 |
The present group does not cover biologically inspired approaches of feature extraction based on modelling the receptive fields of visual neurons, such as Gabor filters, and convolutional neural networks (CNN).
The use of neural networks for image or video pattern recognition or understanding is classified in group G06V 10/82.
When a document presents details on a sampling technique and a clustering technique (bagging), then it should also be classified in group G06V 10/774.
Classical “bag of words” techniques remove most image localisation information (geometry).
When local features are matched directly from one image to another without involving a bagging technique (and thereby retaining geometric information), e.g. when triplets of features are matched using a geometric transformation with a RANSAC algorithm, then the document should also be classified in group G06V 10/75.
Feature extraction techniques in which additional (invariant) information is calculated from certain image regions or patches or at certain points, which are visually more relevant in the process of comparison or matching.
Feature extraction techniques in which information from multiple local image patches can be combined into a joint descriptor by using an approach called “bag of features” (from its origin in text document matching), “bag of visual features”, or “bag of visual words”.
Notes – technical background
These notes provide more information about the technical subject matter that is classified in this place:
1. The image regions referred to in this place are called “salient regions”, and the points are called “keypoints”, “interest points” or “salient points”. The information assigned to these regions or points is referred to as a local descriptor due to the inherent aspect of locality in the image analysis.
A local descriptor aims to be invariant to transformations of the depicted image object (e.g., invariant to affine transforms, object deformations, or changes in image capturing conditions such as contrast or scene illumination, etc.).
A local descriptor may capture image characteristics across different scales for reliably detecting objects at different sizes, distances, or resolutions. Typical descriptors of this kind include:
At a salient point, the pixels in its immediate neighbourhood have visual characteristics, which are different from those of the vast majority of the other pixels. The visual appearance of patches around a salient point is, therefore, somewhat unique; this uniqueness increases the chance of finding a similar patch in other images showing the same object.
Generally, salient points can be expected to be located at boundaries of objects and at other image regions having a strong contrast.
2. A “bag of visual words” is a histogram, which indicates the frequencies of patches with particular visual properties; these visual properties are expressed by a codebook, which is commonly obtained by clustering a collection of typical feature descriptors (e.g. SIFT features) in the feature space; each bin of the histogram corresponds to one specific cluster in the codebook.
The process of generating a bag of features typically involves:
A training phase comprising:
And an operating phase comprising:
Examples
Defining key-patches for different objects classes from a training set, computing features from them and using a set of support vector machine (SVM) classifiers to detect those objects in new images.
Colour feature extraction | G06V 10/56 |
Image preprocessing for image or video recognition or understanding involving the determination of a region or volume of interest [ROI, VOI] | G06V 10/25 |
Global feature extraction, global invariant features (e.g. GIST) | G06V 10/42 |
Local feature extraction; Extracting of specific shape primitives, e.g. corners, intersections; Computing saliency maps with interactions such as reinforcement or inhibition | G06V 10/44 |
Local feature extraction, descriptors computed by performing operations within image blocks (e.g. HOG, LBP) | G06V 10/50 |
Organisation of the matching process; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries | G06V 10/75 |
Obtaining sets of training patterns, e.g. bagging | G06V 10/774 |
Extracting salient feature points for character recognition | G06V 30/18 |
Image retrieval systems using metadata | G06F 16/583 |
The present group does not cover biologically inspired approaches of feature extraction based on modelling the receptive fields of visual neurons, such as Gabor filters, and convolutional neural networks (CNN).
The use of neural networks for image or video pattern recognition or understanding is classified in group G06V 10/82.
When a document presents details on a sampling technique and a clustering technique (bagging), then it should also be classified in group G06V 10/774.
Classical “bag of words” techniques remove most image localisation information (geometry).
When local features are matched directly from one image to another without involving a bagging technique (and thereby retaining geometric information), e.g. when triplets of features are matched using a geometric transformation with a RANSAC algorithm, then the document should also be classified in group G06V 10/75.