Seat No.:  
Enrolment No.  
GUJARAT TECHNOLOGICAL UNIVERSITY  
BE - SEMESTER–VII (NEW) EXAMINATION – WINTER 2021  
Subject Code:3171614  
Date:29/12/2021  
Subject Name: Computer Vision  
Time: 10:30 AM TO 01:00 PM  
Total Marks: 70  
MARKS  
03  
What is Computer Vision? Enlist its applications and explain any two  
of them.  
Q.1(a)  
Computer vision is the field of computer science that deals with how  
computers can be made to gain a high-level understanding from digital  
images or videos. It involves the development of algorithms and models  
that can analyze, interpret, and understand visual data from the world.  
There are many applications of computer vision, including:  
1. Image and video analysis: Computer vision algorithms can be used  
to analyze and interpret images and videos, extracting useful  
information and insights from them.  
2. Object recognition and tracking: Computer vision algorithms can  
be used to recognize and track objects in images and videos, which  
has applications in areas such as surveillance and robotics.  
3. Image and video enhancement: Computer vision algorithms can be  
used to improve the quality of images and videos, such as by  
removing noise or enhancing contrast.  
4. Autonomous vehicles: Computer vision plays a critical role in the  
development of autonomous vehicles, allowing them to perceive  
and understand their environment in order to make decisions  
about how to navigate.  
5. Medical imaging: Computer vision algorithms can be used to  
analyze medical images, such as CT scans and MRIs, to help  
diagnose and treat diseases.  
6. Augmented reality: Computer vision algorithms can be used to  
create augmented reality experiences, in which digital information  
is overlaid on top of the real world in real-time.  
7. Industrial inspection: Computer vision algorithms can be used to  
automate the inspection of industrial products, reducing the need  
1
for human inspection and improving the accuracy and efficiency of  
the process.  
8. Agriculture: Computer vision algorithms can be used to analyze  
images and videos of crops and farm animals to help with tasks  
such as identifying pests and diseases, and optimizing irrigation  
and fertilization.  
(b) What is Radiometry? Explain photometric image formation in detail.  
04  
Radiometry is the field of science that deals with the measurement of  
electromagnetic radiation, including visible light. Photometric image  
formation refers to the process of capturing an image using a camera, and  
it involves the measurement of the intensity of light at each point in the  
image.  
The basic principle of photometric image formation is that the intensity of  
light at each point in an image is proportional to the amount of light that  
is received at that point by the camera's image sensor. The image sensor  
consists of an array of pixels, each of which is sensitive to light and can  
measure the intensity of light that is received.  
When an image is taken with a camera, the lens of the camera focuses  
light from the scene onto the image sensor. The intensity of the light at  
each point in the image is then measured by the corresponding pixel on  
the image sensor. The values measured by the pixels are then recorded  
and used to create a digital image.  
There are several factors that can affect the accuracy of the measurements  
made by the image sensor, including the sensitivity of the pixels, the  
dynamic range of the sensor, and the accuracy of the lens. To obtain high-  
quality images, it is important to use a camera with a high-quality image  
sensor and lens.  
In addition to measuring the intensity of light, cameras can also measure  
other properties of light, such as its color and polarization. This allows for  
the creation of images with a wide range of colors and tones, and it  
enables the use of specialized imaging techniques, such as polarimetric  
imaging.  
(c) What do you understand by geometric 2D transformation in image  
07  
formation? Explain with examples.  
Geometric 2D transformations refer to a class of operations that can be  
applied to 2D images to modify their geometric properties, such as size,  
shape, and orientation. Some examples of geometric 2D transformations  
include:  
1. Scaling: Scaling refers to the process of changing the size of an  
2
image, either by making it larger or smaller. This can be done  
uniformly in both dimensions (isotropic scaling), or independently  
in each dimension (anisotropic scaling).  
2. Translation: Translation refers to the process of shifting an image  
horizontally or vertically. This can be done by adding a fixed offset  
to the position of each pixel in the image.  
3. Rotation: Rotation refers to the process of rotating an image  
around a fixed point (the center of rotation). The angle of rotation  
can be specified in degrees or radians.  
4. Shear: Shear refers to the process of distorting an image by  
stretching it in one direction and compressing it in the orthogonal  
direction. This can be done either horizontally or vertically.  
5. Affine transformation: An affine transformation is a combination of  
translation, scaling, and rotation. It can also include shearing, but it  
preserves straight lines and parallelism.  
6. Projective transformation: A projective transformation is a more  
general type of transformation that can include perspective  
distortion. It maps lines to lines, but it does not preserve  
parallelism.  
Geometric 2D transformations are often used in image processing and  
computer vision to correct for distortion and alignment errors, or to  
transform images into a more convenient coordinate system for further  
analysis. They can also be used for artistic purposes, such as to create  
stylized or distorted images.  
Q.2(a) Define the terms: Image Digitization, Normalized cut and kernel.  
03  
1. Image digitization: Image digitization is the process of converting  
an analog image (such as a photograph or a painting) into a digital  
format, typically by scanning it or taking a digital photograph of it.  
The resulting digital image is a numerical representation of the  
image that can be stored, transmitted, and processed by a  
computer.  
2. Normalized cut: Normalized cut is a technique used in image  
segmentation, which is the process of dividing an image into  
different regions or segments. The goal of normalized cut is to  
divide the image into regions such that the total weight of the  
edges connecting the regions is minimized, while the total weight  
of the edges within each region is maximized. This helps to ensure  
that the regions are coherent and homogeneous, and that the  
boundaries between the regions are well-defined.  
3. Kernel: In the context of image processing and computer vision, a  
kernel is a small matrix of numbers that is used to apply a  
3
convolutional operation to an image. The kernel is multiplied  
element-wise with a patch of the image, and the resulting values  
are summed to produce a single output pixel. Different types of  
kernels can be used to implement different types of image  
processing operations, such as blurring, sharpening, and edge  
detection. Kernels are also known as filters or convolution masks.  
(b) What is convolution? Explain the process of image convolution with  
04  
example.  
Convolution is a mathematical operation that is widely used in image  
processing and computer vision to modify the appearance of an image or  
to extract features from it. It involves the application of a kernel, which is a  
small matrix of numbers, to each patch of an image, in order to produce a  
new image.  
The process of image convolution can be described as follows:  
1. Define the kernel: The kernel is a small matrix of numbers that  
defines the convolution operation. It is typically square, with  
dimensions ranging from 3x3 to 7x7, although larger or smaller  
kernels can also be used.  
2. Slide the kernel over the image: The kernel is overlaid on top of the  
image, and it is moved from left to right and top to bottom,  
covering the entire image. At each position, the kernel is centered  
on a particular pixel in the image, and it is multiplied element-wise  
with the pixel values in a small patch of the image centered on that  
pixel.  
3. Sum the products: The element-wise products are summed to  
produce a single output value for the current position of the kernel.  
This output value is then assigned to the corresponding pixel in the  
output image.  
4. Repeat for all positions: The process is repeated for all positions of  
the kernel on the image, until the entire image has been convolved.  
For example, consider the following 3x3 kernel:  
[1, 0, -1] [2, 0, -2] [1, 0, -1]  
If this kernel is applied to an image using convolution, it will highlight  
horizontal edges in the image. At each position, the kernel will be  
multiplied element-wise with the pixel values in a 3x3 patch of the image  
centered on the current position. The resulting products will be summed,  
and the sum will be assigned to the corresponding pixel in the output  
image. Pixels in the output image will be brighter where there are strong  
horizontal edges in the input image, and darker where there are no edges  
or where the edges are not horizontal.  
4
(c) Differentiate between low pass filtering and high pass filtering.  
07  
High Pass Filtering  
Low Pass Filtering  
Removes high frequency Removes low frequency  
Definition  
components from an  
image  
components from an  
image  
Enhance or sharpen an  
image  
Purpose  
Blur or smooth an image  
Typically has a smooth,  
Gaussian shape  
Typically has a sharp,  
angular shape  
Kernel shape  
Sharpening, noise  
reduction, edge  
enhancement  
Examples of  
applications  
Blurring, noise reduction,  
edge detection  
Low pass filtering is a type of image processing operation that is used to  
remove high frequency components from an image. It is typically used to  
blur or smooth an image, or to reduce noise. Low pass filters are  
implemented using kernels that have a smooth, Gaussian shape, and they  
are designed to pass low frequency components of the image while  
attenuating high frequency components.  
High pass filtering is a type of image processing operation that is used to  
remove low frequency components from an image. It is typically used to  
enhance or sharpen an image, or to highlight fine details and edges. High  
pass filters are implemented using kernels that have a sharp, angular  
shape, and they are designed to pass high frequency components of the  
image while attenuating low frequency components.  
OR  
(c)  
07  
How do you perform filtering process in frequency domain? Show step  
by step process with clear diagram. Explain Butterworth Low Pass  
filter in frequency domain.  
The filtering process in the frequency domain involves performing a  
convolution operation between the Fourier transform of the input image  
and the frequency response of the filter. The frequency response of the  
filter is a 2D function that describes how the filter affects the different  
frequency components of the image.  
5
Here is a step-by-step description of the process of filtering an image in  
the frequency domain:  
1. Compute the Fourier transform of the input image: The Fourier  
transform of an image is a complex-valued function that represents  
the image in the frequency domain. It is calculated by applying the  
discrete Fourier transform (DFT) to the image.  
2. Design the frequency response of the filter: The frequency  
response of the filter is a 2D function that specifies how the filter  
should modify the different frequency components of the image. It  
is typically designed to pass certain frequencies while attenuating  
others.  
3. Multiply the Fourier transform of the image by the frequency  
response of the filter: This is done element-wise, at each point in  
the frequency domain. The result of this multiplication is the  
filtered image in the frequency domain.  
4. Compute the inverse Fourier transform: The inverse Fourier  
transform is applied to the filtered image in the frequency domain  
to obtain the filtered image in the spatial domain. This is the final  
output of the filtering process.  
Here is an example of a Butterworth low pass filter in the frequency  
domain:  
The Butterworth low pass filter is a type of filter that is used to remove  
high frequency components from an image. It has a smooth, monotonic  
transition between the passband and the stopband, and it provides good  
attenuation of high frequency components while minimizing distortion of  
the low frequency components.  
The frequency response of a Butterworth low pass filter can be described  
by the following equation:  
H(u, v) = 1 / (1 + (D(u, v) / D0)^(2n))  
where H(u, v) is the frequency response at frequency (u, v), D(u, v) is the  
distance from the center frequency (u0, v0) in the frequency domain, D0 is  
the cutoff frequency, and n is the order of the filter.  
6
Here is a diagram showing the frequency response of a Butterworth low  
pass filter with different values of the cutoff frequency D0:  
Q.3(a)  
03  
Discuss active contour technique for Segmentation.  
Active contour, also known as snakes, is a technique for image  
segmentation that involves the evolution of a curve or contour to fit a  
desired shape in an image. It is an iterative process that adjusts the shape  
of the contour at each step based on the gradient of the image intensity,  
as well as external constraints or forces that encourage the contour to  
take on a particular shape.  
The basic steps of the active contour technique are as follows:  
1. Initialize the contour: The contour is initialized at a starting position  
in the image, typically by specifying a set of control points or by  
drawing a rough outline of the desired shape.  
2. Compute the gradient of the image intensity: The gradient of the  
image intensity is calculated at each point on the contour, and it is  
used to guide the evolution of the contour towards regions of high  
image contrast.  
3. Update the position of the contour: The position of the contour is  
updated based on the gradient of the image intensity and the  
external forces acting on the contour. These forces can include  
terms that encourage the contour to take on a particular shape or  
size, or that penalize deviation from the desired shape.  
4. Repeat until convergence: The process is repeated until the contour  
reaches a desired level of convergence, or until a maximum  
number of iterations is reached.  
Active contour techniques are widely used in image segmentation and  
object tracking, and they have the advantage of being able to follow  
complex shapes and adapt to changes in the image over time. However,  
they can be sensitive to initialization and may require careful tuning of the  
external forces to obtain good results.  
7
(b) What is descriptor? Explain SIFT descriptor in detail.  
04  
A descriptor is a mathematical representation of a set of features or  
characteristics of an image or an image patch. It is used in image  
processing and computer vision to capture the appearance or shape of an  
object or a region in the image, and to enable comparison or matching  
with other images or regions.  
The Scale Invariant Feature Transform (SIFT) descriptor is a widely used  
descriptor in computer vision that is designed to be robust to changes in  
scale, orientation, and affine distortion. It was developed by David Lowe in  
1999 and has since become a standard method for feature extraction and  
matching in a variety of applications.  
The SIFT descriptor works by extracting a set of keypoints from an image,  
which are locations in the image that are distinctive and stable under  
various image transformations. These keypoints are then used to compute  
the SIFT descriptor, which is a vector of local image features around the  
keypoint.  
The SIFT descriptor is computed as follows:  
1. Detect keypoints: Keypoints are detected in the image using a  
scale-space extrema detection algorithm, which is designed to find  
locations in the image that are stable under scale changes and  
affine distortion.  
2. Compute the scale and orientation: The scale and orientation of  
each keypoint is determined using the Difference of Gaussians  
(DoG) scale-space representation of the image.  
3. Compute the gradient orientation histogram: A gradient  
orientation histogram is computed for each keypoint by dividing  
the region around the keypoint into a set of orientation bins and  
accumulating the gradient magnitudes of the pixels in each bin.  
4. Normalize the histogram: The histogram is normalized to reduce  
the influence of illumination changes and to enhance the contrast  
of the features.  
5. Construct the descriptor vector: The normalized histograms of all  
the keypoints are concatenated to form the final SIFT descriptor  
vector.  
The SIFT descriptor is a powerful tool for image matching and object  
recognition, and it has been used in a wide range of applications,  
including image retrieval, 3D reconstruction, and object tracking. It is  
particularly useful for handling difficult cases such as partial occlusion,  
clutter, and noise.  
8
(c)  
07  
What is histogram? Explain histogram equalization algorithm. Write  
Matlab code for calculation of histogram and histogram equalization.  
A histogram is a graphical representation of the distribution of data in a  
dataset. It is a graph that shows the frequency or number of occurrences  
of different values in the dataset. In image processing and computer  
vision, histograms are often used to analyze the distribution of pixel  
intensities in an image, and to understand the global and local contrast  
and brightness of the image.  
Histogram equalization is an image processing technique that is used to  
enhance the contrast and improve the overall appearance of an image. It  
works by redistributing the intensity values of the pixels in the image such  
that the resulting histogram is more uniformly distributed, with a more  
balanced distribution of light and dark pixels. This helps to stretch the  
dynamic range of the image, making the details more visible and  
improving the contrast.  
Here is an example of Matlab code for calculating the histogram and  
performing histogram equalization on an image:  
% Load the image I = imread('image.jpg');  
% Convert the image to grayscale I = rgb2gray(I);  
% Calculate the histogram of the image [counts, bins] = imhist(I);  
% Plot the histogram bar(bins, counts);  
% Perform histogram equalization I_eq = histeq(I);  
% Calculate the histogram of the equalized image [counts_eq, bins_eq] =  
imhist(I_eq);  
% Plot the histogram of the equalized image bar(bins_eq, counts_eq);  
The first block of code loads the image and converts it to grayscale, which  
is necessary for histogram equalization. The second block calculates the  
histogram of the image using the imhist function, and plots the histogram  
using the bar function. The third block performs histogram equalization  
on the image using the histeq function, and the fourth block calculates  
and plots the histogram of the equalized image.  
To summarize, histogram equalization is a technique that is used to  
enhance the contrast of an image by redistributing the intensity values of  
the pixels such that the resulting histogram is more uniformly distributed.  
It can be implemented in Matlab using the histeq function, which takes an  
input image and returns an equalized version of the image. The histogram  
of the equalized image can then be plotted using the imhist and bar  
functions, as shown in the example code above.  
9
OR  
What is segmentation? Explain graph based segmentation in detail.  
Q.3(a)  
03  
Segmentation is the process of dividing an image into distinct regions or  
segments, each of which corresponds to a different object or background  
in the image. It is an important step in image processing and computer  
vision, as it allows objects in the image to be separated and analyzed  
individually.  
Graph based segmentation is a type of image segmentation method that  
uses a graph representation of the image to divide it into segments. The  
graph consists of a set of vertices that represent the pixels in the image,  
and edges that connect the vertices and represent the relationships  
between the pixels. The goal of graph based segmentation is to find a  
partition of the graph into disjoint sets of vertices, such that the vertices  
within each set are more similar to each other than to vertices in other  
sets.  
The process of graph based segmentation can be divided into the  
following steps:  
1. Construct the graph: The graph is constructed by assigning a vertex  
to each pixel in the image, and connecting the vertices with edges  
based on some criterion of similarity or proximity. The edges can  
be weighted to reflect the degree of similarity between the pixels.  
2. Compute the affinity matrix: The affinity matrix is a square matrix  
that contains the weights of the edges in the graph. It is used to  
represent the relationships between the pixels in the image.  
3. Normalize the affinity matrix: The affinity matrix is normalized to  
ensure that it has certain desirable properties, such as symmetry  
and row stochasticity.  
4. Compute the degree matrix: The degree matrix is a diagonal matrix  
that contains the sum of the weights of the edges incident to each  
vertex. It is used to represent the importance of each vertex in the  
graph.  
5. Compute the Laplacian matrix: The Laplacian matrix is a matrix that  
encodes the structure of the graph. It is computed as the difference  
between the degree matrix and the affinity matrix.  
6. Compute the eigenvectors of the Laplacian matrix: The  
eigenvectors of the Laplacian matrix capture the inherent structure  
of the graph, and they can be used to partition the graph into  
segments.  
7. Assign each pixel to a segment: The pixels in the image are  
assigned to segments based on their corresponding vertices in the  
graph, and the resulting segmentation is output.  
10  
(b) Explain region splitting and region merging in image segmentation.  
04  
Region splitting and region merging are two strategies that are commonly  
used in image segmentation algorithms to divide an image into multiple  
regions or segments.  
Region splitting involves dividing a large region or superpixel into smaller  
regions based on some criterion of dissimilarity or boundary strength. This  
is typically done by identifying points of high contrast or strong edges  
within the region, and using these points to split the region into multiple  
subregions. Region splitting can be useful for preserving fine details and  
boundaries in the image, and for improving the accuracy of the  
segmentation.  
Region merging, on the other hand, involves merging smaller regions or  
subregions into larger regions based on some criterion of similarity or  
homogeneity. This is typically done by comparing the properties of the  
regions, such as their intensity, color, or texture, and merging regions that  
are similar to each other. Region merging can be useful for reducing noise  
and eliminating small isolated regions, and for improving the smoothness  
and coherence of the segmentation.  
Both region splitting and region merging can be useful for improving the  
quality of the image segmentation, and they are often used in  
combination with other segmentation algorithms to achieve good results.  
However, they can also introduce errors and artifacts into the  
segmentation if they are not used carefully, and they may require fine-  
tuning of the parameters to obtain good results.  
(c)  
07  
Explain K-means and Gaussian Mixture Model in detail.  
K-means is an unsupervised machine learning algorithm that is used for  
clustering. It works by partitioning a dataset into a predefined number of  
clusters, based on the similarity of the data points within each cluster. The  
goal of K-means is to find a partition of the data that minimizes the sum  
of squared distances between the data points and the centroid (mean) of  
their respective clusters.  
The K-means algorithm consists of the following steps:  
1. Specify the number of clusters: The number of clusters to be  
formed is specified by the user. This is an important parameter that  
can affect the performance of the algorithm.  
2. Initialize the centroids: The centroids of the clusters are initialized  
at random locations in the feature space.  
3. Assign each data point to the nearest centroid: Each data point is  
assigned to the cluster whose centroid is closest to it, based on  
11  
some distance measure.  
4. Update the centroids: The centroids of the clusters are updated to  
the mean of the data points assigned to each cluster.  
5. Repeat steps 3 and 4 until convergence: The process is repeated  
until the centroids converge, or until a maximum number of  
iterations is reached.  
K-means is a simple and efficient algorithm that is widely used for  
clustering and feature extraction in a variety of applications. It is sensitive  
to the initial location of the centroids, and it may not always find the  
global optimum solution. However, it is often used as a baseline method  
for comparison with other clustering algorithms.  
A Gaussian Mixture Model (GMM) is a probabilistic model that assumes  
that the data is generated from a mixture of several underlying Gaussian  
distributions. It is a flexible and powerful model that can be used for  
clustering, density estimation, and classification.  
A GMM consists of a set of K Gaussian distributions, each of which is  
characterized by its mean vector and covariance matrix. The parameters of  
the GMM are learned from the data using an expectation-maximization  
(EM) algorithm, which estimates the parameters of the model that  
maximize the likelihood of the data.  
The EM algorithm consists of the following steps:  
1. Initialize the parameters of the model: The means and covariances  
of the Gaussian distributions are initialized randomly or using some  
heuristic method.  
2. Compute the probabilities of the data points: For each data point,  
the probability of belonging to each of the K Gaussian distributions  
is computed using the current estimates of the model parameters.  
3. Update the parameters of the model: The means and covariances  
of the Gaussian distributions are updated using the probabilities  
computed in the previous step.  
4. Repeat steps 2 and 3 until convergence: The process is repeated  
until the model parameters converge, or until a maximum number  
of iterations is reached.  
GMM is a widely used model for clustering and density estimation, and it  
has the advantage of being able to capture complex distributions and  
handle mixed data types. It is also flexible and can be extended to  
incorporate additional constraints or priors on the model parameters.  
However, it can be sensitive to initialization and may require careful  
tuning of the parameters to obtain good results.  
12  
Q.4(a) What is watershed? Explain watershed segmentation.  
03  
Watershed is a type of image segmentation algorithm that is based on the  
idea of flooding basins in an image from markers or seeds, until the  
basins merge or reach certain criteria. It is a powerful method for  
extracting objects and boundaries from images, and it has been widely  
used in image processing and computer vision.  
The watershed algorithm consists of the following steps:  
1. Compute the gradient of the image: The gradient of the image is  
computed to identify the locations of strong edges or boundaries  
in the image.  
2. Identify the markers or seeds: The markers or seeds are locations in  
the image that correspond to objects or regions of interest. They  
can be chosen manually or automatically using some criterion, such  
as intensity or texture.  
3. Flood the basins: The basins around the markers are flooded with  
different colors or labels, until they reach the boundaries or other  
markers. The flooding process is typically performed using a  
priority queue or stack, which determines the order in which the  
basins are merged.  
4. Extract the watersheds: The watersheds or boundaries between the  
basins are extracted from the image, and the resulting  
segmentation is output.  
Watershed segmentation is a powerful and flexible method for image  
segmentation, and it has been used in a wide range of applications,  
including medical imaging, microscopy, and satellite imagery. It is  
particularly useful for handling complex and noisy images, and for  
extracting objects and boundaries with high accuracy. However, it can be  
sensitive to initialization and may require careful tuning of the parameters  
to obtain good results.  
Explain Pixel transform and color transform of image with an example.  
(b)  
04  
Pixel transform is a type of image processing operation that involves  
modifying the pixel values of an image in some way, such as scaling,  
rotating, or thresholding. It is a basic operation that is used to modify the  
appearance or contrast of the image, or to extract certain features or  
characteristics of the image.  
An example of pixel transform is image scaling, which involves resizing the  
image by changing the number of pixels in the image. Image scaling can  
be performed using interpolation techniques, such as nearest neighbor,  
bilinear, or bicubic interpolation, which determine how the new pixels are  
13  
calculated from the old ones.  
For example, here is an example of image scaling using Matlab:  
% Load the image I = imread('image.jpg');  
% Scale the image by a factor of 2 I_scaled = imresize(I, 2);  
% Scale the image by a factor of 0.5 I_scaled = imresize(I, 0.5);  
Color transform is a type of image processing operation that involves  
converting the color space of an image from one representation to  
another. It is used to adjust the appearance or contrast of the image, or to  
enable certain image processing operations that are sensitive to the color  
space of the image.  
An example of color transform is image color balance, which involves  
adjusting the relative proportions of the color channels in the image to  
achieve a desired balance or appearance. Image color balance can be  
performed using various techniques, such as global color balance, which  
adjusts the overall color balance of the image, or local color balance,  
which adjusts the color balance in different regions of the image.  
For example, here is an example of image color balance using Matlab:  
% Load the image I = imread('image.jpg');  
% Convert the image to the CIELAB color space I_lab = rgb2lab(I);  
% Adjust the color balance of the image I_balanced =  
adjust_color_balance(I_lab);  
% Convert the balanced image back to the RGB color space I_balanced =  
lab2rgb(I_balanced);  
(c) What is Edge detection? Explain canny edge detection algorithm and  
07  
write a MATLAB code to implement this algorithm.  
Edge detection is a type of image processing operation that involves  
detecting the boundaries or edges of objects in an image. It is an  
important step in image analysis and computer vision, as it allows objects  
in the image to be separated and distinguished from each other, and it  
can be used to extract features or characteristics of the objects.  
Canny edge detection is an edge detection algorithm developed by John  
Canny in 1986. It is a widely used and effective algorithm that is known for  
its good performance and robustness to noise. The Canny edge detector  
consists of the following steps:  
1. Noise reduction: The image is smoothed using a Gaussian filter to  
reduce noise and smooth the edges.  
2. Gradient computation: The gradient of the image is computed  
using a Sobel operator or other edge detection filter to identify the  
14  
locations of strong edges.  
3. Non-maximum suppression: The gradient image is thresholded to  
suppress weak edges and preserve only the strong edges.  
4. Double thresholding: The strong edges are further divided into two  
categories: strong and weak edges, based on two threshold values.  
5. Edge tracking: The strong edges are traced along their length to  
form continuous edges, and the weak edges are connected to the  
strong edges if they are connected.  
Here is an example of Canny edge detection in Matlab:  
% Load the image I = imread('image.jpg');  
% Convert the image to grayscale I = rgb2gray(I);  
% Set the parameters for the Canny edge detector sigma = 1;  
low_threshold = 0.05; high_threshold = 0.1;  
% Apply the Canny edge detector edges = edge(I, 'canny', [low_threshold,  
high_threshold], sigma);  
% Display the resulting edge map imshow(edges);  
OR  
Q.4(a) Explain shape context descriptors.  
03  
Shape context descriptors are a type of feature descriptor that is used to  
represent the shape of an object in an image. They are based on the idea  
of comparing the relative positions of points on the shape, and they are  
robust to small variations in scale, orientation, and position.  
A shape context descriptor consists of a set of points on the shape, called  
keypoints, and a histogram that encodes the relative positions of the  
points. The keypoints are chosen based on some criterion, such as the  
locations of high curvature or corners, and they are used to represent the  
salient features of the shape. The histogram is calculated by comparing  
the relative positions of the points using a distance metric, such as the  
Euclidean distance or the angular distance, and it is used to capture the  
overall structure and layout of the shape.  
Shape context descriptors are used in a variety of applications, including  
object recognition, image matching, and shape analysis. They are  
particularly useful for handling shapes with complex or irregular contours,  
and for dealing with noise and occlusions. However, they can be sensitive  
to the choice of keypoints and may require careful tuning of the  
parameters to obtain good results.  
15  
(b)  
04  
Name two morphological operations and explain them with examples.  
Morphological operations are image processing techniques that involve  
the manipulation of the shape and structure of objects in an image. They  
are based on the idea of applying simple operations, such as dilation,  
erosion, or opening, to the pixels of an image to extract or modify the  
shapes of the objects.  
Here are two examples of morphological operations:  
1. Dilation: Dilation is an operation that involves expanding the shape  
of an object by adding pixels to its boundaries. It is often used to  
fill in small gaps or holes in the object, or to connect isolated  
pixels.  
For example, here is an example of dilation in Matlab:  
% Load the image and create a structuring element I =  
imread('image.jpg'); se = strel('square', 3);  
% Apply dilation to the image I_dilated = imdilate(I, se);  
2. Erosion: Erosion is an operation that involves shrinking the shape of  
an object by removing pixels from its boundaries. It is often used to  
thin or skeletonize the object, or to remove small isolated pixels.  
For example, here is an example of erosion in Matlab:  
% Load the image and create a structuring element I =  
imread('image.jpg'); se = strel('square', 3);  
% Apply erosion to the image I_eroded = imerode(I, se);  
Morphological operations are simple but powerful tools for image  
processing and analysis, and they are widely used in a variety of  
applications, including object recognition, image segmentation, and  
pattern recognition. They are particularly useful for handling noise and  
complex shapes, and for extracting features and characteristics of objects  
in the image.  
16  
(c) What is corner detection? Explain Moravec corner detection  
07  
algorithm and write a MATLAB code to implement this algorithm.  
Corner detection is a type of image processing operation that  
involves identifying the corners or interest points in an image.  
Corners are points in the image that have a high degree of  
uniqueness or distinctive features, and they are often used as  
keypoints for object recognition or image matching.  
Moravec corner detection is an algorithm developed by Paul  
Moravec in 1977 that is used to detect corners in an image. It is  
based on the idea of comparing the intensity of a pixel with its  
neighbors, and it is known for its good performance and robustness  
to noise. The Moravec corner detector consists of the following  
steps:  
1. Compute the gradient: The gradient of the image is  
computed using a Sobel operator or other edge detection  
filter to identify the locations of strong edges.  
2. Compute the corner strength: The corner strength of each  
pixel is computed based on the intensity difference between  
the pixel and its neighbors.  
3. Non-maximum suppression: The corner strength map is  
thresholded and suppressed to preserve only the local  
maxima, which correspond to the corners.  
4. Refinement: The corners are refined using a sub-pixel  
accuracy technique, such as quadratic interpolation, to  
improve their accuracy.  
Here is an example of how to implement Moravec corner detection  
in Matlab:  
% Load the image I = imread('image.jpg');  
% Convert the image to grayscale I = rgb2gray(I);  
% Set the parameters for the Moravec corner detector window_size  
= 3; threshold = 100;  
% Apply the Moravec corner detector corners = corner_moravec(I,  
window_size, threshold);  
% The "corner_morave  
Q.5  
(a) Explain radial distortion in camera calibration.  
03  
Radial distortion is a type of distortion that occurs in images  
captured by cameras due to the non-linear mapping between the 3D  
1
world points and the 2D image points. It is caused by the fact that  
the imaging system is not perfectly aligned or focused, and it results  
in objects appearing distorted or stretched in the image.  
Radial distortion is typically modeled as a function of the distance  
between the image point and the center of the image. It can be  
either positive or negative, depending on the direction of the  
distortion, and it can be described by a set of parameters that  
characterize the degree and form of the distortion.  
Radial distortion can be corrected by calibrating the camera, which  
involves estimating the distortion parameters and applying a  
correction to the image points. Camera calibration is typically  
performed using a set of known 3D world points and their  
corresponding 2D image points, which are used to estimate the  
intrinsic and extrinsic parameters of the camera.  
Once the camera has been calibrated, the distortion parameters can  
be used to correct the image points and remove the distortion from  
the image. This can be done using a distortion model, such as the  
Brown or Kannala-Brandt model, which describes the relationship  
between the distorted and undistorted image points.  
Radial distortion is an important consideration in camera calibration  
and image processing, as it can affect the accuracy and quality of the  
images and the measurements made from them. It is particularly  
relevant for applications that require high accuracy or precision, such  
as machine vision, robotics, or surveying, where even small amounts  
of distortion can have significant consequences. Correcting for radial  
distortion is therefore an important step in many image processing  
pipelines, and it can improve the performance and reliability of the  
system.  
What is camera calibration? Explain pinhole camera models in  
detail.  
(b)  
04  
Camera calibration is the process of estimating the intrinsic and  
extrinsic parameters of a camera, which are used to transform 3D  
world points into 2D image points. It is an important step in  
computer vision and image processing, as it allows the camera to be  
modeled and its distortion to be corrected, which can improve the  
accuracy and quality of the images and the measurements made  
from them.  
The pinhole camera model is a simple and widely used model of a  
camera that is based on the idea of a pinhole aperture through  
which light enters the camera and forms an image on the sensor.  
The pinhole camera model is characterized by a set of intrinsic  
parameters that describe the internal properties of the camera, such  
as the focal length, principal point, and distortion, and a set of  
2
extrinsic parameters that describe the position and orientation of the  
camera in the world.  
The intrinsic parameters of the pinhole camera model can be  
estimated from a set of known 3D world points and their  
corresponding 2D image points, using techniques such as least  
squares or bundle adjustment. The extrinsic parameters can be  
estimated using additional information about the position and  
orientation of the camera, such as GPS or inertial measurements.  
Once the intrinsic and extrinsic parameters have been estimated,  
they can be used to transform 3D world points into 2D image points  
using the pinhole camera model, and vice versa. This can be done  
using a projection matrix, which encodes the intrinsic and extrinsic  
parameters of the camera, and a transformation matrix, which  
represents the orientation and position of the camera in the world.  
The pinhole camera model is a simple and effective model of a  
camera, and it has been widely used in computer vision and image  
processing  
(c) What is object recognition? Explain Different components of an  
07  
object recognition system in detail.  
Object recognition is the process of identifying and classifying  
objects in an image or video stream. It is an important and widely  
studied problem in computer vision and image processing, and it  
has many applications, such as robotics, surveillance, and  
augmented reality.  
An object recognition system typically consists of the following  
components:  
1. Feature extraction: This component is responsible for  
extracting features or characteristics from the image or video  
that can be used to represent the objects. These features may  
include edges, corners, textures, or colors, and they may be  
extracted using techniques such as edge detection, corner  
detection, or texture analysis.  
2. Feature matching: This component is responsible for  
comparing the extracted features to a database of known  
objects or prototypes, and for determining the similarity or  
match between them. This may involve calculating the  
distance or similarity between the features using a metric  
such as the Euclidean distance or the cosine similarity.  
3. Classification: This component is responsible for classifying  
the objects based on their features and their similarity to the  
known objects or prototypes. This may involve using a  
3
classification algorithm, such as k-nearest neighbors or  
support vector machines, to assign the objects to predefined  
classes or categories.  
4. Tracking: This component is responsible for tracking the  
objects as they move or change in the image or video stream.  
This may involve using techniques such as object tracking or  
visual odometry to estimate the motion of the objects and to  
maintain their identity over time.  
Object recognition systems can be implemented using a variety of  
techniques and algorithms, depending on the specific requirements  
of the application and the characteristics of the objects. They may  
also involve additional components or modules, such as  
preprocessing or postprocessing, to improve the performance or  
robustness of the system.  
OR  
Which approaches are for appearance based method in  
object recognition? Explain them in brief.  
Q.5  
(a)  
03  
Appearance-based methods in object recognition are methods that  
rely on the visual appearance of the objects in the image to  
recognize and classify them. They are based on the idea of  
extracting features or characteristics from the image that can be  
used to represent the objects and to compare them to a database of  
known objects or prototypes.  
Here are some approaches that are commonly used for appearance-  
based object recognition:  
1. Feature-based methods: These methods involve extracting  
features from the image, such as edges, corners, textures, or  
colors, and using them to represent the objects. The features  
may be extracted using techniques such as edge detection,  
corner detection, or texture analysis, and they may be  
matched to the known objects or prototypes using a distance  
or similarity metric.  
2. Template matching: This approach involves comparing the  
image to a set of predefined templates or models of the  
objects, and determining the best match based on some  
criterion, such as the minimum distance or maximum  
similarity. Template matching can be performed using  
techniques such as correlation or convolution, and it is often  
used for simple or small objects with distinctive features.  
3. Bag of words: This approach involves representing the image  
as a histogram of features, or a bag of words, and comparing  
4
it to a set of known bags of words using a distance or  
similarity metric. The bag of words model is based on the idea  
of quantizing the features into a fixed set of clusters or  
categories, and it is often used for large or complex objects  
with many features.  
Deep learning: This approach involves using deep neural networks to  
learn the features and characteristics of the objects from a set of  
training examples. Deep learning methods have achieved state-of-  
the-art results on many object recognition tasks, and they are widely  
used in a variety of applications. They are particularly useful for  
handling large or complex objects with many features, and for  
learning to recognize objects in real-world images, which may be  
noisy or contain clutter.  
Appearance-based methods are popular for object recognition  
because they are simple and fast, and they can be implemented  
using a variety of techniques and algorithms. However, they can be  
sensitive to changes in the appearance of the objects, such as  
illumination, pose, or scale, and they may not be robust to noise or  
variations in the image. They may also require a large database of  
known objects or prototypes, and they may not be able to handle  
objects that have never been seen before.  
To address these limitations, object recognition systems may also  
use other types of information, such as shape, context, or motion, to  
improve the performance and robustness of the system. These  
methods are known as shape-based, context-based, or motion-  
based methods, respectively, and they may be combined with  
appearance-based methods to form a more robust and flexible  
object recognition system.  
(b) Explain Kalman filtering in motion tracking.  
04  
Kalman filtering is a method for estimating the state of a system  
over time based on a sequence of noisy measurements. It is a widely  
used technique in the field of control engineering and has also been  
applied to many problems in computer vision and image processing,  
including motion tracking.  
In motion tracking, Kalman filtering can be used to estimate the  
position and velocity of an object in an image or video stream based  
on a series of noisy or incomplete measurements. The Kalman filter  
consists of two main components: a prediction step and an update  
step.  
In the prediction step, the Kalman filter uses the previous state  
estimate and the motion model of the object to predict its current  
state. The motion model may be based on simple kinematic  
equations, such as constant velocity or acceleration, or it may be  
5
more complex, such as a differential equation or a neural network.  
In the update step, the Kalman filter uses the current measurement  
to correct or update the predicted state estimate. It does this by  
computing the error between the measurement and the prediction,  
and by adjusting the state estimate based on the error and the  
uncertainty of the measurement.  
The Kalman filter is a powerful and flexible tool for motion tracking,  
as it can handle non-linear and dynamic systems, and it can  
incorporate additional information or constraints, such as the shape  
or appearance of the object, to improve the accuracy and robustness  
of the tracking. However, it requires a good model of the motion  
and the measurement process, and it may be sensitive to  
initialization or outliers. It is also computationally intensive, and it  
may not be suitable for real-time applications with high data rates.  
(c)  
07  
List the types of noise. Consider that image is corrupted by  
Gaussian noise. Suggest suitable method to minimize Gaussian  
noise from the image and explain that method.  
There are several types of noise that can corrupt an image, including  
Gaussian noise, salt and pepper noise, speckle noise, and impulse  
noise. Each type of noise has different properties and characteristics,  
and different methods may be needed to minimize or remove it  
from the image.  
Gaussian noise is a type of noise that is characterized by a normal or  
Gaussian distribution of intensity values. It is often caused by  
electronic or thermal noise, and it can be modeled as a random  
variable with a mean and a variance. Gaussian noise is commonly  
encountered in images and can be difficult to remove, as it is  
distributed throughout the image and may be correlated with the  
underlying signal.  
To minimize Gaussian noise from an image, one approach is to use a  
smoothing or denoising filter. A smoothing filter is a linear or non-  
linear operation that averages or blends the intensity values of the  
pixels in the image to reduce the noise. There are many different  
types of smoothing filters, such as the mean filter, the median filter,  
or the Gaussian filter, which can be used depending on the specific  
requirements of the application.  
The mean filter is a simple and fast smoothing filter that replaces the  
intensity of each pixel with the average intensity of its neighbors. It is  
effective at removing Gaussian noise and preserving the edges and  
details of the image, but it may also blur the image and remove fine  
details.  
The median filter is a non-linear smoothing filter that replaces the  
6
intensity of each pixel with the median intensity of its neighbors. It is  
more robust to outliers and salt and pepper noise, but it may be  
slower and more complex to implement.  
The Gaussian filter is a linear smoothing filter that convolves the  
image with a Gaussian kernel. It is effective at removing Gaussian  
noise and preserving the edges and details of the image, but it may  
be sensitive to the size and shape of the kernel, and it may blur the  
image more than other filters.  
To choose the appropriate smoothing filter for a given image, it is  
important to consider the properties of the noise and the desired  
trade-off between noise reduction and image quality. In general, it is  
best to use a filter that is matched to the type of noise present in the  
image, as it will be more effective at removing the noise and  
preserving the image quality.  
In addition to smoothing filters, other methods may also be used to  
minimize Gaussian noise from an image, such as wavelet denoising,  
total variation denoising, or non-local means denoising. These  
methods are typically more complex and computationally intensive,  
but they may be more effective at removing noise and preserving  
image quality, particularly for high levels of noise or for images with  
complex structures or textures.  
*************  
7
Seat No.:  
Enrolment No.  
GUJARAT TECHNOLOGICAL UNIVERSITY  
BE - SEMESTER–VII (NEW) EXAMINATION – SUMMER 2022  
Subject Code:3171614  
Date:10/06/2022  
Subject Name:Computer Vision  
Time:02:30 PM TO 05:00 PM  
Total Marks: 70  
MARKS  
What is Computer Vision? List any four applications of computer  
vision.  
Q.1(a)  
03  
Computer vision is a field of artificial intelligence and computer  
science that aims to enable computers to interpret and understand  
visual data from the world around them, in the same way that human  
vision does. Some applications of computer vision include:  
1. Image and video analysis: Computer vision algorithms can be  
used to analyze images and videos to identify objects, people,  
and other features. This has a wide range of applications,  
including security and surveillance, medical image analysis, and  
autonomous vehicles.  
2. Augmented reality: Computer vision can be used to create  
augmented reality (AR) experiences, where digital information  
is overlaid onto the real world in real-time. This is used in  
applications such as AR games and smartphone apps that  
provide information about the world around you.  
3. Robotics: Computer vision is used to enable robots to see and  
understand their environment, which is important for tasks  
such as navigation, object recognition, and manipulation.  
4. Quality control: Computer vision can be used in manufacturing  
to inspect products for defects and ensure that they meet  
quality standards. This can be done using machine learning  
algorithms that have been trained to recognize defects in  
images of the products.  
(b)  
04  
Describe two-dimensional convolution operation with the  
required equation.  
In digital image processing, a convolution is a mathematical operation  
that is used to combine two sets of data to form a third set of data. In  
the case of a two-dimensional convolution, the operation is  
performed on a matrix (or image) to produce a new matrix as the  
result.  
The equation for a two-dimensional convolution is:  
8
(f * g)[m][n] = ∑[i][j] (f[i][j] * g[m-i][n-j])  
where:  
f is the input matrix (or image)  
g is the convolution kernel (also called the filter or mask)  
m and n are the indices for the rows and columns of the output  
matrix  
i and j are the indices for the rows and columns of the input  
matrix and the kernel  
In this equation, the output matrix is formed by applying the kernel g  
to each "neighborhood" of values in the input matrix f, and summing  
the products of the corresponding entries. For example, if g is a 3x3  
kernel, then the output value at (m, n) in the output matrix would be  
the sum of the products of the values in the 3x3 neighborhood  
centered at (m, n) in the input matrix, with the corresponding values  
in the kernel.  
Convolution is a powerful technique that is widely used in image  
processing and computer vision for tasks such as image filtering,  
feature detection, and edge detection.  
Describe digitization of the image with necessary figures.  
In digital image processing, digitization is the process of converting a  
continuous image into a discrete digital representation. This is  
typically done by sampling the image at regular intervals and  
quantizing the samples, converting them into a digital format such as  
a binary image or a grayscale image.  
(c)  
07  
There are several factors that can affect the quality of the digitized  
image, including the resolution of the image, the bit depth of the  
samples, and the color space used to represent the image.  
Resolution refers to the number of pixels in the image, with higher  
resolutions resulting in more detailed images. Bit depth refers to the  
number of bits used to represent each sample, with higher bit depths  
allowing for more accurate representation of the original image. Color  
space refers to the range of colors that can be represented in the  
image, with some common examples including RGB (red, green, blue)  
and CMYK (cyan, magenta, yellow, black).  
Here is a simplified example of the digitization process:  
1. An image is captured by a camera or scanned from a physical  
photograph.  
2. The image is divided into a grid of pixels, with each pixel  
9
representing a sample of the original image.  
3. The intensity or color of each pixel is measured and quantized,  
converting it into a digital format such as a binary value or a  
grayscale value.  
4. The resulting digital image can be stored, edited, and displayed  
on a computer or other digital device.  
Q.2(a) Describe the pinhole imaging model in brief.  
03  
The pinhole imaging model is a simple model that describes how an  
image is formed by light passing through a small aperture, such as a  
pinhole or the aperture of a camera. According to this model, light  
rays from a single point on the object being imaged will pass through  
the aperture and form a focused image on the image plane.  
The position and size of the image on the image plane depends on  
the distance between the object and the aperture, as well as the size  
and shape of the aperture. In general, objects that are closer to the  
aperture will produce larger and clearer images, while objects that are  
further away will produce smaller and less clear images.  
The pinhole imaging model is often used as a starting point for more  
complex models of image formation, such as those used in optics and  
computer vision. It is also useful for understanding the basic principles  
of photography and the design of camera systems.  
(b) Differentiate locally adaptive histogram equalization and block  
04  
histogram equalization methods.  
Here is a comparison of locally adaptive histogram equalization  
(LAHE) and block histogram equalization (BHE) methods:  
Method  
Description  
LAHE  
In LAHE, the image is divided into small blocks, and the  
histogram equalization is applied to each block  
independently. This allows for more localized control  
over the contrast enhancement, and can help to avoid  
over- or under-enhancement in certain areas of the  
image.  
BHE  
In BHE, the image is divided into larger blocks, and the  
histogram equalization is applied to each block as a  
whole. This can lead to more global contrast  
enhancement, but may also result in over- or under-  
enhancement in certain areas.  
10  
Both LAHE and BHE are methods for improving the contrast in  
images, by redistributing the intensity values of the pixels in the  
image. They can be useful for improving the visibility of features in  
images that are poorly contrasted or have low dynamic range.  
(c)  
07  
What is a pixel? Discuss different pixel transformation methods  
with necessary equations.  
A pixel is the smallest unit of a digital image that can be displayed or  
processed by a computer. It is usually represented as a small square  
or rectangle on a computer screen, and is made up of one or more  
color elements (such as red, green, and blue). The color and intensity  
of each pixel can be represented using a digital value, such as a binary  
value or a grayscale value.  
There are several methods for transforming pixels in an image,  
including:  
1. Scaling: Scaling involves changing the size of an image by  
altering the number of pixels it contains. The equation for  
scaling an image by a factor of S is:  
I'[i][j] = I[Si][Sj]  
2. Translation: Translation involves shifting the position of an  
image by a certain number of pixels. The equation for  
translating an image by Tx pixels in the x-direction and Ty  
pixels in the y-direction is:  
I'[i][j] = I[i+Tx][j+Ty]  
3. Rotation: Rotation involves rotating an image around a certain  
point by a certain angle. The equation for rotating an image  
around the point (X0, Y0) by an angle θ is:  
I'[i][j] = I[X0 + (i-X0)*cos(θ) - (j-Y0)*sin(θ)][Y0 + (i-X0)*sin(θ) +  
(j-Y0)*cos(θ)]  
4. Flipping: Flipping involves reflecting an image across a certain  
axis. The equation for flipping an image horizontally (across the  
y-axis) is:  
I'[i][j] = I[i][M-j-1]  
where M is the number of columns in the image. The equation for  
flipping an image vertically (across the x-axis) is:  
I'[i][j] = I[N-i-1][j]  
where N is the number of rows in the image.  
OR  
11  
(c)  
07  
What is the significance of wiener filter in image processing?  
Discuss wiener filter in detail.  
The Wiener filter is a type of signal processing filter that is used to  
remove noise from signals, such as images or audio signals. It is based  
on the idea of Wiener deconvolution, which is a method for  
reconstructing a signal from a noisy version of the signal, by taking  
into account the known characteristics of the noise and the signal.  
The Wiener filter works by estimating the power spectral densities  
(PSDs) of the signal and the noise, and using these estimates to  
compute a filter that minimizes the mean squared error between the  
original signal and the filtered signal. The resulting filter is called a  
Wiener filter, and it can be expressed as:  
H(f) = S(f) / (N(f) + S(f))  
where:  
H(f) is the frequency response of the Wiener filter  
S(f) is the PSD of the original signal  
N(f) is the PSD of the noise  
The Wiener filter is often used in image processing to remove noise  
from images, such as Gaussian noise or salt-and-pepper noise. It is  
particularly useful for images that have a high signal-to-noise ratio,  
since it can effectively reduce the noise without degrading the quality  
of the image too much.  
The Wiener filter can be implemented using various techniques, such  
as the Wiener-Hopf equation or the least squares method. It can also  
be extended to more complex scenarios, such as the case where the  
noise and signal are correlated or the case where the signal has a  
non-stationary PSD.  
Q.3(a) Discuss weak perspective projection in detail.  
03  
In computer vision and image processing, weak perspective projection  
is a model that describes how a three-dimensional scene is projected  
onto a two-dimensional image plane. This projection is typically done  
by a camera or other imaging device, such as a scanner or a satellite.  
The weak perspective projection model assumes that the image plane  
is relatively far from the scene, so that the lines of sight from the  
image plane to the scene are approximately parallel. This results in a  
projection that is approximately "orthographic," meaning that the  
objects in the scene are preserved in size and shape, but their position  
and orientation are distorted.  
12  
The weak perspective projection model can be described using the  
following equation:  
[X Y W] = [f 0 c_x 0 f c_y 0 0 1] [X' Y' Z']  
where:  
[X Y W] is a column vector representing a point in the image  
plane  
[X' Y' Z'] is a column vector representing a point in the three-  
dimensional scene  
f is the focal length of the camera (or the distance between the  
image plane and the optical center of the camera)  
c_x and c_y are the coordinates of the principal point (or the  
center of the image)  
The weak perspective projection model can be used to reconstruct  
the three-dimensional structure of a scene from multiple images  
taken by a camera, or to perform other tasks such as image  
registration or stereo vision.  
(b) What is the significance of morphological operation? Discuss  
erosion operation in detail.  
04  
Morphological operations are a set of image processing techniques  
that are used to modify the shape or form of objects in an image.  
These techniques are based on the idea of "morphing," or changing  
the shape of an object, and are typically applied to binary images  
(images with only two intensity levels).  
One common morphological operation is erosion, which is used to  
shrink or thin the objects in an image. The erosion operation works  
by "eroding" the pixels on the boundary of an object, replacing them  
with the background pixels if they meet certain criteria. This can be  
used to remove small features or noise from an image, or to  
separate touching objects.  
The erosion operation is typically performed using a structuring  
element, which is a small shape that is used to define the erosion  
process. The structuring element is placed at each pixel in the image,  
and the erosion operation is applied by comparing the pixel to the  
corresponding pixels in the structuring element. If the pixel meets  
the criteria defined by the structuring element, it is replaced with the  
background pixel.  
The equation for erosion with a structuring element B is:  
I'[i][j] = min{I[i+k][j+l]} for all (k,l) in B  
where:  
I is the input image  
I' is the output image  
B is the structuring element  
i and j are the indices for the rows and columns of the image  
k and l are the indices for the rows and columns of the  
structuring element  
Erosion is often used in combination with other morphological  
operations, such as dilation or opening, to perform tasks such as  
image segmentation or object recognition.  
13  
What is the use of SIFT feature in image processing? Explain SIFT  
feature in detail.  
(c)  
07  
The SIFT (Scale-Invariant Feature Transform) feature is a method for  
extracting distinctive features from images that can be used for tasks  
such as image matching, object recognition, and 3D reconstruction.  
SIFT features are robust to image scale and rotation, and are invariant  
to image affine distortion and changes in 3D viewpoint.  
The SIFT feature is computed using a scale-space extrema detection  
algorithm, which is applied to the scale-space representation of the  
image. The scale-space representation is obtained by smoothing the  
image using a Gaussian kernel and increasing the standard deviation  
of the kernel at each scale. This results in a set of images that are  
increasingly smoothed and down-sampled, with each image  
corresponding to a different scale.  
The scale-space extrema detection algorithm searches for local  
extrema of the Difference of Gaussians (DoG) function, which is the  
difference between the scale-space images at adjacent scales. These  
extrema are considered to be potential SIFT features, and are then  
subjected to further refinement and selection to eliminate low-  
contrast and poorly localized features.  
The resulting SIFT features are represented as vectors of 128 floating-  
point numbers, which capture the local image gradient information at  
the scale and orientation of the feature. These vectors can be used to  
match features between images, or to build a vocabulary of features  
for use in tasks such as image classification or object recognition.  
OR  
Q.3(a) Discuss orthographic projection in detail.  
03  
Orthographic projection is a type of geometric projection that is used  
to represent three-dimensional objects on a two-dimensional plane. It  
is a non-perspective projection, meaning that it does not take into  
account the distance between the objects and the projection plane,  
and as a result, the size of the objects is not preserved.  
In orthographic projection, the objects in the scene are projected onto  
the image plane by drawing lines from the vertices of the objects to  
the image plane. The resulting image is a top-down or side-view of  
the objects, depending on the orientation of the projection plane.  
Orthographic projection can be classified into several types, based on  
the position of the projection plane relative to the objects in the  
scene:  
14  
1. Cavalier projection: This is a type of orthographic projection in  
which the projection plane is perpendicular to the line of sight,  
and the objects in the scene are projected onto the image  
plane as if they were viewed from above.  
2. Cabinet projection: This is a type of orthographic projection in  
which the projection plane is inclined at an angle to the line of  
sight, and the objects in the scene are projected onto the  
image plane as if they were viewed from above at an angle.  
3. Plan projection: This is a type of orthographic projection in  
which the projection plane is parallel to the xy-plane, and the  
objects in the scene are projected onto the image plane as if  
they were viewed from the side.  
Orthographic projection is often used in technical drawing, computer-  
aided design (CAD), and other fields where it is important to represent  
the objects in a scene with precise measurements and accurate  
proportions.  
(b) Discuss a Sobel operator to detect edges from the image.  
The Sobel operator is a simple edge detection operator that is  
commonly used in image processing and computer vision. It is based  
on the idea of taking the gradient of the image, which is a measure of  
how the intensity of the image changes at each point. The gradient is  
calculated using a convolution operation, which involves taking the  
sum of the products of the image pixels and a set of weights (called  
the kernel or filter).  
04  
The Sobel operator uses two kernels, one for the horizontal direction  
and one for the vertical direction, to approximate the gradient of the  
image in each direction. The horizontal kernel is defined as:  
-1  
-2  
-1  
0
0
0
1
2
1
and the vertical kernel is defined as:  
-1  
0
-2  
0
-1  
0
1
2
1
To apply the Sobel operator to an image, the horizontal and vertical  
kernels are convolved with the image separately, resulting in two  
15  
images: one representing the gradient in the x-direction and one  
representing the gradient in the y-direction. These two images can  
then be combined to form the final edge map of the image.  
The Sobel operator is a simple and effective method for detecting  
edges in images, and is widely used in various applications such as  
image enhancement, object recognition, and robotics.  
(c) Discuss Harris corner detection method in detail.  
07  
The Harris corner detection method is a method for detecting corners  
in images, which are defined as points in an image with high spatial  
variation in all directions. Corners are often used as distinctive  
features in image processing and computer vision tasks, such as  
image matching, object recognition, and 3D reconstruction.  
The Harris corner detection method is based on the idea of  
computing the "cornerness" of each point in an image, by measuring  
the autocorrelation of the image intensity around the point. This is  
done using the following equation:  
M = ∑[i][j] w[i][j] (I[x+i][y+j] - I_mean)^2  
where:  
M is the cornerness measure  
w[i][j] is a weighting function that gives more weight to pixels  
closer to the center  
I[x+i][y+j] is the intensity of the pixel at position (x+i, y+j)  
I_mean is the mean intensity of the pixels in the window  
The cornerness measure M is calculated for each point in the image,  
and the points with the highest values are considered to be corners.  
The Harris corner detection method is widely used because it is  
relatively simple to implement and is robust to noise and other image  
variations. However, it has some limitations, such as the sensitivity to  
the size and orientation of the window used to compute the  
cornerness measure, and the sensitivity to the choice of the weighting  
function.  
Discuss region splitting and region merging image segmentation  
method in brief.  
Q.4(a)  
03  
Region splitting and region merging are two methods for image  
segmentation, which is the process of partitioning an image into  
regions or segments that correspond to different objects or features  
in the image.  
16  
Region splitting is a top-down method for image segmentation, in  
which the image is initially divided into a set of regions, and then  
these regions are iteratively split into smaller regions based on some  
criterion, such as the intensity or color of the pixels. This process  
continues until the regions satisfy some stopping criterion, such as a  
minimum size or a maximum homogeneity.  
Region merging is a bottom-up method for image segmentation, in  
which the image is initially divided into a set of small regions or pixels,  
and then these regions are iteratively merged into larger regions  
based on some criterion, such as the similarity of the regions or the  
presence of an edge between them. This process continues until the  
regions satisfy some stopping criterion, such as a maximum size or a  
minimum homogeneity.  
Region splitting and region merging are often used in combination  
with other image segmentation methods, such as thresholding or  
edge detection, to improve the accuracy and efficiency of the  
segmentation process. They can be applied to various types of  
images, including grayscale, color, and texture images.  
(b) Explain graph based segmentation with details.  
04  
Graph-based image segmentation is a method for partitioning an  
image into regions or segments, by constructing a graph  
representation of the image and applying graph theory algorithms to  
it. In this method, the pixels or superpixels in the image are treated as  
nodes in the graph, and the edges between the nodes represent the  
similarity or dissimilarity between the pixels. The graph is then  
partitioned into segments by applying graph partitioning algorithms,  
such as minimum cut or normalized cut.  
There are several steps involved in graph-based image segmentation:  
1. Preprocessing: This involves smoothing the image to reduce  
noise and reduce the number of nodes in the graph. This can  
be done using techniques such as Gaussian smoothing or  
bilateral filtering.  
2. Constructing the graph: This involves defining the nodes and  
edges of the graph based on the image pixels or superpixels.  
The edges can be defined based on various criteria, such as the  
intensity, color, or texture similarity of the pixels, or the  
presence of an edge between the pixels.  
3. Partitioning the graph: This involves applying a graph  
partitioning algorithm to the graph, to divide it into segments  
or regions. The algorithm may use various criteria, such as the  
size or shape of the segments, the strength of the edges  
between the segments, or the overall homogeneity of the  
17  
segments.  
4. Refining the segments: This involves further refining the  
segments by applying techniques such as region merging or  
post-processing. This can help to improve the accuracy and  
smoothness of the segments.  
Graph-based image segmentation is a powerful method that can  
handle complex and varied images, and is often used in tasks such as  
image segmentation, object recognition, and image annotation.  
(c) Describe feature-based motion field estimation technique in  
07  
details.  
Feature-based motion field estimation is a technique for estimating  
the motion field of a scene from a sequence of images, by tracking a  
set of distinctive features in the images. The motion field is a map that  
describes the displacement of the pixels or features in the scene over  
time, and can be used to estimate the 3D structure and motion of the  
scene, or to stabilize the images or video.  
There are several steps involved in feature-based motion field  
estimation:  
1. Feature detection: This involves detecting a set of distinctive  
features in the images, such as corners, edges, or blobs. The  
features should be well-distributed in the images and should  
have good repeatability and discriminability. Common feature  
detection methods include Harris corner detector, SIFT, SURF,  
and ORB.  
2. Feature tracking: This involves tracking the detected features  
from one image to the next, by finding the corresponding  
features in the adjacent images. The tracking can be done  
using methods such as the Lucas-Kanade algorithm, the  
Kanade-Lucas-Tomasi (KLT) tracker, or the Pyramid Lucas-  
Kanade (PLK) tracker.  
3. Motion field estimation: This involves estimating the motion  
field from the tracked features, by fitting a motion model to  
the feature displacement. The motion model can be a simple  
model such as a constant velocity model, or a more complex  
model such as a homography or a fundamental matrix.  
4. Refinement and validation: This involves refining the motion  
field estimates by applying techniques such as outlier rejection  
or model fitting, and validating the estimates by comparing  
them with other sources of information, such as the image  
intensity or the scene geometry.  
Feature-based motion field estimation is a widely used technique in  
various applications such as video stabilization, object tracking, and  
3D reconstruction. It is particularly useful for scenes with complex  
motion or texture, where other methods such as optical flow may fail.  
18  
OR  
Q.4(a) Describe watershed segmentation method in brief.  
Watershed segmentation is a method for image segmentation, which  
is the process of partitioning an image into distinct regions or  
segments that correspond to different objects or features in the  
image. The watershed segmentation method is based on the concept  
of the "watershed transform," which is a mathematical transformation  
that is used to identify the catchment basins or "watersheds" in a  
grayscale image.  
03  
The watershed segmentation method consists of the following steps:  
1. Preprocessing: This involves preprocessing the image to reduce  
noise and enhance the contrast between the objects and the  
background. This can be done using techniques such as  
smoothing, histogram equalization, or gradient enhancement.  
2. Marker-based segmentation: This involves identifying the  
objects or features in the image by selecting points or regions  
in the image as "markers" and propagating them through the  
image using a watershed transform. The markers can be  
selected manually or automatically using techniques such as  
thresholding or region growing.  
3. Watershed transform: This involves applying the watershed  
transform to the image, which consists of two steps: (a)  
creating a topographic map of the image by applying a  
gradient operator, and (b) identifying the catchment basins or  
watersheds in the map by treating the markers as "seeds" and  
growing the watersheds from them.  
4. Segmentation: This involves partitioning the image into  
segments or regions based on the watersheds identified in the  
previous step. The segments are typically labeled with different  
colors or intensities to distinguish them from each other.  
Watershed segmentation is a useful method for image segmentation,  
particularly for images with uneven or noisy backgrounds, or for  
images with overlapping or touching objects. It is often used in tasks  
such as object recognition, image analysis, and medical imaging.  
(b) Discuss basics of the motion field of rigid objects with necessary  
04  
equations.  
The motion field of a rigid object is a map that describes the  
displacement of the points on the object over time, under the  
assumption that the object is rigid and does not deform. The motion  
field can be used to estimate the 3D structure and motion of the  
object, or to stabilize the images or video of the object.  
The motion field of a rigid object can be described by a set of motion  
19  
equations, which relate the position and orientation of the object at  
different times. The motion equations can be derived using various  
techniques, such as the laws of motion, the Euler-Lagrange equations,  
or the Newton-Euler equations.  
One common set of motion equations for a rigid object is the  
Newton-Euler equations of motion, which describe the motion in  
terms of the forces and torques acting on the object and the object's  
mass and inertia. The Newton-Euler equations can be written as:  
F = ma  
T = Iα  
where:  
F is the force vector acting on the object  
m is the mass of the object  
a is the acceleration vector of the object  
T is the torque vector acting on the object  
I is the inertia tensor of the object  
α is the angular acceleration vector of the object  
The Newton-Euler equations can be used to describe the motion of a  
rigid object in various scenarios, such as uniform motion, nonuniform  
motion, or rotational motion. They can also be used to predict the  
motion of the object based on the forces and torques acting on it.  
(c) Discuss snake method for image segmentation with the necessary  
equations.  
07  
The snake method is a method for image segmentation, which is the  
process of partitioning an image into distinct regions or segments  
that correspond to different objects or features in the image. The  
snake method is based on the idea of a "snake," which is a  
deformable curve that can be adjusted to fit the contours of an object  
in the image.  
The snake method consists of the following steps:  
1. Initialization: This involves selecting a set of points or vertices  
to define the initial shape of the snake, and placing the snake  
on the image. The initial shape of the snake can be a straight  
line, a circle, or an arbitrary curve, depending on the shape of  
the object to be segmented.  
2. Evolution: This involves iteratively adjusting the shape of the  
snake to fit the contours of the object, by minimizing an energy  
function that consists of two terms: an internal energy term  
20  
that penalizes large deformations of the snake, and an external  
energy term that attracts the snake to the object contours. The  
energy function can be written as:  
E = ∑[i=1]^n (ωI[i]d[i]^2 + ωE[i]C[i])  
where:  
E is the energy of the snake  
n is the number of vertices in the snake  
ωI[i] is the weight of the internal energy term at vertex i  
d[i] is the distance between the position of vertex i and its  
neighboring vertices  
ωE[i] is the weight of the external energy term at vertex i  
C[i] is the external energy at vertex i, which is a measure of the  
distance between the vertex and the object contours  
3. Segmentation: This involves extracting the segment or region  
corresponding to the object from the image, using the final  
shape of the snake. The segment can be represented as a  
binary mask or a set of pixels, and can be used for tasks such  
as object recognition, image analysis, or medical imaging.  
The snake method is a useful method for image segmentation,  
particularly for images with irregular or curved objects, or for images  
with noise or clutter. It is often used in tasks such as object tracking,  
image registration, and 3D reconstruction.  
Q.5(a) Describe intrinsic parameters of camera calibration in brief.  
Camera calibration is the process of estimating the intrinsic  
parameters of a camera, which are the parameters that describe the  
internal characteristics of the camera, such as the focal length,  
principal point, and image distortion. These parameters are necessary  
for many computer vision and image processing tasks, such as 3D  
reconstruction, object tracking, and image rectification.  
03  
The intrinsic parameters of a camera can be represented by a 3x3  
matrix called the intrinsic matrix, which has the following form:  
| fx 0 cx | | 0 fy cy | | 0 0 1 |  
where:  
fx and fy are the focal lengths of the camera in the x and y  
directions, respectively  
cx and cy are the coordinates of the principal point of the  
camera, which is the point where the optical axis of the camera  
intersects the image plane  
21  
The intrinsic matrix can be estimated from a set of images of a known  
calibration pattern, such as a checkerboard or a circular grid. The  
calibration pattern should be imaged from different viewpoints and  
under different lighting conditions, to ensure that the intrinsic  
parameters are accurately estimated.  
To estimate the intrinsic parameters, the following steps are typically  
followed:  
1. Detection: This involves detecting the calibration pattern in the  
images, by extracting the corners or features of the pattern and  
matching them to a reference model of the pattern.  
2. Extraction: This involves extracting the 3D coordinates of the  
detected corners or features, using techniques such as stereo  
vision or structured light.  
3. Estimation: This involves fitting a projection  
22  
(b) Discuss the role of image eigenspaces in object identification.  
Eigenspaces are mathematical constructs that are used to  
represent images in a compact and informative manner. In the  
context of object identification, eigenspaces can be used to  
represent the appearance or shape of objects, and can be used  
to compare and classify different objects based on their visual  
similarity.  
04  
There are several methods for constructing eigenspaces for  
image representation and object identification, such as  
principal component analysis (PCA) and linear discriminant  
analysis (LDA). These methods involve calculating the  
eigenvectors and eigenvalues of the image data, and projecting  
the images onto a lower-dimensional space spanned by the  
eigenvectors. The resulting eigenspace representation of the  
images is more compact and discriminative than the original  
pixel representation, and can be used for tasks such as object  
recognition, image classification, and image retrieval.  
For example, in PCA, the eigenspace is constructed by  
calculating the eigenvectors of the image covariance matrix,  
and the eigenspace representation of an image is obtained by  
projecting the image onto the eigenvectors. In LDA, the  
eigenspace is constructed by maximizing the separation  
between the different classes or categories of objects, and the  
eigenspace representation of an image is obtained by  
projecting the image onto the eigenvectors that correspond to  
the largest class separation.  
Eigenspaces have several advantages for object identification,  
such as robustness to noise, illumination, and pose variations,  
and computational efficiency. However, they may not be  
suitable for all types of objects and images, and may require  
large amounts of training data to accurately represent the  
objects.  
(c) Discuss the kalman filter for motion tracking in detail.  
The Kalman filter is a mathematical algorithm that is used to  
estimate the state of a system based on a series of noisy  
measurements. In the context of motion tracking, the Kalman  
filter can be used to estimate the motion of an object based on  
a series of noisy or incomplete observations of the object, such  
as its position, velocity, or acceleration.  
07  
The Kalman filter works by combining two sources of  
information: a prediction of the object's motion based on its  
23  
previous state, and an update of the object's state based on the  
current observations. The prediction and update are performed  
iteratively, at each time step, to produce an optimal estimate of  
the object's state.  
The Kalman filter consists of the following steps:  
1. Prediction: This involves predicting the object's state at  
the current time step based on its state at the previous  
time step and a motion model that describes the  
object's dynamics. The prediction is given by:  
̂
x[k|k-1] = A[k-1]x[k-1] + B[k-1]u[k-1]  
where:  
̂
x[k|k-1] is the predicted state of the object at time k  
x[k-1] is the state of the object at time k-1  
u[k-1] is the control input to the object at time k-1 (e.g.,  
acceleration)  
A[k-1] is the state transition matrix that describes the  
object's motion  
B[k-1] is the control input matrix that describes the  
effect of the control input on the object's motion  
2. Update: This involves updating the prediction with the  
current observations of the object, to produce an  
optimal estimate of the object's state. The update is  
given by:  
̂
̂
̂
x[k] = x[k|k-1] + K[k](z[k] - H[k]x[k|k-1])  
where:  
̂
x[k] is the optimal estimate of the object's state at time k  
z[k] is the current observation of the object at time k  
H[k] is the observation matrix that maps the object's  
state to the observation  
K[k] is the Kalman gain, which is a weighting factor that  
balances the prediction and the observation  
The Kalman filter can be applied to various types of motion  
tracking problems, such as linear or nonlinear motion, single or  
multiple object tracking, and stationary or moving cameras. It is  
widely used in applications such as robot navigation,  
surveillance, and autonomous systems.  
OR  
24  
Q.5 (a) Discuss optical flow in brief.  
Optical flow is a method for estimating the motion of objects in  
03  
an image or video, by analyzing the displacement of pixels or  
features between successive frames. Optical flow is a key  
component of many computer vision and image processing  
tasks, such as object tracking, action recognition, and scene  
flow estimation.  
Optical flow is typically estimated using the following steps:  
1. Feature detection: This involves detecting a set of  
distinctive features in the images, such as corners,  
edges, or blobs. The features should be well-distributed  
in the images and should have good repeatability and  
discriminability. Common feature detection methods  
include Harris corner detector, SIFT, SURF, and ORB.  
2. Feature tracking: This involves tracking the detected  
features from one frame to the next, by finding the  
corresponding features in the adjacent frames. The  
tracking can be done using methods such as the Lucas-  
Kanade algorithm, the Kanade-Lucas-Tomasi (KLT)  
tracker, or the Pyramid Lucas-Kanade (PLK) tracker.  
3. Motion estimation: This involves estimating the motion  
of the features based on their displacement between the  
frames. The motion can be described using various  
models, such as a constant velocity model, a  
homography, or a fundamental matrix.  
4. Refinement and validation: This involves refining the  
motion estimates by applying techniques such as outlier  
rejection or model fitting, and validating the estimates  
by comparing them with other sources of information,  
such as the image intensity or the scene geometry.  
(b) Describe linear dynamics model for constant velocity and  
constant  
04  
acceleration of motion tracking.  
The linear dynamics model is a mathematical model that  
describes the motion of an object in terms of its position and  
velocity, under the assumption that the motion is linear and the  
acceleration is constant. The linear dynamics model is often  
used in motion tracking applications, such as object tracking,  
surveillance, and robot navigation.  
There are two versions of the linear dynamics model: the  
constant velocity model and the constant acceleration model.  
The constant velocity model assumes that the velocity of the  
object is constant over time, and is given by:  
25  
x[k] = x[k-1] + v[k-1]T  
where:  
x[k] is the position of the object at time k  
x[k-1] is the position of the object at time k-1  
v[k-1] is the velocity of the object at time k-1  
T is the time interval between time k-1 and k  
The constant acceleration model assumes that the acceleration  
of the object is constant over time, and is given by:  
x[k] = x[k-1] + v[k-1]T + (1/2)a[k-1]T^2  
where:  
a[k-1] is the acceleration of the object at time k-1  
The linear dynamics model can be used to estimate the motion  
of an object based on a series of noisy or incomplete  
observations of the object, by fitting the model to the  
observations using techniques such as least squares or  
maximum likelihood.  
(c) Discuss invariant-based object recognition algorithm in detail.  
Invariant-based object recognition is a method for recognizing  
and identifying objects in images or videos, based on their  
invariant or stable features, such as shape, color, texture, or  
appearance. Invariant-based object recognition is a key  
component of many computer vision and image processing  
tasks, such as object classification, object tracking, and scene  
understanding.  
07  
There are several steps involved in an invariant-based object  
recognition algorithm:  
1. Preprocessing: This involves preprocessing the images or  
videos to reduce noise, enhance contrast, and extract  
relevant features. Preprocessing can involve techniques  
such as smoothing, histogram equalization, edge  
detection, or feature extraction.  
2. Feature extraction: This involves extracting a set of  
distinctive and stable features from the images or  
videos, which are robust to variations in pose, scale,  
orientation, lighting, or background. Common feature  
extraction methods include SIFT, SURF, ORB, or GIST.  
3. Feature matching: This involves matching the extracted  
features between the images or videos, to identify  
26  
correspondences between the objects. Feature matching  
can be done using techniques such as nearest neighbor  
matching, ratio test, or RANSAC.  
4. Object recognition: This involves recognizing the objects  
based on the matched features, by comparing them to a  
database of known objects or by applying classification  
or clustering algorithms.  
5. Validation: This involves validating the recognition  
results by applying additional constraints or criteria, such  
as context, geometry, or appearance, to ensure that the  
objects are correctly identified.  
Invariant-based object recognition algorithms are widely used  
in applications such as image retrieval, object tracking,  
surveillance, and augmented reality. They are often preferred  
over template-based or model-based methods, due to their  
robustness to variations in pose, scale, and appearance.  
However, they may require  
*************  
27