1 Introduction
-
application of Features from Accelerated Segment Test (FAST) keypoints [7] with the adaptive threshold PCK (Percentage of Considered Keypoints) presented in [5] and application of the Normalised Local NBNN distance measure presented in [5] in order to enhance the performance of the method described in [6] when applied to manuscript images (see Sect. 4, steps 1 and 5 for details);
-
application of the resulting learning-free method to an actual research question from the humanities about palm-leaf manuscripts (see Sect. 3 for details);
-
providing state-of-the-art results on two extremely challenging datasets, namely the AMADI_LontarSet dataset [8] of handwriting on palm-leafs for word-spotting and the DocExplore dataset [3] of medieval manuscripts for pattern detection, with performance analysis in order to facilitate later comparisons.
-
developing an easy-to-use implementation of the proposed method, and releasing it as a free software tool to the public.
2 Related work
3 Use case from manuscript research
3.1 Research question
3.2 The importance of learning-free automatic pattern detection
4 The proposed method
-
Step 1: Since patterns in manuscript research are mostly the result of handmade marks on writing supports, the resulting features on the formed contours can be efficiently detected using the FAST [7] keypoints detector with the adaptive threshold PCK (Percentage of Considered Keypoints) after converting the coloured images to grey-scale images, as demonstrated in [4]; an example is shown in Fig. 4. A circular neighbourhood of 16 pixels is used around every pixel p in the image to be classified as a keypoint if there are n contiguous pixels in the surrounding circle satisfying one of these conditions:or
-
\(\forall i \in n: I_i > I_p + t\),
-
\(\forall i \in n: I_i < I_p - t\), where \(I_p\) is the intensity of the candidate pixel and \(I_i\) is the intensity of any pixel that belongs to the n contiguous pixels in the neighbourhood. t is a threshold to be selected manually. n is set to 9 following the recommendation in [7], and t is set to zero so that we initially consider all the detected keypoints before filtering them by strength using the PCK parameter as described below. The strength of a keypoint is the maximum value of t for which the segment test of that corner point is satisfied, and PCK is the percentage of considered FAST keypoints with the highest strength value; see Fig. 4. The detected keypoints using the FAST algorithm are obviously dependent on image resolution because of the fixed size of the circular neighbourhood. The detection performance is expected to drop gracefully as the scale difference between the queries and the pattern instances in the images increases; see the degradation analysis of FAST keypoints in [4]. Nevertheless, limited scale invariance can be obtained by generating additional scales for each query sample. The descriptors of detected features are then calculated using the Scale-Invariant Feature Transform (SIFT) algorithm [24]. The relative location of detected features is stored as a scaled offset with respect to the spatial centre of the labelled pattern; the keypoint size can be used as a scaling factor when a multi-scale keypoints detection algorithm is used. Local features are detected and described in the test images following the same procedure for query images, but without storing any relative locations.
-
Step 2: When coloured images are converted to grey-scale images, pixels within the range of the red spectrum tend to have very low intensity values. As a consequence, the local contrast will be low compared with other spatial regions. Since our proposed method detects keypoints and extracts features from grey-scale images, the performance could be negatively affected if the query image contains red parts. Thus, the aforementioned issue has to be modified. This is particularly relevant when dealing with manuscript images because colours within the red range frequently appear in handwriting, decorations and drawings. The modification is done in the following way: First, the range of red colour is defined as a range of Hue values after converting the image from Red–Green–Blue (RGB) format to Hue–Saturation–Value (HSV) format. Then a mask is created to define the spatial location of red pixels in the image. Finally, the keypoints located within this spatial region are sorted separately. Once the strongest ten per cent of all the keypoints have been selected as described in Step 1, the strongest ten per cent of the spatial location of red pixels are added. This allows keypoints detected in low-contrast red regions to be included in the total number of Considered Keypoints (PCK).
-
Step 3:: The performance of the object detection method presented in [6] is sensitive to the Kernel Radius R, which is a free parameter of the method. Therefore, we propose to calculate it automatically using the image dimensions of labelled patterns. This parameter represents the radius of the kernel, which convolves with the detection matrix in order to generate the final detections; see Eq. 8. In our approach, the kernel size is adaptively calculated from the average value of all medians of width and height for all the examples from a given labelled pattern (class) as follows:where \(R_c\) is the calculated parameter R for pattern (class) c, and \(\mathrm{Med}_c^w\) and \(\mathrm{Med}_c^h\) are the medians of widths and heights respectively, calculated from all the samples of a given labelled pattern (class) c; which are typically no more than a few samples, or even just one. The average value of all medians are multiplied by a fixed value to calculate the final kernel radius. This fixed value has been set to 0.1 (10%) in all our experiments. Other values have been tested in our preliminary experiments with no significant difference in the overall performance, but the performance starts to drop once we exceed a value of 0.5.$$\begin{aligned} R_c = 0.1 \times \left[ \dfrac{(\mathrm{Med}_c^w + \mathrm{Med}_c^h)}{2}\right] ; \end{aligned}$$(1)
-
Step 4: Two detection matrices are used in [6] for each class with the same size of the test image. One matrix (\(M_c^v\)) accumulates the number of matched features for the corresponding class in a location calculated by Eq. 3. The other matrix (\(M_c^s\)) accumulates the distances calculated between the features in the test image and the labelled query. These two matrices are then combined, after being convolved with their corresponding kernels, in order to calculate the final detection matrix (\(M_c\)) using the parameter \(\alpha \), which has to be selected manually, as a weight:where \(K_{\mathrm{mask}}\) and \(K_{\mathrm{dist}}\) are the kernels to be convolved with the corresponding matrices. In this work, only one detection matrix per pattern is created for each test image instead of the two matrices used in [6]. Our preliminary experiments showed that the matrix \(M_c^v\) does not contribute to the performance of the method in the used datasets of digitised manuscripts, yet it adds to the total computational cost. Only the matrix \(M_c^s\) is used from [6] and renamed \(M_c^{d_i}\). As a result, the parameter \(\alpha \) has been eliminated and there is no need to perform any further computations. The detection matrix \(M_c^{d_i}\) is the same size as the corresponding test image.$$\begin{aligned} M_c = M_c^s * K_{\mathrm{mask}} + \alpha (M_c^v * K_{\mathrm{dist}}); \end{aligned}$$(2)
-
Step 5: One of the main contributions proposed originally by the NBNN algorithm [25] is measuring the image-to-class distance instead of image-to-image distance in order to generalise the image-matching to class-matching. The image-to-class distance is measured by calculating the overall distance of image features to the features of all the images in a given class instead of the features of one image (image-to-image distance). In this work, we measure the feature-to-class distance in order to estimate the distance of each detected feature in the test image to the class distributions estimated by their labelled features. Each detected feature in the test image votes for a centre of an expected pattern in the detection matrix; see Fig. 5. The position of this expected centre is calculated using the relative location of nearest-neighbour feature in the corresponding labelled pattern as follows:where \(L_{i,c}\) is the location of the expected centre by feature \(d_i\) in the detection matrix of class c. \(L_f(d_i)\) is the location of feature \(d_i\) in the test image. \(\mathrm{Offset}(NN_c(d_i))\) is the scaled offset of the nearest-neighbour feature from the centre of the labelled pattern from the corresponding class. An example in Fig. 5 shows five detected features. Each one in the test image votes for the centre of an expected (labelled) pattern (class) using relative offsets. Circles represent the detected features, and the dots indicate the expected centres. Colours are used to associate each detected feature with its expected centre. It is clear that the feature marked in pink has been mismatched with the wrong feature in this example. Only detected features in the second part of the word are used in this example, and PCK is set to one percent for better visibility. The value of the vote is equal to the distance of each detected feature in the test image to features of the corresponding class (labelled pattern) using the Normalised Local NBNN distance calculation presented in [5] in order to consider the calculated priori of each class which is approximated by the number of detected features in each class:$$\begin{aligned} L_{i,c} = L_f(d_i) - \mathrm{Offset}(NN_c(d_i)); \end{aligned}$$(3)$$\begin{aligned} M^d(L_{i,c})= & {} M^d(L_{i,c}) + \mathrm{Dist}_{N}(d_i, c), \end{aligned}$$(4)where \(M^d(L_{i,c})\) is the detection matrix of class c and \(\mathrm{Dist}_{N}(d_i, c)\) is the normalised distance between the detected feature \(d_i\) in the test image and class c using the distance calculation presented in [5]. \(K_c\) is the number of features from the labelled patterns in class c, and \(\mathrm{Dist}(d_i, c)\) is the Local NBNN [26], which has been reformulated in [5] as follows:$$\begin{aligned} \mathrm{Dist}_{N}(d_i, c)= & {} \dfrac{\mathrm{Dist}(d_i, c)}{K_c}, \end{aligned}$$(5)where$$\begin{aligned} \begin{aligned} \mathrm{Dist}(d_i, c) = \sum _{i=1}^{n} \bigg [ \big ( \parallel d_i - \phi (\text {NN}_c (d_i)) \parallel ^2 \\ - \parallel d_i - \text {N}_{k+1} (d_i) \parallel ^2 \big ) \bigg ], \end{aligned} \end{aligned}$$(6)and \(\text {N}_{k+1} (d_i)\) is the neighbour \((k+1)\) of \(d_i\). In a similar way to the work in [26], we used the distance to the \(k+1\) nearest neighbours (\(k=10\)) as a “background distance” to estimate the distances of classes which were not found in the k nearest neighbours. According to Eq. 6, the larger the value of \(Dist(d_i, c)\) the closer class c to feature \(d_i\), because \(Dist(d_i, c)\) measures the distance between class c and the background (\(k+1\)) relative to \(d_i\). Therefore, the matrix \(M^d(L_{i,c})\) is initialised with zeros in order to allow for the detection of local maximums. Search indices are created for all the classes using the kd-trees implementation provided by the FLANN [27] (Fast Library for Approximate Nearest Neighbours) to have efficient nearest-neighbour search. An example of a detection matrix is shown in Fig. 6. It can be clearly seen that the darkest spot corresponds to the centre of the correct pattern annotated in part (a) of Fig. 6. The detection matrices are smoothed using a Gaussian filter. The kernel size of the filter is \(R_c\) x \(R_c\), where \(R_c\) is the adaptive parameter calculated in Equ. 1.$$\begin{aligned} \phi (\text {NN}_c (d_i)) = {\left\{ \begin{array}{ll} \text {NN}_c (d_i) &{} \quad \text {if } \text {NN}_c (d_i) \le \text {N}_{k+1} (d_i) \\ \text {N}_{k+1} (d_i) &{} \quad \text {if } \text {NN}_c (d_i) > \text {N}_{k+1} (d_i) ,\\ \end{array}\right. } \end{aligned}$$
-
Step 6: Each detection matrix is convolved with a kernel in order to produce the final detections. The detection kernel can be described as follows:
where \(K_c^{d_i}(x,y)\) is the detection kernel of class c for the detected feature \(d_i\) centred at location (x, y). \(\mathrm{Offset}_x\) and \(\mathrm{Offset}_y\) are the differences in the x- and y-axis between the kernel centre and the current location (x,y) respectively. The final detections \(D_c\) are calculated as follows:$$\begin{aligned} K_c^{d_i}(x,y) = \left\{ \begin{array}{ll} 1 &{} \text{ if } \mathrm{Offset}_x^2 + \mathrm{Offset}_y^2 < R_c \\ 0 &{} \mathrm{otherwise}, \end{array} \right. \end{aligned}$$(7)The size of a detected pattern is set to be equal to the median height and width of the corresponding labelled pattern samples.$$\begin{aligned} D_c = M_c^{d_i} * K_c^{d_i}; \end{aligned}$$(8) -
5 Evaluation on relevant datasets
5.1 The École française d’Extrême orient [EFEO] dataset
5.2 The AMADI_LontarSet dataset
Queries | mAP | average F-score | average Recall at 0.3 FPPI |
---|---|---|---|
All 36 queries | 0.476 | 0.707 | 0.732 |
The best performing 30 queries | 0.560 | 0.780 | 0.810 |
The worst performing 6 queries | 0.053 | 0.344 | 0.343 |
5.3 The DocExplore dataset
Queries | mAP per Category |
---|---|
All 35 categories | 0.587 |
The best performing 29 categories | 0.700 |
The worst performing 6 categories | 0.041 |