International Journal of Automation and Computing  2018, Vol. 15 Issue (4): 454-461 PDF
MFSR: Maximum Feature Score Region-based Captions Locating in News Video Images
Zhi-Heng Wang, Chao Guo, Hong-Min Liu, Zhan-Qiang Huo
School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454003, China
Abstract: For news video images, caption recognizing is a useful and important step for content understanding. Caption locating is usually the first step of caption recognizing and this paper proposes a simple but effective caption locating algorithm called maximum feature score region (MFSR) based method, which mainly consists of two stages: In the first stage, up/down boundaries are attained by turning to edge map projection. Then, maximum feature score region is defined and left/right boundaries are achieved by utilizing MFSR. Experiments show that the proposed MFSR based method has superior and robust performance on news video images of different types.
Key words: News video images     captions recognizing     captions locating     content understanding     maximum feature score region (MFSR)
1 Introduction

Text characters embedded in images carry a great deal of useful and important information, which is considered to be an important aspect of overall image understanding[1]. Generally, text in video images can be divided into two categories[2-5]: 1) Artificial text, which is also named as "superimposed text" and manually added; 2) Scene text, which is also named as "graphics text" and exists naturally in the video images. Artificial text and scene text also exist in news video images. In news videos, artificial text is manually added to the video in a post-processing step in order to convey additional information related to the news video, which can be further classified into caption text and subtitle text. As a brief of news video, caption is of great significance for content understanding or content-based news video retrieval[6-8]. Relative to caption text, subtitle text is less important for video content understanding[9]. Scene text is captured by a camera as part of scenes, such as the advertising boards, address boards of houses, landmarks of streets and so on, which are mostly random and usually do not contain useful content information[2]. Some examples of these three kinds of text in news video images are provided in Fig. 1, in which red boxes denote caption text, black boxes indicate the scene text and yellow boxes represent the subtitle text.

 Download: larger image Fig. 1. Examples of three kinds of text in news video images

Caption recognizing is critical in news video processing for news content understanding, and locating is the first step for recognizing[10-12]. In recent years, many methods for text locating in images have been studied, most of which depend on text features such as color, edge or texture. Therefore, existing methods can be mainly classified into three categories: color-based, edge-based and texture-based methods.

1) Color-based methods, known as connected component based methods, assume that characters in image share the uniform color. Through the clustering method, text is separated from the background as a special single color region. Shim et al.[13] discriminated text regions from the image based on the homogeneity of character pixels. When dealing with an image with the monochrome text, these methods would perform satisfactorily. But if the background is complex or the color of characters is polychrome, the result would be imperfect. On the other hand, because of the video compression, the text in video image often suffers from local color bleeding, which will also affect the results. To overcome the varieties of text color, Yan and Gao[14] proposed a new method. First, they separated the RGB image into different layers exploiting fuzzy c-means clustering algorithm. Then, generated bounding boxes around candidate text regions based on the connected component and eliminated some regions utilizing some heuristic rules such as the aspect ratio or the size of each bounding box, etc. Last, these candidate regions in each layer are merged. This method is robust to the text color in image, yet it is of high computational complexity.

2) Edge-based methods are also considered to be useful for text locating and have been used by many researchers. Edge-based methods focus on the high contrast between text and background. The usually utilized method is to apply an edge filter (e.g., a Sobel operator) to preprocess image and then determine text regions with high edge intensity and density[15]. Lyu et al.[8] located the text regions using the local threshold on a sobel-based edge map. Chen et al.[16] used the canny operator to preprocess images and obtain the text regions through the edge information. Anthimopoulos et al.[17] also utilized the canny operator to detect edges in an image. Then, based on the edge map, they determined the text regions exploiting the morphological operations and projection analysis. As opposed to other approaches, edge-based methods are simple and have a low time cost. But if the background also has too many strong edges, the experimental results will be not satisfactory.

3) Texture-based methods are always combined with the machine learning and they utilize the observation that text in image has distinct textural properties that distinguish text from the background. First, a method such as Fast Fourier Transform (FFT) or Wavelet Transform is applied to extract textural properties of each local region in an image. Then, these textural properties are fed into a classifier to estimate whether or not the local region contains text[18]. Ye et al.[19] suggested the use of wavelet decomposition to get the textural properties of different local regions followed by an adaptive threshold to distinguish between text and non-text area. Kim et al.[20] preprocessed every pixel using support vector machine (SVM) and then exploited a continuous adaptive mean shift algorithm to ascertain the text regions. Texture-based approaches have better robustness, but they have always high computational complexity since these methods always need a small sliding window to scan the entire image with a typical step of 1 or 2 pixels and determine whether or not the window contains text[21].

Although most existing methods can achieve good results under certain applications, most of them still have shortcomings. Color-based and edge-based methods can locate text regions quickly and have low computational complexity, but they are usually sensitive to the background and have weak robustness[22]. In addition, texture-based methods can reduce the interference of the background and have a strong robustness, but the executing speed is relatively slow and it is quite difficult to design reliable machine learning classifiers to classify small region as text or non-text[21].

Concluding the discussion on previous works of text locating and combining the characteristics of caption text in news video image, a new caption locating algorithm called maximum feature score region (MFSR) based method for news video image is developed in this paper. The remainder of this paper is organized as follows: Section 2 details the proposed MFSR method. Experimental results and analysis are presented in Section 3, and the conclusion is given in Section 4.

2 The proposed method

Before introducing the proposed MFSR, we have observed that captions in news video images of different channels usually are of the following characteristics: 1) Captions are mainly located in the bottom of the news video images (almost in the lower $\frac{1}{4}$ part) and have a certain distance from the image boundaries. 2) Captions are always aligned horizontally and arranged in one or two lines. 3) The color of caption text is uniform and obviously different from the background. 4) The size of caption text is larger than subtitle text. According to 1), it is acceptable that only the lower $\frac{1}{4}$ part of a news video image is processed in our method, which can simplify the problem and reduce the computational complexity.

As shown in Fig. 2, the proposed MFSR method for caption locating mainly consists of four steps: edge detection, up-down locating based on projection, left-right locating based on MFSR and non-caption removing. Here our method begins with the first edge detection step. Due to the existence of characters, caption regions in news video images usually have rich edge information, and position information of caption regions can be indicated in the edge map. In our method, the up and down positions of the objective caption are estimated based on horizontal projection of the edge map. The Sobel operator is utilized to compute the gradient map of the original image, and then the binary edge map is obtained by using a simple global threshold $T$, which is determined by computing the mean of the gradient magnitude of the whole region. In the following discussion, the edge pixel is expressed as foreground pixel and the non-edge pixel is expressed as background pixel.

2.1 Up/Down locating based on projection

Fig. 3 illustrates the principle of up/down locating based on projection. The number of edge points of each line is computed and a projection vector $V$, the dimension of which is equal to the height of the region, is constructed. Then, up/down locations can be estimated by using two thresholds $T_{1}$ and $T_{2}$, where $T_{1}$= $\delta$ $\times$ $\sum$$V$$(i)/h$ refers to the minimum value, and $T_{2}$ ($T_{2}$=10 in our experiments) means the minimum continuous interval. Firstly, the vector $V$ is converted to zero or one by using $T_{1}$ (one as text line and zero as non-text line), and then continuous intervals with one value larger than $T_{2}$ will be accepted as candidate intervals. In fact, usually more than one candidate intervals (up/down boundaries) $[U_{i}, {D}_{i}], i=1, 2, \cdots, K$ will be found in this step.

2.2 Left/Right locating based on MFSR

Considering the region defined by up/down boundaries determined in the previous section, it is obvious that the characters are closely clustered with uniform inter-character distance as in Fig. 3, and determining left/right boundaries of the caption region means finding a rectangular region containing most of the foreground pixels with the smallest size.

How to obtain the smallest rectangular region containing most of the foreground pixels? The problem is converted to searching the maximum value of the following defined feature score, and the attained region is called MFSR:

 \begin{align} {FS}({l, r})=\frac{{B}({l, r}) \times {{P}({l, r})}}{{(r-l+1)}} \end{align} (1)

where ${B}({l, r}), {P}({l, r})$ are called beneficial item and penalty item, respectively, and l, r are variables representing the left and right boundaries of the objective region.

Specifically, beneficial item ${B}({l, r})$ and penalty item ${P}({l, r})$ are defined as follows: Our goal is finding a rectangular region containing most of the foreground pixels, and a simple strategy is defining the beneficial item as the total number of foreground pixels. ${B}({l, r})$ is expressed as follows:

 \begin{align} {B}(l, r)=\bigg(\sum\limits_{x\in{G}(l, r, {U}_{i}, {D}_{i})}e(x)\bigg)^{\gamma} \end{align} (2)

where ${G}(l, r, {U}_{i}$, ${D}_{i}$) refers the rectangular region with the boundaries as l, r, ${U}_{i}$, ${D}_{i}$ and ${e}({x})$ is defined as

 \begin{align} {e}({x})=\begin{cases} 1, \quad {\rm if~} {x} {\rm ~is~a~foreground~pixel}\\ 0, \quad {\rm if~} {x} {\rm ~is~a~background~pixel}. \end{cases} \end{align} (3)

To achieve robustness, a robustifying factor $\gamma$ is introduced here, which is used to increase the proportion of beneficial item B(l, r). In fact, the value of $\gamma$ only needs to be greater than 1. In this paper, $\gamma$ is set as 1.5.

The penalty item is expressed as

 \begin{align} {P}({l, r})=P_{L}(l, r) \times P_{R}(l, r) \times P_{M}(l, r) \end{align} (4)

where

 $P_{L}(l, r)=\alpha-\frac{\sum\nolimits_{x\in Line(l)}e(x)}{D_{i}-U_{i}+1}$ (5)
 $P_{R}(l, r)=\alpha-\frac{\sum\nolimits_{x\in Line(r)}e(x)}{D_{i}-U_{i}+1}.$ (6)

First we discuss ${P}_{L}({l, r})$ and ${P}_{R}({l, r})$, which are called left penalty item and right penalty item respectively. The two summation terms in (5) and (6) refer to the number of foreground pixels on left and right boundary line, which mean that the left or right boundary should be located as a blank region without foreground pixels. In other words, it will be penalized if the left or right boundary is located on a character. In the above formula, $\alpha$ is a constant factor for ensuring non-negative value and $\alpha$ is set at 1.5 in this paper.

Now we discuss the third penalty item ${P}_{M}({l, r})$, which in fact provides constraints on the internal structure of the objective region. This penalty item is introduced mainly based on the following considerations: Characters in objective caption region are nearly uniformly distributed and thus foreground pixels can be considered to be continuous. Therefore, if one or more blank regions without any foreground pixels exist in a test region, it will be rejected as a false one. In practice, this point can be achieved using the following steps: 1) A small window is defined with size as (${D}_{i}$-${U}_{i}$-10) $\times$ $\varepsilon$, where $\varepsilon$ is the width of the window; 2) The small window slides in the region defined by [${U}_{i}$, ${D}_{i}$] and the number of foreground pixels of the small window is stored; 3) Positions at which the small window contains no foreground pixels will be marked as blank position, as shown in Fig. 4; 4) The third penalty item ${P}_{M}({l, r})$ is expressed as follows (where ${N}_{{B}}$ refers to the number of blank positions occurred in the test region):

 \begin{align} P_{M}(l, r)=\begin{cases} 1, \quad {N}_{B}=0 \\ 0, \quad N_B \geq 1. \end{cases} \end{align} (7)

After determining the up/down positions of the objective caption region in the Section 2.1, a candidate region for caption can be easily attained by turning to the maximum feature score defined by the formula (1). For clearly understanding the details of MFSR, its pseudo-code is shown in algorithm 1.

Algorithm 1. MFSR

Require:   Up/down boundaries [${U, D}$]

Ensure:   Region [${U, D, L, R}$]

1) Initialization ${L}=1; {R}={W}; \max{FS}=0$

2) for $l=1:{W}/2$

3)    for ${r=l+10:{W}}$

4) compute feature score:

5) ${FS}({l, r})={B}({l, r}) \cdot {P}({l, r})/({r-l+1})$

6) if ${FS}({l, r}) > \max{FS}$

7)    $\max{FS}={FS}({l, r}), {L}=l, {R}=r$

8)     end if

9)       end for

10) end for

11) return [${U, D, L, R}$]

2.3 Non-caption removing and multi-captions recognizing

As shown in Fig. 5, since that possibly more than one group of up/down boundaries can be found in Section 2.1, still multi-candidate caption regions can be outputted based on the MFSR, some of which are false ones as in Fig. 5 (a) while others may be due to multi-captions as in Fig. 5 (b). Therefore, further work should be done for distinguishing them.

In news video images, most caption texts usually have larger size than other characters, and it is acceptable that the gradient information of the caption region is richer than other regions. Here the gradient information is measured by the following formula (The two summation terms denote the sum of horizontal and vertical gradient value of each pixel in a candidate caption region, respectively):

 \begin{align} {GI}= \sum\limits_{{X} \in {G}}{d}_{x}({X}) \times \sum\limits_{{X} \in {G}}{d}_{y}({X}). \end{align} (8)

The candidate region with the largest value of GI is directly regarded as the caption region. As for the rest candidate regions, they can be easily classified as multi-captions or non-caption by using the following facts: Heights of multi-caption regions are very approximative and left positions of multi-caption regions are not different from each other too much. Finally, multi-caption regions are retained and non-caption regions will be removed as in Figs. 5 (d) and 5 (c).

3 Experiments

To verify the effectiveness of the MFSR, the algorithm has been implemented in Matlab 2012a. 314 news video images with resolution 640$\times$480 from different news channels are selected as the test-bed to evaluate the performance of the proposed algorithm. And 328 caption regions were existing in these test images.

3.1 Parameters selection

The parameters, such as the parameter $T_{2}$ to control the minimum height of the caption line, the parameter $\delta$ to control the value of $T_{1}$ and the parameter $\varepsilon$ to control the width of the small window in Section 2.2, significantly affect the performance of the algorithm. Therefore, an important task before using the algorithm is to select appropriate parameters. To this end, we use the over-location rate (OR) and error-location rate (ER) to measure the effects of parameters on the identified results, i.e.,

 ${{OR}}=\frac{{\rm {Number~of~over-location~caption~ regions}} }{{\rm{ Total~number~of~caption~ regions}}}$ (9)
 ${{ER}}=\frac{{\rm{Number~of~error-location~caption~ regions}}}{{\rm {Total~number~of~caption~regions}}}.$ (10)

Over-location caption region means that the caption region is not completely surrounded by the bounding box (Fig. 6 top). Error-location caption region means that the background is surrounded by the bounding box (Fig. 6 bottom). As shown in Fig. 6, the blue dotted bounding boxes denote the ground truth caption regions and the red bounding boxes denote the experimental result with different parameters (The colorful figure can be seen in electric version). In our experiments, for the parameter $T_{2}$, we take $T_{2}$=10 to ensure that all caption regions are detected in news video image.

 Download: larger image Fig. 6. Over-location caption region and error-location caption region

1) The parameters $\delta$ and $T_{1}$

The parameter $T_{1}$ is a benchmark, which is the minimum number of foreground pixels that text line should contain, which indirectly determines the positions of up/down boundaries. And the parameter $\delta$ is used to adjust the value of $T_{1}$, so determining the value of $\delta$ is an important task. If the value of $\delta$ is too small, error-location caption regions would appear in the experimental results, i.e., the background would be located as caption region as shown in Fig. 6 (c). If the value of $\delta$ is too large, over-location caption regions would appear in the experimental results, i.e., the caption region is not located completely as in Fig. 6 (a). To select the optimal parameter $\delta$, many experiments are conducted with different $\delta$ for our 314 testing images with 328 caption regions. As shown in Fig. 7 (a), with the increase of $\delta$ the OR will increase after several zero points, whereas the ER will drop to zero. It is clear that the OR curve rises quickly and the ER curve declines slowly after $\delta$=0.4. Moreover, when $\delta$=0.4, the OR is equal to zero. Therefore, $\delta$=0.4 is selected in our experiment.

 Download: larger image Fig. 7. Parameters determination: (a) OR and ER curves of different $\delta$; (b) OR and ER curves of different $\varepsilon$.

2) The determination of $\varepsilon$

In reality, the caption regions usually have a certain distance from the image boundaries as shown in Fig. 6. To accelerate the execution speed, a small window defined in Section 2.2 is utilized to search the MFSR. The width of the small window is expressed as $\varepsilon$. Whereas punctuation may also occur in the caption region, such as the colon existing in Fig. 6, the inter-character distance will become larger. If the value of $\varepsilon$ is too small, over-location caption regions would appear in the experimental results, i.e., the caption region is not located completely as in Fig. 6 (b). If the value of $\varepsilon$ is too large, error-location caption regions would appear in the experimental results, i.e., the background would be located as caption region as shown in Fig. 6 (d). As shown in Fig. 7 (b), it is clear that the OR will drop to zero after $\varepsilon$=25 and the ER will rise quickly. To ensure that all caption regions can be detected with low ER, hence $\varepsilon$=25 is selected in our experiment.

3.2 Results and analysis

For conveniently viewing the experimental results, several experimental results of news video images from different channels are listed in Fig. 8, where red boxes denote the results. In addition, for a quantitative evaluation, the method mentioned in [19] is utilized to estimate whether or not an identified caption region is correct, i.e., if the intersection of the identified caption region (ICR) and the ground truth caption region (GCR) covers more than 75% of the ICR and 95% of the GCR, the identified caption region is correct; otherwise, it is incorrect. The GCRs in the test news video images are localized manually. In experiment, the identified results are classified into three kinds: correctly identified caption regions, incorrectly identified caption regions and mistakenly located non-caption regions. In fact, the incorrectly identified caption regions also contain the caption text, which just could not meet with the criteria mentioned in [19]. And the mistakenly located non-caption regions mean that some subtitle regions or background are located as caption regions. As shown in Fig. 9, white dotted bounding box refers to the correctly identified caption region, green dotted bounding box means the mistakenly located non-caption region and yellow dotted bounding box denotes the incorrectly identified caption region.

Here, based on the classification of experimental results, recall rate (RR) and false positive rate (FPR) are utilized to evaluate the algorithm quantitatively, i.e.,

 ${RR}=\frac{{c}}{{c}+{m}} \times 100\%$ (11)
 ${FPR}=\frac{{f}}{{c}+{m}+{f}} \times 100\%$ (12)

where c means the number of correctly identified caption regions, m refers to the number of incorrectly identified caption regions and f denotes the number of mistakenly located non-caption regions. The statistics of the experimental results is given in Table 1.

Table 1
The statistics of experimental results

Fig. 10 shows some experimental results of our method and the method in [23], which is also based on edge detection. The left/right boundaries of caption region are attained by using the vertical projection method of [23], which may result in inaccurate location. However, our algorithm can overcome the shortcoming. As shown in Table 2, the RR of our method is higher than the method in [23], while the FPR is lower. Our proposed method processes a frame in 0.1503 s tested on a PC with two Intel Celeron 2.60 G CPUs. The processing time of our method is longer than the method in [23]. The main cause of the higher time is that the non-caption removing is done in our method. In general, our method is better.

Table 2
The statistics of experimental results

3.3 Experiments on images with different languages

In reality, MFSR not only can be utilized to locate the captions in news video images, it can also be applied to locate the dialogue subtitles in movies or teleplays images. In our experiments, some movies and teleplays images with different languages, such as English, Korean and Japanese are selected from the internet to test the proposed algorithm. For conveniently viewing the results, some testing examples are listed in Fig. 11.