Tracking Players in Indoor Sports Using a Vision System Inspired in Fuzzy and Parallel Processing

Sports are an important part of nowadays society and there is an increasing interest by the sports’ community on having mechanisms that allow them to better understand the dynamics of teams (their own and their opponents). This information is frequently extracted manually by operators that, after the game, visualize game recordings (frequently TV footages) and perform hand annotation, which is a time consuming and error prone task. There is a clear necessity for developing automatic mechanisms and methodologies which allow performing these tasks much faster and systematically. The importance of such systems was first highlighted in the late 80’s by Franks et al. ( (Franks & Nagelkerke, 1988; Franks et al., 1987) ). In this chapter, we present an automatic and intelligent visual system for detecting and tracking handball players based on two cameras that cover the entire playing area. The followed methodology includes the identification of foreground pixels using dynamic background subtraction, the definition of colour subspaces for each team using a Fuzzy inspired model that allows detecting the players based on the colour properties of their clothes. Player tracking is further improved by using one Kalman Filter per player (object to track). The resulting information is aggregated in an undistorted image view of the entire field that is very interesting and meaningful to the target end-user. The generation of the video is a demanding computational task that takes advantage of using parallel computing. The resulting videos are interesting and include several important informations to the human end-user. Tests were conducted on videos collected during the Handball Portuguese SuperCup competition, where the best six Portuguese teams competed for the trophy of the year of 2010/11. The chapter structure is as follow: the next section presents some relevant information on image segmentation methodologies, parallel processing and implementations of automatic visual systems for detecting and tracking players and describes the main methodologies used. Section 3 discusses the proposed architecture principles, providing an overview of the methodology used. Section 4 presents the results achieved and finally section 5 concludes this chapter with the main conclusions. 6


Introduction
Sports are an important part of nowadays society and there is an increasing interest by the sports' community on having mechanisms that allow them to better understand the dynamics of teams (their own and their opponents). This information is frequently extracted manually by operators that, after the game, visualize game recordings (frequently TV footages) and perform hand annotation, which is a time consuming and error prone task. There is a clear necessity for developing automatic mechanisms and methodologies which allow performing these tasks much faster and systematically. The importance of such systems was first highlighted in the late 80's by Franks et al. ( (Franks & Nagelkerke, 1988;Franks et al., 1987) ). In this chapter, we present an automatic and intelligent visual system for detecting and tracking handball players based on two cameras that cover the entire playing area. The followed methodology includes the identification of foreground pixels using dynamic background subtraction, the definition of colour subspaces for each team using a Fuzzy inspired model that allows detecting the players based on the colour properties of their clothes. Player tracking is further improved by using one Kalman Filter per player (object to track). The resulting information is aggregated in an undistorted image view of the entire field that is very interesting and meaningful to the target end-user. The generation of the video is a demanding computational task that takes advantage of using parallel computing. The resulting videos are interesting and include several important informations to the human end-user. Tests were conducted on videos collected during the Handball Portuguese SuperCup competition, where the best six Portuguese teams competed for the trophy of the year of 2010/11. The chapter structure is as follow: the next section presents some relevant information on image segmentation methodologies, parallel processing and implementations of automatic visual systems for detecting and tracking players and describes the main methodologies used. Section 3 discusses the proposed architecture principles, providing an overview of the methodology used. Section 4 presents the results achieved and finally section 5 concludes this chapter with the main conclusions. 6 www.intechopen.com

Related research
This section focus on the three main areas involved on the implementation of the system, and therefore presents some of the most common used methodologies for video/image segmentation and for parallel processing as well as an overview of systems that are able to detect and track players in indoor team sports.

Video/image segmentation
Video and inherently image segmentation is the first step and probably the most critical step in any vision system. Video segmentation can be subdivided into temporal segmentation and spatial segmentation. Temporal segmentation corresponds to segmenting the video into meaningful temporal sequences. This kind of segmentation is usually used as the first step of video annotation and tries to segment the video taking into account similarities/dissimilarities between successive frames (Koprinska & Carrato, 2001). On the other hand, spatial segmentation analyses the content of each frame, and divides it into homogeneous regions that correspond to independent objects. The focus of this work is more on spatial segmentation, therefore for a detailed survey on temporal video segmentation please refer to (Koprinska & Carrato, 2001). Spatial video segmentation may be performed using the methodologies that are used for image segmentation and further enhanced using the temporal characteristics of video. In addition, when performing colour analysis there is also the need to choose a colour space. Regarding colour image segmentation, a detailed survey is provided by (Cheng et al., 2001). As they state, most of the existing colour image segmentation methodologies have their origins on grey scale image segmentation with the addition of a proper colour space choice. The main categories of image segmentation methodologies (Cheng et al., 2001) are summarized on Table 1. Nowadays there is the tendency to apply techniques from different categories in order to achieve better results. A good example of this tendency is the JSEG algorithm (Deng & Manjunath, 2001) which initially clusters colours into several representative classes, afterwards replaces each pixel by their corresponding colour class label and at the end employs a region growing process directly to the class map in order to identify homogeneous regions. On videos, contrary to static images, besides the physical (x and y coordinates) and colour information there is also the time component. Using this property it is possible to segment images based on motion along time. There are two main approaches to perform this task: background subtraction and optical flow. Background subtraction is usually used in situations where a more or less fixed background exists and subdivides the image into foreground and background regions.
Several background subtraction techniques have been proposed in literature. The main issue on these methods is to obtain a good estimate of the background. The simplest method to model the background is to use a single static image without objects, however its efficiency is low, because it does not take into account changes such as light effects or shadows that may occur in the background. More robust methods include estimating the background model using a moving average (Heikkila & Silvén, 1999), median or even a mixture of Gaussians (Grimson et al., 1998).

Category
Description Histogram Thresholding Determine the peaks or modes of the multi-dimensional histogram of a colour image. Feature Space Clustering Groups the image feature space into a set of meaningful groups or classes based on intensity, colour or texture characteristics of pixels and not on the spatial relation among them.

Region based
Includes region growing, Watershed transform and region split and merge. These methods try to divide the image domain based on the fact that adjacent pixels in a same region have similar visual features (colour, intensity, texture or motion). Edge Detection Segment the image by finding the edges of each region using one of the well known edge detectors (Canny (Canny, 1986), Sobel (Sobel, 1970), Roberts (Roberts, 1963)). Fuzzy These methods allow classes and regions to have a certain uncertainty and ambiguity which is generally the case in image processing.

Neural Networks
Allow parallel processing and the incorporation of non-linearities. They can be used either to pattern recognition, classification or clustering.  (Barron & Thacker, 2005) is based on the fact that when an object moves in front of a camera, there is a corresponding change on the image, however it assumes small displacements during time.
Following the tendency, we propose, for the video segmentation step, a methodology that combines background subtraction for detecting foreground regions, region growing and a Fuzzy inspired categorization for colour calibration.

Parallel processing
Initially, computing power was pure sequential processing, that is, sequential operations over time in a single dedicated processor. The hunger for usefulness and processing power lead to advanced solutions such as concurrency and distributed computing. Given large "workloads", concurrency is interleaving processing over time for the multiple "workloads" (even in a single processor). Distributed computing is sharing "workloads" (or parts) over different processors linked over a network. Recent advances in computer architecture introduced multi core architectures that enable true parallel processing of several "workloads" in parallel in the same computer. The challenge remains unaltered: to maximize the usefulness of a computer systems and harness as much computing power as possible, possibly circumventing technical limitations of an architecture, whilst maximizing performance but still keeping cost interesting.
When considering a single complex (large) computational job, parallel processing starts by: • (i) dividing work into several pieces (this can be done by the programmer, at run time or a mixture of both situations) then • Synchronization problems -the explicit need for sequential operations caused by interdependence among several work parts or using shared resources; • Load Imbalancing -the difficulty in sharing workloads as even amounts of work with the likely problem that, if one of the intermediate calculations takes longer to process than the others, optimization is not as perfect as one could wish for.
In parallel computing, the Scalability curve is defined as the speed-up in processing (performance gain), over the number of available processors. Generically, it is not likely at all that a job done in a single processor in a time t 1 could be done in N processors in t 1 {N of that time, that is, time to solve a problem in N processors is very frequently t N ą t 1 {N.B y adding too many processors (large N), the scalability curve will drop heavily when too many work parts are generated and computational overhead for parallel processing outweighs the benefits of having many "workers" (processors). Additionally, adding processors will likely have severe financial impact on the final overall cost of the computational solution.
OpenMP ( (OpenMP, 2011)) is a technological framework for cross platform, shared-memory application programming interface (API) for taking advantage of run time distribution of work parts into several processors ( (Chapman et al., 2007)). OpenMP (OMP) was introduced formally in 1997 and is maintained by the Architecture Review Board (ARB), a consortium of industry and academia and its licence is essentially free. Advertised benefits include, among others, good scalability, implicit communication and programming at (somewhat) high level. The technological limitations for performance include transferring code and data among several types of memory and assigning work parts (threads) to available resources (processors). CUDA ( (NVIDIA, 2011)) stands for Compute Unified Device Architecture and is a software platform for massively parallel high-performance computing using NVidia's video graphics hardware. The CUDA software toolkit is proprietary from NVidia, it essentially allows a free licence of use and was formally introduced in 2006. A large portion of the (recent, high-end) hardware from that manufacturer is usable with CUDA. The intent is to turn the resources of the video card into a number of generic processors and memory ( (Halfhill, 2008)). The actual number of CPUs and memory available for generic use in the video card is hardware dependent and a highly volatile issue over time. Sometimes 16 or more independent processors are available for custom parallel programming (working frequencies frequently below 1 GHz) and about 512 MB of memory is also frequently usable. These resources are usable if the video card is used for showing 2D images. Communication with the video card has inherent technological limitations as code and data must be transferred into the video card, an "external" element when compared to CPUs and Memory Systems. While using the video card(s) as generic processors presents limitations to hardware changes it is most interesting for application sharing lots of data, with the added benefit of freeing up the main processor(s) for other tasks. High-end video cards have risen in complexity, performance and cost and may currently be more expensive than the "main" general purpose CPU of the computer. The reader should be aware that at, the time of writing this article, parallel programming is still at its infancy. A promising framework exists, however: OpenCL (Open Computing Language) is open, royalty-free, standard for general-purpose parallel programming of heterogeneous systems across different hardware (Group, 2011b). OpenCL provides a uniform programming environment for software developers to write efficient, portable code for high-performance computing using a diverse mix of multi-core CPUs, GPUs, etc. OpenCL is recent and its first introduction is in late 2009 by the Kronos Group (Group, 2011a), a consortium of companies Intel, AMD, ATI, ARM, NVidia, among others. The performance will likely be inferior to pure CUDA as OpenCL calls CUDA when possible. These topics are currently issues of large interest in the Scientific and Technological communities because they promise great performance benefits. The most important topics and dominant toolkits were, however, briefly addressed in this section.
Prospective users of parallel computing should take into consideration that there is a non-negligible effort in learning the concepts at stake and even more in porting large conventional algorithms to take advantage of parallel processing. While OpenMP is not far to common programming techniques, CUDA approach of offering massive parallelism with common data requires different approaches and most likely the new algorithm will demand very large portions of code to be written almost from start. As in other optimization techniques, profiling the application is of great interest to find which part of the code takes more time to execute and what part(s) of the code will benefit from parallel computing.

Player detection and tracking using vision systems
Although there are devices to detect and track players using other methodologies besides vision (Radio Frequency Identifiers, Local Position Measurement (Stelzer & Fischer, 2004) or Global Positions Systems) these other methodologies are beyond the scope of this paper and will not be addressed.
The sport that has deserved the highest attention by the scientific community has been soccer, nevertheless the focus of this section will be on the work developed in indoor sports since our object of study is handball and also because the challenges posed by indoor/outdoor sports are quite different. Outdoor sports usually receive more attention by the media and therefore it is possible to use images provided by TV broadcast cameras, however the light conditions are much worse and the environment is less controllable. On the other hand, indoor sports usually need a dedicated camera system and despite being a more controlled environment, players tend to be closer to each other (the playing area is smaller) which brings added difficulties since merging and occlusion situations occur more often. On indoor team sports it is possible to find works on basketball, handball and indoor soccer, using either broadcast video footages or dedicated cameras placed at strategic places and several methodologies for player detection and tracking.
Concerning basketball games we can find (Hu et al., 2011) using broadcast videos. Players are detect by extracting the field (through dominant colour detection) and generating a player mask. Afterwards players are tracked in image coordinates using a CamShift-based tracking algorithm and their positions are converted into real world coordinates using an automatic calibration methodology. Results are presented for 48 consecutive frames and show high precision and recall percentages, 91.38% and 91.34%, respectively. However occlusion and merging between players of the same team affect the detection and are not taken into consideration.
A very promising project is the Autonomous Production of Images based on Distributed and Intelligent Sensing (APIDIS), (APIDIS, 2008), that equipped a basketball court with an acquisition network composed of microphones, conventional and (arrays of) omnidirectional cameras in order to provide a basketball dataset. This dataset was used by (Alahi et al., 2009;Delannay et al., 2009). Taking advantage of the setup flexibility and amount of cameras (Alahi et al., 2009) were able to minimize occlusion and merging problems and determine the 3D positions of the players. The player detection is performed via a sparsity constrained binary occupancy map based on severely degraded silhouettes. These silhouettes are obtained through a basic background subtraction method. Results show that the usage of more than one camera can tremendously increase both the precision and recall rates for the player detection, from 57% and 62% in a single camera system to 76% and 72% if an omnidirectional camera is added. (Delannay et al., 2009) detect the players from the foreground activity masks determined on multiple views. They take into consideration players occupy a 3D space and therefore sum the cumulative projections of the multiple views' foreground activity masks on a set of planes that are parallel to the ground plane. Afterwards, regions with larger projection values are considered to be a player and scanned for digits in order to detect the player's number. The tracking propagation is performed frame by frame and is based on the Munkres general assignment algorithm (Munkres, 1957). They compare their methodology with methods that project the activity masks into a single plane and conclude that projecting into multiple planes increases the detection rate and minimizes the shadows effects. We could find two works (Kristan et al., 2009;Monier et al., 2009) that explore both basketball and handball, use closed world assumptions and two dedicated cameras placed at the ceiling of the sports hall. Placing the cameras at the ceiling has the advantage of minimizing occlusion problems.
On the first work, players are detected by applying a background mask image obtained from thresholding the differences between the background (background is obtained by a simple method, for example a median filter) and the current frame with a dynamic threshold specific for each player. The tracking algorithm is formulated as a closed world problem based on a Boot-Strap Particle Filter and each tracker is initialized manually. The multiple players' tracking is achieved by partitioning the world into several Voronoi cells, one for each player, that are updated at each time step. Results indicate low failure rates per player per minute, less than 1.1 in the worst test case. The second work follows some of the assumptions of the one from (Kristan et al., 2009), however the foreground pixels are detected by creating a dynamic background model rather than acting on the threshold level or generating a background mask and afterwards apply template matching to track the players. The templates for each player are manually initialized by the user and the tracking is performed by searching in a region of interest determined by image resolution and players' speed restrictions. The multiple tracking is also achieved by partitioning the world into Voronoi regions where each tracker acts. The fusion between the two images is performed in real world coordinates. Results indicate average correction rates between 0.0019 and 0.00677 correction/frame/player. Regarding indoor soccer, there are the works of (Needham & Boyle, 2001) and (Kasiri-Bidhendi & Safabakhsh, 2009). (Needham & Boyle, 2001) use a single stationary camera system and apply a multiple object condensation algorithm. Initially a bounding box of each player is detected, then, through a propagation algorithm, the fitness of each bounding box is evaluated and adjusted. The prediction stage of the condensation algorithm takes into consideration position estimates from a Kalman filter. They report on trajectory based results compared with hand annotated values and indicate distance errors less than 1 meter, which is a quite high value given that an indoor soccer minimum field measures 15x25 meters.
On the other hand, (Kasiri-Bidhendi & Safabakhsh, 2009) use TV broadcast images. Initially, the background colour is determined through clustering, afterwards the entire field region is extracted using the Mahalanobis distance between pixels on the frame and on the background colour distribution. Once the image is background free the lines of the field are extracted using a Canny edge operator. The remainder of the information stored in the image corresponds to the players and the ball that are further refined using physical constraints of area and roundness. The player and ball tracking is based on level set contours. By using this tracking technique occluded players are tracked by a single contour instead of two during the occlusion period and are again correctly detected after splitting. This work follows (Santiago et al., 2011), that present a methodology based on colour in order to detect handball players. In addition, we have included a dynamic background model in order to take into consideration light changes and a Fuzzy inspired methodology to calibrate the colour teams. Moreover, a vector of Kalman filters is used for the tracking process and an HyperVideo with the resulting information is generated.

Engineering solution
The challenges of defining a vision system able to identify and track the players in a team game are quite huge due to the dynamic and spatial characteristics of the game itself. Usually indoor team games are quite dynamic with high physical contact among players and rapid movements, for example a player can achieve velocities of more than 5 m/s. These characteristics impose a careful choice on the system's architecture which includes choosing not only the cameras and their disposition but also defining the software design. In order to minimize crowd interference, merging and occlusion situations and have a good view of the entire field we propose using two cameras placed at the ceiling of the sport's hall to cover each half of the field (this solution was also adopted by (Kristan et al., 2009;Monier et al., 2009)). The cameras chosen are GigEthernet (DFK 31BG03.H model from Imaging Source) which allows placing the recording unit far away from the cameras and maintaining the signal integrity, have a resolution of 1024x768 pixels and can supply images at 30 frames per second. The images collected from the cameras are shown on Fig. 1 .
We propose a two module software system, one responsible for acquiring the images from the two cameras (Acquisition System) and another responsible for the offline processing of the two video streams (Processing System). This last module is responsible for detecting and tracking the players as well as for generating an HyperVideo that consists on an unified image of the two streams with the positions of the players. Additionally it generates log files with the players' positions so that they can be used by the sports community to perform game analysis and infer game statistics. Figure 2 presents a scheme of the proposed architecture.

Player detection
Following previous work (( Santiago et al., 2011)) the player detection is achieved through colour identification and is composed of three steps. The first step consists on the colour calibration of each team using a region growing method allied with a Fuzzy inspired categorization methodology. This calibration is responsible for subdividing the colour space into subspaces, which are not necessary disjoint since there may be colours that are common to both teams (for example there are many teams that have uniforms with white stripes). Afterwards, the user manually indicates the location of each player on the field by clicking on the images. Despite some works make this initialization automatically we choose this approach because both handball and basketball allow unlimited players' substitutions, and this manual initialization enables the possibility to always give the same number to the player and also discard the player when he/she leaves the field.
The second step consists on detecting foreground regions through a dynamic background subtraction method, which uses an empty image of the field and a dynamic threshold that is continuously and locally updated at each new frame (as will be described later on this chapter).
After the foreground pixels are identified, their colour is compared against the colour subspaces and classified into one of the teams. In case there is a belonging tie between teams, the adjacent pixels are searched in order to break the tie. Additionally the teams' colour subspaces are updated with new information. Finally, pixels are aggregated to form blobs and categorized into player or no player according to size and density restrictions. The centre of mass of the blob is considered the player's position that is afterwards transformed into real world coordinates (court coordinates) using the cameras' homographies.

Colour calibration
The colour calibration is performed under the supervision of the user and is achieved using a region growing method (Santiago et al., 2011). Let us define colour subspace S c as the set of RGB colour triplets that are tagged as having the colours of the vests of the team c. The initial colour seeds Cpx s , y s q for each colour subspace S c are set manually using the mouse to click on the objects that will be segmented, afterwards the surrounded pixels' colours Cpx a , y a q are agglomerate around these seeds using colour distance criteria. The colour expansion is performed on the HSL (Hue, Saturation and Luminance) colour space in order to minimize the effects of shadows and light variations. The regions growth is performed in all directions (using a 8 neighbour mask n 8 ) in a recursive way until reaching a pixel that in terms of colour is away from the seed more than a global threshold (C ThresG ) or from its previous neighbour Cpx p , y p q more than a local threshold (C ThresL ), according to the following definition (both thresholds are user definable): Cpx a , y a qPS c ô@px a , y a qPn 8 px p , y p q : ∆pCpx a , y a q, Cpx p , y p qq ă C ThresL^∆ pCpx a , y a q, Cpx s , y s qq ă C ThresG ,where • Cpx, yq is the HSL colour of pixel at location (x,y), • n 8 px, yq are the eight neighbours of pixel at location (x,y), • ∆pC 1 , C 2 q is a configurable weighed distance function, involving HSL components of colours C 1 and C 2 .
During the colour expansion each colour value is attributed a given belonging degree to the subspace being calibrated. This value is stored in a lookup table that contains for each colour triplet the belonging degree to each subspace. Despite the expansion is performed on the HSL colour space, the colour lookup table is built on the RGB (red, green, blue) colour space as well as the remainder image processing operations. The fuzzy belonging µpq of the colour Cpq of a pixel P of coordinates px P , y P q to a given colour subspace Sc is µ Sc pCpx P , y P qq and can assume four levels: when not belonging to the colour subspace µpq " 0 and Bpq " C 0 , by default and before the calibration takes place, all the colours are categorized with no belong degree to every subspace; for low belong degree to the colour subspace µpq " 0.5 and Bpq " C L ; for full belong degree to the colour subspace µpq " 1 and Bpq " C F and additionally a full belong degree with the characteristic of also being a colour seed µpq " 1 and Bpq " C S . The B Sc function maps the four colour categories (C 0 , C L , C F and C S ) into the fuzzy belonging according to Table 2.
Colour B Sc µ Sc Not the colour C 0 0 Resembles the colour C L 0.5 Is the colour C F 1 Is a seed colour C S 1 Table 2. Mapping of B Sc to fuzzy belonging µpq.
In order to determine the belonging degree of the colour triplet the following rules are applied sequentially during the region growing process: • Rule1 -if the pixel was assigned to the subspace, is physically quite close to the initial seed pixel and the colour distance to the initial seed pixel is less than one fifth the maximum allowed distance for the growing process (less than 1 5 C ThresG ) then it is also assumed to be a seed pixel with a full belonging degree. B S c pCpx a , y a qq " C S ð Cpx a , y a qPS c^∆ ppx a , y a q, px s , y s qq ă 1 5 C ThresG • Rule2 -if the colour distance to the initial seed pixel is less than two fifths the maximum allowed distance for the growing process than the pixel is categorized with a full belonging degree but without being a seed.
B S c pCpx a , y a qq " C F ð Cpx a , y a qPS c^∆ pCpx a , y a q, Cpx s , y s qq ă 2 5 C ThresG • Rule3 otherwise and in case the pixel obeys to the region growing conditions it is categorized with a low belong degree (C L ).
By the end of the calibration process the colour space is subdivided into subspaces, which are not necessary disjoint since the same colour can belong to different subspaces. The motivation for allowing non-disjoint subspaces is that teams frequently share colours, for example uniforms with white stripes are common and thus the exact same well known colour belongs to the two opposing teams. Hence the Fuzzy Logic inspiration methodology, however implementation issues make it not interesting to allow for continuous degrees of belonging. The belonging degrees attributed to each colour triplet, as will be seen later, will allow to break ties but also to generate dynamic subspaces that can adapt (either grow or shrink) during the game. Subspaces do not have nor ever create any predefined specific shape as they are created from user-selected seeds on the image, that have been grown in the user selected video frames and in colour space.

Background subtraction
Since the background is more or less static, due to the semi-controlled environment of an indoor game, the subtraction is performed using an empty image of the viewed scene and only the threshold used to distinguish between foreground and background pixels is dynamic and specific for each pixel.
The background subtraction is performed on the RGB colour space, because tests showed that for some pixels a small difference between the RGB colour components of the background and the processed images corresponded to a large difference on the Hue component (HSL colour space). In fact, non-linear colour spaces suffer from the non removable singularity problem as stated by (Cheng et al., 2001). Also in order to make the processing time shorter, the subtraction is executed locally and not to the entire image. In other words, only predefined regions, which are defined by the Kalman Filter predictive stage, suffer this process. The threshold applied to each pixel is only updated if the pixel is classified as background, otherwise its value remains unchanged. The update obeys to equation 1 and the value is never allowed to go below 4% or above 23.5% of the entire colour range (0-255) for each colour component. These values were obtained experimentally. α is a learning constant, that for our specific case was set to 0.02. Pixels whose colour difference to the background image is less than the respective threshold are labelled as background, the others are labelled as foreground.

Team identification
After the foreground pixels are identified, their colour is compared against the colour lookup table that resulted from the calibration process ( 3.2.1) and classified into one of the teams. Since the same colour can belong to different teams (subspaces) it may occur that a pixel is classified into more than one team. To break this tie, information not only from the belonging degree itself but also from adjacent pixels that have already been classified is used. Spatial information is used by counting the number of adjacent pixels that belong to each team, afterwards if the sum of the team with the highest value is 1.5 times the value of the lowest team then that pixel is assigned a weight of 2 to that team otherwise it is assigned a value of 1 to both teams. Colour calibration information is used by adding to the previous weight the corresponding fuzzy belong values (as shown in Table 2). The team with the highest final weight is the one assigned to the pixel. This way, it is possible that, although the belonging degree of a pixel to a team based on the colour calibration information is higher than the belonging degree to the other team it may be the winner due to the neighbourhood characteristics. Additionally, if the winning team has a full belong to that colour triplet and corresponds to a seed colour (Bpq " C S ) then a region growing process is triggered and the colour lookup table that contains the information concerning the colour subspaces is updated. This auto expansion is more restrictive than the one performed during the manual initialization and is performed at time intervals (t expans ), an adjustable setting. This setting may change to take into consideration the speed at which light changes in the pavilion and yields better overall performance to the application. In order for this update to add not only colour triples to the subspaces but also to remove them (otherwise subspaces would grow too much), each colour triplet has associated a persistence (p S c pR, G, Bq) to that subspace. Colours with lower belonging have lower persistence and colours with higher belonging have higher persistence. The initial persistence given to the colour is proportional to the time between auto expansions according to Eq.2.
The persistence is maximum (with the values defined in Eq. 2) when the colour is added to the subspace and diminishes whenever it is not detected in a frame, however seed colours have infinite persistence and will therefore always remain in the subspace. Whenever the persistence value reaches zero the colour triplet is removed from the subspace.
With the introduction of this dynamic it is possible to have mutable subspaces that adapt to light changes either occurring at different regions of the same frame or between frames. At the same time the foreground pixels are classified, they are also aggregated horizontally to form run length encoding (RLE) structures characterized by the y, x min and x max positions of the RLE. Finally the RLEs are merged vertically to form blobs. A full description of this pixel aggregation can be found in (Santiago et al., 2011) with the particularity that small horizontal RLEs are ignored and do not pass to the vertical merging process in order to minimize noise. The blobs resulting from this pixel aggregation are further refined according to size and colour density constraints. Therefore, blobs that are too small or too large or blobs that have low colour density are discarded as being players. The colour density is measured as the percentage of pixels inside the bounding box of the blob that belong to the team divided by the total number of pixels of the bounding box. The remainder blobs are considered players that belong to a given team (S c ) and have an px, yq position on image and world coordinates. The position on image coordinates is calculated as being the center of mass of the blob according to Eq.3 .
px cm blob , y cm blob q"p , where c is the team the blob belongs to.
The world coordinates 1 are obtained by first removing the barrel effect produced by the lens (only radial effect was considered, since the tangential component can, in most cases, be neglected) using Eq. 4 . The unknowns in these equations (k 1 , k 2 , k 3 , x c and y c )a r e determined using the information extracted from the field lines. # x u " x d`p x d´xc qpk 1 r 2`k 2 r 4`k 3 r 6 q y u " y d`p y d´yc qpk 1 r 2`k 2 r 4`k 3 r 6 q (4) , where • r 2 "px d´xc q 2`p y d´yc q 2 , • px u , y u q are the undistorted coordinates, • px d , y d q are the distorted coordinates, • px c , y c q are the coordinates of the center of distortion of the lens, • k 1 , k 2 and k 3 are the radial coefficients for barrel distortion. Fig. 3 illustrates the images before and after removing the barrel effect for the two cameras.
Once the barrel effect is removed from the players' positions, it is possible to apply the pinhole camera model in order to obtain the world coordinates of the players. This model uses intrinsic parameters (K) and extrinsic parameters (R and T) to map image coordinates (X) into world coordinates (x) according to Eq.5.
x " KrR|TsX ô x " H w X The H matrix is defined according to Eq.6. cos φ cos α sin ω sin φ cos α´cos ω sin α cos ω sin φ cos α`sin ω sin ω sin α T x cos φ sin α sin ω sin φ sin α`cos ω cos α cos ω sin φ sin α´sin ω cos α T ý sin φ sin ω cos φ cos ω cos φ T z fi fl (6) , where • f is the focal length, • c x and c y are the coordinates of the optical center, • φ, ω and α are the rotations around the x, y and x axis, respective, • T x , T y , T z are the translations on x, y and z directions.
The coordinates are projected at 1.2m from the ground which corresponds to a best effort of the height of the center of mass of an average person, when seen in most frequent positions of the field. This projection allows to have a more correct measure of the players' positions and enables the information fusion between cameras.

Player tracking
The player tracking is based on a vector of Kalman filters (Kalman & Others, 1960;Welch & Bishop, 2002) (one per player) with state x k (Eq. 7) and measure z k (Eq. 8) at instant time k.
x k "rxyv x v y s T (7) , where • x and y are the player center of mass position in real world coordinates, • v x and v y are the player velocity in real world coordinates.
And modelled according to the following linear stochastic difference equations (Eq. 9, 10).
x k " Ax k´1`wk´1 (9) Where A is the state model matrix and assumes the form of Eq.11 (where ∆t is the time between frames -in this case corresponds to 1 30 seconds), H is the observation model matrix (Eq.12) and the random variables w k and v k represent the process and measurement noise. The usage of real world coordinates allows for a transparent tracking between the two video streams.
Whenever the user indicates a player (using the mouse), a new Kalman filter is added to the vector with the real world position of the player and a default velocity of 0 m/s. Afterwards, the players' locations on the subsequent frames are predicted using Eq.9. The area around the predicted measure is searched according to the process explained on 3.2.3 to generate a measure (z k ) to updated the estimate. In addition, and since the players' velocity is not constant through out the game, at each frame each player velocity is updated. By predicting the position of the players on the subsequent frames it is possible to reduce the computational cost because only a few regions of the entire image are search for players.

Generation of human meaningful video
This chapter concerns primarily with the generation of a single complete image video stream that is of the utmost importance for the possible human end-users of our system, for example a sports' scientist, educator or coach. Generating a single, high quality, "undistorted", video stream is a complex task due to the usage of complex optical systems (wide angle zoom lenses) and the need for accuracy of the system that involves dealing with 2 sets of intrinsic and extrinsic camera parameters for the so called pinhole models of the cameras. High accuracy mapping of the pixels of the images onto real world is needed, with the added difficulty of covering large real world areas. This is an even greater task given that high resolution and high frame rate are of interest, thus producing large amounts of data (2 cameras @ 1024 x 768 resolution, RGB colour depth @ 30fps). By using advanced camera calibration techniques, the two different sets of parameters were found, thus allowing image to world accurate mapping (on separate images) -as seen in Fig. 3. In order to produce the "undistorted" Human Meaningful Video stream, the first task is to optimize the algorithm used. Firstly a pair of "static", off-line created, Look Up Tables (LUTs) are created to map real world pixels into their origin in the original images. Mapping non-overlapped image areas is not complex if all data is available. Additional processing is necessary for the overlapped parts of the image, in order to get a human meaningful image (without neither repeated nor cut objects). LUTs are very useful because they exchange complex mathematical operations with memory storage for the repeated computations which is very interesting in terms of general performance and parallel computing in particular. One of the issues to be solved is to show an interesting view of the overlapped portions of the images shown in Fig. 1. This is an issue because near the centre line of the handball field, where images overlap, the same player is seen tilted in opposite directions by both cameras (as shown in Fig. 4). The strategy is to have complete objects always from one of the cameras -this is always possible because the largest object is inferior in image size to the size of the overlap in the images (see Fig. 4). It is not computationally immediate, though, because objects and background have to be identified and reconstructed.

Player detection and tracking
In order to validate the approach, the system was mounted at the public sports hall of Portimão to film the games of the portuguese SuperCup. The video footages collected validated the engineering solution since we were able to cover the entire field with good resolution and a good overlapped zone as shown on Fig. 1. Initially two distinct teams (team A and team B -examples of both teams can be seen on Fig.  6(b)) were calibrated as explained on section 3.2.1. The original seeds were selected by clicking on the players of both teams which resulted on the initial colour subspaces illustrated on Fig.  6(a). After 1200 frames and a time between expansions (t expans ) of 30 frames (which corresponds to 34 expansions), it is possible to verify that the teams' colour subspaces have updated and grown around the initial seeds resulting on the new colour subspaces of Fig. 7. Table 3 provides an overview of the colour spaces dimensions during the auto calibration process. The @F represents the frame number (remember that the auto expansion occurs at every 30 th frame). The table clearly illustrates that from the initial colour subspaces (Fig. 6(a)) to the final (Fig.  7) the number of colour triplets that are seeds (C S ) increases so that the colour subspaces can adapt to the colour conditions of that team through the entire field and through time.
In addition, if the colour subspace is more condensed, triplets that resemble the colour (C L ) and are colour (C F ) tend to exist in higher number (Team B) than when the colour seeds are more spread in the entire colour space (Team A).

132
Cutting Edge Research in New Technologies www.intechopen.com  Team B @F1 @F5 @F31 @F35 @F61 @F301 @F421 @F691 @F811 @F1021 A C L 34 28 12 10 10 10 3 1 4 0 C F  Table 3. Evolution of the number of colour triples that belong to each team and the respective belonging degree. The expansion is performed at every 30 th frame (t expans " 30) It is also possible to verify the effect of colour persistence, namely the number of colour triplets that are C L decreases on Team A from frame 1 to frame 5 (the persistence of these colours is of 3.75 (p S c " 1 8 t expans )) also they tend to have a more erratic behaviour because they belong to the periphery of the colour subspace and therefore do not appear so often, and triplets with C F decrease on Team A from frame 61 to frame 301 and on Team B from frame 811 to frame 1021.
However, it is possible to verify that the initial seed choice will influence how well the colour subspace adapts to the environment conditions. In fact, the initial seeds for team A resulted in a faster adaptation: the colour subspace growth is higher at the initial expansions and tends to stabilize, while for team B there is still a reasonable growth even at frame 1021. The initial seed choice is a well known problem of region growing methods. Comparing the results with and without the Fuzzy inspired model of colour expansion it is possible to verify that the players' detection achieves better results with the mutable colour subspaces as depicted on Fig. 8. These images show the players' detection at frame 762, where the green crosses correspond to players detected from team A and the purple crosses correspond to players detected from team B. The green and red highlighted pixels correspond to pixels that have been labelled as belonging to one of the teams (green correspond to team A and red to team B). Analysing the two images it is possible to verify that using the Fuzzy inspired auto expansion model all fielders from both teams were detected ( Fig. 8(b)), while using the initial colour subspaces ( Fig. 6(a)) the system is unable to detect four players of team B (Fig. 8(a)). In addition, the detected area of the players is higher with the Fuzzy model which allows to have a better measure of the player center of mass. Moreover, the detection rate increases along with the colour subspaces update as illustrated on Table 4. This increase rate is more visible on players from team B, since its initial colour subspace did not reflect so well the colour properties of the team.  Table 4. Player detection of three players from teams A and B from frame 1 until frame 800.
The usage of Kalman filters to perform the tracking allows not only to make the processing time shorter, but also to minimize the miss-detections because the predictive stage (Eq. 10) determines the next position of the player (taking into account the model of the player movement). This position is used on the next frame to search for the blob and in case of not finding a measure the value is used as the position of the player. The following table (Table 5) shows how the tracking of players from team B is improved using the Kalman filter.
Player Additionally, by performing the tracking in real world coordinates it is possible to track the players between cameras in a transparent way. Figure 9 illustrates players from team B crossing the middle line and being identified initially by the left camera and afterwards by the right camera, without the need for user intervention.

Generation of human meaningful video
In order to ascertain the importance of parallel processing in this application, the same algorithm was ported to the OpenMP and CUDA frameworks, yielding the results shown in Table 6. Naturally, parallel processing is heavily dependent on hardware and results are shown for two laptop solutions, only one of each is able to run the proprietary CUDA toolkit for GPU processing.   By studying Table 6 and Figure10, it can be found that a single better CPU improves performance marginally -a much better CPU running with a clock at least 1.49 times faster yields about 17% performance gain. This is probably due to the large amount of data being manipulated in this application -the limitations are in the memory area, not pure single CPU power. The same results also demonstrate the advantage of parallel computing with joint usage of OpenMP (OMP) and CUDA, with a performance gain little over 2.6 times (in the same computer). Even larger gains were initially expected but the complexity of the algorithm on the selection of the objects to show on the overlapped portion of the initial images, limited Fig. 10. Execution times for several parallel computing techniques (see Table 6).
considerably the overall performance. As recent processors offer high frequency dual/quad (or more) cores, the interest of using 16 much slower processors is also to be considered -but using the GPU processors of the video card will free up the main CPU for other tasks. The previous study is related to producing a single frame of the human meaningful video stream with parallel computing. Future work includes using distributed computing to producing the stream, for example, using 2 computers to produce alternating frames is expected to produce excellent performance gains (almost halving processing time). This expectation is justified as little dependencies exist among consecutive frames and, as such, parallel processing is most effective and the "slow" transmission over network would not introduce significant performance loss, when compared to the expected benefits of the double computational power available for processing. A similar approach (several frames at once) would also be possible in a single computer but would likely not be as interesting as most available resources are used fully most of the time. As mentioned earlier, actual performance gains are very much hardware dependent. In order to really have a meaningful video for humans, the found objects (handball players, etc) have to be marked onto the image, a task that was not parallelized.

Conclusions
This chapter presented a visual automatic system for detecting and tracking players in indoor games. The main objectives were to implement a system that could be adaptative and take into consideration the light changes (and subsequent colour change) that frequently occurs in sports pavilions (for example due to windows, clouds, sun orientation, ...). The ultimate goal is to further improve information for sports agents and sportive quality. Players are identified by color segmentation. The used techniques include foreground detection and classification of team colours using a Fuzzy inspired categorization model that allows a single colour to belong to both teams simultaneously. The colour subspaces have no particular shape and grow or shrink over time in order to take into account spatial and temporal changes of the recognized colours. Player tracking is improved by making use of a Kalman Filter per player and the resulting information is organized and shown in a single undistorted image view of the entire field, adequate for human studies. Although the system is multi-camera, the usage of parallel processing allows efficient generation of this final image. The proposed methodology was validated with a real game footage filmed at an important Portuguese handball championship. Results show that, using simple features such as colour combined with a powerful tool for tracking (Kalman Filter), it is possible to detect and track players throughout the game area with very limited user intervention -initial colour calibration by user and little more. The usage of adaptative colour subspaces generated by a Fuzzy inspired methodology allows to better define the teams' colour properties during the game and increasing detection rates.

Future work
Future work includes exploring methodologies to automatically define the time between expansions along the game according to the colour subspaces dynamics and also the detection rate and incorporate more information on the Kalman Filter model in order to make it more robust to long merging and occlusion situations and also enable both the detection and the tracking with the benefits of parallel processing. Additionally, interesting game sequences are also intended to be automatically detected (taking into consideration the players positions) and the generated video tagged accordingly. This video is to be made "active" (searchable, "jumpable", ...) to be integrated into a human interface navigation tool to allow for an easy usage of the extracted information by the final end users.

Acknowledgments
We would like to thank Fundação Calouste Gulbenkian by the support given through a PhD scholarship with ref. 104410.