Our system consists of three main components: preprocessing, feature extraction with classifier and display result module as shown in Fig 2. In the first step, we collect the web pages for preprocessing. In the second step, the preprocessing will segment the web pages into five areas (top, bottom, left, right and center) as shown in section 3.1 and the information will be sent to the feature extraction such as coordinate (width, height) and the essential data. In the third step, we obtain three features: spatial, location and presentation into feature extraction process.
1) A spatial feature set uses coordination to calculate ratio of block area: top, bottom, left, right and center including sizing ratio of width divided by height concurrently.
2) A location feature set counts the occurrence of ecommerce block with all attributes for statistical collection. To prevent the ambiguity between news and advertisement, we use internal or external source condition for classification. The news often generate from internal source and advertisement from external source.
3) A presentation feature set checks the characteristic of information inside block such as format of navigation search and customer service menu for preparing the data to up to the classification step. In the fourth step, classification will use the previous three features for creating the learning model for classifying the data set. In the final step, display result will show the accuracy of propose system and prepare integration in the future.
Experimental setup
We assume the presence of professional design on Internet Retailer’s Top 500 Guide. Therefore, we collect the first page (main page) from both the Internet Retailer’s Top 500 Guide and another randomly ecommerce web site. Two hundred pages were collected 527 into our corpus with one hundred pages from Internet Retailer’s Top 500 Guide and another one hundred from randomly selected e-commerce web sites. We need to segment web pages according to Section 3.1 and filter some web sites such as flash, animation or whole graphical page due to un-segmentation.
We, then, obtain the three main features extraction: spatial, location and presentation to the classifier. In order to create our model, we have to label the training set according to our attributes in three feature set. We adopt three classification algorithms [8] for creating our training model such as decision tree (J48), Naïve Bayes and Support Vector Machines based on Sequential Minimal Optimization (SMO) training. We use Weka2 machine learning tool for performing our experiment based on 10- fold cross-validation technique.