Extracting Web Navigation Patterns Using Association Rule Mining Essay

Extracting web pilotage forms utilizing Association Rule Mining

Abstraction—Due to rapid addition in size of web along with figure of users, it is really much necessity for the web site proprietors to better understand their clients so that they can supply better service, and besides heighten the quality of the web site. So, the use of informations mining methods and knowledge find on the web is now the major concern for most of the research workers. To accomplish this, web entree log files are required. The web entree log files can be mined to pull out interesting form so that the user behaviour can be understood. Web use excavation ( WUM ) is a sort of informations mining method that can be utile in urging the web use forms with the aid of user’s session and behaviour. The end of this research is to use Association Rule Mining ( ARM ) to pull out web pilotage forms from web session logs which can so be used for urging list of web pages to the user which he/she has non antecedently visited. Our chief application country is web recommendation system for happening intuition of the user when he/she visits a web site. The web log dataset used for experiment is of DePaul CTI University which is filtered and sessionized. In this paper, the proposed attack uses ARM for happening forms from transactional informations utilizing FP-Growth algorithm. We examined different support/confidence thresholds, and analyze ensuing regulations ; which as a consequence, we found some interesting relationships among web pages. The consequences were evaluated by happening truth of the generated forms.

Keywords—Web use excavation ; pattern extraction ; User Navigation ; Association Rule Mining ; FP-Growth

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

I. Introduction

Web excavation [ 1 ] is the usage of informations mining techniques to automatically detect and pull out information from Web documents/services ( Etzioni, 1996 ) . Web excavation is categorized into 3 types. 1. Content Mining ( Examines the content of web pages every bit good as consequences of web Searching ) 2. Structure Mining ( Exploiting Hyperlink Structure ) 3. Use excavation ( analysing user web pilotage.

The Web is immense, diverse, and dynamic therefore raising the scalability, multimedia information, and temporal issues severally. To mine the interesting information from this immense pool, informations excavation techniques can be applied. Data excavation techniques can non be applied straight because the web informations is unstructured or semi-structured. Web excavation is used to detect interesting forms which can be applied to many existent universe jobs like bettering web sites, better understanding the visitor’s behaviour, merchandise recommendation etc. Typical applications are those based on user patterning techniques, such as Web Personalization, adaptative web sites and user mold.

Web use excavation ( WUM ) is the portion of web excavation which deals with the extraction of cognition from waiter log files ; beginning informations chiefly consist of the ( textual ) logs that are collected when users entree Web waiters and might be represented in standard formats ( e.g. , Common Log Format [ 2 ] , Extended Log Format [ 3 ] , Log [ 4 ] ) . WUM comprises of three stairss, viz. preprocessing, pattern find and pattern analysis. Different data pre-processing techniques include informations cleansing, sessionization, user designation, dealing designation, informations integrating, informations transmutation and informations decrease. After treating the web log files, the following measure is to detect web pilotage forms by using informations excavation techniques. Data excavation techniques are statistical analysis, association regulation excavation, bunch, categorization and consecutive form excavation.

Several issues addressed in this paper are Distinguishing among alone users, server Sessionss, episodes, etc. in the presence of hoarding and proxy waiters.

The demand of web use excavation was due to following grounds i.e. 1. For personalization of a user by maintaining path of antecedently accessed pages of a user. 2. To place the needful links to better the overall public presentation of future entrees. 3. To better the existent design of web pages and for doing other Alterations to a Web site. 4. Use forms can be used for concern intelligence in order to better gross revenues and advertizement by supplying merchandise recommendations.

The focal point of this paper is to supply an overview on how to utilize frequent form excavation techniques for detecting different types of forms in a Web log database. I have used FP-Growth algorithm for pull outing web pilotage forms. I have used Rapidminer tool for pull outing interesting regulations and Matlab for happening truth of the generated regulations. The advantages of utilizing FP-Growth are 1. Less execution clip. 2. Requires less memory due to pack construction and no campaigner coevals. 3. Leads to focussed hunt of smaller databases.

In this paper a elaborate treatment on proposed attack has been studied. This paper is organized as follows. In subdivision II association regulation excavation is presented with its restrictions and solutions. In subdivision III, I proposed a Methodology and algorithm to foretell web pages for user. In subdivision IV, the dataset used is explained and experimented and in subdivision V, I evaluated the public presentation of proposed method by happening the truth of the consequences generated. In subdivision VI, decision is presented.


The formal statement of association regulation excavation job was foremost stated in [ Agrawal et Al. 1993 ] by Agrawal.Associationregulationsare a information excavation technique that searches for relationships between properties in big informations sets. In the context of WUM, one time Sessionss have been identified association regulations can be used to associate pages that are most frequently referenced together in a individual waiter session. Such regulations indicate the possible relationship between pages that are frequently viewed together even if they are non straight connected, and can uncover associations between groups of users with specific involvements. Since normally such dealing databases contain highly big sums of informations, current association regulation find techniques try to snip the hunt infinite harmonizing to support for points under consideration. Support is a step based on the figure of happenings of user minutess within dealing logs. They can be officially represented as:

It means the presence of point ( page ) X leads to the presence of point ( page ) Y, with [ Support ] % happening of [ X, Y ] in the whole database, and [ Confidence ] % happening of [ Y ] in set of records where [ X ] occurred.

For illustration, if one discovers that 80 % of the users accessing/computer/products/printer.html and /computer/products/scanner.html besides accessed, but merely 30 % of those who accessed/computer/products besides accessed computer/products/scanner.html, so it is likely that some information in printer.html leads users to entree scanner.html.

If T denotes all minutess t, such that t ? T, and if there is an attribute Ten in dealing T, X ? T, there is likely an property Yttrium in T every bit good, Y ? t. The possibility of this occurrence is called association regulation assurance, denoted by degree Celsius

and measured as a per centum of minutess holding Y along with X compared to the overall figure of minutess incorporating Ten.

Assurance ( X- & gt ; Y ) =Support ( X?Y )….. ( 2 )

Support ( X )

Another of import parametric quantity depicting the derived association regulation is its support, denoted by s. It can be calculated as a per centum of minutess incorporating Ten and Y to overall figure of minutess.

Support ( X?Y ) =Support count of Xy…… ( 3 )

Entire figure of dealing in D

These two prosodies determine the significance of an association regulation. Extra restraints of interesting regulations besides can be specified by the users such asLaplace, Gain,Conviction, Lift and p-s. Since the association regulations tend to happen relationships in big datasets, it would be really clip and resource consuming to seek for the regulations among all informations. Because of this each algorithm for detecting association regulations begins with the designation of so called frequent point sets. The most popular algorithms usage two attacks for finding these point sets. The first attack is BFS ( breath-first hunt ) and is based on cognizing all support values of ( k-1 ) th point set before ciphering the support of the kth point set. DFS ( depth-first hunt ) algorithms determine frequent point sets based on a tree construction. The best known algorithms for mining association regulations are Apriori, AprioriTID, STEM, DIC, Partition-Algorithm, Elcat, FP-growth, etc.

In web use excavation, association regulations are used to detect pages that are visited together rather frequently. Knowledge of these associations can be used either in selling and concern or as guidelines to net interior decorators for ( rhenium ) structuring Web sites. Minutess for mining association regulations differ from those in market basket analysis as they can non be represented every bit easy as in MBA ( points bought together ) . Association regulations are mined from user Sessionss incorporating remote host, user Idaho, and a set of urls. As a consequence of excavation for association regulations we can acquire, for illustration, the regulation: Ten, Yi? Z ( c=85 % , s=1 % ) . This means that visitants who viewed pages X and Y besides viewed page Z in 85 % ( assurance ) of instances, and that this combination makes up 1 % of all minutess in preprocessed logs. In ( Cooley et al. , 1999 ) a differentiation is made between association regulations based on a type of pages looking in association regulations. They identify Auxiliary-Content Minutess and Content-only minutess. The 2nd 1 is far more meaningful as association regulations are found merely among pages that contain informations of import to visitants.

Another interesting application of association regulations is the find of so called negative associations. In mining negative association regulations ( Xi?Yttrium ) points that have less than minimal support are non discarded. Algorithms for happening

Negative association regulations can besides happen indirect associations.

Recommendation theoretical accounts with association regulations

In the context of this paper [ 5 ] , a recommendation theoretical account M outputs a set of points as recommendations R, given a set of discernible points O. In our instance, the theoretical account M is a set of association regulations with support and assurance. To bring forth the recommendations, we build the set R as follows:

R = { consequent ( RI) | RI? M and ancestor ( RI) ? O

and consequent ( RI) ? O } ………………… ( 4 )

If we want the N best recommendations ( top N ) , we select from R the recommendations matching to the regulations with highest assurance.

Restriction of ARM

One of the major drawbacks of associations regulation excavation [ 6 ] is that excessively many regulations are generated and no warrant for all generated regulations to be relevant. Minimal support and minimal assurance parametric quantities are set in such a manner to extinguish false finds. When minimal support is excessively little, every regulation will acquire a opportunity to be true, taking to wrong

Recommendation and when minimal support is excessively big, for little informations set, incorrect anticipations may happen.

Solution to restriction of ARM

Clustering is procedure of grouping object with similar behaviour in different bunch. Clustering reduces the input informations set to be little for Association regulation excavation, accordingly the Numberss of regulations are reduced and the extracted regulations are extremely relevant and meaningful.


The proposed attack for web use excavation is shown in figure below.

Figure 1. Proposed Approach for Pattern Discovery

Stairss are: –

1 )Natural logs informations are collected from web waiter which are in unstructured signifier.

2 )Then informations preprocessing techniques will be applied on the log informations to filtrate out the unneeded informations. The stairss include are: –

Figure 2. Block diagram for pre-processing

3 )After pre-processing, log informations in the dealing format must be applied to FP-Growth Algorithm to detect interesting regulations based on support and assurance.

4 )The discovered forms or regulations are so passed as an input to the anticipation theoretical account for foretelling the most likely page to be visited following.

5 )Then the consequence generated from anticipation theoretical account will be passed to recommendation engine for urging list of pages to the user based on old pages he/she has already visited.


The dataset [ 7 ] used for experimentation is DePaul’s University CTI web log informations. This information set contains preprocessed and filtered sessionized informations for the chief DePaul CTI Web waiter ( hypertext transfer protocol: //www.cs.depaul.edu ) . The information is based on a random sample of users sing this site for a 2 hebdomad period during April of 2002. The original ( unfiltered ) informations contained a sum of 20950 Sessionss from 5446 users. The filtered informations files were produced by filtrating low support page positions, and extinguishing Sessionss of size 1. The filtered information contains 13745 Sessionss and 683 page positions.

The dataset contains three informations i.e. preprocessed informations, non-preprocessed informations and cleaned informations. I will be utilizing preprocessed informations in which every preprocessing stairss are applied. It contains cti.tra file which is in transactional format in which each row corresponds to the sequence of pages visited during one session.

Table 1. cti.tra file




/news/default.asp, /people/search.asp? sort=pt


/news/default.asp, /resources/tutoring.asp


/people/search.asp, /people/facultyinfo.asp? id=332

I have converted cti.tra file into frequent point set informations utilizing JAVA ( Eclipse ) so that the FP-growth algorithm can be applied on it.

Table 2. Frequent point set file


Items/web pages


5 6 7 8 9 10 2 11 12 13


5 10 21 26 27


5 31 32 34 35 36 37 38 39 15


5 13 45 46 6 47 2 31 8 48 49

The frequent point set file is applied as an input to SPMF tool [ 8 ] created by Philippe Fournier viger. It is an unfastened beginning informations excavation library which is used for association regulation excavation. With support = 0.1 and confidence= 0.8, some interested generated regulations are shown in the below tabular array.

Table 3. Interesting regulations ( sup=0.1, conf=0.8 )


Support count


/programs/i? /news/default.asp



/courses/i? /news/default.asp



/authenticate/login.asp? i?





/news/default.asp /courses/syllabilist.aspi? /courses/



So, the interesting generated regulations are /news/default.asp, /courses/ , /authenticate/login.asp? section=mycti & A ; title=mycti & A ; urlahead=studentprofile/studentprofile, /cti/studentprofile/studentprofile.asp? section=mycti i.e. ( 5, 9, 31, and 32 ) . The ensuing pages indexed harmonizing to their page Idahos are shown in the below tabular array.

Table 4. Page indexing harmonizing to PIDs

Pelvic inflammatory disease








section=mycti & A ; title=

mycti & A ; urlahead=studentprofile/studentprofile


/cti/studentprofile/studentprofile.asp? section=mycti

Now, the frequent point set file mentioned above is converted into CSV format which Rapidminer takes as an input for bring forthing regulations.

Stairss: –

1 )First the CSV file generated is converted into binomial ( true/false ) format because FP-growth algorithm takes binomial values as an input.

2 )Now, FP-Growth operater is used to bring forth frequent itemsets and make association regulations operater is used to bring forth association regulations.

The regulations generated from Rapidminer are stated in the below tabular array.

Table 5. Interesting regulations ( sup=0.1, conf=0.8 )

Rules [ premisesi? Conclusion ]



[ p31, p32i? p5 ]



[ p31i? p5 ]



[ p32i? p5 ]



[ p16i? p5 ]



[ p5, p16i? p9 ]



[ p5, p31i? p32 ]



[ p31i? p32 ]



[ p9, p16i? p5 ]



[ p10i? p5 ]



[ p2i? p5 ]



[ p9i? p5 ]



[ p5, p32i? p31 ]



[ p32i? p31 ]




For rating intent, I have split the information into preparation ( 75 % ) and proving ( 25 % ) . When using ARM in Web use excavation, we have to see active session window into history. We use a fixed-size sliding window [ 9 ] over the current active session to capture the current user’s history deepness. For illustration, if the current session ( with a window size of 3 ) is & lt ; A, B, C & gt ; , and the user accesses the page position D, so the new active session becomes & lt ; B, C, D & gt ; . Therefore, the skiding window of size N over the active session allows merely the last N visited pages to act upon the recommendation value of points in the recommendation set. We call this sliding window, the user’s active session window.

I have consider window size of length ( W=1,2 ) for my experiment as per the generated consequences.

Table 6.Rules generated when W=1 and 2



P31i? p5

P31, p32i? p5

P32i? p5

P5, p16i? p9

P16i? p5

P5, p31i? p32

P31i? p32

P9, p16i? p5

P10i? p5

P5, p32i? p31

P2i? p5

P9i? p5

P32i? p31

Now, for happening the truth of the generated regulations, I have written little codification in Matlab that finds how accurate the regulations are when applied it on trial informations. The truth can be find harmonizing to the given expression.

Accuracy =P ( Premisessi?decision )…… . ( 5 )

P ( Premises )

I.e. first count the premises portion in the trial informations, so number the occurring of both premises and decision portion. After that divide the latter portion by the former portion.

Table 7. Accuracy on trial case

Train = 10309 ( 75 % )

Test = 3436 ( 25 % )

Train = 12371 ( 90 % )

Test = 1374 ( 10 % )































Avg. Air Combat Command = 0.88

Avg Air Combat Command. = 0.88

Avg Air Combat Command. = 0.89

Avg Air Combat Command. = 0.88

From the above consequences, we can state that there is 88 % about chance that the generated regulations are accurate and are most likely to be visited.

VI. Decision

The web waiter log informations of DePaul CTI University is studied and analyzed. In this research work, ARM algorithm is proposed and been applied on the web log dataset. The consequences generated are far adequate accurate. In future, we will seek to implement it utilizing Conditional Random Field ( CRM ) and Hidden Markov Model ( HMM ) and the consequences will be compared against the bing attacks.

  1. Anand, Sarabjot Singh, and Bamshad Mobasher. “ Intelligent techniques for web personalization. ”Proceedings of the 2003 international conference on Intelligent Techniques for Web Personalization. Springer-Verlag, 2003.
  2. Masseglia, Florent, et Al. “ Web use excavation: extracting unexpected periods from web logs. ”Data Mining and Knowledge Discovery16.1 ( 2008 ) : 39-65.
  3. Kosala, Raymond, and Hendrik Blockeel. “ Web excavation research: A study. ”ACM Sigkdd Explorations Newsletter2.1 ( 2000 ) : 1-15.
  4. Al Murtadha, Y. M. , et Al. “ Mining web pilotage profiles for recommendation system. ”Information Technology Journal9.4 ( 2010 ) : 790-796.
  5. Jorge, Alipio, Mario Amado Alves, and P. J. Azevedo. “ Recommendation with association regulations: A web excavation application. ”Proceedings of Data Mining and Warehousing, Conference of Information Society. 2002.
  6. Langhnoja, Shaily G. , Mehul P. Barot, and Darshak B. Mehta. “ Web Use Mining Using Association Rule Mining on Clustered Data for Pattern Discovery. ”International Journal of Data Mining Techniques and Applications2.01 ( 2013 ) .
  7. DePaul University web log informations, Available on [ URL: hypertext transfer protocol: //facweb.cs.depaul.edu /mobasher/classes/ect584/resource.html ] , Accessed on [ 25 Sep,2014 ]
  8. SPMF tool, Available on [ URL: hypertext transfer protocol: //www.philippe-fournier-viger.com/spmf/ ] , Accessed on [ 1 March,2015 ]
  9. Nakagawa, Miki, and Bamshad Mobasher. “ Impact of site features on recommendation theoretical accounts based on association regulations and consecutive forms. ”Proceedings of the IJCAI. Vol. 3. 2003.