Bayesian Content Manager, Part II

Projects > Bayesian Content Manager, Part II

Background

The amount of information available via the Internet is growing exponentially every day. The discussion on website censorship becomes increasingly important at the same rate and it is the responsibility of those who use the Internet to control or monitor this content. There are several commercial products on the market to sift through web information and decide for the consumer what is "good" or "bad," but some of these products present issues. For example, most of the filtering process is left to the software without user involvement. In addition, the consumers are not directly connected to the other consumers of the same product. This limits the amount of discussion regarding individual purposes for filtering or assistance between those with similar objectives.

Solution and Goals

A solution to overcome these issues can be created using Bayesian filtering methods that are mostly used today for email spam detection . The overall project goal is to design, implement and deploy a Bayesian filtering system used to calculate the suitability of web content in an online community of similar-minded users. It is in this environment where community leaders can create web filters that prevent access to content deemed inappropriate by the community (e.g. corporate offices, church groups, etc.). Our current research focuses on producing an efficient technique to enable a filter, and acquire swift performance when calculating the validity of a web page.

To begin creating a system that will block select web pages, an Apache 2 server was set up with its Apache proxy module enabled. The design is to proxy the client web browser through the server and filter the content as it passes through the server. The filtering process is done in a custom Apache module that we developed. The module receives web page data from the proxy module, extracts text and other useful information from the web page and then calculates a rating for the appropriateness of the web page based on the information extracted. At this point, our prototype returns either the original page to the requesting user or an error page which states that the page has been blocked.

Full Paper
Publication on ACM Portal.