웹로그를 이용한 페이지 연관분석

 

웹로그를 이용한 페이지 연관분석

0. 개요
웹로그의 referer 정보를 이용하여 페이지간의 연결구조를 파악하면,
레이지 링크를 따라움직이지 않고 직접 URL에 접근 또는 임시 페이지 및 취약한 페이지를 찾을수 있다 는 가설을 세우고 접근

1. 웹로그
대상 referer 가 존재하는 웹로그

2. map reduce
웹로그에서 필요한 정보만 출력
– 정적인 페이지 제외(.gif, .jpg, .png, .swf, .css, .js, .dwr, .htc, .flv, .xml)
– 외부 referer 는 제외
– output 은 아래와 같은 format
referer|requesturi^status cnt

input weblog

xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /btn_byline.gif HTTP/1.1"	200	2931	"http://domain.com/info.do?cmd=main&mid=84"			HTTP/1.1	410
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /pop_close.jpg HTTP/1.1"	200	1633	"http://domain.com/info.do?cmd=main&mid=84"			HTTP/1.1	680
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /ico_arrow.gif HTTP/1.1"	200	57	"http://domain.com/info.do?cmd=main&mid=84"			HTTP/1.1	450
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /include/common/calendar.jsp HTTP/1.1"	200	7584	"http://domain.com/info.do?cmd=main&mid=84"			HTTP/1.1	4670
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /js/counter/counter.js HTTP/1.1"	200	6121	"http://domain.com/info.do?cmd=main&mid=84"			HTTP/1.1	720
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /calendar/calendar.js HTTP/1.1"	200	2073	"http://domain.com/include/common/calendar.jsp"			HTTP/1.1	1350
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"POST /info.do HTTP/1.1"	200	299	"http://domain.com/info.do?cmd=main&mid=84"			HTTP/1.1	97320
xxx.xxx.xxx.xxx	-	-	[01/Feb/2013:00:00:51 +0900]	"GET /include/common/tags.js HTTP/1.1"	404	881	"http://domain.com/include/common/calendar.jsp"			HTTP/1.1	3360

output

/|/index.jsp^200	1150
/|/index.jsp^500	1
/|/info.do^200	4
./index.jsp|/combi.do^200	2
./index.jsp|/notice.do^200	1
./info.do|/notice.do^200	1

3. python
map reduce의 결과 output을 d3에서 사용할수 있는 json 으로 변환해주는 프로그램

]$ cat page_link.py
#!/usr/bin/python

import sys, glob
import commands
import time
from datetime import timedelta, date
from os.path import *


def getYesterday():
        d=date.today()
        td=timedelta(days=-1)
        yd=d+td
        return yd.strftime("%Y%m%d")

def writeLog(data) :
        log_date = stxDate
        log_file = "/page_link/" + log_date + ".log"
        fp = file(log_file, 'a+')
        now = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
        comment = "["+str(now)+"] "+str(data)+"\n"
        fp.write(comment)
        fp.close()

def faile_check(fail):
    if fail:
        writeLog("Failed =>" + str(fail))
        sys.exit(1)

def command(cmd):
        writeLog(cmd)
        fail, res = commands.getstatusoutput(cmd)
        faile_check(fail)
        return res


stxDate = getYesterday()
inputfile = sys.argv[1]
outputfile = sys.argv[2]


# hadoop file down
if exists("/page_link/"+stxDate):
        cmd="rm -rf /page_link/"+stxDate
        command(cmd)
        cmd="/home/hadoop/hadoop/bin/hadoop fs -get "+inputfile+" /page_link/"+stxDate
        command(cmd)
else :
        cmd="/home/hadoop/hadoop/bin/hadoop fs -get "+inputfile+" /page_link/"+stxDate
        command(cmd)


# get all file count
cmd="wc -l /page_link/"+stxDate+" |awk '{print $1}'"
cnt = command(cmd)
i = int(cnt)

# log line process
nodes = [];
links = [];
fo = open("/page_link/"+stxDate,"r")
while i > 0 :
        line = fo.readline().rstrip('\n')
        cols = line.split('\t')
        # cols : refer | request ^ status \t 1
        node = cols[0].split('|')
        nodes.append(str(node[1]))
        links.append(dict(source=str(node[0]), target=str(node[1])))
        i-= 1
fo.close();

# unique node info
nodes = list(set(nodes))


#make node json format
nodes2 = [];
json = "{ \"nodes\": [";
for node in nodes :
        if node.rfind("^", 1) > 0 :
                url = node.rsplit("^",1)[0].replace("\\","")
                status = node.rsplit("^",1)[1]
                nodes2.append(url)
                json = json + "{\"name\":\""+ url +"\", \"count\":3, \"group\":\""+status+"\"},"
        else :
                json = json + "{\"name\":\""+ node +"\", \"count\":3, \"group\":\"ref\"},"

json = json.rstrip(",") + " ],\n\"links\": [";


#make link json format
for link in links :
        idx_target = nodes.index(link["target"])
        try :
                idx_source = nodes2.index(link["source"])
        except :
                idx_source = 0

        json = json + "{\"source\":"+ str(idx_source) +", \"target\":"+str(idx_target)+"},"

json = json.rstrip(",") + " ] }";

#print("//=================================")
#print(json)
#print("//=================================")

#make json file
json_file = "/page_link/"+outputfile+".json"
fp = file(json_file, 'w')
fp.write(json)
fp.close()

4. d3
시각화 도구로 d3를 선택했다. http://d3js.org/
https://github.com/mbostock/d3/wiki/Gallery 여기에 가면 많은 d3 예제가 있는데,
나는 특별히 Fisheye Distortion (http://bost.ocks.org/mike/fisheye/) 를 사용했다.
여기서 연결 구조만 사용하고 실제 샘플의 Fisheye Distortion 기능은 사용하지 않았다.

Fisheye Distortion 예제를 기초로 수정한 소스

<!DOCTYPE html>
<head>
<meta charset="utf-8">
<style>
text {
  font: 10px sans-serif;
}

.node {
  stroke: #fff;
  stroke-width: 1.5px;
}

.link {
  stroke: #999;
  stroke-opacity: .6;
}

.background {
  fill: none;
  pointer-events: all;
}

#chart1 {
  width: 1000px;
  height: 500px;
  border: solid 1px #ccc;
}

#chart1 .node {
  stroke: #fff;
  stroke-width: 1.5px;
}

#chart1 .link {
  stroke: #999;
  stroke-opacity: .6;
  stroke-width: 1.5px;
}

#chart1 .status {
  stroke: #999;
  stroke-opacity: .6;
  stroke-width: 1.5px;
}

circle.node.200{
    fill:black;
} 

</style>
<script src="/lib/d3.v2.min.js?2.9.4"></script>
<script src="/lib/jquery-1.9.0.min.js"></script>
</head>

<body>
<select id="date">
	<option value="20130131">20130131</option>
	<option value="20130201">20130201</option>
</select>
<select id="domain">
	<option value="www.aaa.com">www.aaa.com</option>
	<option value="www.bbb.co.kr">www.bbb.co.kr</option>
    <option value="www.ccc.com">www.ccc.com</option>
</select>

<input type="button" id="search" name="search" value="search"/>
<div id="msg"></div>

<p id="chart1"></p>

<script>
var width = 1000,
height = 500;

var color = d3.scale.category20();

var force = d3.layout.force()
.charge(-20)
.linkDistance(20)
.size([width, height]);


var svg = d3.select("#chart1").append("svg")
.attr("width", width)
.attr("height", height);

svg.append("rect")
.attr("class", "background")
.attr("width", width)
.attr("height", height);

$("#msg").text('select date and domain and search!');

function draw(data) {
    var n = data.nodes.length;

    var domain=[];
    data.nodes.forEach(function(d){
           domain.push(d.group);
    });
    domain = domain.sort()
    color.domain(domain); 
    var lengend_height = 22 * domain.unique().length;
    var legend = d3.select("#chart1").append("svg").attr("class", "legend").attr("width", 100).attr("height",lengend_height).attr("x", width).attr("y", height)
    .selectAll("g").data(color.domain()).enter().append("g").attr("transform", function(d, i) {
        return "translate(0," + i * 20 + ")";
    });
    
    legend.append("rect").attr("width", 18).attr("height", 18).style("fill", color);
    
    legend.append("text").attr("x", 24).attr("y", 9).attr("dy", ".35em").text(function(d) {
        return d;
    });
    
    
    force.nodes(data.nodes).links(data.links);

    // Initialize the positions deterministically, for better results.
    data.nodes.forEach(function(d, i) { d.x = d.y = width / n * i; });

    // Run the layout a fixed number of times.
    // The ideal number of times scales with graph complexity.
    // Of course, don't run too long?you'll hang the page!
    force.start();
    for (var i = n; i > 0; --i) force.tick();
    force.stop();

    // Center the nodes in the middle.
    var ox = 0, oy = 0;
    data.nodes.forEach(function(d) { ox += d.x, oy += d.y; });
    ox = ox / n - width / 2, oy = oy / n - height / 2;
    data.nodes.forEach(function(d) { d.x -= ox, d.y -= oy; });
    
    var link = svg.selectAll(".link")
        .data(data.links)
      .enter().append("line")
        .attr("class", "link")
        .attr("x1", function(d) { return d.source.x; })
        .attr("y1", function(d) { return d.source.y; })
        .attr("x2", function(d) { return d.target.x; })
        .attr("y2", function(d) { return d.target.y; })
        .style("stroke-width", function(d) { return Math.sqrt(d.value); });

    var node = svg.selectAll(".node")
        .data(data.nodes)
        .enter().append("circle")
        .attr("class", "node")
        .attr("cx", function(d) { return d.x; })
        .attr("cy", function(d) { return d.y; })
        .attr("r", 3)
        .style("fill", function(d) {  return color(d.group); })
        .call(force.drag);

    node.append("title")
    .text(function(d) { return d.name; });
}

$("#search").click(function(){
    file = "/resource/"+$("#domain").val() +"_"+$("#date").val()+".json"
    $("#msg").text('');
    //alert(file);
    d3.selectAll('.legend').remove();
    d3.selectAll('.node').remove();
    d3.selectAll('.link').remove();
    
    d3.json(file, draw);
});

Array.prototype.unique=function() {
    var newArray=[], len=this.length;
    label:for(var i=0; i<len; i++) {
      for(var j=0; j<newArray.length; j++)
        if(newArray[j]==this[i]) continue label;
      newArray[newArray.length] = this[i];
    }
    return newArray;
  }

</script>

</body>

데이터의 포맷..

{
    "links": [
        {
            "source": 0, 
            "target": 1
        }, 
        {
            "source": 0, 
            "target": 2
        }, 
        {
            "source": 1, 
            "target": 3
        }, 
        {
            "source": 1, 
            "target": 4
        }, 
        {
            "source": 1, 
            "target": 5
        }, 
        {
            "source": 2, 
            "target": 6
        }, 
        {
            "source": 2, 
            "target": 7
        }
    ], 
    "nodes": [
        {
            "name": "/",
            "count": 30
        }, 
        {
            "name": "/a.html",
            "count": 20
        }, 
        {
            "name": "/b.html",
            "count": 10
        }, 
        {
            "name": "/a/a1.html",
            "count": 7
        }, 
        {
            "name": "/a/a2.html",
            "count": 10
        }, 
        {
            "name": "/a/a3.html",
            "count": 3
        }, 
        {
            "name": "/b/b1.html",
            "count": 7
        }, 
        {
            "name": "/b/b2.html",
            "count": 3
        }
    ]
}

결과는 아래와 같은 형태로 나온다.

웹로그를 이용한 페이지 연관분석1

웹로그를 이용한 페이지 연관분석2

웹로그를 이용한 페이지 연관분석3

웹로그를 이용한 페이지 연관분석4

 

This entry was posted in Bigdata/Hadoop, Javascript/Html, Python/Ruby/Perl and tagged , , , , , . Bookmark the permalink.

댓글 남기기