웹로그를 이용한 페이지 연관분석
0. 개요
웹로그의 referer 정보를 이용하여 페이지간의 연결구조를 파악하면,
레이지 링크를 따라움직이지 않고 직접 URL에 접근 또는 임시 페이지 및 취약한 페이지를 찾을수 있다 는 가설을 세우고 접근
1. 웹로그
대상 referer 가 존재하는 웹로그
2. map reduce
웹로그에서 필요한 정보만 출력
– 정적인 페이지 제외(.gif, .jpg, .png, .swf, .css, .js, .dwr, .htc, .flv, .xml)
– 외부 referer 는 제외
– output 은 아래와 같은 format
referer|requesturi^status cnt
input weblog
xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /btn_byline.gif HTTP/1.1" 200 2931 "http://domain.com/info.do?cmd=main&mid=84" HTTP/1.1 410 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /pop_close.jpg HTTP/1.1" 200 1633 "http://domain.com/info.do?cmd=main&mid=84" HTTP/1.1 680 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /ico_arrow.gif HTTP/1.1" 200 57 "http://domain.com/info.do?cmd=main&mid=84" HTTP/1.1 450 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /include/common/calendar.jsp HTTP/1.1" 200 7584 "http://domain.com/info.do?cmd=main&mid=84" HTTP/1.1 4670 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /js/counter/counter.js HTTP/1.1" 200 6121 "http://domain.com/info.do?cmd=main&mid=84" HTTP/1.1 720 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /calendar/calendar.js HTTP/1.1" 200 2073 "http://domain.com/include/common/calendar.jsp" HTTP/1.1 1350 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "POST /info.do HTTP/1.1" 200 299 "http://domain.com/info.do?cmd=main&mid=84" HTTP/1.1 97320 xxx.xxx.xxx.xxx - - [01/Feb/2013:00:00:51 +0900] "GET /include/common/tags.js HTTP/1.1" 404 881 "http://domain.com/include/common/calendar.jsp" HTTP/1.1 3360
output
/|/index.jsp^200 1150 /|/index.jsp^500 1 /|/info.do^200 4 ./index.jsp|/combi.do^200 2 ./index.jsp|/notice.do^200 1 ./info.do|/notice.do^200 1
3. python
map reduce의 결과 output을 d3에서 사용할수 있는 json 으로 변환해주는 프로그램
]$ cat page_link.py #!/usr/bin/python import sys, glob import commands import time from datetime import timedelta, date from os.path import * def getYesterday(): d=date.today() td=timedelta(days=-1) yd=d+td return yd.strftime("%Y%m%d") def writeLog(data) : log_date = stxDate log_file = "/page_link/" + log_date + ".log" fp = file(log_file, 'a+') now = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time())) comment = "["+str(now)+"] "+str(data)+"\n" fp.write(comment) fp.close() def faile_check(fail): if fail: writeLog("Failed =>" + str(fail)) sys.exit(1) def command(cmd): writeLog(cmd) fail, res = commands.getstatusoutput(cmd) faile_check(fail) return res stxDate = getYesterday() inputfile = sys.argv[1] outputfile = sys.argv[2] # hadoop file down if exists("/page_link/"+stxDate): cmd="rm -rf /page_link/"+stxDate command(cmd) cmd="/home/hadoop/hadoop/bin/hadoop fs -get "+inputfile+" /page_link/"+stxDate command(cmd) else : cmd="/home/hadoop/hadoop/bin/hadoop fs -get "+inputfile+" /page_link/"+stxDate command(cmd) # get all file count cmd="wc -l /page_link/"+stxDate+" |awk '{print $1}'" cnt = command(cmd) i = int(cnt) # log line process nodes = []; links = []; fo = open("/page_link/"+stxDate,"r") while i > 0 : line = fo.readline().rstrip('\n') cols = line.split('\t') # cols : refer | request ^ status \t 1 node = cols[0].split('|') nodes.append(str(node[1])) links.append(dict(source=str(node[0]), target=str(node[1]))) i-= 1 fo.close(); # unique node info nodes = list(set(nodes)) #make node json format nodes2 = []; json = "{ \"nodes\": ["; for node in nodes : if node.rfind("^", 1) > 0 : url = node.rsplit("^",1)[0].replace("\\","") status = node.rsplit("^",1)[1] nodes2.append(url) json = json + "{\"name\":\""+ url +"\", \"count\":3, \"group\":\""+status+"\"}," else : json = json + "{\"name\":\""+ node +"\", \"count\":3, \"group\":\"ref\"}," json = json.rstrip(",") + " ],\n\"links\": ["; #make link json format for link in links : idx_target = nodes.index(link["target"]) try : idx_source = nodes2.index(link["source"]) except : idx_source = 0 json = json + "{\"source\":"+ str(idx_source) +", \"target\":"+str(idx_target)+"}," json = json.rstrip(",") + " ] }"; #print("//=================================") #print(json) #print("//=================================") #make json file json_file = "/page_link/"+outputfile+".json" fp = file(json_file, 'w') fp.write(json) fp.close()
4. d3
시각화 도구로 d3를 선택했다. http://d3js.org/
https://github.com/mbostock/d3/wiki/Gallery 여기에 가면 많은 d3 예제가 있는데,
나는 특별히 Fisheye Distortion (http://bost.ocks.org/mike/fisheye/) 를 사용했다.
여기서 연결 구조만 사용하고 실제 샘플의 Fisheye Distortion 기능은 사용하지 않았다.
Fisheye Distortion 예제를 기초로 수정한 소스
<!DOCTYPE html> <head> <meta charset="utf-8"> <style> text { font: 10px sans-serif; } .node { stroke: #fff; stroke-width: 1.5px; } .link { stroke: #999; stroke-opacity: .6; } .background { fill: none; pointer-events: all; } #chart1 { width: 1000px; height: 500px; border: solid 1px #ccc; } #chart1 .node { stroke: #fff; stroke-width: 1.5px; } #chart1 .link { stroke: #999; stroke-opacity: .6; stroke-width: 1.5px; } #chart1 .status { stroke: #999; stroke-opacity: .6; stroke-width: 1.5px; } circle.node.200{ fill:black; } </style> <script src="/lib/d3.v2.min.js?2.9.4"></script> <script src="/lib/jquery-1.9.0.min.js"></script> </head> <body> <select id="date"> <option value="20130131">20130131</option> <option value="20130201">20130201</option> </select> <select id="domain"> <option value="www.aaa.com">www.aaa.com</option> <option value="www.bbb.co.kr">www.bbb.co.kr</option> <option value="www.ccc.com">www.ccc.com</option> </select> <input type="button" id="search" name="search" value="search"/> <div id="msg"></div> <p id="chart1"></p> <script> var width = 1000, height = 500; var color = d3.scale.category20(); var force = d3.layout.force() .charge(-20) .linkDistance(20) .size([width, height]); var svg = d3.select("#chart1").append("svg") .attr("width", width) .attr("height", height); svg.append("rect") .attr("class", "background") .attr("width", width) .attr("height", height); $("#msg").text('select date and domain and search!'); function draw(data) { var n = data.nodes.length; var domain=[]; data.nodes.forEach(function(d){ domain.push(d.group); }); domain = domain.sort() color.domain(domain); var lengend_height = 22 * domain.unique().length; var legend = d3.select("#chart1").append("svg").attr("class", "legend").attr("width", 100).attr("height",lengend_height).attr("x", width).attr("y", height) .selectAll("g").data(color.domain()).enter().append("g").attr("transform", function(d, i) { return "translate(0," + i * 20 + ")"; }); legend.append("rect").attr("width", 18).attr("height", 18).style("fill", color); legend.append("text").attr("x", 24).attr("y", 9).attr("dy", ".35em").text(function(d) { return d; }); force.nodes(data.nodes).links(data.links); // Initialize the positions deterministically, for better results. data.nodes.forEach(function(d, i) { d.x = d.y = width / n * i; }); // Run the layout a fixed number of times. // The ideal number of times scales with graph complexity. // Of course, don't run too long?you'll hang the page! force.start(); for (var i = n; i > 0; --i) force.tick(); force.stop(); // Center the nodes in the middle. var ox = 0, oy = 0; data.nodes.forEach(function(d) { ox += d.x, oy += d.y; }); ox = ox / n - width / 2, oy = oy / n - height / 2; data.nodes.forEach(function(d) { d.x -= ox, d.y -= oy; }); var link = svg.selectAll(".link") .data(data.links) .enter().append("line") .attr("class", "link") .attr("x1", function(d) { return d.source.x; }) .attr("y1", function(d) { return d.source.y; }) .attr("x2", function(d) { return d.target.x; }) .attr("y2", function(d) { return d.target.y; }) .style("stroke-width", function(d) { return Math.sqrt(d.value); }); var node = svg.selectAll(".node") .data(data.nodes) .enter().append("circle") .attr("class", "node") .attr("cx", function(d) { return d.x; }) .attr("cy", function(d) { return d.y; }) .attr("r", 3) .style("fill", function(d) { return color(d.group); }) .call(force.drag); node.append("title") .text(function(d) { return d.name; }); } $("#search").click(function(){ file = "/resource/"+$("#domain").val() +"_"+$("#date").val()+".json" $("#msg").text(''); //alert(file); d3.selectAll('.legend').remove(); d3.selectAll('.node').remove(); d3.selectAll('.link').remove(); d3.json(file, draw); }); Array.prototype.unique=function() { var newArray=[], len=this.length; label:for(var i=0; i<len; i++) { for(var j=0; j<newArray.length; j++) if(newArray[j]==this[i]) continue label; newArray[newArray.length] = this[i]; } return newArray; } </script> </body>
데이터의 포맷..
{ "links": [ { "source": 0, "target": 1 }, { "source": 0, "target": 2 }, { "source": 1, "target": 3 }, { "source": 1, "target": 4 }, { "source": 1, "target": 5 }, { "source": 2, "target": 6 }, { "source": 2, "target": 7 } ], "nodes": [ { "name": "/", "count": 30 }, { "name": "/a.html", "count": 20 }, { "name": "/b.html", "count": 10 }, { "name": "/a/a1.html", "count": 7 }, { "name": "/a/a2.html", "count": 10 }, { "name": "/a/a3.html", "count": 3 }, { "name": "/b/b1.html", "count": 7 }, { "name": "/b/b2.html", "count": 3 } ] }
결과는 아래와 같은 형태로 나온다.