從過年前到目前為止~ 都一直和學弟忙於將以前所實作的東西要轉換到線上版,我還需要點時間呀~ ><"
而在轉換的過程之中~ 突然想到一個問題!! 那就是全世界URL的平均長度究竟約多長?
我想這個答案只有大型搜尋引擎(Google、Yahoo、Cuil)能給出一個較接近的答案吧~
下述是一個簡單計算這樣結果的MapReduce小程式:
URLList
http://l.yimg.com/f/a/tw/ivychang/708971_020409_420x80_0202_yahoo-elite.swf http://l.yimg.com/tw.yimg.com/a/tw/ivychang/712756_1231_1231new350_100.swf http://l.yimg.com/tw.yimg.com/a/tw/erinlin/721493_0123_350x200.swf http://www.kriesi.at/wp-content/themes/dark_rainbow/js/Particles.swf http://tw.promo.yahoo.com/2008auction/shpticket/images/top.swf http://l.yimg.com/tw.yimg.com/a/tw/fanny/658216_101508_420x80_4.swf http://l.yimg.com/f/a/tw/vikii/606895_shopping_center_20090203r.swf http://l.yimg.com/f/a/tw/hedy/697827_e3_hp_012109.swf http://l.yimg.com/tw.yimg.com/a/tw/ivychang/708334_0120_350x200_certificate_081224.swf http://l.yimg.com/tw.yimg.com/a/tw/ivychang/708334_0120_350x100_linux_080826.swf http://www.ysed.org.tw/3rd_upLoad/4156/index.swf
URLAvgLength
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.Counters;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.RunningJob;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class URLAvgLength extends Configured implements Tool {
static enum Counter {
URL_COUNT
}
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static Text word = new Text("Len");
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String key2 = value.toString();
reporter.incrCounter(Counter.URL_COUNT, 1);
output.collect(word, new IntWritable(key2.length()));
}
}
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public int run(String[] args) throws Exception {
String input = "/usr/Ring/urllist/*";
String output = "/usr/Ring/urlavglen";
JobConf conf = new JobConf(getConf(), URLAvgLength.class);
FileSystem fs = FileSystem.get(conf);
fs.delete(new Path(output), true);
conf.setJobName("URLAvgLength");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setNumReduceTasks(1);
TextInputFormat.setInputPaths(conf, new Path(input));
TextOutputFormat.setOutputPath(conf, new Path(output));
RunningJob running = JobClient.runJob(conf);
Counters ct = running.getCounters();
long count = ct.getCounter(Counter.URL_COUNT);
InputStream in = fs.open(new Path("hdfs://localhost:9000"+output+"/part-00000"));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line = br.readLine();
Integer value = Integer.parseInt(line.split("\t")[1]);
System.out.println("Avg:" + value/count);
return 0;
}
public static void main(String[] args) {
try {
int res = ToolRunner.run(new Configuration(), new URLAvgLength(),args);
System.exit(res);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Avg:67

哈哈,我也在思考这个问题。
“全世界URL的平均長度究竟約多長?”
2009-06-26 11:11:02