2009 January

Memcached - a distributed memory object caching system

Memcached(pronunciation: mem-cache-dee.)是一套分散式的記憶體快取系統~ 這年頭實在是什麼都要分散式的~XD

尤其從Google的身上更是如此~Google File System、MapReduce、Bigtable... 全部都是分散式的..= ="(因為後兩者均架構在GFS之上)

根據memcached(wiki)上的解說~ 它原本是由「Danga Interactive」公司所為了開發「LiveJournal」而誕生的~ 它可以用在大部份database-driven的網站下~ (本部落格應該也要改用Memcached吧~ XD 有時間再說~)，也就是說~ 我們可以快取一些經常要從資料庫抓出來的資料~ 然而它並沒有提供任何安全性或是認證的功能~ 換句話說~ Memcached需要被安置在防火牆的保護之下~ 而且許多大咖級的公司都有在用~ 像是YouTube、Digg、Wikipedia、Slashdot... Facebook更是用了超過800臺伺服器，並提供「28TB」級的記憶體來作為快取使用~ (心想Google應該更多..)，重點~ 在UMD的一位Jimmy Lin助理教授~ 很快的就將它整合到Hadoop之中~ 還寫了份技術報告「Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework.」.. 真是先驅..Orz

關於安裝Memcached網路上已經有許多資源了~ 有興趣的人Google一下就行了~ 或參考下述相關資源。

筆者主要用「spymemcached」這一個Java API，如果您需要其它語言的API請參考：Memcached Clients。

下述是印出相關Memcached Server的狀態程式：

import java.net.SocketAddress;
import java.util.Map;

import net.spy.memcached.AddrUtil;
import net.spy.memcached.MemcachedClient;

public class MemcachedTest
{
	public static void main(String arg[]) throws Exception
	{
		long total_items = 0;
		MemcachedClient mc = new MemcachedClient(AddrUtil.getAddresses("xxx.xxx.xxx.xxx:11211"));
		
		Map<SocketAddress, Map<String, String>> stats = mc.getStats();

		for (Map.Entry<SocketAddress, Map<String, String>> e : stats.entrySet())
		{
			System.out.println("memcached server: " + e.getKey().toString());

			for (Map.Entry<String, String> s : e.getValue().entrySet())
			{
				System.out.println(" - " + s.getKey() + ": " + s.getValue());

				if (s.getKey().equals("curr_items"))
					total_items += Long.parseLong(s.getValue());

			}
		}

		System.out.println("Total number items in memcache: " + total_items);
		mc.shutdown();
	}
}

相關消息與資源

．Scaling memcached at Facebook

．How to install memcache on Debian Etch

2009-01-13 22:31:18 | Comments (2)

Java class File Format

In Java

昨天試著在整理電腦的東西~ 發現一個塵封已久的Word檔~ 裡頭包含的是以前研究Java class File Format記錄~

還記得當初剛看到「0xCAFEBABE」真是會心一笑啊~ 咖啡寶寶？XDDD 是的，沒錯! Java就是拿這四個Bytes當做File Signature，真是有創意極了!!

依稀記得這個class format曾經大幅更動過一次~ 那時候是從「Java 1.4.x」直接跳到「Java 5.0」，而當時最流行的就是「二隻老虎」~ 一隻是Java 5.0的代號「Tiger」，另一隻則是「Mac OS X 10.4」代號也稱為「Tiger」~ 直到去年才被「自然界」所取代... 因為開始出現很多的「Cloud」和「Air」.. 不是「雲」就是「大氣」~ 不勝枚舉 XDD(Cloud Computing、Tag Cloud、Adobe AIR、MacBook AIR...)

上述格式的原始檔如下：

Hello.java

public class Hello
{
	public static void main(String arg[])
	{
		String s = "Hello";
	}
}

有興趣的人可以對照著「VM Spec The class File Format」來剖析~

而Java 5.0 更動的class format請至「JSR 202: JavaTM Class File Specification Update」下載。

筆者的記錄檔：Hello.class 格式剖析 (整張圖放上來會漏漏長...)

2009-01-06 12:54:34 | Add Comment

Pairwise Document Similarity in Large Collections with MapReduce

In Hadoop

Pairwise Document Similarity in Large Collections with MapReduce．這是一篇由UMD的一位博士生Tamer M. Elsayed和他的指導教授所共同發表在「ACL-08: HLT」的短篇論文，主要用「MapReduce」來處理大量文件相似度的計算，如果您對這篇論文有興趣的話，請參考上述論文連結，筆者不再詳述。

下述筆者撰寫的驗證程式需要用到「Cloud9 - A MapReduce Library for Hadoop」，「Cloud9」是由UMD所開發的，主要用來作為課程的教學工具和一些文字處理方面的研究，它採用Apache License，所以您可以直接用Subversion checkout下來使用，而下述範例主要用到「PairOfIntString」和「ArrayListWritable」。

Pairwise Document Similarity

在進行「Pairwise Similarity」的「Map」階段時，筆者純粹利用Regular Expression來處理~ 這並不是最佳的處理方式(我承認偷懶~)，最佳的方式應該撰寫一些特定的「OutputFormat」和「Writable」來加以處理，整個效率才會大大的提高!(如：Cloud9所提供的Tuple)

由於此範例需要處理二次的MapReduce，所以筆者直接利用「job2.addDependingJob(job1);」將兩個Job產生相依性，也就是先執行job1完成之後JobControl才會去呼叫job2開始執行。

import java.io.IOException;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.jobcontrol.Job;
import org.apache.hadoop.mapred.jobcontrol.JobControl;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import edu.umd.cloud9.io.ArrayListWritable;
import edu.umd.cloud9.io.PairOfIntString;

public class PairwiseDS extends Configured implements Tool
{

    public static class Map extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, PairOfIntString>
    {
        private Text word = new Text();

        public void map(LongWritable key, Text value,
                OutputCollector<Text, PairOfIntString> output, Reporter reporter)
                throws IOException
        {
            FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
            String fileName = fileSplit.getPath().getName();
            fileName = fileName.substring(0, fileName.length() - 4);

            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);
            while (tokenizer.hasMoreTokens())
            {
                word.set(tokenizer.nextToken());
                output.collect(word, new PairOfIntString(1, fileName));
            }
        }
    }

    public static class Reduce extends MapReduceBase implements
            Reducer<Text, PairOfIntString, Text, ArrayListWritable>
    {
        public void reduce(Text key, Iterator<PairOfIntString> values,
                OutputCollector<Text, ArrayListWritable> output,
                Reporter reporter) throws IOException
        {

            ArrayList<PairOfIntString> al = new ArrayList<PairOfIntString>();
            HashMap<String, Integer> map = new HashMap<String, Integer>();

            while (values.hasNext())
            {
                PairOfIntString psi = values.next();
                if (map.containsKey(psi.getRightElement()))
                {
                    Integer i = (Integer) map.get(psi.getRightElement());
                    map.put(psi.getRightElement(), i.intValue() + 1);
                } else
                {
                    map.put(psi.getRightElement(), psi.getLeftElement());
                }
            }
            Iterator i = map.entrySet().iterator();
            while (i.hasNext())
            {
                java.util.Map.Entry m = (java.util.Map.Entry) i.next();
                al.add(new PairOfIntString((Integer) m.getValue(), (String) m
                        .getKey()));
            }
            output.collect(key, new ArrayListWritable<PairOfIntString>(al));

        }
    }

    public static class Map2 extends MapReduceBase implements
            Mapper<LongWritable, Text, Text, IntWritable>
    {
        private Text word = new Text();

        public void map(LongWritable key, Text value,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException
        {
            String line = value.toString().trim();

            ArrayList<String> keyList = new ArrayList<String>();
            ArrayList<Integer> valList = new ArrayList<Integer>();

            String p = "\\(([0-9]+), ([a-z0-9.]+)\\)";
            Pattern r = Pattern.compile(p);
            Matcher m = r.matcher(line);
            while (m.find())
            {
                String k = m.group(2);
                String v = m.group(1);
                keyList.add(k);
                valList.add(new Integer(v));
            }

            if (keyList.size() > 1)
            {
                String[] key_arr = keyList.toArray(new String[0]);
                Integer[] val_arr = valList.toArray(new Integer[0]);
                int klen = key_arr.length;
                for (int i = 0; i < klen; i++)
                {
                    for (int j = i + 1; j < klen; j++)
                    {
                        word.set(key_arr[i] + "," + key_arr[j]);
                        output.collect(word, new IntWritable(val_arr[i]
                                * val_arr[j]));
                    }
                }
            }

        }
    }

    public static class Reduce2 extends MapReduceBase implements
            Reducer<Text, IntWritable, Text, IntWritable>
    {
        public void reduce(Text key, Iterator<IntWritable> values,
                OutputCollector<Text, IntWritable> output, Reporter reporter)
                throws IOException
        {
            int sum = 0;
            while (values.hasNext())
            {
                sum += values.next().get();
            }
            output.collect(key, new IntWritable(sum));
        }
    }

    public int run(String[] args) throws Exception
    {
        // ===================== Indexing =====================
        JobConf conf = new JobConf(getConf(), PairwiseDS.class);
        conf.setJobName("Indexing");
        conf.setOutputKeyClass(Text.class);
        conf.setOutputValueClass(PairOfIntString.class);

        conf.setMapperClass(Map.class);
        conf.setReducerClass(Reduce.class);

        conf.setInputFormat(TextInputFormat.class);
        conf.setOutputFormat(TextOutputFormat.class);

        conf.setNumReduceTasks(1);
        FileInputFormat.setInputPaths(conf, new Path(args[0]));
        TextOutputFormat.setOutputPath(conf, new Path(args[1]));

        Job job1 = new Job(conf);
        // ===================== Pairwise Similarity =====================
        JobConf conf2 = new JobConf(getConf(), PairwiseDS.class);
        conf2.setJobName("Pairwise Similarity");
        conf2.setOutputKeyClass(Text.class);
        conf2.setOutputValueClass(IntWritable.class);

        conf2.setMapperClass(Map2.class);
        conf2.setReducerClass(Reduce2.class);

        conf2.setInputFormat(TextInputFormat.class);
        conf2.setOutputFormat(TextOutputFormat.class);
        conf2.setNumReduceTasks(1);

        FileInputFormat.setInputPaths(conf2, new Path(args[1] + "/p*"));
        TextOutputFormat.setOutputPath(conf2, new Path(args[2]));
        Job job2 = new Job(conf2);

        job2.addDependingJob(job1);
        JobControl controller = new JobControl("Pairwise Document Similarity");
        controller.addJob(job1);
        controller.addJob(job2);
        new Thread(controller).start();

        while (!controller.allFinished())
        {
            System.out.println("Jobs in waiting state: "+ controller.getWaitingJobs().size());
            System.out.println("Jobs in ready state: "+ controller.getReadyJobs().size());
            System.out.println("Jobs in running state: "+ controller.getRunningJobs().size());
            System.out.println("Jobs in success state: "+ controller.getSuccessfulJobs().size());
            System.out.println("Jobs in failed state: "+ controller.getFailedJobs().size());
            System.out.println();

            try
            {
                Thread.sleep(20000);
            } catch (Exception e)
            {
                e.printStackTrace();
            }
        }
        return 0;

    }

    public static void main(String[] args) throws Exception
    {
        int res = ToolRunner.run(new Configuration(), new PairwiseDS(), args);
        System.exit(res);
    }
}

after Indexing:

after Pairwise Similarity:

相關資源

．Hadoop常用SDK系列四 JobControl

2009-01-05 22:30:04 | Comments (4)

PDFBox - 擷取PDF檔案中的純文字

In Java, Open Source

PDFBox．是一個Open Source的Java PDF Library，可以利用它來協助處理PDF檔案的一些應用(用iText也是可行的，不過它好像不支援擷取純文字「iText in Action, pp. 576」)，例如：擷取PDF檔案中的純文字、轉換PDF檔案到Image檔等等.. 諸如此類的應用。

而且「Lucene」就是用它來轉換PDF到純文字再進行索引的~

下述是擷取PDF檔案中的純文字：(FontBox-0.1.0-dev.jar、PDFBox-0.7.3.jar required!)

import java.io.IOException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class PDFTextExtractor
{
    private PDDocument document;
    
    public String extractText(String file) throws IOException
    {
        document = PDDocument.load(file);       
        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setStartPage(1);
        stripper.setEndPage(document.getNumberOfPages());
        return stripper.getText(document);
    }

    public static void main(String[] args)
    {
        PDFTextExtractor extractor = new PDFTextExtractor();
        try
        {
            String text = extractor.extractText("C:\\test.pdf");
            System.out.println(text);
        }catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

．Glyph & Cog: Text Extraction - 解釋為何擷取PDF中的純文字不是那麼容易

2009-01-03 18:17:22 | Comments (3)