Open Source

Memcached - a distributed memory object caching system

Memcached(pronunciation: mem-cache-dee.)是一套分散式的記憶體快取系統~ 這年頭實在是什麼都要分散式的~XD

尤其從Google的身上更是如此~Google File System、MapReduce、Bigtable... 全部都是分散式的..= ="(因為後兩者均架構在GFS之上)

根據memcached(wiki)上的解說~ 它原本是由「Danga Interactive」公司所為了開發「LiveJournal」而誕生的~ 它可以用在大部份database-driven的網站下~ (本部落格應該也要改用Memcached吧~ XD 有時間再說~)，也就是說~ 我們可以快取一些經常要從資料庫抓出來的資料~ 然而它並沒有提供任何安全性或是認證的功能~ 換句話說~ Memcached需要被安置在防火牆的保護之下~ 而且許多大咖級的公司都有在用~ 像是YouTube、Digg、Wikipedia、Slashdot... Facebook更是用了超過800臺伺服器，並提供「28TB」級的記憶體來作為快取使用~ (心想Google應該更多..)，重點~ 在UMD的一位Jimmy Lin助理教授~ 很快的就將它整合到Hadoop之中~ 還寫了份技術報告「Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework.」.. 真是先驅..Orz

關於安裝Memcached網路上已經有許多資源了~ 有興趣的人Google一下就行了~ 或參考下述相關資源。

筆者主要用「spymemcached」這一個Java API，如果您需要其它語言的API請參考：Memcached Clients。

下述是印出相關Memcached Server的狀態程式：

import java.net.SocketAddress;
import java.util.Map;

import net.spy.memcached.AddrUtil;
import net.spy.memcached.MemcachedClient;

public class MemcachedTest
{
	public static void main(String arg[]) throws Exception
	{
		long total_items = 0;
		MemcachedClient mc = new MemcachedClient(AddrUtil.getAddresses("xxx.xxx.xxx.xxx:11211"));
		
		Map<SocketAddress, Map<String, String>> stats = mc.getStats();

		for (Map.Entry<SocketAddress, Map<String, String>> e : stats.entrySet())
		{
			System.out.println("memcached server: " + e.getKey().toString());

			for (Map.Entry<String, String> s : e.getValue().entrySet())
			{
				System.out.println(" - " + s.getKey() + ": " + s.getValue());

				if (s.getKey().equals("curr_items"))
					total_items += Long.parseLong(s.getValue());

			}
		}

		System.out.println("Total number items in memcache: " + total_items);
		mc.shutdown();
	}
}

相關消息與資源

．Scaling memcached at Facebook

．How to install memcache on Debian Etch

2009-01-13 22:31:18 | Comments (2)

PDFBox - 擷取PDF檔案中的純文字

In Java, Open Source

PDFBox．是一個Open Source的Java PDF Library，可以利用它來協助處理PDF檔案的一些應用(用iText也是可行的，不過它好像不支援擷取純文字「iText in Action, pp. 576」)，例如：擷取PDF檔案中的純文字、轉換PDF檔案到Image檔等等.. 諸如此類的應用。

而且「Lucene」就是用它來轉換PDF到純文字再進行索引的~

下述是擷取PDF檔案中的純文字：(FontBox-0.1.0-dev.jar、PDFBox-0.7.3.jar required!)

import java.io.IOException;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.util.PDFTextStripper;

public class PDFTextExtractor
{
    private PDDocument document;
    
    public String extractText(String file) throws IOException
    {
        document = PDDocument.load(file);       
        PDFTextStripper stripper = new PDFTextStripper();
        stripper.setStartPage(1);
        stripper.setEndPage(document.getNumberOfPages());
        return stripper.getText(document);
    }

    public static void main(String[] args)
    {
        PDFTextExtractor extractor = new PDFTextExtractor();
        try
        {
            String text = extractor.extractText("C:\\test.pdf");
            System.out.println(text);
        }catch (IOException e)
        {
            e.printStackTrace();
        }
    }
}

．Glyph & Cog: Text Extraction - 解釋為何擷取PDF中的純文字不是那麼容易

2009-01-03 18:17:22 | Comments (3)

SQLite JDBC

In Java, Open Source

由於SQLite官網沒有提供相對應的JDBC Library~ 所以只好從Google下手~ 到目前為止找到了兩套~

一套是「SQLite ODBC Driver」，從它的名稱來看~ 是屬於JDBC Type 1的產物...暫不考慮~

而另一套則是「SQLiteJDBC」~ 嗯~ 就用這一套來連結SQLite吧~

下述是官網所提供的範例：

import java.sql.*;

public class Test {
  public static void main(String[] args) throws Exception {
      Class.forName("org.sqlite.JDBC");
      Connection conn = DriverManager.getConnection("jdbc:sqlite:test.db");
      Statement stat = conn.createStatement();
      stat.executeUpdate("drop table if exists people;");
      stat.executeUpdate("create table people (name, occupation);");
      PreparedStatement prep = conn.prepareStatement(
          "insert into people values (?, ?);");

      prep.setString(1, "Gandhi");
      prep.setString(2, "politics");
      prep.addBatch();
      prep.setString(1, "Turing");
      prep.setString(2, "computers");
      prep.addBatch();
      prep.setString(1, "Wittgenstein");
      prep.setString(2, "smartypants");
      prep.addBatch();

      conn.setAutoCommit(false);
      prep.executeBatch();
      conn.setAutoCommit(true);

      ResultSet rs = stat.executeQuery("select * from people;");
      while (rs.next()) {
          System.out.println("name = " + rs.getString("name"));
          System.out.println("job = " + rs.getString("occupation"));
      }
      rs.close();
      conn.close();
  }
}

2008-12-08 02:18:00 | Add Comment

Katta - distribute lucene indexes in a grid

In Open Source, Hadoop, Lucene, Katta

在資訊檢索的領域中~ 「Lucene」算是表現最為突出的一個全文檢索函式庫，想當然~ 本站的搜尋功能就是用「Lucene」來建構的~

不過~ 雖然Lucene相當好用~ 但... 如果「index」太大呢？或者「index」比一整顆硬碟還大呢？是否需要負載平衡來分散處理？...

「Katta」就是要改善這樣的問題~ 所以它系建構在「Hadoop」和「Zookeeper」之上~ 採用「Apache Version 2 License」，並在今年的9月17日釋出了「katta-0.1.0」版~

上圖就是我在Cygwin的環境下照著「katta : Getting started」所跑出來的測試結果~ 總之又是一個值得關注的Project~ 期待它能隨著Zookeeper的腳步加入Apache的計畫之下~

katta是什麼呢？

．環尾狐猴

相關資源

．How Katta works.

．Install and configure Katta

．find23.net: katta-overview(pdf)

．find23.net: katta, pig and hadoop in production - experience report slides

2008-09-29 19:11:27 | Comments (2)

cpdetector - 編碥偵測(Java)

In Java, Open Source

「cpdetector, free java code page detection.」，這是另一套編碼偵測的解決方案(Java)~ 同時也包含了Mozilla's chardet (jchardet)~

另外根據「Shared Development: Character encoding detection」所針對「cpdetector」的編碼測試~ 它所顯現的成果的確相當顯著~ 有需要的人用看看吧~

範例程式

import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.nio.charset.Charset;

import cpdetector.io.CodepageDetectorProxy;
import cpdetector.io.HTMLCodepageDetector;
import cpdetector.io.JChardetFacade;

public class CPdetector
{
	private static CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
	
	static
	{
		detector.add(new HTMLCodepageDetector(false));
		detector.add(JChardetFacade.getInstance());
	}
	
	public String getEncoding(File f)throws Exception
	{
		return getEncoding(f.toURI().toURL());
	}
	public String getEncoding(URL url)throws IOException
	{
		Charset charset = detector.detectCodepage(url);

		if (charset != null)
			return charset.name();

		return null;
	}

	public static void main(String[] args)
	{
		CPdetector detector = new CPdetector();
		try
		{
			String encoding = detector.getEncoding(new File("Big5.txt"));
			System.out.println("encoding:"+encoding);
			encoding = detector.getEncoding(new URL("http://www.google.com.tw"));
			System.out.println("encoding:"+encoding);
		}catch(Exception e)
		{
			e.printStackTrace();
		}
	}
}

範例結果：

encoding:UTF-8
encoding:Big5

2008-09-29 14:55:43 | Add Comment

Next Posts~:::~Previous Posts

Open Source

Memcached - a distributed memory object caching system

PDFBox - 擷取PDF檔案中的純文字

SQLite JDBC

Katta - distribute lucene indexes in a grid

cpdetector - 編碥偵測(Java)

::: 搜尋 :::

::: 分類 :::

::: 最新文章 :::

::: 最新回應 :::

::: 訂閱 :::