2008 September

JavaScript - 用 Range 標記文字重點

不曉得有多少人習慣拿螢光筆在書本上畫重點？

假如現在我們有個需求~ 而這個需求就是要將我們在書本上畫重點的這個動作~ 轉移到Web上面來~ OK, 那該如何做呢？

我想~ 要在網頁上取得所選取的文字~ 那必然離不開「Range」這個Object來協助我們達成~

那有什麼樣的方式又可以達到IE、Firefox、Chrome眾多瀏覽器的支持？

下述是筆者的方式，請參考：

<html>
<head>
<script>
function labelText()
{
	var node = document.createElement("span");
	node.style.backgroundColor = 'yellow';
	
	if(document.selection)
	{
		var range = document.selection.createRange();
		var container = document.createElement("div");
		container.appendChild(node);
		node.innerHTML = range.htmlText;
		range.pasteHTML(container.innerHTML);
	
	}else{
		var selection = window.getSelection();
		var range = selection.getRangeAt(0);	
		
		range.surroundContents(node);
	}
}
</script>
</head>
<body onmouseup="labelText()">
This is a test paragraph.<br/>
This is a test paragraph.<br/>
This is a test paragraph.<br/>
This is a test paragraph.<br/>
This is a test paragraph.<br/>
</body>
</html>

結果：

從結果上來看是沒什麼大問題~ 但... 如果要標記的文字是用「<p>(paragraph)」所標記的話那又會如何？

的確~ 這樣的方式在Firefox或Chrome都還會出現些問題~ 就留待之後探討...

參考資源

．JavaScript Rangeの使い方の差分

．range.surroundContents - MDC

．百度空间发帖快捷键设置代码高亮

．Rich HTML editing in the browser: part 1

．Inserting text into Firefox rich text editor - Jeff&'s Junk

．Document Object Model Range

2008-09-30 22:45:59 | Add Comment

Katta - distribute lucene indexes in a grid

In Open Source, Hadoop, Lucene, Katta

在資訊檢索的領域中~ 「Lucene」算是表現最為突出的一個全文檢索函式庫，想當然~ 本站的搜尋功能就是用「Lucene」來建構的~

不過~ 雖然Lucene相當好用~ 但... 如果「index」太大呢？或者「index」比一整顆硬碟還大呢？是否需要負載平衡來分散處理？...

「Katta」就是要改善這樣的問題~ 所以它系建構在「Hadoop」和「Zookeeper」之上~ 採用「Apache Version 2 License」，並在今年的9月17日釋出了「katta-0.1.0」版~

上圖就是我在Cygwin的環境下照著「katta : Getting started」所跑出來的測試結果~ 總之又是一個值得關注的Project~ 期待它能隨著Zookeeper的腳步加入Apache的計畫之下~

katta是什麼呢？

．環尾狐猴

相關資源

．How Katta works.

．Install and configure Katta

．find23.net: katta-overview(pdf)

．find23.net: katta, pig and hadoop in production - experience report slides

2008-09-29 19:11:27 | Comments (2)

cpdetector - 編碥偵測(Java)

In Java, Open Source

「cpdetector, free java code page detection.」，這是另一套編碼偵測的解決方案(Java)~ 同時也包含了Mozilla's chardet (jchardet)~

另外根據「Shared Development: Character encoding detection」所針對「cpdetector」的編碼測試~ 它所顯現的成果的確相當顯著~ 有需要的人用看看吧~

範例程式

import java.io.File;
import java.io.IOException;
import java.net.URL;
import java.nio.charset.Charset;

import cpdetector.io.CodepageDetectorProxy;
import cpdetector.io.HTMLCodepageDetector;
import cpdetector.io.JChardetFacade;

public class CPdetector
{
	private static CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
	
	static
	{
		detector.add(new HTMLCodepageDetector(false));
		detector.add(JChardetFacade.getInstance());
	}
	
	public String getEncoding(File f)throws Exception
	{
		return getEncoding(f.toURI().toURL());
	}
	public String getEncoding(URL url)throws IOException
	{
		Charset charset = detector.detectCodepage(url);

		if (charset != null)
			return charset.name();

		return null;
	}

	public static void main(String[] args)
	{
		CPdetector detector = new CPdetector();
		try
		{
			String encoding = detector.getEncoding(new File("Big5.txt"));
			System.out.println("encoding:"+encoding);
			encoding = detector.getEncoding(new URL("http://www.google.com.tw"));
			System.out.println("encoding:"+encoding);
		}catch(Exception e)
		{
			e.printStackTrace();
		}
	}
}

範例結果：

encoding:UTF-8
encoding:Big5

2008-09-29 14:55:43 | Add Comment

Universal Encoding Detector - 編碼偵測(Python)

In Open Source, Python

什麼時候我們需要做「編碼偵測」的動作呢？最明顯的例子不外乎就是「瀏覽器」~ 假設我們的網頁沒附上「<meta http-equiv="Content-Type" content="text/html;charset=utf-8">」這樣的字句~ 那Browser還能有足夠的能力偵測此網頁是用何種編碼的嗎？

再舉另一個例子~ 當我們寫了一個Crawler來爬行網頁的同時~ 在下載這些網頁之後~ 我們又該如何得知這些網頁的編碼呢？

所以~ 「編碼偵測」算是處理文字資訊前的必要動作~ 而「Universal Encoding Detector」就提供了一個這麼好的工具~ 當然也是給它Open Source的嚕~ 不過這是針對Python語言的~ 當然也還有其它的解決方案~ 就請參考相關資源!

Universal Encoding Detector

「Universal Encoding Detector」目前的版本是1.0.1版~ 而在使用它之前必須先安裝在你的電腦~

下載：chardet-1.0.1.tgz

安裝過程如下：

tar zxvf chardet-1.0.1.tgz
cd chardet-1.0.1
setup.py build
setup.py install

接著就給它寫一個簡單的測試程式：

import urllib2, chardet

if __name__ == "__main__":
	urlread = lambda url: urllib2.urlopen(url).read()
	running = True
	while running:
		str = raw_input('Please enter a url: ')
		if str == 'q':
			running = False
		else:
			print chardet.detect(urlread(str))
	else:
		print 'Done'

測試結果：

Please enter a url: http://www.google.com.cn
{'confidence': 0.98999999999999999, 'encoding': 'GB2312'}
Please enter a url: http://blog.ring.idv.tw
{'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
Please enter a url: http://www.cnn.com
{'confidence': 1.0, 'encoding': 'ascii'}
Please enter a url: q
Done

相關資源

．中文編碼偵測 || William's Blog

．A composite approach to language/encoding detection

．大步向前走: Programming 自動偵測編碼

．Shared Development: Character encoding detection

．Java port of Mozilla charset detector

．cpdetector, free java code page detection.

2008-09-29 03:11:15 | Add Comment