Elasticsearch中ik添加同义词

發表於 2016-04-06

配置synonym.txt

在config目录下analysis,在analysis目录里新建synonym.txt文件,内容如下

1 2	beijing,北京,帝都上海,魔都

配置elasticsearch.yml

在elasticsearch.yml里添加

index:
    analysis:
        filter:
            my_synonym:
                type: synonym
                synonyms_path: analysis/synonym.txt
        analyzer:
            ik_smart_syno:
                type: custom
                tokenizer: ik_smart
                filter: [my_synonym]
            ik_max_word_syno:
                type: custom
                tokenizer: ik_max_word
                filter: [my_synonym]

测试

新建索引curl -XPUT 'localhost:9200/test?pretty',之后执行http://localhost:9200/test/_analyze?analyzer=ik_max_word_syno&text=上海外滩

联系作者

Solr从MySQL导数据

By robinjia

發表於 2016-04-04

本来打算用Solr来搭建搜索服务，而公司的数据放在MySQL数据里，于是在文档里找到DataImportHandler,参考https://wiki.apache.org/solr/DataImportHandler, 这里以导入Wordpress数据为例

在conf目录下新建data-config.xml

data-config.xml的内容为

<dataConfig>
  <dataSource type="JdbcDataSource" 
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost/blog" 
              user="blog" 
              password="12345678"/>
  <document>
    <entity name="post" pk="ID"
            query="select ID,post_title,post_content from wp_posts where post_status='publish'"
            deltaImportQuery="select ID,post_title,post_content from wp_posts where ID='${dih.delta.ID}'"
            deltaQuery="select ID from wp_posts where post_status='publish' and post_modified_gmt > '${dih.last_index_time}'">
      <field column="ID" name="id"/>
      <field column="post_title" name="title"/>
      <field column="post_content" name="content"/>
    </entity>
  </document>
</dataConfig>

配置schema.xml

1
2
3

 <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="title" type="text_general" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="content" type="text_general" indexed="true" stored="true" required="true" multiValued="false" />

修改solrconfig.xml

在solrconfig.xml增加
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />,这样就不会报solr.Dataimport Class not found error.

添加jdbc连接mysql

在server/lib里添加mysql-connector-java-5.1.38.jar，我这里下载到的是5.1.38,其它版本的也可以。

新建core.properties

在blog目录下新建core.properties文件，内容为

#Written by CorePropertiesLocator
#Wed Mar 23 10:55:00 UTC 2016
numShards=1
collection.configName=blog
#name=blog_shard1_replica1
shard=shard1
collection=blog
coreNodeName=core_node1

启动Solr

bin/solr start -s server/solr/blog启动Solr

执行全量索引

命令为http://127.0.0.1:8983/solr/blog/dataimport?command=full-import

执行增量索引

命令为http://127.0.0.1:8983/solr/blog/dataimport?command=delta-import

遇到的问题

nohup: can’t detach from console: Inappropriate ioctl for device

这个问题时在搭建SolrCloud时遇到的，在这里不妨说说。在启动zookeeper时，遇到这个问题，网上说时因为在tmux里启动的缘故，于是新开一个终端,启动zookeeper,这次正常启动。

/Users/long/program/java/solr-5.5.0/solr/server/logs/solr.log: No such file or directory

执行命令bin/solr start -s server/solr/blog时出现这个错误，莫名奇妙的，我想依然是不能在tmux里执行shell, 于是新开一个终端再次执行，这次正常启动

联系作者

在Intellij中启动ElasticSearch

By robinjia

發表於 2016-04-03

有时候真的很郁闷，想要对Solr和Elasticsearch进行二次开发，结果在Eclipse和Intellij上，都不知道怎么启动，官网也没有说，只能上网找或者自己摸索。上网找也是很耗时间的，这些人就不能在官网上记一下吗？这里记下遇到的问题，目前使用Intellij进行Java开发，所以只纪录Intellij的情况。

下载源码

官网没有提供源码的下载，所以只好到github仓库上下载，尝试用git clone -b 2.3 https://github.com/elastic/elasticsearch.git, 但下载到的是2.3.1的，于是纠结要怎么样才能得到2.3.0的，最后求助于之前的搜索同事，知道在https://github.com/elastic/elasticsearch/releases里可以下载。

主程序入口

查看elasticsearch脚本，发现程序入口是org.elasticsearch.bootstrap.ElasticSearch

path.home is not configured

参考elasticsearch2.0源码在开发环境eclipse中启动的问题及解决方案

查看执行./elasticsearch脚本启动时添加的参数，设置VM options为

-Xms256m -Xmx1g -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Djna.nosys=true -Des.path.home=/Users/long/elasticsearch

其中主要是设置es.path.home,目录位置并没有限制。设置Program arguments为start

“java.lang.IllegalStateException” jar hell!

参考https://github.com/elastic/elasticsearch/pull/13465

I stripped the SDK classpath in IntelliJ down to the default sun.boot.class.path and I am not seeing jar hell failures anymore. Specifically:

jre/lib/charsets.jar
jre/lib/jce.jar
jre/lib/jfr.jar
jre/lib/jsse.jar
jre/lib/resources.jar
jre/lib/rt.jar

到这里才想起来Intellij在导入jdk时，将许多的jar包加入到Classpath中了，进入File->Other Settings->Default Project Structure,修改jdk的Classpath为

jre/lib/charsets.jar
jre/lib/jce.jar
jre/lib/jfr.jar
jre/lib/jsse.jar
jre/lib/resources.jar
jre/lib/rt.jar

提示找不到config目录

在/Users/long/program/java/elasticsearch-2.3.0/core目录下新建config目录，将官方发布的Elasticsearch可执行包里的config目录拷贝到这里。

之后启动org.elasticsearch.bootstrap.Elasticsearch, 成功。

联系作者

搜索第二页实现

By robinjia

發表於 2016-03-16

在搜索引擎中，要得到第一页的结果，可以使用堆这个数据结构来实现。在最小的K个数有这样的例子，这里需要将最小的K个数，改成最大的K个数实现。也就是说，建立一个大小为K的小顶堆，对于之后的元素，每个与堆顶比较，如果小于堆顶，则它不可能是最大的K个数之一，如果大于堆顶，则将堆顶替换，并重建小顶堆。之后剩下的K个元素就是最大的K个数，而堆顶是这K个元素中最小的。之后取出堆顶，得到这K个元素中最小的，然后重建小顶堆，再取出堆顶，得到这K个元素中第二小的，一直到堆中没有元素。

要得到第二页的结果，其实也是类似的。假设每页是K个元素，则先建立一个大小为2K的小顶堆。之后按照最大K个数的做法得到最大的2K个数。然后取出这2K个元素中的后面K个元素即是第二页的结果。

在Solr的QueryCompent.java中，mergeIds函数里就是这样做的。

联系作者

Django添加markdown

By robinjia

發表於 2016-02-10

在Django后台添加markdown编辑器中说过如何在Django后台添加markdown编辑器,后来发现这里添加的pagedown有一个问题，也就是换行问题。在markdown中，单个换行会用空格代替，但pagedown中并没有这么做。经过跟踪，发现问题是在pagedown-extra中,解决的办法是在pagedown/Markdown.Converter.js的_FormParagraphs函数1168行//if this is an HTML marker, copy it前添加str = str.replace(/\n/g, " ");即可.

如此，在后台添加markdown编辑器就完成了。之后还需要前台现实时也用markdown渲染,通过自定义filter,添加markdown渲染可以实现这个功能。

pip install markdown安装markdown

按照自定义模版标签和过滤器, 在所在的app目录下新建templatetags目录，在templatetags目录里新建__init__.py文件，之后编写my_markdown.py文件，内容如下：

from django import template
from markdown import markdown
register = template.Library()
@register.filter(name='mark')
def mark(value):
    return markdown(value, extensions=['markdown.extensions.extra', 'markdown.extensions.codehilite'])

在模版中使用

1 2	{% load my_markdown %} <p>{{ post.content\|mark\|safe}}</p>

联系作者

Lucene编写Analyzer

By robinjia

發表於 2015-12-23

在有些应用中，需要针对应用的特征编写Analyzer，这里以Lucene5.0为例。在许多中文搜索应用，往往需要对文本进行分词，而用单字分词不能满足条件，所以需要使用其它分词，而MMSEG是其中一种。

从网上找到了chenbl写的mmseg4j，学会如何使用mmseg4j后，开始编写Analyzer。查看Analysis包的介绍后，发现主要是实现一个Tokenizer，然后在Analyzer中调用即可。于是编写了如下MMSegAnalyzer,

public class MMSegAnalyzer extends Analyzer {
    public MMSegAnalyzer() {
    }
    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        // TODO Auto-generated method stub
        return new TokenStreamComponents(new MMSegTokenizer());
    }
}

之后编写MMSegTokenizer,

public class MMSegTokenizer extends Tokenizer {
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    Dictionary dic;
    Seg seg;
    MMSeg mmSeg;

    public MMSegTokenizer() {
        dic = Dictionary.getInstance();
        seg = new ComplexSeg(dic);
        mmSeg = new MMSeg(input, seg);
    }

    @Override
    public boolean incrementToken() throws IOException {
        clearAttributes();
        // TODO Auto-generated method stub
        Word word = null;
        while((word = mmSeg.next())!=null) {
            termAtt.copyBuffer(word.getSen(), word.getWordOffset(), word.getLength());
            offsetAtt.setOffset(word.getStartOffset(), word.getEndOffset());
            return true;
        }
        return false;
    }
    @Override
    public void close() throws IOException {
        super.close();
    }

    @Override
    public void reset() throws IOException {
        super.reset();
    }
}

其中

1 2	private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);

这两个属性是用来设置Token的内容和文本的偏移位置。
然后使用《Lucene in Action2》第四章中提到的AnalyzerDemo.java来进行测试，发现抛出异常java.lang.IllegalStateException: TokenStream contract violation，
查看TokenStream类后,知道reset函数是在incrementToken函数之前调用，主要是完成一些初始化工作。猜测是MMSeg有一些初始化工作没有完成，然后查看MMSeg类，发现有个reset函数，正是完成一些初始化工作。
于是修改修改MMSegTokenizer的reset函数，如下:

public class MMSegTokenizer extends Tokenizer {
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
    private final OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
    Dictionary dic;
    Seg seg;
    MMSeg mmSeg;

    public MMSegTokenizer() {
        dic = Dictionary.getInstance();
        seg = new ComplexSeg(dic);
        mmSeg = new MMSeg(input, seg);
    }

    @Override
    public boolean incrementToken() throws IOException {
        clearAttributes();
        // TODO Auto-generated method stub
        Word word = null;
        while((word = mmSeg.next())!=null) {
            termAtt.copyBuffer(word.getSen(), word.getWordOffset(), word.getLength());
            offsetAtt.setOffset(word.getStartOffset(), word.getEndOffset());
            return true;
        }
        return false;
    }
    @Override
    public void close() throws IOException {
        super.close();
    }

    @Override
    public void reset() throws IOException {
        super.reset();
        mmSeg.reset(input);
    }
}

MMSegAnalyzer可以进行分词了。之后看mmseg4j的实现，才发现要实现一个高效的MMSEG分词并不是一件容易的事。

联系作者

Java创建线程

By robinjia

發表於 2015-12-21

在Java中创建线程有两中方法，一种是实现Runnable接口，一种是继承Thread类

实现Runnable接口

将任务代码迁移到实现Runnable接口的类的run方法中

class MyRunnable implements Runnable {
    public void run() {
        task code
    }
}

创建一个类对象:
Runnable r = new MyRunnable()
由Runnable创建一个Thread对象
Thread t = new Thread(r)

启动线程
t.start()
完整例子如下:

public class MyRunnable implements Runnable {
    @Override
    public void run() {
        System.out.println("Child: " + Thread.currentThread().getId());
        
    }
    public static void main(String[] arg) {
        for (int i = 0; i < 5; i++) {
            Runnable r = new MyRunnable();
            Thread t = new Thread(r);
            t.start();
        }
        System.out.println("Parent: " + Thread.currentThread().getId());
    }
}

输出结果如下：
Child: 8
Child: 9
Child: 10
Child: 11
Parent: 1
Child: 12
这里不能直接调用run方法，因为这样不会创建新的线程:

public class MyRunnable implements Runnable {
    @Override
    public void run() {
        System.out.println("Child: " + Thread.currentThread().getId());
    }
    public static void main(String[] arg) {
        for (int i = 0; i < 5; i++) {
            Runnable r = new MyRunnable();
//          Thread t = new Thread(r);
//          t.start();
            r.run();
        }
        System.out.println("Parent: " + Thread.currentThread().getId());
    }
}

输出结果如下：
Child: 1
Child: 1
Child: 1
Child: 1
Child: 1
Parent: 1
可以看到id都是一样的,也就是这里没有创建新的线程.

继承Thread类

class MyThread extends Thread {
    public void run() {
        task code
    }
}

完整例子如下:

public class MyThread extends Thread {
    @Override
    public void run() {
        System.out.println("Child: " + Thread.currentThread().getId()); 
    }
    public static void main(String[] arg) {
        for (int i = 0; i < 5; i++) {
            Thread t = new MyThread();
            t.start();
        }
        System.out.println("Parent: " + Thread.currentThread().getId());
    }
}

输出结果如下：
Child: 8
Child: 9
Child: 10
Child: 11
Parent: 1
Child: 12
这里不能直接调用t.run()，因为这样不会创建新的线程

public class MyThread extends Thread {
    @Override
    public void run() {
        System.out.println("Child: " + Thread.currentThread().getId()); 
    }
    public static void main(String[] arg) {
        for (int i = 0; i < 5; i++) {
            Thread t = new MyThread();
//          t.start();
            t.run();
        }
        System.out.println("Parent: " + Thread.currentThread().getId());
    }
}

结果如下：
Child: 1
Child: 1
Child: 1
Child: 1
Child: 1
Parent: 1
可以看到id都是一样的。
查看Thread类的源码就会发现问题的所在.

public synchronized void start() {
    /**
     * This method is not invoked for the main method thread or "system"
     * group threads created/set up by the VM. Any new functionality added
     * to this method in the future may have to also be added to the VM.
     *
     * A zero status value corresponds to state "NEW".
     */
    if (threadStatus != 0)
        throw new IllegalThreadStateException();

    /* Notify the group that this thread is about to be started
     * so that it can be added to the group's list of threads
     * and the group's unstarted count can be decremented. */
    group.add(this);

    boolean started = false;
    try {
        start0();
        started = true;
    } finally {
        try {
            if (!started) {
                group.threadStartFailed(this);
            }
        } catch (Throwable ignore) {
            /* do nothing. If start0 threw a Throwable then
              it will be passed up the call stack */
        }
    }
}

private native void start0();

在start方法中调用native方法start0()，虽然看不到它的具体实现，但可以推测这里创建了新的线程，然后调用run方法。而run方法中，则没有创建线程相关的代码

public void run() {
    if (target != null) {
        target.run();
    }
}

关于两种方法的区别，可以看http://stackoverflow.com/questions/541487/implements-runnable-vs-extends-thread, 推荐使用实现Runnable接口的方法。

联系作者

安装jdk源码

By robinjia

發表於 2015-12-21

要想提高Java水平，阅读jdk源码是很有必要的，所以要安装jdk源码。所幸安装过程很简单。

在jdk目录下，如(/home/long/jdk1.7.0_80)，有src.zip文件，这里保存了jdk源码。安装过程如下:

进入jdk目录
cd /home/long/jdk1.7.0_80
新建src子目录
mkdir src
进入src子目录
cd src
解压jdk源码
unzip ../src.zip

这样，在Eclipse中，按住ctrl键，单击类名，就可以看到源码了。

联系作者

寻找极小值

By robinjia

發表於 2015-12-12

题目

一个数组是以循环顺序排列的，也就是说在数组中有某个元素i，从x[i]开始有这样的关系，即x[0] < x[1] < x[2] < … < x[i - 1]，x[i] < x[i + 1] < … < x[n] < x[0]。例如8，10，14，15，2，6这7个元素就是循环顺序排列的，因为从2开始为递增，到了最后一个元素就转化为第1个元素，再一次顺序递增。换句话说，如果把x[i]，x[i + 1]，…，x[n]取出，并且接到数组开头，于是就是一个从小到大的顺序(这不是个旋转的工作吗？)。编写一个程序，接收一个以循环顺序排列的数组，把它的极小值找出来，以上面的数据为例，程序应该会输出2.

说明

因为从x[0]起顺序是递增的，一直到极小值出现，马上就会出现相反的顺序，于是很多人马上就会想出这个做法：
for (i = 1; i < n && x[i] >= x[i - 1]; i++)
一旦这个循环停下来了，如果i等于n那就表示每一个元素都大于在它前面的哪一个，因而极小值为x[0]；但若i < n，且x[i] < x[i - 1]，因此极小值为x[i]。
这是个正确的做法，但效率却不够高，因为在最坏的情况下可能要做n - 1次的比较。不过，这个数组严格说还是有顺序性的，根据这一特性应该可以找出更好、更快的方法，不妨试试看。

解法

解决的办法是用二分查找。也许会质疑这个数组并没有完全依顺序排列，所以不能用二分查找法。其实只要能够把问题分成两部分，而有办法判断解答在其中一部分的话，这就是个二分查找。

现在处理x[L]与x[R]之间的元素(包含两个端点)，去中间元素x[M], M = (R - L) / 2 + L，会出现以下两中情况

x[M] < x[R]，因为从左到右是递增的，直到极小值开始才下降，之后又开始递增。而第一个递增部分的任意一个元素大于第二个递增部分的任意元素。所以极小值一定不会在M的右边。所以下一个R = M。
x[M] >= x[R]，会出现这种情况，说明M在第一个递增部分，R在第二个递增部分，所以极小值一定在M的右边。所以下一个L = M + 1。

就这样一直反复下去，等到L=R的时候， x[L]就是极小值。

代码

写成代码如下：

public class MinimumInRotatedSortedArray {
    public static int findMin(int[] nums) {
        int left = 0;
        int right = nums.length - 1;
        int mid = 0;
        while (left < right) {
            mid = (right - left) / 2 + left;
            if (nums[mid] < nums[right]) {
                right = mid;
            } else {
                left = mid + 1;
            }
        }
        return nums[left];
    }
    
    public static void main(String[] args) {
        int[] temp = {6, 7, 1, 2, 3, 4};
        System.out.println(findMin(temp));
    }
}

联系作者

vps上部署Hexo

By robinjia

發表於 2015-12-11

Hexo一般都是部署到github上去，只是我有vps，干吗不用。

对于部署到vps上，本来是想使用Hexo server，然后用Nginx做反向代理。后来想想，这样耗费资源，于是在网上找到在VPS上部署hexo，直接将生成的页面给Nginx服务器，既节省资源，访问速度又更快。只是我还是想通过git管理Hexo代码，就像以前写MV小站那样。可是对于git不熟悉，上次也没有做笔记。于是在网上找到VPS上(debian8 jessie)部署hexo(Nginx代理+git部署)，正是我想要的。具体可以参考这篇，这里只记录遇到的问题。

设置ssh密钥登陆vps失败

用ssh-keygen生成密钥之后，将公钥id_rsa.pub的内容复制到vps上的authorized_keys里，一直无法登陆。最后在linux ssh 使用深度解析（key登录详解）中找到了解答，原来是authorized_keys文件权限的缘故，这个文件必须设置为600，ssh key登陆才会通过。查看日志文件/var/log/secure可以得道一些帮助。

git push时无法通过

在master上执行git config receive.denyCurrentBranch ignore即可。

Hexo生成的css文件没有更新

不知道什么情况，有时候有更新，有时候又没有更新。所以干脆先执行hexo clean后再执行hexo g。另外，git hooks很实用。

在git仓库里添加hooks

在.git/hooks目录里，参考post-receive脚本,添加如下内容

GIT_REPO=/home/dengsl/program/nodejs/blog
 DEPLOY_DIR=/home/dengsl/program/html/blog/note

 # Get the latest commit subject
 SUBJECT=$(git log -1 --pretty=format:"%s")

 cd $GIT_REPO
 env -i git reset --hard

 #update or deploy
 IF_DEPLOY=$( echo $SUBJECT | grep 'deploy')
 if [ -z "$IF_DEPLOY" ]; then
     echo >&2 "Success. Repo update only"
     exit 0;
 fi

 # Check the deploy dir whether it exists
 if [ ! -d $DEPLOY_DIR ] ; then
 echo >&2 "fatal: post-receive: DEPLOY_DIR_NOT_EXIST: \"$DEPLOY_DIR\""
 exit 1
 fi

 #deploy static site
 hexo clean
 hexo g
 cp -r public/* $DEPLOY_DIR

现在就可以通过git来发布页面，很有意思。