用太多的MySQL这样的数据库了,直到有一天,用了ES遇到一个大坑。
就是post mapping的时候有一个“字段”analyzed 和 not_analyzed没区分好,一时失误导致该列所有数据全部分词了。数据量大概1.5亿条。
天真的以为能够像MySQL那样修改一下字段的属性即可。ES是基于Lucene的,没有别的办法,通俗一点讲,要么删除索引,重行导入,要么reindex。所谓的reindex就是建立一个新的index,把旧index的数据拷贝过去。这样的教程网上很多。比如:
http://blog.csdn.net/loveyaqin1990/article/details/77684599
https://www.cnblogs.com/wmx3ng/p/4112993.html
目前网上来讲,具体实现代码很少,我找了好久只找到了Python的实现。本文基于ES官方代码的PHP SDK和bulk有一个迁移实现。
<?php require 'vendor/autoload.php'; $hosts['hosts'] = array( "host"=>'127.0.0.1', "port"=>'9200', 'scheme' => 'http' ); $client = Elasticsearch\ClientBuilder::create() ->setSSLVerification(false) ->setHosts($hosts) ->build(); for ($i = 1; $i <= 10; $i++) { if ($i != 10) { $params['index'] = 'index-0'.$i;<br /> } else { $params['index'] = 'index-'.$i;<br /> } echo $params["index"]."\r\n"; $params['type'] = 'raw';<br /> $params['scroll'] = '120s'; $params["size"] = 50000; $params["body"] = array(<br /> "query" => array(<br /> "match_all" => array()<br /> )<br /> ); $response = $client->search($params); $step = 1; while (isset($response['hits']['hits']) && count($response['hits']['hits']) > 0) { echo $step++."\t"; $scroll_id = $response['_scroll_id']; unset($response); $response = $client->scroll(<br /> array(<br /> "scroll_id" => $scroll_id, "scroll" => "120s" )<br /> ); if (count($response['hits']['hits']) > 0) {<br /> $bulk = array('index'=>$params['index']."-reindex",'type'=>$params['type']); foreach ($response["hits"]["hits"] as $key=>$val) { $bulk['body'][]=array(<br /> 'index' => array(<br /> '_id'=>$val['_id'] ),<br /> );<br /> $bulk['body'][] = $val['_source']; } // insert reindex $res = $client->bulk($bulk); unset($bulk); } else {<br /> break;<br /> }<br /> }<br /> }<br />