msg: the TiDB port 4000 is not up

删除目录前是否有停掉集群 ? 使用 ansible 启动报错的话,可以试下手动启动,执行 start.sh 启动 tidb-server 节点,然后看下 日志中有什么异常。如果启动正常的话,则正常运行即可。

我比较暴力,直接reboot 机器,然后重新部署集群。日志一直报
 [2020/01/07 15:32:45.854 +08:00] [WARN] [session.go:1113] ["compile SQL failed"] [error="[schema:1146]Table 'mysql.tidb' doesn't exist"] [SQL="SELECT HIGH_PRIORITY VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME=\"bootstrapped\""]
报上面的错误,Mysql  登录不了
ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.4' (111)

重新部署的操作步骤麻烦提供下。

所有服务器 reboot
rm -rf  /deploy/*

sudo systemctl stop ntpd.service && \
sudo ntpdate pool.ntp.org && \
sudo systemctl start ntpd.service

ansible-playbook rolling_update.yml 
ansible-playbook deploy.yml
ansible-playbook start.yml
TASK [wait until the TiDB port is up] ****************************************************************************************************************************
fatal: [10.0.0.3]: FAILED! => changed=false 
  elapsed: 300
  msg: the TiDB port 4000 is not up
ERROR MESSAGE SUMMARY ********************************************************************************************************************************************
[10.0.0.3]: Ansible Failed! ==>
  changed=false 
  elapsed: 300
  msg: the TiDB port 4000 is not up

[tidb@tidb1 tidb-ansible]$ ps -ef|grep tidb-server
tidb     29571     1  1 06:53 ?        00:00:00 bin/tidb-server -P 4000 --status=10080 --advertise-address=10.0.0.3 --path=10.0.0.5:2379,10.0.0.6:2379,10.0.0.7:2379 --config=conf/tidb.toml --log-slow-query=/home/tidb/deploy/log/tidb_slow_query.log --log-file=/home/tidb/deploy/log/tidb.log

[tidb@tidb1 tidb-ansible]$ mysql -u root -h 10.0.0.3 -P 4000
ERROR 2003 (HY000): Can't connect to MySQL server on '10.0.0.3' (111)

[tidb@tidb1 tidb-ansible]$ ansible-playbook deploy.yml --tags=pump -l pump1,pump2,pump3
[tidb@tidb1 tidb-ansible]$ ansible-playbook start.yml --tags=pump
[tidb@tidb1 tidb-ansible]$ ansible-playbook rolling_update.yml --tags=tidb

按照这个再检查下吧 ?

帮忙看一下这个问题,启动集群的时候TIDB 端口起不起来,报 
  [error="[kv:8004]Transaction is too large, size: 13541" 


 [2020/01/08 16:15:08.470 +08:00] [INFO] [ddl_worker.go:309] ["[ddl] finish DDL job"] [worker="worker 1, tp general"] [job="ID:4, Type:create schema, State:synced, Sch
    emaState:public, SchemaID:3, TableID:0, RowCount:0, ArgLen:0, start time: 2020-01-08 16:15:08.352 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"]
    [2020/01/08 16:15:08.482 +08:00] [INFO] [ddl.go:622] ["[ddl] DDL job is finished"] [jobID=4]
    [2020/01/08 16:15:08.482 +08:00] [INFO] [domain.go:595] ["performing DDL change, must reload"]
    [2020/01/08 16:15:08.483 +08:00] [INFO] [session.go:2028] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=2] [cur_db=] [sql="CREATE TABLE if not exists mysql.user (\
\t
    \tHost\t\t\t\tCHAR(64),\
\t\tUser\t\t\t\tCHAR(32),\
\t\tPassword\t\t\tCHAR(41),\
\t\tSelect_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tInsert_priv\t\t\tENUM('
    N','Y') NOT NULL DEFAULT 'N',\
\t\tUpdate_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tDelete_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_priv\t\t
    \tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tDrop_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tProcess_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tGrant_pr
    iv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tReferences_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tAlter_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t
    \tShow_db_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tSuper_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_tmp_table_priv\t\tENUM('N','Y') NOT NULL 
    DEFAULT 'N',\
\t\tLock_tables_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tExecute_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_view_priv\t\tENUM('N'
    ,'Y') NOT NULL DEFAULT 'N',\
\t\tShow_view_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_routine_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tAlter_ro
    utine_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tIndex_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_user_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N'
    ,\
\t\tEvent_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tTrigger_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_role_priv\t\tENUM('N','Y') NOT NULL 
    DEFAULT 'N',\
\t\tDrop_role_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tAccount_locked\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tShutdown_priv\t\t\tENUM('N
    ','Y') NOT NULL DEFAULT 'N',\
\t\tPRIMARY KEY (Host, User));"] [user=]
    [2020/01/08 16:15:08.495 +08:00] [INFO] [tidb.go:192] ["rollbackTxn for ddl/autocommit failed"]
    [2020/01/08 16:15:08.495 +08:00] [WARN] [session.go:1028] ["run statement failed"] [schemaVersion=2] [error="[kv:8004]Transaction is too large, size: 13541"] [session
    ="{\
  \"currDBName\": \"\",\
  \"id\": 0,\
  \"status\": 2,\
  \"strictMode\": true,\
  \"user\": null\
}"]
    [2020/01/08 16:15:08.495 +08:00] [FATAL] [bootstrap.go:1042] ["mustExecute error"] [error="[kv:8004]Transaction is too large, size: 13541"] [stack="github.com/pingcap
    /tidb/session.mustExecute\
\t/home/jenkins/agent/workspace/tidb_master/go/src/github.com/pingcap/tidb/session/bootstrap.go:1042\
github.com/pingcap/tidb/session.doDDL
    Works\
\t/home/jenkins/agent/workspace/tidb_master/go/src/github.com/pingcap/tidb/session/bootstrap.go:944\
github.com/pingcap/tidb/session.bootstrap\
\t/home/jenkins
    /agent/workspace/tidb_master/go/src/github.com/pingcap/tidb/session/bootstrap.go:303\
github.com/pingcap/tidb/session.runInBootstrapSession\
\t/home/jenkins/agent/wor
    kspace/tidb_master/go/src/github.com/pingcap/tidb/session/session.go:1678\
github.com/pingcap/tidb/session.BootstrapSession\
\t/home/jenkins/agent/workspace/tidb_mast
    er/go/src/github.com/pingcap/tidb/session/session.go:1588\
main.createStoreAndDomain\
\t/home/jenkins/agent/workspace/tidb_master/go/src/github.com/pingcap/tidb/tidb-
    server/main.go:247\
main.main\
\t/home/jenkins/agent/workspace/tidb_master/go/src/github.com/pingcap/tidb/tidb-server/main.go:185\
runtime.main\
\t/usr/local/go/src/r
    untime/proc.go:203"]

这个报错是超过了事务限制,详细参考该链接 https://pingcap.com/docs-cn/stable/faq/tidb/#433-transaction-too-large-是什么原因怎么解决

我这个是新起的集群在创建Mysql 系统表

 [sql="CREATE TABLE if not exists mysql.user (\
\t
\tHost\t\t\t\tCHAR(64),\
\t\tUser\t\t\t\tCHAR(32),\
\t\tPassword\t\t\tCHAR(41),\
\t\tSelect_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tInsert_priv\t\t\tENUM('
N','Y') NOT NULL DEFAULT 'N',\
\t\tUpdate_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tDelete_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_priv\t\t
\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tDrop_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tProcess_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tGrant_pr
iv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tReferences_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tAlter_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t
\tShow_db_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tSuper_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_tmp_table_priv\t\tENUM('N','Y') NOT NULL 
DEFAULT 'N',\
\t\tLock_tables_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tExecute_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_view_priv\t\tENUM('N'
,'Y') NOT NULL DEFAULT 'N',\
\t\tShow_view_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_routine_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tAlter_ro
utine_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tIndex_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_user_priv\t\tENUM('N','Y') NOT NULL DEFAULT 'N'
,\
\t\tEvent_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tTrigger_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tCreate_role_priv\t\tENUM('N','Y') NOT NULL 
DEFAULT 'N',\
\t\tDrop_role_priv\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tAccount_locked\t\t\tENUM('N','Y') NOT NULL DEFAULT 'N',\
\t\tShutdown_priv\t\t\tENUM('N
','Y') NOT NULL DEFAULT 'N',\
\t\tPRIMARY KEY (Host, User));"] [user=]
部署集群的步骤完全按照这里面的
https://pingcap.com/docs-cn/stable/how-to/deploy/orchestrated/ansible/

是不是触发了什么Bug, 很奇怪

看起来操作步骤有问题,可以使用 unsafe_cleanup 销毁集群重新部署
https://pingcap.com/docs-cn/stable/faq/tidb/#311-ansible-常见运维操作有那些

我把实例删除了,重新购买新的实例,也是一样的问题,谷歌云的服务器,步骤没问题,在自己测试机上同样部署就OK的

这边是在谷歌云上部署集群不成功,然后在测试机上部署是没问题的对吧 ? 因为上面的步骤直接进行 reboot 了,不太好排查,如果方便的话,辛苦按照以下操作步骤试下,如果有问题,保留下现场,我们分析下 ?

解决思路:

  1. 确认 tidb server 状态是否正常,是否通过脚本启动
  1. 检查 tidb server 状态和日志是否具体的报错
  2. 检查 tidb server 服务和端口是否存在冲突导致 tidb server 启动失败
  3. 检查 tidb server 对应的启动参数配置是否正确。

如果上述检查没有问题,可以去 tidb 节点上手动启动以下,然后看下 tidb.log 中有什么具体报错内容,无法排查的话,可以提供到 tug 上,参数先使用默认值,多谢。

是的,谷歌云不成功,其他自建库成功的

如果时间允许的话,辛苦按照上面描述重新搭建下试试,如果有问题,请详细提供下内容。感谢。