Monthly Archives: June 2014

MongoDB not preallocate journal log

在安装mongodb 2.6.2的时候发现一个奇怪的问题 RS 节点的journal log并没有提前分配.我们知道在安装mongodb的时候
mongo总是会预先分配journal log 启用smallfile的时候默认为128MB 否则会分配1GB的journal log

下面是仲裁节点的日志:

2014-06-17T11:50:09.842+0800 [initandlisten] MongoDB starting : pid=4749 port=27017 dbpath=/data/mongodb/data 64-bit host=vm-3-57
2014-06-17T11:50:09.844+0800 [initandlisten] db version v2.6.2
2014-06-17T11:50:09.844+0800 [initandlisten] git version: 4d06e27876697d67348a397955b46dabb8443827
2014-06-17T11:50:09.844+0800 [initandlisten] build info: Linux build10.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2014-06-17T11:50:09.844+0800 [initandlisten] allocator: tcmalloc
2014-06-17T11:50:09.844+0800 [initandlisten] options: { config: "/data/mongodb/mongod.cnf", net: { http: { enabled: false }, maxIncomingConnections: 5000, port: 27017, unixDomainSocket: { pathPrefix: "/data/mongodb/data" } }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 500 }, processManagement: { fork: true, pidFilePath: "/data/mongodb/data/mongod.pid" }, replication: { replSet: "rs1" }, security: { authorization: "enabled", keyFile: "/data/mongodb/data/rs1.keyfile" }, storage: { dbPath: "/data/mongodb/data", directoryPerDB: true, journal: { enabled: true }, repairPath: "/data/mongodb/data", syncPeriodSecs: 10.0 }, systemLog: { destination: "file", path: "/data/mongodb/log/mongod_data.log", quiet: true } }
2014-06-17T11:50:09.863+0800 [initandlisten] journal dir=/data/mongodb/data/journal
2014-06-17T11:50:09.864+0800 [initandlisten] recover : no journal files present, no recovery needed
2014-06-17T11:50:10.147+0800 [initandlisten] preallocateIsFaster=true 3.52
2014-06-17T11:50:10.378+0800 [initandlisten] preallocateIsFaster=true 3.4
2014-06-17T11:50:11.662+0800 [initandlisten] preallocateIsFaster=true 2.9
2014-06-17T11:50:11.662+0800 [initandlisten] preallocating a journal file /data/mongodb/data/journal/prealloc.0
2014-06-17T11:50:14.009+0800 [initandlisten]        File Preallocator Progress: 629145600/1073741824    58%
2014-06-17T11:50:26.266+0800 [initandlisten] preallocating a journal file /data/mongodb/data/journal/prealloc.1
2014-06-17T11:50:29.009+0800 [initandlisten]        File Preallocator Progress: 723517440/1073741824    67%
2014-06-17T11:50:40.751+0800 [initandlisten] preallocating a journal file /data/mongodb/data/journal/prealloc.2
2014-06-17T11:50:43.020+0800 [initandlisten]        File Preallocator Progress: 597688320/1073741824    55%
2014-06-17T11:50:55.830+0800 [FileAllocator] allocating new datafile /data/mongodb/data/local/local.ns, filling with zeroes...

mongo默认创建了3个1GB 的journal log

再来看下RS 节点的日志:

2014-06-17T14:31:31.095+0800 [initandlisten] MongoDB starting : pid=8630 port=27017 dbpath=/storage/sas/mongodb/data 64-bit host=db-mysql-common01a
2014-06-17T14:31:31.096+0800 [initandlisten] db version v2.6.2
2014-06-17T14:31:31.096+0800 [initandlisten] git version: 4d06e27876697d67348a397955b46dabb8443827
2014-06-17T14:31:31.096+0800 [initandlisten] build info: Linux build10.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2014-06-17T14:31:31.096+0800 [initandlisten] allocator: tcmalloc
2014-06-17T14:31:31.096+0800 [initandlisten] options: { config: "/storage/sas/mongodb/mongod.cnf", net: { http: { enabled: false }, maxIncomingConnections: 5000, port: 27017, unixDomainSocket: { pathPrefix: "/storage/sas/mongodb/data" } }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 500 }, processManagement: { fork: true, pidFilePath: "/storage/sas/mongodb/data/mongod.pid" }, replication: { replSet: "rs1" }, security: { authorization: "enabled", keyFile: "/storage/sas/mongodb/data/rs1.keyfile" }, storage: { dbPath: "/storage/sas/mongodb/data", directoryPerDB: true, journal: { enabled: true }, repairPath: "/storage/sas/mongodb/data", syncPeriodSecs: 10.0 }, systemLog: { destination: "file", path: "/storage/sas/mongodb/log/mongod_data.log", quiet: true } }
2014-06-17T14:31:31.101+0800 [initandlisten] journal dir=/storage/sas/mongodb/data/journal
2014-06-17T14:31:31.102+0800 [initandlisten] recover : no journal files present, no recovery needed
2014-06-17T14:31:31.130+0800 [FileAllocator] allocating new datafile /storage/sas/mongodb/data/local/local.ns, filling with zeroes...
2014-06-17T14:31:31.130+0800 [FileAllocator] creating directory /storage/sas/mongodb/data/local/_tmp
2014-06-17T14:31:31.132+0800 [FileAllocator] done allocating datafile /storage/sas/mongodb/data/local/local.ns, size: 16MB,  took 0 secs
2014-06-17T14:31:31.137+0800 [FileAllocator] allocating new datafile /storage/sas/mongodb/data/local/local.0, filling with zeroes...
2014-06-17T14:31:31.138+0800 [FileAllocator] done allocating datafile /storage/sas/mongodb/data/local/local.0, size: 64MB,  took 0 secs
2014-06-17T14:31:31.141+0800 [initandlisten] build index on: local.startup_log properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.startup_log" }

没有创建journal log 直接创建了datafile 这个问题很奇怪 之前认为是ext4的问题,咨询朋友之后发现mongodb在创建journal log 之前会有一个判断,部分源码如下:

// @file dur_journal.cpp writing to the writeahead logging journal

  bool _preallocateIsFaster() {
            bool faster = false;
            boost::filesystem::path p = getJournalDir() / "tempLatencyTest";
            if (boost::filesystem::exists(p)) {
                try {
                    remove(p);
                }
                catch(const std::exception& e) {
                    log() << "Unable to remove temporary file due to: " << e.what() << endl;
                }
            }
            try {
                AlignedBuilder b(8192);
                int millis[2];
                const int N = 50;
                for( int pass = 0; pass < 2; pass++ ) {
                    LogFile f(p.string());
                    Timer t;
                    for( int i = 0 ; i < N; i++ ) { 
                        f.synchronousAppend(b.buf(), 8192);
                    }
                    millis[pass] = t.millis();
                    // second time through, file exists and is prealloc case
                }
                int diff = millis[0] - millis[1];
                if( diff > 2 * N ) {
                    // at least 2ms faster for prealloc case?
                    faster = true;
                    log() << "preallocateIsFaster=true " << diff / (1.0*N) << endl;
                }
            }
            catch (const std::exception& e) {
                log() << "info preallocateIsFaster couldn't run due to: " << e.what()
                      << "; returning false" << endl;
            }
            if (boost::filesystem::exists(p)) {
                try {
                    remove(p);
                }
                catch(const std::exception& e) {
                    log() << "Unable to remove temporary file due to: " << e.what() << endl;
                }
            }
            return faster;
        }
        bool preallocateIsFaster() {
            Timer t;
            bool res = false;
            if( _preallocateIsFaster() && _preallocateIsFaster() ) { 
                // maybe system is just super busy at the moment? sleep a second to let it calm down.  
                // deciding to to prealloc is a medium big decision:
                sleepsecs(1);
                res = _preallocateIsFaster();
            }
            if( t.millis() > 3000 ) 
                log() << "preallocateIsFaster check took " << t.millis()/1000.0 << " secs" << endl;
            return res;
        }
        
 int diff = millis[0] - millis[1];
                if( diff > 2 * N ) {
                    // at least 2ms faster for prealloc case?
                    faster = true;
                    log() << "preallocateIsFaster=true " << diff / (1.0*N) << endl;
                }   

如果diff> 2*N 那么mongo 将会认为 preallocate 是更好的选择,将会预先分配log ,这在仲裁节点日志也体现出来了,都大于2ms

2014-06-17T11:50:10.147+0800 [initandlisten] preallocateIsFaster=true 3.52
2014-06-17T11:50:10.378+0800 [initandlisten] preallocateIsFaster=true 3.4
2014-06-17T11:50:11.662+0800 [initandlisten] preallocateIsFaster=true 2.9

不过个人认为这个设计很无聊,我相信没人会介意初始化日志的那么点时间吧,何况如果出现峰值的之后再去分配log,对IO又是一个冲击。

xtrabackup拷贝文件的原理

innodb在很多方面都是模仿oracle的,xtrabackup的机制也跟rman很像,可以先看看rman是怎么copy block的:

http://docs.oracle.com/cd/E11882_01/backup.112/e10642/rcmcncpt.htm

Unlike user-managed tools, RMAN does not require extra logging or backup mode because it knows the format of data blocks. RMAN is guaranteed not to back up fractured blocks. During an RMAN backup, a database server session reads each data block and checks whether it is fractured by comparing the block header and footer. If a block is fractured, then the session rereads the block. If the same fracture is found, then the block is considered permanently corrupt. Also, RMAN does not need to freeze the data file header checkpoint because it knows the order in which the blocks are read, which enables it to capture a known good checkpoint for the file.

http://docs.oracle.com/cd/B19306_01/backup.102/b14191/rcmconc1.htm

RMAN does not require that you put datafiles into backup mode. During an RMAN backup, a database server session reads each block of the datafile and checks whether each block is fractured by comparing the block header and footer. If a block is fractured, the session re-reads the block. If the same fracture is found, then the block is considered permanently corrupt. If MAXCORRUPT is exceeded, the backup stops.

Xtrabackup备份恢复原理可以如下概括:
备份innodb表时,Xtrabackup若干个线程拷贝独立表空间的.ibd文件,并不停监视此过程中redo log的变化,添加到自己的事务日志文件(xtrabackup_logfile)中。在此过程中,发生的物理写操作越多,xtrabackup_logfile越大。在拷贝完成后的第一个prepare阶段,Xtrabackup采用类似于innodb崩溃恢复的方法,把数据文件恢复到与日志文件一致的状态,并把未提交的事务回滚。如果同时需要备份myisam表以及innodb表结构等文件,那么就需要用flush tables with lock来获得全局锁,开始拷贝这些不再变化的文件,同时获得binlog位置,拷贝结束后释放锁,也停止对redo log的监视。

很多同学对上面的理解有混淆,以为拷贝.ibd文件就跟操作系统拷贝文件一样。其实这里涉及到fractured page的问题,他应该会重新读取(应该也有重试次数,超过后备份不成功)。

其实这个原理很简单,了解下doublewrite就可以理解了“本段摘录自《MySQL技术内幕:InnoDB存储引擎》”:
如果说插入缓冲带给InnoDB存储引擎的是性能,那么两次写带给InnoDB存储引擎的
是数据的可靠性。当数据库宕机时,可能发生数据库正在写一个页面,而这个页只写了一部分(比如16K的页,只写前4K的页)的情况,我们称之为部分写失效(partial page write)。在InnoDB存储引擎未使用double write技术前,曾出现过因为部分写失效而导致数据丢失的情况。
有人也许会想,如果发生写失效,可以通过重做日志进行恢复。这是一个办法。但是
必须清楚的是,重做日志中记录的是对页的物理操作,如偏移量800,写’aaaa’记录。如果这个页本身已经损坏,再对其进行重做是没有意义的。这就是说,在应用(apply)重做日志前,我们需要一个页的副本,当写入失效发生时,先通过页的副本来还原该页,再进行重做,这就是doublewrite。

其实也就是我们在oracle恢复时如果数据块损坏,要求有备份,然后滚日志才行,没有无缘无故的恨,一切都是有原因的。。。。

Good news for MySQL HA solutions


Oracle Clusterware is portable cluster software that allows clustering of independent servers so that they cooperate as a single system. Oracle Clusterware was first released with Oracle Database 10g Release 1 as the required cluster technology for the Oracle multi-instance database,  Oracle Real Application Clusters (RAC).

Oracle Clusterware 12c Release 1 is the integrated foundation for Oracle Real Application Clusters (RAC) and the High Availability (HA) and resource management framework for all applications on any major platform.  Oracle Clusterware 12c builds on the innovative technology introduced with Oracle Clusterware 11g by providing comprehensive multi-tiered HA and resource management for consolidated environments. The idea is to leverage Oracle Clusterware in the cloud to provide enterprise-class resiliency where required and dynamic, online allocation of compute resources where needed, when needed.


NEW in Oracle Clusterware 12c

  •  Oracle Flex Cluster 12c: a new clustering solution for agile, large-scale deployments
  •  Server Categorization: support for heterogeneous server pool management
  •  Cluster Configuration Policy Sets: automated management of business critical workloads
  •  Utility Cluster Concept: centralized, data center management cluster
  •  Oracle Generic Agent: a simple way to integrate applications into Oracle Clusterware
  •  Oracle Grid Infrastructure Generic standalone and Bundled Agents: 
    • Currently available for Siebel, Oracle GoldenGate, Peoplesoft, Apache and MySQL
    • Currently supported on AIX, Solaris and Linux

From MySQL Blog

MySQL has an extensive range of high-availability solutions to suit many different use cases and deployment needs.  This list spans from the time-tested – yet continuously-improved – MySQL replication to the just-released MySQL Fabric, giving users many certified solutions for highly available MySQL deployments.  The list is growing yet again, with Oracle Clusterware adding support for MySQL.
Oracle’s Clusterware product is the foundation for the Oracle RAC, and has been battle-tested for high availability support for Oracle database, as well as other Oracle applications.  This technology is now available as part of the MySQL Enterprise subscription, and – like all Oracle commercial products – is freely available for evaluation purposes.  This post will explain Oracle Clusterware architecture and the benefits to MySQL users, and will be followed by a later post focusing on how to deploy Clusterware agents with MySQL.

Link:http://www.oracle.com/technetwork/database/database-technologies/clusterware/overview/index.html