关于Partial write for all MySQL and Oracle

很久之前看过这个帖子,当时没有细究,这几天看了下mysql相关的internal 突然想起这个问题,跟三四个朋友讨论了下 想把这个问题展开讲一讲。

源帖子内容见:https://community.oracle.com/thread/1087650?start=0&tstart=0

这个帖子的内容为一个老外问oracle如何避免partial write的,也就是大家所说的半个IO问题,IO 在写到一半的时候发生断裂,(瞬间断线,瞬间crash瞬间存储或者机器故障等) 虽然这个情况极少发生,但是在非常规恢复的时候也是经常遇到的,例如某个block内部出现不一致,也就是我们所谓的逻辑坏块。
我仔细的又过了一遍帖子,发现下面的几个回答完全不准确,或者有点张冠李戴的意思。

首先我们看看什么是mysql 的double write:

—–引自 mysqlperformace blog

为什么要用doublewrite?
目的是为了保证出现部分写失效(partial page write)–即数据页写到一半时就出现故障–时的数据安全性。Innodb并不在日志中记录整个数据页,而是使用一种称之为“生理”日志的技术,即日志项中只包含页号、对数据进行的操作(如更新一行记录)和日志序列号等信息。这一日志结构的优点是能够减少写入到日志的数据量,但这也要求要保持页内部的一致性。页的版本号是没关系的,数据页可以是当前版本(这时Innodb(故障恢复时)会跳过对页的更新操作)若是早期版本(这时Innodb将执行更新操作)。但如果页内部数据不一致,恢复将不能进行。

部分写失效
什么是部分写失效,为什么会出现这一现象呢?部分写失效指的是提交给操作系统的写数据页操作只完成了一部分。例如一个16K的Innodb数据页中只有4K被更新了,其余部分数据还是旧的。大部分的部分写失效都在断电时产生,在操作系统崩溃时也可能出现,这是由于操作系统可能将一个写16K数据的操作分割成多个写操作(这可能由文件碎片导致),而在执行这多个写操作的过程中发出的故障。当使用软件RAID技术时,数据页可能恰好跨越分片(stripe)的边界,这时也需要执行多个IO操作,因此也可能导致部分写失效。当使用硬件RAID又没有使用带电池的缓存时,断电时出现部分写失效也是可能的。当发送到磁盘本身只有一个写操作时,理论上硬件可以保证写操作即使在断电时也可以完成,因为驱动器内部应该积累有足够的电量来完成这一操作。但实话说我并不知道事实是不是这样,这很难检测,而且这也不是部分写失效的唯一原因。我所知道的只是部分写失效可能会出现,当Innodb实现 doublewrite功能前,我遇到过很多由于这一原因而导致数据被破坏。

doublewrite如何工作?
你可以将doublewrite看作是在Innodb表空间内部分配的一个短期的日志文件,这一日志文件包含100个数据页。Innodb在写出缓冲区中的数据页时采用的是一次写多个页的方式,这样多个页就可以先顺序写入到doublewrite缓冲区并调用fsync()保证这些数据被写出到磁盘,然后数据页才被定出到它们实际的存储位置并再次调用fsync()。故障恢复时Innodb检查doublewrite缓冲区与数据页原存储位置的内容,若数据页在doublewrite缓冲区中处于不一致状态将被简单的丢弃,若在原存储位置中不一致则从doublewrite缓冲区中还原。

doublewrite缓冲区对MySQL有何影响?
虽然doublewrite要求每个数据页都要被写二次,其性能开销远远小于二倍。写出到doublewrite缓冲区时是顺序写,因此开销很小。 doublewrite同时还可以降低Innodb执行的fsync()操作,即不需要写每个页时都调用一下fsync(),而可以提交多个写操作最后再调用一次fsync()操作,这使得操作系统可以优化写操作的执行顺序及并行使用多个存储设备。但在不使用doublewrite技术时也可以用这些优化,事实上这些优化是与doublewrite同时实现的。因此总体来说,我预计使用doublewrite技术带来的性能开销不会超过5%到10%。

能否禁用doublewrite?
如果你不关心数据一致性(比如使用了RAID0)或文件系统可以保证不会出现部分写失效,你可以通过将innodb_doublewrite参数设置为0还禁用doublewrite。但通常这可能带来更大的麻烦。

这里暂且不讨论为何mysql不开启double write会容易出现断裂的写操作. 在mysql中数据写的单元是以page为单位 1page=16KB 而在oracle中是以block为单位 block可以指定大小。但是写入OS的时候都是以OS block为单位,也就是说如果写入OS block时发生partial write 同样会出现逻辑问题。

这里我们看一个老外的回答:

It’s an interesting question. I think the clue is in the link you provided: “Such logging structure is geat as it require less data to be written to the log, however it requires pages to be internally consistent.”

What that’s saying (I think!) is that the contents of the innodb transaction log can only be replayed to datafile pages which are ‘clean’ -and that’s true for Oracle, too. You can’t apply Oracle redo to an Oracle database block that is internally corrupted because some of its consituent “os pages” were written at a different time from others. When such partial writes happen, you get what’s called a “fractured block”, warnings in the alert log …and the data file is regarded as corrupt from that point on.

Oracle’s fix to this potential problem, however, is also hinted at in the article you linked to: “Innodb does not log full pages to the log files”. That’s an interesting sentence because , you see, Oracle does write full pages to the logs! I should immediately qualify that: it only does so when you take a “hot backup” using O/S copy commands -because it’s only then that you have to worry about the problem. In other words, you only have to worry about the fact that you can only apply redo to an internally consistent database block if you’re actually in the business of applying redo… and you’re only doing that in the event of a recovery. And complete recoveries in Oracle (as opposed to mere crash recoveries) require you to have restored something from backup. So, it’s only during the backup process that you only have to worry about the problem of fractured blocks -and so it’s only then that Oracle says, ‘if you have put the tablespace into hot backup mode (alter tablespace X begin backup), then the first time a block of data is changed, the entire block should be written in a consistent state into the redo (transaction) logs. Then, if the datafile copy of the block in the backup turns out to be fractured, we’ve got a known good copy in the redo we can restore in its place. And once you have a clean block as a starting point, you can continue to apply redo from that point on’.

Oracle has an alternative (and more up to date) mechanism for achieving this “I know your data block is clean” starting state, though. It’s called RMAN -the Oracle backup and recovery tool. Unlike your OS copy command, it’s an Oracle utility… so it understands the concept of Oracle blocks, and it can therefore check that a block that’s been copied has been copied consistently, with all its constituent OS ‘pages’ written coherently to disk in the same state. It knows how to compare input and output in a way no OS command could ever hope to do. So when RMAN copies a data block hot, it reads the copy, compares it with the original -and if it sees the copy is fractured, it just has another go copying the block again. Repeat until the copy is indeed verified as a good copy of the original. No need to write the block into the transaction log at all, because you know that the backup file itself contains the necessary clean block copy.

So, putting that into practice. Let’s say your server corrupts data on the disk for whatever reason and, in the process, your Oracle instance dies. You try and restart Oracle, but you get told that recovery is needed (you might get a message that file 16, for example, can’t be read). So you restore file 16 from your hot backup taken with OS commands. In that backup, one of the blocks is fractured, because only part of the Oracle block had hit disk at the point the backup was performed. So you restore a fractured block. But that’s not a problem, because as redo is replayed, you’ll find the clean copy of the block in the redo stream, and restore that over the top of the fractured block. The rest of the redo can then be replayed without a problem. Or, you restore file 16 using RMAN… and what it restores cannot be fractured, because it checks for that before it reports the original backup a success. Therefore, you restore a clean copy of file 16, and can apply redo to it without drama. Either way, you get your database recovered.

So, the article you linked to nails the important matter: “It does not matter which page version it is – it could be “current” version in which case Innodb will skip page upate operation or “former” in which case Innodb will perform update. If page is inconsistent recovery can’t proceed.” Absolutely true of Oracle, too. But Oracle has two alternatives for ensuring that a clean version of the block is always available: write a whole block into redo if it’s changed whilst the database is being backed up with OS commands, or make sure you only write clean blocks into the backup if you’re using RMAN -and you achieve that by multiple reads of the block, as many as are necessary to ensure the output is clean.

Oracle’s solutions in these regards are, I think, a lot more efficient than double-writing every block all the time, because the only time you have to worry that what’s on disk isn’t consistent is, as your linked article again points out, when ‘power failure’ or ‘os crash’ happens. That is, during some sort of failure. And the response to failure that involves corruption is always to restore something from backup… so, it’s really only that backup that needs to worry about ‘clean pages’. Instead of writing everything twice to disk during normal running (which sounds like a potentially enormous overhead to me!), therefore, Oracle only has to employ protective measures during the backup process itself (which should, ordinarily, be a mere fraction of ‘normal running’ time). The overhead is therefore only encountered sparingly and not something you need worry about as a potentially-constant performance problem.

In closing, I’ll second Aman’s observation that it is generally and usually the case that any variation away from the default 8K block size is a bad idea. Not always, and there may be justification for it in extremis… but you will certainly be at risk of encountering more and weirder bugs than if you stick to the defaults.

这一段回答可谓一点儿也没讲到重点 主要回答oracle 采用避免partial write的几个方法,注意 这里是在特殊场景下 如何避免partial write.而不是数据库机理. 我们看一下mysql怎么同样避免这个问题 —> ( oracle hot backup/RMAN backup)
传统Oracle的热备,备份读取与DBWR写文件并行执行,因此可能读取到Fractured Block(不一致的块),解决办法是对于备份的文件,DBWR写的Dirty Page同时写入Redo Log,用于Fractured Block的恢复。RMAN备份,会检查每个读取页面的一致性,不一致就重新读取。Percona的XtraBackup,采用类似于RMAN的方案。
如何检查备份读取的页面是否是一致的,其实方法很简单:无论是Oracle/InnoDB的页面,都会在页面的页头与页尾存储相同的SCN /CheckSum。当页头/页尾的SCN/CheckSum相同,说明页面处于一致状态,否则就说明页面损坏(Fractured Block),备份重新读取损坏页面即可。
所以这一段可以理解为当你需要备份的时候 如何避免partial write 因为在这个时候最容易发生断裂的块或者页。而此前别人提问的是oracle是否有类似double write的手法避免常规的partial write.

我们继续看下面的回答:

user646745 wrote:
Thanks HJR for detailed analysis.

But the double write mechanism works without restore from a backup an apply redo: before write the dirty buffer blocks, innodb flush the blocks in the double write buffer to disk which is a sequential disk area (so it’s fast),
————————-
before write the dirty buffer blocks, logwr flush the blocks in the redo buffer to disk which is a sequential disk area (so it’s fast),

so even if partial write happen, the the blocks in double write buffer already persistent in disk,
————————-
so even if partial write happen, the the blocks in redo buffer already persistent in disk,

and when mysql restart, innodb can compare the blocks flushed from double write buffer with the datafile blocks,
————————-
and when mysql restart, smon can compare control file scn with the datafile blocks scn,

if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just overwrite the datafile blocks with the blocks from double write buffer.
————————-
if they matched (means no partial write), then everything is fine, if not matched (partial write happen), just apply the redo from the redo logs.

So no media recover is required.
————————-
sounds like media recovery to me

Based on your anaysis, oracle needs media recover.
————————-
Based on your analysis, so does mysql. It just applies it in a very slightly different fashion, and calls it all something else.

这里的回答又有误点. 他说道 “so even if partial write happen, the the blocks in double write redo buffer already persistent in disk” 这句话存在明显的误导,首先partial write发生的时候,redo是无法恢复
一个块内不一致的块的,redo只能负责recover作用,但这不是针对块内部的恢复 而是trasaction或者media的.

oracle recovery 分为 instance/crash recovery and media recovery 其本质区别在于instance recovery 需要online redo logfile 即apply 增量检查点之后的redolog. 而media recovery又分为 datafile /block media
recovery 其本质为恢复一个旧的datafile或者block 可能需要用到归档日志。注意这里的前提为需要有备份。restore datafile /block 只能在备份中恢复,而之后的recover是要用到archive log。 这里的media recover可以
对应为解决partial write的方法。

对于一个内部不一致的page mysql无法通过redo去恢复 :redo有效应用的前提是data的一致性,当data flush到磁盘时发生故障,比如16K只写了4K并且redo条目是change vector形式,属于逻辑记录,那么在InnoDB recovery时,redo就不知道从哪开始跑起。

对于oracle来说 内部不一致的block.仅仅有 redo也是不够的.我们可以通过dbf检查到逻辑问题的block,去repair这个block,前提是需要有备份。这里用到的方法是media recovery里的block recovery.
我们可以使用bbed 或者一些OS的工具来模拟这些逻辑坏块,并且往往发现数据库是可以正常open的,只有在访问到这些block的时候才会出现逻辑问题。所以说备份重于一切,oracle在处理这些”半个IO”或者一些静默丢失(storage),在没有备份的情况下也是很乏力的。

vmcd

About vmcd

Phone: +86 18666668061 Email & Gtalk: ylouis83@gmail.com Personal Blog: http://www.vmcd.org 2010 Oracle Certified Database 10g Administrator Master SHOUG Member