Abstract
Hot standby systems often have to trade safety (i.e., not losing committed work) and freshness (i.e., having access to recent updates) for performance. Guaranteeing safety requires synchronous log shipping that blocks the primary until the log records are durably replicated in one or multiple backups; maintaining freshness necessitates fast log replay on backups, but is often defeated by the dual-copy architecture and serial replay: a backup must generate the "real" data from the log to make recent updates accessible to read-only queries.
This paper proposes Query Fresh, a hot standby system that provides both safety and freshness while maintaining high performance on the primary. The crux is an append-only storage architecture used in conjunction with fast networks (e.g., InfiniBand) and byte-addressable, non-volatile memory (NVRAM). Query Fresh avoids the dual-copy design and treats the log as the database, enabling lightweight, parallel log replay that does not block the primary.
Experimental results using the TPC-C benchmark show that under Query Fresh, backup servers can replay log records faster than they are generated by the primary server, using one quarter of the available compute resources. With a 56Gbps network, Query Fresh can support up to 4--5 synchronous replicas, each of which receives and replays ∼1.4GB of log records per second, with up to 4--6% overhead on the primary compared to a standalone server that achieves 620kTPS without replication.
- AgigaTech. AgigaTech Non-Volatile RAM. 2017. http://www.agigatech.com/nvram.php.Google Scholar
- J. Arulraj, M. Perron, and A. Pavlo. Write-behind logging. PVLDB, 10(4):337--348, 2016. Google ScholarDigital Library
- M. Balakrishnan, D. Malkhi, V. Prabhakaran, T. Wobber, M. Wei, and J. D. Davis. CORFU: A shared log design for flash clusters. NSDI, 2012. Google ScholarDigital Library
- M. Balakrishnan, D. Malkhi, T. Wobber, M. Wu, V. Prabhakaran, M. Wei, J. D. Davis, S. Rao, T. Zou, and A. Zuck. Tango: Distributed data structures over a shared log. SOSP, pages 325--340, 2013. Google ScholarDigital Library
- C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using RDMA. SIGMOD, pages 1463--1475, 2015. Google ScholarDigital Library
- P. A. Bernstein, S. Das, B. Ding, and M. Pilman. Optimizing optimistic concurrency control for tree-structured, log-structured databases. SIGMOD, pages 1295--1309, 2015. Google ScholarDigital Library
- P. A. Bernstein, C. W. Reid, and S. Das. Hyder - a transactional record manager for shared flash. CIDR, 2011.Google Scholar
- P. A. Bernstein, C. W. Reid, M. Wu, and X. Yuan. Optimistic concurrency control by melding trees. PVLDB, 4(11):944--955, 2011.Google ScholarDigital Library
- C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian. The end of slow networks: It's time for a redesign. PVLDB, 9(7):528--539, 2016. Google ScholarDigital Library
- Y. Chen, X. Wei, J. Shi, R. Chen, and H. Chen. Fast and general distributed transactions using RDMA and HTM. EuroSys, pages 26:1--26:17, 2016. Google ScholarDigital Library
- D. Cohen, T. Talpey, A. Kanevsky, U. Cummings, M. Krause, R. Recio, D. Crupnicoff, L. Dickman, and P. Grun. Remote direct memory access over the converged enhanced Ethernet fabric: Evaluating the options. Hot Inteconnects, pages 123--130, 2009. Google ScholarDigital Library
- J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. Better I/O through byte-addressable, persistent memory. SOSP, pages 133--146, 2009. Google ScholarDigital Library
- J. C. Corbett et al. Spanner: Google's globally-distributed database. OSDI, 2012. Google ScholarDigital Library
- R. Crooke and M. Durcan. A revolutionary breakthrough in memory technology. Intel 3D XPoint launch keynote, 2015.Google Scholar
- J. DeBrabant, A. Pavlo, S. Tu, M. Stonebraker, and S. Zdonik. Anti-caching: A new approach to database management system architecture. PVLDB, 6(14):1942--1953, 2013. Google ScholarDigital Library
- C. Diaconu et al. Hekaton: SQL server's memory-optimized OLTP engine. SIGMOD, pages 1243--1254, 2013. Google ScholarDigital Library
- C. Douglas. RDMA with PMEM: Software mechanisms for enabling access to remote persistent memory. Storage Developer Conference, 2015. http://www.snia.org/sites/default/files/SDC15_presentations/persistant_mem/ChetDouglas_RDMA_with_PM.pdf.Google Scholar
- A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast remote memory. NSDI, pages 401--414, 2014. Google ScholarDigital Library
- G. Graefe. Instant recovery for data center savings. SIGMOD Record, 44(2):29--34, Aug. 2015. Google ScholarDigital Library
- J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 1st edition, 1992. Google ScholarDigital Library
- IBM. High availability through log shipping. IBM DB2 9.7 for Linux, UNIX, and Windows documentation, 2015.Google Scholar
- InfiniBand Trade Association. InfiniBand roadmap. 2016. http://www.infinibandta.org/content/pages.php?pg=technology_overview.Google Scholar
- Intel Corporation. Intel data direct I/O technology (Intel DDIO): A primer. 2012.Google Scholar
- Intel Corporation. Intel 64 and IA-32 architectures software developer's manual. 2015.Google Scholar
- JEDEC. DDR3 SDRAM standard. 2012. http://www.jedec.org/standards-documents/docs/jesd-79-3d.Google Scholar
- R. Johnson, I. Pandis, R. Stoica, M. Athanassoulis, and A. Ailamaki. Aether: a scalable approach to logging. PVLDB, 3(1):681--692, 2010. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. SIGCOMM, pages 295--306, 2014. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. OSDI, pages 185--201, 2016. Google ScholarDigital Library
- R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: a high-performance, distributed main memory transaction processing system. PVLDB, 1(2):1496--1499, 2008. Google ScholarDigital Library
- R. Kateja, A. Badam, S. Govindan, B. Sharma, and G. Ganger. Viyojit: Decoupling battery and DRAM capacities for battery-backed DRAM. ISCA, 2017. Google ScholarDigital Library
- J. Kim, K. Salem, K. Daudjee, A. Aboulnaga, and X. Pan. Database high availability using shadow systems. SoCC, pages 209--221, 2015. Google ScholarDigital Library
- K. Kim, T. Wang, R. Johnson, and I. Pandis. ERMIA: Fast memory-optimized database system for heterogeneous workloads. SIGMOD, pages 1675--1687, 2016. Google ScholarDigital Library
- H. Kimura. FOEDUS: OLTP engine for a thousand cores and NVRAM. SIGMOD, pages 691--706, 2015. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM TOCS, 16(2):133--169, May 1998. Google ScholarDigital Library
- J. Levandoski, D. Lomet, and S. Sengupta. LLAMA: A cache/storage subsystem for modern hardware. PVLDB, 6(10):877--888, 2013. Google ScholarDigital Library
- J. Levandoski, D. Lomet, and S. Sengupta. The Bw-tree: A B-tree for new hardware platforms. ICDE, pages 302--313, 2013. Google ScholarDigital Library
- F. Liu, L. Yin, and S. Blanas. Design and evaluation of an RDMA-aware data shuffling operator for parallel database systems. EuroSys, pages 48--63, 2017. Google ScholarDigital Library
- D. Makreshanski, J. Giceva, C. Barthels, and G. Alonso. BatchDB: Efficient isolated execution of hybrid OLTP+OLAP workloads for interactive applications. SIGMOD, pages 37--50, 2017. Google ScholarDigital Library
- N. Malviya, A. Weisberg, S. Madden, and M. Stonebraker. Rethinking main memory OLTP recovery. ICDE, pages 604--615, 2014.Google ScholarCross Ref
- Y. Mao, E. Kohler, and R. T. Morris. Cache craftiness for fast multicore key-value storage. EuroSys, pages 183--196, 2012. Google ScholarDigital Library
- Mellanox Technologies. RDMA aware networks programming user manual. 2015.Google Scholar
- Mellanox Technologies. RDMA over converged ethernet (RoCE) - an efficient, low-cost, zero copy implementation. 2017. http://www.mellanox.com/page/products_dyn?product_family=79.Google Scholar
- C. Min, S. Kashyap, S. Maass, W. Kang, and T. Kim. Understanding manycore scalability of file systems. USENIX ATC, pages 71--85, 2016. Google ScholarDigital Library
- U. F. Minhas, S. Rajagopalan, B. Cully, A. Aboulnaga, K. Salem, and A. Warfield. RemusDB: Transparent high availability for database systems. PVLDB, 4(11):738--748, 2011. Google ScholarDigital Library
- C. Mitchell, Y. Geng, and J. Li. Using one-sided RDMA reads to build a fast, CPU-efficient key-value store. USENIX ATC, pages 103--114, 2013. Google ScholarDigital Library
- C. Mohan, D. Haderle, B. Lindsay, H. Pirahesh, and P. Schwarz. ARIES: a transaction recovery met- hod supporting fine-granularity locking and partial roll backs using write-ahead logging. TODS, 17(1):94--162, 1992. Google ScholarDigital Library
- Oracle. TimesTen in-memory database replication guide. Oracle Database Online Documentation, 2014.Google Scholar
- Oracle. Chapter 17 Replication. MySQL 5.7 Reference Manual, 2015.Google Scholar
- I. Oukid, J. Lasperas, A. Nica, T. Willhalm, and W. Lehner. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. SIGMOD, pages 371--386, 2016. Google ScholarDigital Library
- I. Oukid, W. Lehner, T. Kissinger, T. Willhalm, and P. Bumbulis. Instant recovery for main memory databases. CIDR, 2015.Google Scholar
- D. Qin, A. D. Brown, and A. Goel. Scalable replay-based replication for fast databases. PVLDB, 10(13):2025--2036, 2017. Google ScholarDigital Library
- P. S. Randal. High availability with SQL Server 2008. Microsoft White Papers, 2009. https://technet.microsoft.com/en-us/library/ee523927.aspx.Google Scholar
- R. Ricci, G. Wong, L. Stoller, K. Webb, J. Duerig, K. Downie, and M. Hibler. Apt: A platform for repeatable research in computer science. SIGOPS Oper. Syst. Rev., 49(1):100--107, Jan. 2015. http://docs.aptlab.net/. Google ScholarDigital Library
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015. Google ScholarDigital Library
- M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee. Making updates disk-I/O friendly using SSDs. PVLDB, 6(11):997--1008, 2013. Google ScholarDigital Library
- T. Talpey. RDMA extensions for remote persistent memory access. 12th Annual Open Fabrics Alliance Workshop, 2016. https://www.openfabrics.org/images/eventpresos/2016presentations/215RDMAforRemPerMem.pdf.Google Scholar
- The PostgreSQL Global Development Group. Chapter 25. High Availability, Load Balancing, and Replication. PostgreSQL 9.4.4 Documentation, 2015.Google Scholar
- A. Thomson and D. J. Abadi. The case for determinism in database systems. PVLDB, 3(1--2):70--80, 2010. Google ScholarDigital Library
- A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: fast distributed transactions for partitioned database systems. SIGMOD, pages 1--12, 2012. Google ScholarDigital Library
- TPC. TPC benchmark C (OLTP) standard specification, revision 5.11, 2010. http://www.tpc.org/tpcc.Google Scholar
- S. Tu, W. Zheng, E. Kohler, B. Liskov, and S. Madden. Speedy transactions in multicore in-memory databases. SOSP, pages 18--32, 2013. Google ScholarDigital Library
- A. Verbitski, A. Gupta, D. Saha, M. Brahmadesam, K. Gupta, R. Mittal, S. Krishnamurthy, S. Maurice, T. Kharatishvili, and X. Bao. Amazon aurora: Design considerations for high throughput cloud-native relational databases. SIGMOD, pages 1041--1052, 2017. Google ScholarDigital Library
- Viking Technology. DDR4 NVDIMM. 2017. http://www.vikingtechnology.com/products/nvdimm/ddr4-nvdimm/.Google Scholar
- T. Wang and R. Johnson. Scalable logging through emerging non-volatile memory. PVLDB, 7(10):865--876, 2014. Google ScholarDigital Library
- T. Wang, R. Johnson, and I. Pandis. Fresh replicas through append-only storage. HPTS, 2015. http://www.hpts.ws/papers/2015/lightning/append-only-log-ship.pdf.Google Scholar
- Y. Wu, J. Arulraj, J. Lin, R. Xian, and A. Pavlo. An empirical evaluation of in-memory multi-version concurrency control. PVLDB, 10(7):781--792, 2017. Google ScholarDigital Library
- Y. Wu, W. Guo, C.-Y. Chan, and K.-L. Tan. Fast failure recovery for main-memory DBMSs on multicores. SIGMOD, pages 267--281, 2017. Google ScholarDigital Library
- M. Yang, D. Zhou, C. Kuo, C. Hong, L. Zhang, and L. Zhou. KuaFu: Closing the parallelism gap in database replication. ICDE 2013, pages 1186--1195, 2013. Google ScholarDigital Library
- C. Yao, D. Agrawal, G. Chen, B. C. Ooi, and S. Wu. Adaptive logging: Optimizing logging and recovery costs in distributed in-memory databases. SIGMOD, pages 1119--1134, 2016. Google ScholarDigital Library
- E. Zamanian, C. Binnig, T. Kraska, and T. Harris. The end of a myth: Distributed transactions can scale. PVLDB, 10(6):685--696, 2017. Google ScholarDigital Library
- Y. Zhang, J. Yang, A. Memaripour, and S. Swanson. Mojim: A reliable and highly-available non-volatile memory system. ASPLOS, pages 3--18, 2015. Google ScholarDigital Library
Index Terms
- Query fresh: log shipping on steroids
Recommendations
Fresh apps: an empirical study of frequently-updated mobile apps in the Google play store
Mobile app stores provide a unique platform for developers to rapidly deploy new updates of their apps. We studied the frequency of updates of 10,713 mobile apps (the top free 400 apps at the start of 2014 in each of the 30 categories in the Google Play ...
Cache-oblivious dynamic dictionaries with update/query tradeoffs
SODA '10: Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete algorithmsSeveral existing cache-oblivious dynamic dictionaries achieve O(logB N) (or slightly better O(logB N/M)) memory transfers per operation, where N is the number of items stored, M is the memory size, and B is the block size, which matches the classic B-...
Update or wait: How to keep your data fresh
IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer CommunicationsIn this work we study how to manage the freshness of status updates sent from a source to a remote monitor via a network server. A proper metric of data freshness at the monitor is the age-of-information, which is defined as how old the freshest update is ...
Comments