[3]
Xia F., Dou Y., Lei G., and Tan Y., FPGA accelerator for protein secondary structure prediction based on the GOR algorithm, BMC Bioinformatics, vol. 12, no. S1, p. S5, 2011.
[4]
Jiang J., Mirian V., Tang K. P., Chow P., and Xing Z., Matrix multiplication based on scalable macro-pipelined FPGA accelerator architecture, in 2009 International Conference on Reconfigurable Computing and FPGAs, 2009, pp. 48–53.
[5]
Liu L., Neal O., Chitlur B., Wang Q., Alvin C., Shen W., Yu Z., Arthur S., Ian M., Joseph G., et al., High-performance, energy efficient platforms using in-socket FPGA accelerators, in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2009, pp. 261-264.
[6]
Wehner M., Oliker L., and Shalf J., Towards ultra-high resolution models of climate and weather, International Journal of High Performance Computing Applications, vol. 22, no. 2, pp. 149-165, 2008.
[7]
Wehner M., Oliker L., and Shalf J., Green flash: Designing an energy efficient climate super-computer, in IEEE International Symposium on Parallel & Distributed Processing, 2009, 2009.
[8]
Shaw D. E., Dror R. O., Salmon J. K., Salmon J. K., Grossman J. P., Mackenzie K. M., Bank J. A., and Chow E., Millisecond-scale molecular dynamics simulations on anton, in Proceedings of the ACM/IEEE Conference on Supercomputing, 2009, pp. 1-11.
[9]
Hameed R., Qadeer W., Wachs M., Azizi O., Solomatnikov A., Lee B. C., and Horowitz M., Understanding sources of inefficiency in general-purpose chips, in Proceedings of the 37th Annual International Symposium on Computer Architecture, vol. 38, no. 3, pp. 37–47, 2010.
[11]
Cong J., Sarkar V., Reinman G., and Bui A., Customizable domain-specific computing, IEEE Design and Test of Computers, vol. 28, no. 2, pp. 5-15, 2011.
[13]
Levesque J., Larkin J., Foster M., Glenski J., Geissler G., Whalen S., and Wasserman H., Understanding and mitigating multicore performance issues on the AMD opteron architecture, Techincal Report, Lawrence Berkeley National Laboratory, 2007.
[14]
Atasu K., Luk W., Mencer O., Ozturan C., and Dundar G., FISH: Fast instruction synthesis for custom processors, IEEE Transactions on Very Large Scale Integratiions (VLSI) Systems, vol. 20, no. 1, pp. 52-65, 2012.
[15]
Grad M. and Plessl C., Pruning the design space for just-in-time processor customization, in International Conference on Reconfigurable Computing and FPGAs (ReConFig), 2010, pp. 67-72.
[16]
Datta K., Murphy M., Volkov V., Williams S., Carter J., Oliker L., and Yelick K., Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, in Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008, p. 4.
[17]
Membarth R., Hannig F., Teich J., and Kostler H., Towards domain-specific computing for stencil codes in HPC, in High Performance Computing, Networking, Storage and Analysis (SCC), Lake City, UT, USA, 2012, pp. 1133-1138.
[18]
Berger M. and Oliger J., Adaptive mesh refinement for hyperbolic partial differential equations, Journal of Computational Physics, vol. 53, pp. 484-512, 1984.
[19]
Zhang Y. and Mueller F., Auto-generation and auto-tuning of 3-D stencil codes on GPU clusters, in Proceedings of the Tenth International Symposium on Code Generation and Optimization, 2012, pp. 155-164.
[20]
Kamil S., Datta K., Williams S., Oliker L., Shalf J., and Yelick K., Implicit and explicit optimizations for stencil computations, in ACM SIGPLAN Workshop Memory Systems Performance and Correctness, San Jose, CA, USA, 2006, pp. 51-60.
[21]
Rivera G. and Tseng C., Tiling optimizations for 3-D scientific computations, in Proceedings of ACM/IEEE 2000 Conference on Supercomputing, 2000, p. 32.
[22]
Coleman S. and McKinley K., Tile size selection using cache organization and data layout, ACM SIGPLAN Notices, vol. 30, no. 6, pp. 279-290, 1995.
[23]
Bondhugula U., Hartono A., Ramanujam J., and Sadayappan P., A practical automatic polyhedral parallelizer and locality optimizer, ACM SIGPLAN Notices, vol. 43, no. 6, pp. 101-113, 2008.
[24]
Phillips E. and Fatic M., Implementing the himeno benchmark with CUDA on GPU clusters, in 2010 IEEE International Symposium on Parallel and Distributed Processing (IPDPS), 2010, pp. 1-10.
[25]
Yang Y., Cui H., Feng X., and Xue J., A hybrid circular queue method for iterative stencil computations on GPUs, Journal of Computer Science and Technology, vol. 27, pp. 57-74, 2012.
[26]
Araya-Polo M., Cabezas J., Hanzich M., Pericas M., Rubio F., Gelado I., and Cela J. M., Assessing accelerator based HPC reverse time migration, IEEE Transactions on Parallel and Distributed Systems, vol. 22, pp. 147-162, 2011.
[27]
Niu X., Jin Q., Luk W., Liu Q., and Pell O., Exploiting runtime reconfiguration in stencil computation, in Proceedings of 22nd International Confererence Field Programmable Logic and Applications (FPL), 2012, pp. 173-180.