这个CNN系列,主要内容是斯坦福大学“CS231n: Convolutional Neural Networks for Visual Recognition”课程的笔记。斯坦福大学机器视觉相关课程包括CS131、CS231a、CS231n、CS331和CS431。
机器视觉简史
- 1959年,Hubel & Wiesel,[1];
- 1963年,Larry Roberts,Block world [2];
- 1966年,The Summer Vision Project;
- 1970s,David Marr,”Vision”,Stages of Visual Representation [3];
- 1973年,Fischler & Elschlager,Pictorial Structure [4];
- 1979年,Brooks & Binford,Generalized Cylinder [5];
- 1987年,David Lowe,[6];
- 1997年,Shi & Malik,Normalized Cut [7];
- 1999年,David Lowe,SIFT & Object Recognition [8];
- 2001年,Viola & Jones,Face Detection [9];
- 2005年,Dalal & Triggs,HOG(Histogram of Gradients) [10];
- 2005年~2012年,PASCAL Visual Object Challenge [11], [12];
- 2006年,Lazebnik, Schmid & Ponce,Spatial Pyramid Matching [13];
- 2009年,Felzenswalb, McAllester & Ramanan,Deformable Part Model [14];
- 2009年,ImageNet:Large scale visual recognition challenge [15], [16];
2006年,Fuji Film采用Viola & Jones的方法[9],第一个实现了人脸检测的数码相机。
图像分类简介
图像分类与一系列的视觉识别问题都相关,比如:对象识别、图像标注、行为识别。卷积神经网络(CNN,Convolutional Neural Network)是对象识别的重要工具。
在ILSVRC比赛中,2011年采用的是经典的特征提取与线性分类器[17],从2012年开始,优胜队伍均采用了深度神经网络[18], [19], [20], [21],2015年MSRA的深度神经网络多达151层。
2012年,Krizhevsky采用的深度神经网络,事实上对LeCun的网络[22]改进很少,但是由于计算能力的提升,数据量的增加,赢得了ImageNET的LSVRC比赛。
视觉智能(visual intelligence)追求的目标远远高于对象识别,不仅要识别对象,而且要理解图像表达的意思[23]。
参考资料
- [1]D. H. Hubel and T. N. Wiesel, “Receptive fields of single neurones in the cat’s striate cortex,” The Journal of physiology, vol. 148, no. 3, pp. 574–591, 1959.
- [2]L. G. Roberts, “Machine perception of three-dimensional solids,” PhD thesis, Massachusetts Institute of Technology, 1963.
- [3]D. Marr, Vision: A computational investigation into the human representation and processing of visual information. The MIT Press, 2010. [Online]
- [4]M. A. Fischler and R. A. Elschlager, “The representation and matching of pictorial structures,” IEEE Transactions on computers, no. 1, pp. 67–92, 1973.
- [5]R. A. Brooks, R. Creiner, and T. O. Binford, “The ACRONYM model-based vision system,” in Proceedings of the 6th international joint conference on Artificial intelligence, 1979, pp. 105–113.
- [6]D. G. Lowe, “Three-dimensional object recognition from single two-dimensional images,” Artificial intelligence, vol. 31, no. 3, pp. 355–395, 1987.
- [7]J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
- [8]D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
- [9]P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, 2001, vol. 1, pp. I–511.
- [10]N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2005, vol. 1, pp. 886–893.
- [11]M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
- [12]M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, Jan. 2015.
- [13]S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in IEEE Conference on Computer Vision and Pattern Recognition, 2006, vol. 2, pp. 2169–2178.
- [14]P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
- [15]J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
- [16]O. Russakovsky et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- [17]Y. Lin et al., “Large-scale image classification: fast feature extraction and svm training,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1689–1696.
- [18]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
- [19]C. Szegedy et al., “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
- [20]K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [21]K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
- [22]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [23]L. Fei-Fei, A. Iyer, C. Koch, and P. Perona, “What do we perceive in a glance of a real-world scene?,” Journal of vision, vol. 7, no. 1, pp. 10–10, 2007.