[X:AI] RetinaNet; Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

paper| https://arxiv.org/abs/1708.02002

source code| GitHub - facebookresearch/Detectron: FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.

Object detection model: 이미지 내 object의 영역을 추정하고 IoU threshold에 따라 positive/negaive sample로 구분한 후, 이를 활용하여 학습 진행

But 일반적으로 이미지 내 object 수가 적기 때문에 positive sample(object 영역)은 negative sample(배경 영역)에 비해 매우 적음 ⇒ positive/negative sample 사이에 큰 차이가 생겨 class imblance 문제 발생

Abstract

현재까지 가장 높은 accuracy의 object detector는 R-CNN에 의해 대중화된 two-stage 접근법을 기반으로 함. 여기서 classifier는 candidate object 위치의 sparse한 set에 적용됨

대조적으로, 좀 더 regular하고, 가능한 object location의 dense한 sampling이 적용되는 one-stage detector는 더 빠르고, 간단할 가능성이 있지만, 지금까지 two-stage의 accuracy를 추척해왔음.

본 논문에서는 왜 이런지 조사함

dense한 detectors의 training 과정 중 부닥치는 극단적인 foreground-background class 불균형이 주요 원인이라는 것을 발견

잘 분류된 examle에 할당된 loss의 가중치를 낮추도록 standard cross entropy loss을 reshapng하여 calss 불균형을 해결할 것을 제안

새로운 Focal Loss는 어려운 example의 sparse한 set에서 training에 초점을 맞추고, 방대한 수의 easy negative가 training을 하는 동안 detector가 잘못되는 것을 방지함

논문의 loss의 효과를 평가하기 위해, RetinaNet이라고 하는 간단하고 dense한 detector을 디자인하고 train함

focal loss로 train될 때, RetinaNet은 모든 SOTA의 two-stage detector의 accuracy를 능가하면서 이전의 one-stage detector의 속도와 대등할 수 있었음

1. Introduction

현재의 SOTA object detector는 two-stage, proposal-driven mechanism에 기반됨

R-CNN 프레임워크에서 대중화된 것처럼, 1번째 stage는 candidate object 위치의 sparse set을 생성하고, 2번째 stage는

convolutional neural network를 사용하여 foreground class 중 하나 혹은 배경으로 각 candidate 위치 분류 진행

Fast R-CNN, Faster R-CNN, Mask R-CNN 등의 과정을 통해 이 two-stage framework는 challenging COCO benchmark에서일관되게 top accuracy 달성

→ 심플한 one-stage detector도 유사한 accuracy를 달성할 수 있을까??

one-stage detector는 object 위치, scale, 가로세로 비율의 regular하고, dense한 sampling에 적용됨

YOLO, SSD와 같은 one-stage detector에 대한 최근 연구는 유망한 결과를 보여주며, SOTA two-stage method에 비해 10-40%이내의 accuracy로 더 빠른 detector를 산출함

FPN(Feature Pyramid Network)Mask R-CNN, Faster R-CNN의 변형과 같은 더 복잡한 two-stage detector의 SOTA COCO AP와 일치하는 one-stage object detector를 제시

class imbalance를 one-stage detector가 SOTA accuracy를 달성하는 것을 방해하는 main 방해물로 식별 & 이를 제거하는 새로운 loss function을 제안

class imbalance는 R-CNN과 같은 detector에서 two-stage cascade & sampling heuristic에 의해 해결

proposal stage(ex. Selective Search,EdgeBoxes,DeepMask,RPN)는 대부분의 background sample을 걸러내면서, 빠르게 candidate object의 수를 적은 수(ex. 1-2K)로 좁힘

두번째 분류 stage에서 고정된 foreground 대 background 비율 (1:3) or online hard example mining (OHEM)과 같은 sampling heuristic은 foreground과 background의 균형을 유지하기 위해 수행됨

이와 대조적으로, one-stage detector는 image에서 regular하게 샘플링되는 candidate object 위치의 더 많은 set을 처리해야 함

실제로 dense하게 spatial position, scale, aspect ratio를 다루는 ~10만개의 위치를 다루는 것과 같음

유사한 sampling heuristic도 적용될 수 있지만, training 절차가 여전히 쉽게 분류된 background example에 의해 지배되기 때문에 비효율적임

이러한 비효율성은 일반적으로 bootstrapping or hard example mining과 같은 기술을 통해 해결되는)\ object detection에서 클래식한 문제임

Figure 1. standard cross entropy criterion에 $(1-p_t)^{\gamma}$를 추가한 Focal Loss라고 하는 새로운 loss. $\gamma > 0$를 설정해서 잘 분류된 example $(p_t > .5)$에 대한 상대적 loss를 줄이고, 잘못 분류된 example에 초점을 둠. 제안된 focal loss는

방대한 수의 easy background example가 있는 상황에서 매우 정확한 dense한 object detector를 훈련시킬 수 있음

본 논문에서, class imbalance를 해결하기 위한 새로운 loss function을 제안

loss function → scaled cross entropy loss

올바른 class에서 confidence가 증가함에 따라 scaling factor가 0으로 감소 (Figure 1 참고)

직관적으로, 이 scaling factor는 training동안 easy example의 contribution을 자동적으로 낮출 수 있음 & model을 hard example에 빠르게 초첨을 맞출 수 있음

실험에 따르면, 제안된 Focal Loss는 one-stage detector을 train하기 위한 이전의 SOTA 기술인 sampling heuristic or hard example mining을 통해 training하는 것 대신 이것을 능가하는 one-stage detector를 높은 accuracy로 train할 수 있음

마지막으로, focal loss의 정확한 형태가 중요하지 않음 & 다른 instantiation가 유사한 결과를 달성할 수 있음

Figure 2. COCO test-dev에서 속도(ms) 대 accuracy(AP)

focal loss에 의해 활성화된 심플한 one-stage RetinaNet detector은 Faster R-CNN을 포함한 이전의 모든 one-stage & two-stage detector를 능가

파란 선: RetinaNet과 ResNet-50-FPN / 주황 선: RetinaNet과 ResNet-101-FPN

낮은 accuracy (AP<25)를 무시하고, RetinaNet은 모든 detector의 상위 성능 / 개선된 변형(표시 안됨)은 40.8AP 달성

자세한 내용은 5번 항목에 나와있음

제안된 focal loss의 효과를 입증하기 위해, RetinaNet이라 불리는 심플한 one-stage object detector을 설계

→ input image에서 object 위치의 dense한 sampling

network내 feature pyramid & anchor box 사용

SSD, Faster R-CNN, Feature pyramid network의 다양한 아이디어 활용

RetinaNet ⇒ 효율적이고 정확함

ResNet-101-FPN backbone을 기반으로 하는게 best model → 이게 COCO test-dev AP를 39.1 달성 → 이전에 나온 one-stage&two-stage detector single-model 결과 능가 (Figure 2 참고)

2. Related Work

Classic Object Detectors:

classifier가 dense한 image grid에 적용되는 sliding-window paradigm은 길고, 풍부한 역사를 갖고 있음

가장 초기 → ex. 손으로 쓴 숫자 인식에 convolutional neural network 적용

이외에도 등등 있었음 → sliding-window approach가 고전적인 CV에서 선도적인 detection paradigm이었지만 two-stage detector가 빠르게 object detection 분야 지배

Two-stage Detectors:

현대 object detection은 거의 two-stage approach에 기반

Selective Search work에서 나온것처럼,

첫 번째 단계 → 대부분의 negative 위치를 걸러내면서 모든 object를 포함해야하는 sparse한 candidate proposal set 생성

두 번째 단계 → proposal을 foreground class / background로 분류

R-CNN은 second-stage classifier를 convolutional network로 upgrade → accuracy를 크게 높이고, 현대의 obhect detection 시대를 열었음

속도 측면과 learned object proposal을 사용함으로써 개선되어왔음

RPN(Region Proposal Networks)는 second-stage classifier을 single convolution network로 proposal generation를 통합항 Faster R-CNN framework를 형성

이후에도 많이 확장되어왔음

One-stage Detectors:

OverFeat → deep network에 기반한 최초의 현대적인 one-stage object detector 중 하나

최근에는 SSD, YOLO가 one-stage method로 관심받음

SSD는 AP가 10-20% 낮은 반면, YOLO는 훨씬 더 극단적인 speed/accuracy 균형에 초점을 맞춤 (Figure 2 참고)

최근 연구에 따르면, input image 해상도와 proposal 수를 줄이는 것만으로 two-stage detector를 더 빠르게 만들수 있었음 But one-stage method는 더 큰 계산으로도 accuracy가 떨어졌음

이에 대조적으로, 본 논문의 연구의 목적은 유사하게 or 더 빠른 속도록 실행되는 동안 one-stage detector가 two-stage detector의 accuracy와 일치 or 능가하는지 알아보는 것

RetinaNet detector는 이전의 dense한 detector와 많은 유사점을 공유

특히, RPN에 의해 도입된 anchor의 개념 & SSD, FPN에서와 같은 features pyramid 사용

결국 새로운 loss를 통한 심플한 detector가 좋은 성능을 달성한다는 것을 강조

Class Imbalance:

boosted detectors, DPMs와 같은 고전적인 one-stage object detection method와 SSD와 같은 최신 method 모두 training동안 큰 class imbalance에 직면

이러한 detector는 image당 104-15개의 candidate 위치를 평가하지만 일부 위치에만 object가 포함됨

→ class imbalance가 야기하는 문제

training이 비효율적. 대부분의 위치가 학습에 유용하지 않은 negative 위치이기 때문
전반적으로, negative가 모델 성능을 안좋게 할 수 있음

일반적인 해결책은 training or complex sampling/reweighing scheme 동안 hard negative mining을 수행하는 것

focal loss가 제안된 one-stage detector에서는 class imbalance을 자연스럽게 처리하고, sampling과 negative한 요소 없이도 loss와 계산된 gradient를 효과적으로 train할 수 있음

Robust Estimation:

error가 큰 example(hard example)의 loss를 낮게 가중치를 부여함으로써 outlier의 contribution을 줄이는 robust한 loss function을 설계하는 데 관심이 많음

대조적으로, focal loss는 outlier를 다루기 보다는, 그 수가 많더라도 전체 loss에 대한 contribution가 작도록 특이치를 낮게 가중치를 부여함으로써(easy example) class imbalance를 해결하도록 설계

→ 즉, focal loss는robust loss의 반대 역할 수행: hard example의 sparse set에 training을 집중

3. Focal Loss

train 중 foreground와 background class 간 극심한 불균형이 발생하는 one-stage object detection 시나리오(ex. 1:1000)를 다루기 위해 설계됨

binary classification을 위한 cross entropy (CE) loss으로 시작

$$CE(p,y) = \begin{cases} -log(p) \quad\quad\;\; \text{if y=1} \\ -log(1-p) \quad \text{otherwise.} \end{cases}$$

위의 $y \in {\pm1}$에서 ground-truth class를 지정하며 $p \in [0,1]$은 label $y=1$인 class에 대한 model 추정 확률

표기 편의를 위해 $p_t$를 정의 & $CE(p,y) = CE(p_t) = -log(p_t)$라고 씀

$$p_t = \begin{cases} p \quad\quad\quad\;\; \text{if y=1} \\ 1-p \quad\quad \text{otherwise.} \end{cases}$$

CE loss는 Figure 1의 파란색 곡선에서 볼 수 있음

그림에서 이 loss에 대해 주목할 만한 특성 중 하나는 쉽게 분류되는 example($p_t \gg .5$)에서도 사소한 크기의 loss가 발생

많은 easy example을 종합할 때, 이런 작은 loss 값이 희귀 class를 압도할 수 있음

3.1 Balanced Cross Entropy

class imbalance를 해결하는 일반적인 방법은 class가 1이면 가중치 계수 $\alpha \in [0,1]$, class -1인 경우 $1- \alpha$를 도입하는 것

실제로 $\alpha$는 inverse class frequency에 의해 설정되거나 교차검증에 의해 설정되는 hyperparameter로 처리될 수 있음

표기상 편의를 위해 $p_t$를 정의한 방법과 유사하게 $\alpha_t$를 정의

α-balanced CE loss

$$CE(p_t) = -\alpha_t log(p_t).$$

이런 loss는 제안된 focal loss에 대한 실험 베이스라인으로 고려하는 CE에 대한 단순한 확장임

3.2 Focal Loss Definition

$$FL(p_t) = -(1-p_t)^{\gamma} log(p_t).$$

$$FL(p_t) = -\alpha_t (1-p_t)^{\gamma} log(p_t)$$

3.3 Class Imbalance and Model Initialization

Binary classification model은 기본적으로 $y=-1$ 또는 $1$의 출력 확률이 동일하도록 초기화 됨

이러한 초기화하에, class imbalance이 존재하는 상황에서, 잦은 class로 인한 loss가 전체 loss를 지배하고, 조기 training에서 불안정성을 야기할 수 있음

이에 대응하기 위해, training 시작 시, rare class(foreground)에 대한 model에 의해 추첮된 p값에 대한 prior 개념을 도입

prior를 $\pi$로 표시하고, rare class의 example에 대한 model의 추정 p가 낮아지도록 설정(ex. 0.01)

→ 이것이 loss function이 아니라 모델 초기화의 변화(4.1 참조)라는 것

+ class imbalance가 심한 경우, cross entropy와 focal loss 모두에 대한 train 안정성을 향상시키기 위해 이걸 발견

3.4 Class Imbalance and Two-stage Detectors

two-stage detector는 대개 α-balancing or focal loss를 사용하지 않고, cross entropy loss로 훈련됨

대신 두가지 mechanism을 통해 Class Imbalance를 해결

two-stage cascade
biased mini-batch sampling

첫 번째 cascade stage는 가능한 object 위치의 거의 무한한 set을 1,000개 or 2,000개로 줄이는 object proposal mechanism임

중요한 것은 선택된 proposal이 랜덤하지는 않지만, 실제 object 위치에 해당할 가능성이 높기 때문에 대부분의 easy negative가 제거됨

두 번째 stage를 train할 때, biased sampling은 일반적으로 positive와 negative example의 1:3 비율을 포함하는 mini-batch를 구성하는데 사용

이 비율은 sampling을 통해 구현되는 암묵적인 $\alpha$-balancing factor와 같음

제안된 focal loss는 loss function을 통해 직접 one-stage detection system에서 이러한 mechanism을 해결하도록 설계됨

4. RetinaNet Detector

ㅓㅜㅏㅓㅓ

RetinaNet는 backbone network와 two task별 sub-network로 구성된 single 통합 network

backbone은 전체 input image에 걸쳐 convolutional feature map을 계산하는 역할을 하며 off-the-self comvolutional network

첫 번째 subnet은 backbone의 output에 대해 convolutional object 분류를 수행

두 번째 subnet은 convolutional bounding box regression을 수행

두 sub-network는 one-stage, dense detection을 위해 저자들이 제안하는 간단한 설계를 특징으로 함 (Figure 3 참고)

이러한 요소의 디테일에 대해 여러 선택이 가능하지만 대부분의 design parameter는 실험에 표시된 정확한 값에 특별히 민감하진 않음

다음 내용은 RetinaNet의 각 구성요소에 대한 설명

Feature Pyramid Network Backbone:

Anchors:

Classification Subnet:

Box Regression Subnet:

4.1 Inference and Training

Focal Loss:

Initialization:

Optimization:

5. Experiments

5.1 Training Dense Detection

Network Initialization:

Balanced Cross Entropy:

Focal Loss:

Analysis of the Focal Loss:

Online Hard Example Mining (OHEM):

Hinge Loss:

5.2 Model Architecture Design

Anchor Density:

Speed versus Accuracy:

5.3 Comparsion to State of the Art

6. Conclusion

본 연구에서,

class 불균형을 식별함

최고 성능의 tow-stage 방법으로붜

source code도 공개되어있음

참고| https://herbwood.tistory.com/19

RetinaNet 논문(Focal Loss for Dense Object Detection) 리뷰

이번 포스팅에서는 RetinaNet 논문(Focal Loss for Dense Object Detection)을 리뷰해도록 하겠습니다. Object detection 모델은 이미지 내의 객체의 영역을 추정하고 IoU threshold에 따라 positive/negative sample로 구분한

herbwood.tistory.com

참고| https://deep-learning-study.tistory.com/504

[논문 읽기] RetinaNet(2017) 리뷰, Focal Loss for Dense Object Detection

RetinaNet 논문은 모델이 예측하기 어려운 hard example에 집중하도록 하는 Focal Loss를 제안합니다. ResNet과 FPN을 활용하여 구축된 one-stage 모델인 RetinaNet은 focal loss를 사용하여 two-stage 모델 Faster R-CNN의

deep-learning-study.tistory.com

참고| https://velog.io/@cha-suyeon/Focal-Loss-for-Dense-Object-Detection

[논문리뷰] RetinaNet: Focal Loss for Dense Object Detection

📑 논문 제목: Focal Loss for Dense Object Detection📑 논문 다운로드: PDF이번엔 RetinaNet의 논문을 리뷰해보려고 합니다. RetinaNet은 Object Detection 알고리즘 중 하나이며 One-statge-Detect

velog.io

저작자표시

'공부 끄적끄적 > 논문리뷰' 카테고리의 다른 글

[X:AI] RoBERTa: A Robustly Optimized BERT Pretraining Approach (0)	2023.05.11
[paper reivew] BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer (0)	2023.04.30
[X:AI] U-Net: Convolutional Networks for Biomedical Image Segmentation (0)	2023.04.03
[X:AI] ELMo; Deep Contextualized Word Representations (0)	2023.03.31
[X:AI] Transformer; Attention Is All You Need (0)	2023.03.23

내맘대로 끄적

[X:AI] RetinaNet; Focal Loss for Dense Object Detection

Focal Loss for Dense Object Detection

Abstract

1. Introduction

2. Related Work