Logにおいて、ある行とある行の時間差を集計する

背景

レガシーなシステムで、レスポンスタイムを出していない箇所のレスポンスタイムのスパイク(急上昇)があるか知りたい。

解決方法

bashで無理やり。

個人的にワンライナーを好まない（後々読めないし読める人も限られちゃうし、満足に書けもしない）ので.shファイルを書く。

結果

※新規登録・変更処理の開始ログと、その共用の終了ログを取得し、経過時間を出力している。

#!/bin/bash
timegrep='[0-2][0-9]:[0-5][0-9]:[0-5][0-9]'

targetFiles=`find /home/app/logs/2019/12/ -type f`
search='Request with URI = "hogehoge/complete"|Request with URI = "hogehoge/changeComplete"|complete.php\(846\)\] execComplete end'

convert_seconds () {
  declare -a timeset=();
  timeset=(${1//:/ })
  hour=`expr ${timeset[0]} \* 60 \* 60`
  minutes=`expr ${timeset[1]} \* 60`
  seconds=${timeset[2]}

  result=`expr $hour + $minutes + $seconds`
  echo $result 
}

declare -a results=()
while read file; do
  # echo "files: ${file}"
  targets=`cat ${file} | grep -P "${search}"`
  while read line; do
    # begin(insert)
    if [[ $line == *hogehoge/complete* ]]; then
      begintime=`echo "$line" | grep -Po "$timegrep"`
      results+=( "I,${begintime}" )
    fi
    # begin(change)
    if [[ $line == *changeComplete* ]]; then
      begintime=`echo "$line" | grep -Po "$timegrep"`
      results+=( "C,${begintime}" )
    fi
    # end
    if [[ $line == *execComplete* ]]; then
      endtime=`echo "$line" | grep -Po "$timegrep"`
      index=`echo ${#results[@]}`
      value=${results[$index-1]}
      results[$index-1]="${value},${endtime}"
    fi
  done < <(echo "$targets")
done < <(echo "$targetFiles")

# 開始・終了時間の差分計算結果
for timeset in ${results[@]}; do
  # split
  declare -a resultArray=();
  resultArray=(${timeset//,/ })
  
  index=`echo ${#resultArray[@]}`
  if [ $index -eq 3 ]; then
    # 計算可能なフォーマット
    beginTime=`convert_seconds ${resultArray[1]}`
    endTime=`convert_seconds ${resultArray[2]}`
    subTime=`expr $endTime - $beginTime`
    echo "${resultArray[0]} $subTime"
    # else
    # echo "INVALID FORMAT:${timeset}"
  fi
done

手がかり1

まず、レスポンスタイムを計測するために必要な開始点と終了点を決める。
当然、それぞれ1トランザクションで一意でなければならない。
例えば、下記をそれぞれ開始・終了とみなす。

#開始
INFO - 2020-01-14 00:33:26 --> Fuel\Core\Request::__construct - Creating a new Request with URI = "hogehoge/complete"

#終了
INFO - 2020-01-14 00:33:35 --> execComplete end

これを一つのgrepで引っ掛けられるようにする。

grep -P 'Request with URI = "hogehoge/complete"|execComplete end' ./2020/01/14.php

これによって、エラーを考えなければ開始・終了のセットが手に入ることになる。開始->開始になる場合の考慮は別途必要。

手がかり2

開始、終了のログが手に入れば、次はそれを秒数に変換する必要がある。そのためにはまず、開始と終了時間という単位に情報を加工しなければならない。
例えば、下記のように書ける。

timegrep='[0-2][0-9]:[0-5][0-9]:[0-5][0-9]'
if [[ $line == *reserve/complete* ]]; then
  begintime=`echo "$line" | grep -Po "$timegrep"`
  results+=( "I,${begintime}" )
fi
# end
if [[ $line == *execComplete* ]]; then
  endtime=`echo "$line" | grep -Po "$timegrep"`
  index=`echo ${#results[@]}`
  value=${results[$index-1]}
  results[$index-1]="${value},${endtime}"
fi

Grepの結果を1行ずつ読み出し、それを開始・終了ごとに処理する。
開始では配列に値を保存し、終了では末尾の要素に終了時間を付け足している。

これによって、終了ログがない(途中でエラーが起きたとか)の場合は終了時間がなくなるため、フォーマットエラーを判断することができる。
ちなみにgrepのオプションPは正規表現を、oは一致箇所のみを取得を意味する。今回の場合はtimegrepで指定している時間フォーマットだけを検索し、値を取得している。

手がかり3

ここまでで開始・終了時間の組を手にすることができた。
次は、それぞれの時間を秒数に変換する。

変換方法は単純なものである。(日またぎは考慮できないが)

convert_seconds () {
  declare -a timeset=();
  timeset=(${1//:/ })
  hour=`expr ${timeset[0]} \* 60 \* 60`
  minutes=`expr ${timeset[1]} \* 60`
  seconds=${timeset[2]}

  result=`expr $hour + $minutes + $seconds`
  echo $result 
}

：をdelimiterとして用い、それぞれを秒数に変換し和を取る関数である。
これを開始、終了時間に対し用い、終了時間-開始時間を計算すれば良い。

なお、exprを使用すれば01 * 03のような計算でも正しく処理することができる。

これを用いて下記のようにして計算結果をEchoしてやる。
あとは実行結果をリダイレクトすれば、計算結果の入ったテキストファイルが手に入る。

# 開始・終了時間の差分計算結果
for timeset in ${results[@]}; do
  # split
  declare -a resultArray=();
  resultArray=(${timeset//,/ })
  
  index=`echo ${#resultArray[@]}`
  if [ $index -eq 3 ]; then
    # 計算可能なフォーマット
    beginTime=`convert_seconds ${resultArray[1]}`
    endTime=`convert_seconds ${resultArray[2]}`
    subTime=`expr $endTime - $beginTime`
    echo "${resultArray[0]} $subTime"
  fi
done

ヒストグラム

問題は解決したが、このままでは人に見せるには不十分である。かといってエクセルやら高等なプログラミング言語を持ち出すほどのものでもない。

そんなときはシェルスクリプトでヒストグラムを作れば良い。下記が参考になった。

https://qiita.com/kkdd/items/88581a2d04843c3c7ce7

これを使用すれば、下記のようなワンライナーでヒストグラムを出力できる。
(grep ‘I’はいろんな都合の結果、新規登録と予約処理のレスポンス結果を一緒にとってしまったのでここで出し分けるために使っている。)

awkのprintfのフォーマットに文字列(%s)を使用した場合、は見た目が狂う可能性がある。数値とわかっているなら%dを使ったほうが良い。

bash-4.1$ cat response.txt | grep 'I' | awk '{print $2}' | sort | uniq -c | sort -k1nr | awk '{printf "%4d %5d\n",$2,$1;}' | awk '!max{max=$2;}{f=50/max;if(f>1)f=1;i=$2*f;r="";while(i-->0)r=r"#";printf "%s %s\n",$0,r;}'
               5 1197 ##################################################
               6  986 ##########################################
               4  552 ########################
               7  430 ##################
               3  137 ######
               8   83 ####
               9   20 #
              10    7 #
               1    3 #
               2    3 #
              11    1 #
              35    1 #

というわけで、このレガシーなシステムは普段は5秒前後で安定しているものの、最悪30秒を超えるスパイクを出すことがわかった。

あとは、その最悪値を出した原因を追うことになったりすると思うが、それは適当に途中の処理をEchoして該当の行数・ファイルを見つけてやれば良い。

Chalkboard