CARVIEW |
Select Language
HTTP/2 200
date: Thu, 09 Oct 2025 21:35:34 GMT
content-type: text/html; charset=utf-8
cache-control: max-age=0, private, must-revalidate
cf-cache-status: DYNAMIC
link: ; rel=preload; as=style; nopush,; rel=preload; as=script; nopush,; rel=preload; as=style; nopush,; rel=preload; as=script; nopush,; rel=preload; as=script; nopush
nel: {"report_to":"heroku-nel","response_headers":["Via"],"max_age":3600,"success_fraction":0.01,"failure_fraction":0.1}
referrer-policy: strict-origin-when-cross-origin
report-to: {"group":"heroku-nel","endpoints":[{"url":"https://nel.heroku.com/reports?s=GiQ6QTkSMbWlGnYn35HsvOA3DAZtNPICFguoc8nPbi8%3D\u0026sid=e11707d5-02a7-43ef-b45e-2cf4d2036f7d\u0026ts=1760045734"}],"max_age":3600}
reporting-endpoints: heroku-nel="https://nel.heroku.com/reports?s=GiQ6QTkSMbWlGnYn35HsvOA3DAZtNPICFguoc8nPbi8%3D&sid=e11707d5-02a7-43ef-b45e-2cf4d2036f7d&ts=1760045734"
server: cloudflare
strict-transport-security: max-age=0; includeSubDomains
vary: Accept,Accept-Encoding
via: 2.0 heroku-router
x-content-type-options: nosniff
x-permitted-cross-domain-policies: none
x-request-id: e97e01d9-c494-674d-7384-3c72e68ed614
x-runtime: 0.167115
x-xss-protection: 0
content-encoding: gzip
set-cookie: _secure_speakerd_session=XrZFNWLUA49oZlosx0YjavL%2BQcsHk5kbnD709c0sjFQlnFCdHk%2FkFskbvnvcr4jXWZSuwHanNeUVESdhzs3p8U08aBC0hayTPuDRuKWcAcL4W0WiNVp2H5AQKh9WtETi3HfUQhKvqm9iKNGEITxP5nPvQIAgdT8GYdiATaTDt4mXNtJOAY7jyS5N%2Fy8wJu%2F90je098gbSMdl8eR4NIqS8nIH39m6qF1LmbYxp4HHRJGRyZD7BoXQdpmMwpD%2FMNM3N7wfFzdnRFp8jci%2FcztT1Zi3xXm5Ni2IJoCJEiFVJ6b7QyhevW0glXhkHQ5c1FBoJ6%2FIQkxssVugdEKR1BI%2FkEYBJmhcZMcnBAROck5ETW1uTn4mKj1SvhC8avuzzYmsDDMQKlElU0nB2YpSv2YWY4PI--DgOQGafWLvnswFDn--9pn34AMWgrXblbaCiEctiQ%3D%3D; HttpOnly; SameSite=Lax; Secure; Path=/; Expires=Thu, 23 Oct 2025 21:35:34 GMT
cf-ray: 98c1022bac6aa403-BLR
#SRE論文紹介 Detection is Better Than Cure: A Cloud Incidents Perspective
V. Ganatra et. al., ESEC/FSE’23 - Speaker Deck
#SRE論文紹介 Detection is Better Than Cure: A Cloud Incidents Perspective V. Ganatra et. al., ESEC/FSE’23
Waroom Meetup #1
https://topotal.connpass.com/event/317285/
Ganatra V, Parayil A, Ghosh S, Kang Y, Ma M, Bansal C, Nath S, Mace J. Detection Is Better Than Cure: A Cloud Incidents Perspective. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2023 (pp. 1891-1902).
Yuuki Tsubouchi (yuuk1)
June 04, 2024
More Decks by Yuuki Tsubouchi (yuuk1)
Other Decks in Research
Featured
Transcript
-
#SREจհ Yuuki Tsubouchi / @yuuk1t TopotalςΫϊϩδΞυόΠβʔ Waroom Meetup #1 Detection
is Better Than Cure: A Cloud Incidents Perspective V. Ganatra et. al., ESEC/FSE’23 2024/06/04 -
3 ɾஶऀɿMicrosoft India, China, USͦΕͧΕͷॴଐ ɾձٞɿESEC/FSEɻιϑτΣΞֶܥͷτοϓձٞ (CORE Rank A) จհͷϝλσʔλ
Ganatra V, Parayil A, Ghosh S, Kang Y, Ma M, Bansal C, Nath S, Mace J. Detection Is Better Than Cure: A Cloud Incidents Perspective. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering 2023 (pp. 1891-1902). ※ ɾɿ2023 MicrosoftϝΨΫϥυϕ ϯμʔͷதͰΠϯγσϯτ ཧͰଟ͘ͷจΛ ൃද͍ͯ͠Δ -
4 (1) Ϟχλʔ͕ෆ͢Δ͜ͱʢmiss-detection, ݟಀ͠ʣ Πϯγσϯτݕͷҙࣝ Monitoring Gap (2) ෆཁͳϞχλʔ͕͋Δ͜ͱ Πϯγσϯτͷॳظঢ়Λݕ
Ͱ͖ͳ͍ ΠϯγσϯτղܾޙʹɺϞχλʔΛΞυϗοΫʹ࡞͕ͪ͠ ΞϥʔτετʔϜΛҾ͖ى͜͢ -
5 େنΫϥυαʔϏεͷ࣮ূݚڀ ɾAzureͷΠϯγσϯτੳ͔ΒιϑτΣΞόάͷҰൠతͳࠜຊݪҼΛಛఆ ɾTeamsͷΠϯγσϯτੳ͔Βڞ௨͢ΔࠜຊݪҼͱ؇ࡦΛಛఆ ࣅͨΑ͏ͳ͜ͱΛ͍ͬͯΔਓ͍ͳ͍ʁ ΞϥʔτετʔϜΛੳ͢Δ࣮ূݚڀ ɾେنۜߦγεςϜͷΞϥʔτετʔϜΛੳ ɾදతͳΞϥʔτΛબ͢ΔͨΊͷΞϧΰϦζϜఏҊ https://x.com/yuuk1t/status/1648558134481547264 [Ghosh+,
SoCC2022]: How to fight production incidents? an empirical study on a large-scale cloud service [Liu+,HotOS2019]: What bugs cause production cloud incidents? miss-detectionʢݟಀ͠ʣʹؔ͢Δ࣮ূݚڀͳ͍ [Zhao+,ICSE/SEP’2020]. Understanding and handling alert storm for online service systems. -
8 (1) Missing/improper signal: ඞཁͳςϨϝτϦ͕ͳ͍ (2) Missing monitor/alert: ςϨϝτϦ͋Δ͕Ϟχλʔ/Ξϥʔτ͕ͳ͍ (3)
Improper monitor coverage: Ϟχλʔ͕ΠϯγσϯτΛΧόʔ͠ͳ͍ (4) Incorrect alerting logic: ᮢ͕ߴ͗͢ΔͳͲϩδοΫ͕ෆద (5) Buggy monitor: Ϟχλʔઃఆόά(৽൛ϝτϦΫεΛ͑ͯͳ͍ͳͲʣ (6) Others: ΞϥʔτͷυΩϡϝϯτʢRunbookʣ͕ܽམɺޡΓ miss-detectionͷओཁͳ̒छͷݪҼ -
10 miss-detectionͷ27.5%͕αʔϏεఀࢭ(outage) miss-detectionͷӨڹͲΕ͘Β͍ʁ ΞϥʔτͷϩδοΫ/υΩϡϝϯτޡΓͰ 40%Ҏ্͕ఀࢭ Figure 5: (a) Proportion of
incidents from each miss-detection class that led to outages -
11 miss-detectionͷӨڹͲΕ͘Β͍ʁ Figure 5: (b) Time to Detect (TTD) and
Time to Mitigate (TTM) for cloud incidents that were not detected properly. TTDϞχλʔ/ςϨϝτϦ͕ ܽམ͍ͯ͠Δͱ࠷େ TTMςϨϝτϦυΩϡϝϯτ͕ ͳ͍߹ʹಛʹߴ͘ͳΔ -
̎αʔϏεؒڞ௨ͷґଘؔ&ϞχλʔͷࣝΛͬͯ ϞχλʔՃΛࣗಈఏҊͰ͖ͨͣ 13 ΑΓྑ͍ϞχλϦϯάΛ͢Δʹʁ ࠷ॳͷΠϯγσϯτ ̎൪ͷΠϯγσϯτ 2. ಉҰϦʔδϣϯͷαʔϏε͕DB ͷ৽نଓΛ։͚ͳ͘ͳͬͨ վળ
1. ͋ΔϦʔδϣϯͷDB͕ఀࢭͨ͠ 3. DBଓͷোΛࢹ͓ͯ͠Βͣɺ ΠϯγσϯτΛݕग़Ͱ͖ͳ͔ͬͨ 4. ΞΫγϣϯΞΠςϜͱͯ͠DBଓ ͷϞχλʔΛՃͨ͠ 1. ผαʔϏεͰΤϯΩϡʔʹ͕࣌ؒ ͔͔Γɺδϣϒ͕٧·ͬͨ 2. SQLͷλΠϜΞτʹ໘ͨ͠ 3.ΠϯγσϯτϞχλʔͰݕग़Ͱ͖ ͣɺखಈͰ؍ଌ͞Εͨ ࠷ॳͷαʔϏεͱґଘؔͷ40%Ҏ্Λڞ ༗͠ɺେྔͷڞ௨ͷϞχλʔΛ͍ͬͯͨ -
15 ײɿΑΓΑ͍Πϯγσϯτཧ ϞχλϦϯά ݕ ৼΓฦΓ ΠϯγσϯτରԠ ղܾ Φϯίʔϧ Ϟχλʔઃఆ Ξϥʔτ
σʔλΛूੵ ༗༻/ෆཁϞχλʔͷࣗಈఏҊͳͲ ແବͳΞϥʔτ/ίʔϧ miss-detection ΠϯγσϯτϝτϦΫε ΞΫγϣϯ