ABOUT ME

ํ”„๋กœ๊ทธ๋ž˜๋ฐ & ๋ณด์•ˆ ๊ณต๋ถ€

Today
Yesterday
Total
  • ํŒŒ์ผ ํƒ€์ž… File Type ํ™•์ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋น„๊ต :: Apache Tika, JMimeMagic, SimpleMagic
    PROGRAMMING/๊ธฐํƒ€ 2021. 2. 1. 22:46

    ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํ•˜๋‹ค๋ณด๋ฉด file ์— ๊ด€๋ จ๋œ ์ž‘์—…์„ ํ•  ๋•Œ๊ฐ€ ์žˆ๋‹ค. 

    ์ด ๋•Œ ํŒŒ์ผ์ด ์–ด๋–ค ํƒ€์ž…์ธ์ง€ ํ™•์ธํ•ด์ฃผ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋“ค์ด ๋ช‡ ๊ฐ€์ง€ ์กด์žฌํ•˜๋Š”๋ฐ, ํ•œ ๋ฒˆ ์ง์ ‘ ๋น„๊ตํ•ด๋ณด์•˜๋‹ค.

     

    1. ์‹คํ—˜ ํ™˜๊ฒฝ:: Intellij, gradle, Kotlin, project SDK 15.0.2

     

    ์ข…๋ฅ˜์™€ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

     

    1) Apache Tika (tika.apache.org/)

    ๋ฐฉ์‹:: FIle MetaData ์™€ ํŒŒ์ผ ๋‚ด์šฉ์„ ํŒŒ์‹ฑํ•ด์„œ ํ™•์ธ

    ํŠน์ง•:: ์ด์ „์—๋Š” ์˜์กด์„ฑ์ด ๋งŽ์•„์„œ ๋ถˆํŽธํ–ˆ์ง€๋งŒ, ๊พธ์ค€ํ•œ ์„ฑ๋Šฅ ๊ฐœ์„ ์œผ๋กœ ํ•˜๋‚˜์˜ dependency ๋งŒ ์ถ”๊ฐ€ํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    ์‚ฌ์šฉํ•œ dependency : org.apache.tika:tika-parsers:1.18

     

    ์œ ์˜ ์‚ฌํ•ญ:: ๊ทธ๋ƒฅ dependency ์— ์ถ”๊ฐ€ํ–ˆ๋‹ค๊ฐ€๋Š” ๊ธฐ์กด์˜ dependency ์™€ ์ถฉ๋Œ์„ ์ผ์œผํ‚ฌ ํ™•๋ฅ ์ด ๋†’์œผ๋‹ˆ, ๋ฒ„์ „์„ ๊ผญ ์ž˜ ํ™•์ธํ•  ๊ฒƒ !!

     

    (๋‚˜์˜ ๊ฒฝ์šฐ lombck ์ด 1.18.16 ์ด์—ˆ๋Š”๋ฐ, Apache Tika ๋ฅผ ์ด๊ฑฐ๋ž‘ ๋งž์ถฐ์ฃผ๋‹ˆ๊นŒ Conflict ๊ฐ€ ํ•ด๊ฒฐ๋๋‹ค.. ์ฐธ๊ณ )

     

    2) JMimeMagic (github.com/arimus/jmimemagic)

    ๋ฐฉ์‹:: ํŒŒ์ผ ํ™•์žฅ์ž & ํ—ค๋” ์ •๋ณด๋กœ ํ™•์ธ

    ์‚ฌ์šฉํ•œ dependency : net.sf.jmimemagic:jmimemagic:0.1.5

     

    3) SimpleMagic (github.com/j256/simplemagic)

    ๋ฐฉ์‹:: ํŒŒ์ผ ํ™•์žฅ์ž & ํ—ค๋” ์ •๋ณด๋กœ ํ™•์ธ

    ์‚ฌ์šฉํ•œ dependency : com.j256.simplemagic:simplemagic:1.16

     

     

    2. ์‹คํ—˜ ์ค€๋น„

     

    1) gradle ์— dependencies ์ถ”๊ฐ€

    (์ž์‹ ์˜ build ์— ๋งž๊ฒŒ ์ž‘์„ฑํ•ด์ฃผ๋ฉด ๋œ๋‹ค.)

    implementation("org.apache.tika:tika-parsers:1.18")
    implementation("net.sf.jmimemagic:jmimemagic:0.1.5")
    implementation("com.j256.simplemagic:simplemagic:1.16")
    

     

    2) ํ…Œ์ŠคํŠธ ์ฝ”๋“œ ์ž‘์„ฑ

    (ํ…Œ์ŠคํŠธํ•  ํŒŒ์ผ์„ ๋ชจ์•„๋‘” ํด๋” ๊ฒฝ๋กœ : E:\\tmp)

     

    2-1) Apache Tike ํ…Œ์ŠคํŠธ ์ฝ”๋“œ

    fun fileType1(){
            //Apache Tika
    
            val rootPath = "E:\\tmp"
            val file = File(rootPath)
            if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
            var tika = Tika()
    
            file.walk().forEach {
                println(it.absolutePath)
                println(tika.detect(it)+"\n")
            }
        }

     

    2-2 JMimeMagic ํ…Œ์ŠคํŠธ ์ฝ”๋“œ

    fun fileType2 () {
            //JMimeMagic
    
            val rootPath = "E:\\tmp"
            val file = File(rootPath)
            if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
    
            file.walk().forEach {
                println(it.absolutePath)
                val match : MagicMatch? = Magic.getMagicMatch(it, true, false)
                if(match == null) {
                	println("file match x\n")
                }else {
                	println("extension: ${match.extension} / mimeType: ${match.mimeType}\n")
                }
            }
        }

     

    2-3 SimpleMagic ํ…Œ์ŠคํŠธ ์ฝ”๋“œ

    fun fileType3 () {
            //SimpleMagic
    
            var util : ContentInfoUtil = ContentInfoUtil()
            val rootPath = "E:\\tmp"
            val file = File(rootPath)
            if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
    
            file.walk().forEach {
                println(it.absolutePath)
                val info : ContentInfo? = util.findMatch(it)
                if(info == null) println("file match x \n")
                else {
                	println("contentType: ${info.contentType} / mimeType: ${info.mimeType} \n")
                }
            }
    
        }

     

     

    3. ๊ฒฐ๊ณผ

     

    3-1 Apache Tika

     

    3-2 JMimeMagic

    3-3 SimpleMagic

     

     

    4. ๊ฒฐ๊ณผ ๋ถ„์„

     

    (์–ด๋””๊นŒ์ง€๋‚˜ ์œ„์— ์ž‘์„ฑํ•œ ํ…Œ์ŠคํŠธ ์ฝ”๋“œ์™€ ์˜ˆ์‹œ ํŒŒ์ผ, ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฒ„์ „์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ์ด๋ฏ€๋กœ, ๋‹ค๋ฅธ ํŒŒ์ผ์ด๋‚˜ ์ฝ”๋“œ, ๋ฒ„์ „์„ ๋ฐ”๊พธ๋ฉด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ์„ ์•Œ๋ ค๋“œ๋ฆฝ๋‹ˆ๋‹ค.)

     

    (ํŒŒ๋ž€์ƒ‰ Bold : [๋‚ด๊ธฐ์ค€] ๋งž๊ฒŒ ํŒ๋ณ„ํ•œ ๊ฒƒ)

    FileType / ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ข…๋ฅ˜ Apache Tika JMimeMagic SimpleMagic
    .sin text/plain text/plain x
    .html text/html text/html text/html
    .png image/png image/png image/png
    .cpp text/x-csrc text/plain x
    .cpp
    (๋‚ด์šฉ์€ empty ์ƒํƒœ, ํ™•์žฅ์ž๋งŒ ๋ณ€๊ฒฝ)
    text/x-c++src text/plain x
    .pdf application/pdf application/pdf application/pdf
    .css text/css text/plain x
    .js application/javascript text/plain x
    .pptx [warning] application/vnd.openxmlformates-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.presentation
    .txt text/plain text/plain x
    .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.document
    .hwp application/x-hwp-v5 application/msword null

     

     

    ๋Œ€๋ถ€๋ถ„์˜ ํŒŒ์ผ์—์„œ Apache Tika ๊ฐ€ ์ •ํ™•๋„๊ฐ€ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    ๊ทธ๋Ÿฐ๋ฐ png, ppt, docx ์ •๋„์˜ ํŒŒ์ผ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค ์‹ถ์œผ๋ฉด ๊ตณ์ด Apache Tika ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋ฌด๋ฐฉํ•  ๊ฒƒ์ด๋‹ค.

     

    ์ฆ‰ ์ž์‹ ์˜ ํ•„์š”์— ๋”ฐ๋ผ ์ทจ์‚ฌ์„ ํƒํ•˜๊ธฐ~

     

    ๋Œ“๊ธ€

Designed by Tistory.