ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • 파일 νƒ€μž… File Type 확인 라이브러리 비ꡐ :: Apache Tika, JMimeMagic, SimpleMagic
    PROGRAMMING/기타 2021. 2. 1. 22:46

    ν”„λ‘œκ·Έλž˜λ°μ„ ν•˜λ‹€λ³΄λ©΄ file 에 κ΄€λ ¨λœ μž‘μ—…μ„ ν•  λ•Œκ°€ μžˆλ‹€. 

    이 λ•Œ 파일이 μ–΄λ–€ νƒ€μž…μΈμ§€ ν™•μΈν•΄μ£ΌλŠ” λΌμ΄λΈŒλŸ¬λ¦¬λ“€μ΄ λͺ‡ 가지 μ‘΄μž¬ν•˜λŠ”λ°, ν•œ 번 직접 λΉ„κ΅ν•΄λ³΄μ•˜λ‹€.

     

    1. μ‹€ν—˜ ν™˜κ²½:: Intellij, gradle, Kotlin, project SDK 15.0.2

     

    μ’…λ₯˜μ™€ 방식은 λ‹€μŒκ³Ό κ°™λ‹€.

     

    1) Apache Tika (tika.apache.org/)

    방식:: FIle MetaData 와 파일 λ‚΄μš©μ„ νŒŒμ‹±ν•΄μ„œ 확인

    νŠΉμ§•:: μ΄μ „μ—λŠ” μ˜μ‘΄μ„±μ΄ λ§Žμ•„μ„œ λΆˆνŽΈν–ˆμ§€λ§Œ, κΎΈμ€€ν•œ μ„±λŠ₯ κ°œμ„ μœΌλ‘œ ν•˜λ‚˜μ˜ dependency 만 μΆ”κ°€ν•΄μ„œ μ‚¬μš©ν•  수 μžˆλ‹€.

    μ‚¬μš©ν•œ dependency : org.apache.tika:tika-parsers:1.18

     

    유의 사항:: κ·Έλƒ₯ dependency 에 μΆ”κ°€ν–ˆλ‹€κ°€λŠ” 기쑴의 dependency 와 μΆ©λŒμ„ μΌμœΌν‚¬ ν™•λ₯ μ΄ λ†’μœΌλ‹ˆ, 버전을 κΌ­ 잘 확인할 것 !!

     

    (λ‚˜μ˜ 경우 lombck 이 1.18.16 μ΄μ—ˆλŠ”λ°, Apache Tika λ₯Ό μ΄κ±°λž‘ λ§žμΆ°μ£Όλ‹ˆκΉŒ Conflict κ°€ 해결됐닀.. μ°Έκ³ )

     

    2) JMimeMagic (github.com/arimus/jmimemagic)

    방식:: 파일 ν™•μž₯자 & 헀더 μ •λ³΄λ‘œ 확인

    μ‚¬μš©ν•œ dependency : net.sf.jmimemagic:jmimemagic:0.1.5

     

    3) SimpleMagic (github.com/j256/simplemagic)

    방식:: 파일 ν™•μž₯자 & 헀더 μ •λ³΄λ‘œ 확인

    μ‚¬μš©ν•œ dependency : com.j256.simplemagic:simplemagic:1.16

     

     

    2. μ‹€ν—˜ μ€€λΉ„

     

    1) gradle 에 dependencies μΆ”κ°€

    (μžμ‹ μ˜ build 에 맞게 μž‘μ„±ν•΄μ£Όλ©΄ λœλ‹€.)

    implementation("org.apache.tika:tika-parsers:1.18")
    implementation("net.sf.jmimemagic:jmimemagic:0.1.5")
    implementation("com.j256.simplemagic:simplemagic:1.16")
    

     

    2) ν…ŒμŠ€νŠΈ μ½”λ“œ μž‘μ„±

    (ν…ŒμŠ€νŠΈν•  νŒŒμΌμ„ λͺ¨μ•„λ‘” 폴더 경둜 : E:\\tmp)

     

    2-1) Apache Tike ν…ŒμŠ€νŠΈ μ½”λ“œ

    fun fileType1(){
            //Apache Tika
    
            val rootPath = "E:\\tmp"
            val file = File(rootPath)
            if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
            var tika = Tika()
    
            file.walk().forEach {
                println(it.absolutePath)
                println(tika.detect(it)+"\n")
            }
        }

     

    2-2 JMimeMagic ν…ŒμŠ€νŠΈ μ½”λ“œ

    fun fileType2 () {
            //JMimeMagic
    
            val rootPath = "E:\\tmp"
            val file = File(rootPath)
            if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
    
            file.walk().forEach {
                println(it.absolutePath)
                val match : MagicMatch? = Magic.getMagicMatch(it, true, false)
                if(match == null) {
                	println("file match x\n")
                }else {
                	println("extension: ${match.extension} / mimeType: ${match.mimeType}\n")
                }
            }
        }

     

    2-3 SimpleMagic ν…ŒμŠ€νŠΈ μ½”λ“œ

    fun fileType3 () {
            //SimpleMagic
    
            var util : ContentInfoUtil = ContentInfoUtil()
            val rootPath = "E:\\tmp"
            val file = File(rootPath)
            if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
    
            file.walk().forEach {
                println(it.absolutePath)
                val info : ContentInfo? = util.findMatch(it)
                if(info == null) println("file match x \n")
                else {
                	println("contentType: ${info.contentType} / mimeType: ${info.mimeType} \n")
                }
            }
    
        }

     

     

    3. κ²°κ³Ό

     

    3-1 Apache Tika

     

    3-2 JMimeMagic

    3-3 SimpleMagic

     

     

    4. κ²°κ³Ό 뢄석

     

    (μ–΄λ””κΉŒμ§€λ‚˜ μœ„μ— μž‘μ„±ν•œ ν…ŒμŠ€νŠΈ μ½”λ“œμ™€ μ˜ˆμ‹œ 파일, 라이브러리 버전에 λŒ€ν•œ κ²°κ³Όμ΄λ―€λ‘œ, λ‹€λ₯Έ νŒŒμΌμ΄λ‚˜ μ½”λ“œ, 버전을 λ°”κΎΈλ©΄ λ‹¬λΌμ§ˆ 수 μžˆμŒμ„ μ•Œλ €λ“œλ¦½λ‹ˆλ‹€.)

     

    (νŒŒλž€μƒ‰ Bold : [λ‚΄κΈ°μ€€] 맞게 νŒλ³„ν•œ 것)

    FileType / 라이브러리 μ’…λ₯˜ Apache Tika JMimeMagic SimpleMagic
    .sin text/plain text/plain x
    .html text/html text/html text/html
    .png image/png image/png image/png
    .cpp text/x-csrc text/plain x
    .cpp
    (λ‚΄μš©μ€ empty μƒνƒœ, ν™•μž₯자만 λ³€κ²½)
    text/x-c++src text/plain x
    .pdf application/pdf application/pdf application/pdf
    .css text/css text/plain x
    .js application/javascript text/plain x
    .pptx [warning] application/vnd.openxmlformates-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.presentation
    .txt text/plain text/plain x
    .docx application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.document
    .hwp application/x-hwp-v5 application/msword null

     

     

    λŒ€λΆ€λΆ„μ˜ νŒŒμΌμ—μ„œ Apache Tika κ°€ 정확도가 높은 것을 λ³Ό 수 μžˆλ‹€.

    그런데 png, ppt, docx μ •λ„μ˜ 파일만 μ‚¬μš©ν•œλ‹€ μ‹ΆμœΌλ©΄ ꡳ이 Apache Tika λ₯Ό μ‚¬μš©ν•˜μ§€ μ•Šμ•„λ„ 무방할 것이닀.

     

    즉 μžμ‹ μ˜ ν•„μš”μ— 따라 μ·¨μ‚¬μ„ νƒν•˜κΈ°~

     

    λŒ“κΈ€

Designed by Tistory.