PROGRAMMING/기타

파일 νƒ€μž… File Type 확인 라이브러리 비ꡐ :: Apache Tika, JMimeMagic, SimpleMagic

\b\t 2021. 2. 1. 22:46

ν”„λ‘œκ·Έλž˜λ°μ„ ν•˜λ‹€λ³΄λ©΄ file 에 κ΄€λ ¨λœ μž‘μ—…μ„ ν•  λ•Œκ°€ μžˆλ‹€. 

이 λ•Œ 파일이 μ–΄λ–€ νƒ€μž…μΈμ§€ ν™•μΈν•΄μ£ΌλŠ” λΌμ΄λΈŒλŸ¬λ¦¬λ“€μ΄ λͺ‡ 가지 μ‘΄μž¬ν•˜λŠ”λ°, ν•œ 번 직접 λΉ„κ΅ν•΄λ³΄μ•˜λ‹€.

 

1. μ‹€ν—˜ ν™˜κ²½:: Intellij, gradle, Kotlin, project SDK 15.0.2

 

μ’…λ₯˜μ™€ 방식은 λ‹€μŒκ³Ό κ°™λ‹€.

 

1) Apache Tika (tika.apache.org/)

방식:: FIle MetaData 와 파일 λ‚΄μš©μ„ νŒŒμ‹±ν•΄μ„œ 확인

νŠΉμ§•:: μ΄μ „μ—λŠ” μ˜μ‘΄μ„±μ΄ λ§Žμ•„μ„œ λΆˆνŽΈν–ˆμ§€λ§Œ, κΎΈμ€€ν•œ μ„±λŠ₯ κ°œμ„ μœΌλ‘œ ν•˜λ‚˜μ˜ dependency 만 μΆ”κ°€ν•΄μ„œ μ‚¬μš©ν•  수 μžˆλ‹€.

μ‚¬μš©ν•œ dependency : org.apache.tika:tika-parsers:1.18

 

유의 사항:: κ·Έλƒ₯ dependency 에 μΆ”κ°€ν–ˆλ‹€κ°€λŠ” 기쑴의 dependency 와 μΆ©λŒμ„ μΌμœΌν‚¬ ν™•λ₯ μ΄ λ†’μœΌλ‹ˆ, 버전을 κΌ­ 잘 확인할 것 !!

 

(λ‚˜μ˜ 경우 lombck 이 1.18.16 μ΄μ—ˆλŠ”λ°, Apache Tika λ₯Ό μ΄κ±°λž‘ λ§žμΆ°μ£Όλ‹ˆκΉŒ Conflict κ°€ 해결됐닀.. μ°Έκ³ )

 

2) JMimeMagic (github.com/arimus/jmimemagic)

방식:: 파일 ν™•μž₯자 & 헀더 μ •λ³΄λ‘œ 확인

μ‚¬μš©ν•œ dependency : net.sf.jmimemagic:jmimemagic:0.1.5

 

3) SimpleMagic (github.com/j256/simplemagic)

방식:: 파일 ν™•μž₯자 & 헀더 μ •λ³΄λ‘œ 확인

μ‚¬μš©ν•œ dependency : com.j256.simplemagic:simplemagic:1.16

 

 

2. μ‹€ν—˜ μ€€λΉ„

 

1) gradle 에 dependencies μΆ”κ°€

(μžμ‹ μ˜ build 에 맞게 μž‘μ„±ν•΄μ£Όλ©΄ λœλ‹€.)

implementation("org.apache.tika:tika-parsers:1.18")
implementation("net.sf.jmimemagic:jmimemagic:0.1.5")
implementation("com.j256.simplemagic:simplemagic:1.16")

 

2) ν…ŒμŠ€νŠΈ μ½”λ“œ μž‘μ„±

(ν…ŒμŠ€νŠΈν•  νŒŒμΌμ„ λͺ¨μ•„λ‘” 폴더 경둜 : E:\\tmp)

 

2-1) Apache Tike ν…ŒμŠ€νŠΈ μ½”λ“œ

fun fileType1(){
        //Apache Tika

        val rootPath = "E:\\tmp"
        val file = File(rootPath)
        if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")
        var tika = Tika()

        file.walk().forEach {
            println(it.absolutePath)
            println(tika.detect(it)+"\n")
        }
    }

 

2-2 JMimeMagic ν…ŒμŠ€νŠΈ μ½”λ“œ

fun fileType2 () {
        //JMimeMagic

        val rootPath = "E:\\tmp"
        val file = File(rootPath)
        if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")

        file.walk().forEach {
            println(it.absolutePath)
            val match : MagicMatch? = Magic.getMagicMatch(it, true, false)
            if(match == null) {
            	println("file match x\n")
            }else {
            	println("extension: ${match.extension} / mimeType: ${match.mimeType}\n")
            }
        }
    }

 

2-3 SimpleMagic ν…ŒμŠ€νŠΈ μ½”λ“œ

fun fileType3 () {
        //SimpleMagic

        var util : ContentInfoUtil = ContentInfoUtil()
        val rootPath = "E:\\tmp"
        val file = File(rootPath)
        if(!file.exists()) throw IllegalArgumentException("no ${file.absolutePath}")

        file.walk().forEach {
            println(it.absolutePath)
            val info : ContentInfo? = util.findMatch(it)
            if(info == null) println("file match x \n")
            else {
            	println("contentType: ${info.contentType} / mimeType: ${info.mimeType} \n")
            }
        }

    }

 

 

3. κ²°κ³Ό

 

3-1 Apache Tika

 

3-2 JMimeMagic

3-3 SimpleMagic

 

 

4. κ²°κ³Ό 뢄석

 

(μ–΄λ””κΉŒμ§€λ‚˜ μœ„μ— μž‘μ„±ν•œ ν…ŒμŠ€νŠΈ μ½”λ“œμ™€ μ˜ˆμ‹œ 파일, 라이브러리 버전에 λŒ€ν•œ κ²°κ³Όμ΄λ―€λ‘œ, λ‹€λ₯Έ νŒŒμΌμ΄λ‚˜ μ½”λ“œ, 버전을 λ°”κΎΈλ©΄ λ‹¬λΌμ§ˆ 수 μžˆμŒμ„ μ•Œλ €λ“œλ¦½λ‹ˆλ‹€.)

 

(νŒŒλž€μƒ‰ Bold : [λ‚΄κΈ°μ€€] 맞게 νŒλ³„ν•œ 것)

FileType / 라이브러리 μ’…λ₯˜ Apache Tika JMimeMagic SimpleMagic
.sin text/plain text/plain x
.html text/html text/html text/html
.png image/png image/png image/png
.cpp text/x-csrc text/plain x
.cpp
(λ‚΄μš©μ€ empty μƒνƒœ, ν™•μž₯자만 λ³€κ²½)
text/x-c++src text/plain x
.pdf application/pdf application/pdf application/pdf
.css text/css text/plain x
.js application/javascript text/plain x
.pptx [warning] application/vnd.openxmlformates-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.presentation application/vnd.openxmlformats-officedocument.presentationml.presentation
.txt text/plain text/plain x
.docx application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.document application/vnd.openxmlformats-officedocument.wordprocessingml.document
.hwp application/x-hwp-v5 application/msword null

 

 

λŒ€λΆ€λΆ„μ˜ νŒŒμΌμ—μ„œ Apache Tika κ°€ 정확도가 높은 것을 λ³Ό 수 μžˆλ‹€.

그런데 png, ppt, docx μ •λ„μ˜ 파일만 μ‚¬μš©ν•œλ‹€ μ‹ΆμœΌλ©΄ ꡳ이 Apache Tika λ₯Ό μ‚¬μš©ν•˜μ§€ μ•Šμ•„λ„ 무방할 것이닀.

 

즉 μžμ‹ μ˜ ν•„μš”μ— 따라 μ·¨μ‚¬μ„ νƒν•˜κΈ°~