WO2015140562A1 - Steganographic document alteration - Google Patents

Steganographic document alteration Download PDF

Info

Publication number
WO2015140562A1
WO2015140562A1 PCT/GB2015/050813 GB2015050813W WO2015140562A1 WO 2015140562 A1 WO2015140562 A1 WO 2015140562A1 GB 2015050813 W GB2015050813 W GB 2015050813W WO 2015140562 A1 WO2015140562 A1 WO 2015140562A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
data
encoded
geometrical features
parameter values
Prior art date
Application number
PCT/GB2015/050813
Other languages
French (fr)
Inventor
Ralph Mahmoud Omar
Original Assignee
Omarco Network Solutions Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Omarco Network Solutions Limited filed Critical Omarco Network Solutions Limited
Publication of WO2015140562A1 publication Critical patent/WO2015140562A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N1/00Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
    • H04N1/32Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
    • H04N1/32101Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
    • H04N1/32144Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title embedded in the image data, i.e. enclosed or integrated in the image, e.g. watermark, super-imposed logo or stamp
    • H04N1/32149Methods relating to embedding, encoding, decoding, detection or retrieval operations
    • H04N1/32203Spatial or amplitude domain methods
    • H04N1/32219Spatial or amplitude domain methods involving changing the position of selected pixels, e.g. word shifting, or involving modulating the size of image components, e.g. of characters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B42BOOKBINDING; ALBUMS; FILES; SPECIAL PRINTED MATTER
    • B42DBOOKS; BOOK COVERS; LOOSE LEAVES; PRINTED MATTER CHARACTERISED BY IDENTIFICATION OR SECURITY FEATURES; PRINTED MATTER OF SPECIAL FORMAT OR STYLE NOT OTHERWISE PROVIDED FOR; DEVICES FOR USE THEREWITH AND NOT OTHERWISE PROVIDED FOR; MOVABLE-STRIP WRITING OR READING APPARATUS
    • B42D1/00Books or other bound products
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07DHANDLING OF COINS OR VALUABLE PAPERS, e.g. TESTING, SORTING BY DENOMINATIONS, COUNTING, DISPENSING, CHANGING OR DEPOSITING
    • G07D7/00Testing specially adapted to determine the identity or genuineness of valuable papers or for segregating those which are unacceptable, e.g. banknotes that are alien to a currency
    • G07D7/005Testing security markings invisible to the naked eye, e.g. verifying thickened lines or unobtrusive markings or alterations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N1/00Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
    • H04N1/32Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
    • H04N1/32101Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
    • H04N1/32144Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title embedded in the image data, i.e. enclosed or integrated in the image, e.g. watermark, super-imposed logo or stamp
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N1/00Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
    • H04N1/32Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
    • H04N1/32101Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
    • H04N1/32144Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title embedded in the image data, i.e. enclosed or integrated in the image, e.g. watermark, super-imposed logo or stamp
    • H04N1/32149Methods relating to embedding, encoding, decoding, detection or retrieval operations
    • H04N1/32203Spatial or amplitude domain methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N1/00Scanning, transmission or reproduction of documents or the like, e.g. facsimile transmission; Details thereof
    • H04N1/32Circuits or arrangements for control or supervision between transmitter and receiver or between image input and image output device, e.g. between a still-image camera and its memory or between a still-image camera and a printer device
    • H04N1/32101Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title
    • H04N1/32144Display, printing, storage or transmission of additional information, e.g. ID code, date and time or title embedded in the image data, i.e. enclosed or integrated in the image, e.g. watermark, super-imposed logo or stamp
    • H04N1/32149Methods relating to embedding, encoding, decoding, detection or retrieval operations
    • H04N1/3232Robust embedding or watermarking

Definitions

  • the present invention relates to methods and systems for encoding data steganographically within documents. Naturally, the invention also extends to methods and systems for decoding data from such documents, and also the documents themselves that contain steganographically encoded data therein.
  • Steganography is the practice of concealing data within other data.
  • the data to be hidden such as a message, is concealed within a generally open document so that the existence of the hidden data is not suggested by or apparent in the open document.
  • Steganography can be applied to both physical documents and electronic documents.
  • a 24-bit bitmap image file encodes the colour of each pixel using 8 bits for each colour component (red, green and blue). The least significant bit of each colour component of each pixel can thereby be altered to encode hidden data at a data density of three bits per pixel without the image change being perceptible to the human eye. If different identifiers are steganographically embedded within different versions of the same photograph, it is also possible for the photographer to track or verify the source of each distributed photograph. This requires analysis of each photograph to extract the data from the least significant bit of each colour component of each pixel.
  • the method comprises at least one of the steps of:
  • the method may also comprise at identifying at least one group of geometrical features of the rendered form of the document. Ideally the method comprises registering parameter values that are associated with and define geometrical characteristics of the geometrical features. Thus carrying out the document alteration instructions may involve altering the parameter values associated with and defining the geometrical
  • the method further comprises selecting a reference.
  • a reference set of parameter values are selected using a predetermined reference selection process.
  • the method further comprises mapping the data to be steganographically encoded within the document to a set of relative differences by using a predetermined mapping process. Furthermore it is preferred that the method comprises encoding the data within the document by modifying the identified geometrical features of the document using a predetermined document alteration process. Ideally this is so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values, and ideally by an amount dependent on the set of relative differences.
  • the geometrical features are typographical features, such as those associated with text.
  • the geometrical feature may comprise at least one of: baselines, mean lines, cap height, descender height, ascender height, character size, character position, character kerning, letter spacing, sentence spacing, size of dots and/or diacritics, size and position of superscript and/or subscript text, paragraph position and margins size.
  • alterations to the geometrical features include at least one of: translation, scaling and rotation of those geometrical features. Ideally, said alterations are carried out with respect to a predetermined reference. Ideally, said alterations are made to otherwise regularly repeated geometrical features.
  • the method further comprises receiving at least one user input; and wherein the data to be encoded is derived, at least in part, from the at least one user input.
  • receiving at least one user input comprises receiving at least one of: biometric data and an alphanumeric code.
  • the method further comprises determining at least one metric
  • the data to be encoded is derived, at least in part, from the at least one metric.
  • the method further comprises treating at least part of the data to be encoded prior to steganographically encoding the data within the document.
  • Said treating may comprise encrypting it using an encryption process.
  • the encryption process comprises receiving as an input at least one of: a user input and a metric representing the content of the document.
  • Said treating may comprise appending a verifier to the data, such as a checksum.
  • the method further comprises choosing an encoding strategy. Ideally, this is so as to determine how to map the data to be steganographically encoded within the document to a set of document alteration instructions.
  • the method further comprises checking the altered document to determine that data has been successfully encoded therein.
  • This checking step may comprise comparing the data extracted by a decoding method with the original data to be encoded.
  • the altered document is first treated with a degradation process prior to the checking step so as to test the resistance of the encoded data to corruption.
  • a method of processing a document to decode data steganographically encoded within it Ideally, the data is encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans.
  • the method comprises at least one of the steps of:
  • decoding data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process.
  • a computer program arranged to carry out the method of the first or second aspects of the present invention.
  • a computer program product comprising at least one non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer- readable program code portions comprising at least one executable portion which, when executed, carries out a method according to the first or second aspect of the present invention.
  • the system comprises at least one of:
  • a reader ideally arranged to read in the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified and altered;
  • a processor ideally arranged to: analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;
  • a system for processing a document to decode data steganographically encoded within it Ideally, the data is encoded in a rendered form of that document in a way that is machine- readable yet is substantially imperceptible to humans.
  • the system comprises at least one of:
  • a reader ideally arranged to read the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified;
  • a processor ideally arranged to:
  • decode data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process;
  • the system is in the form of a document replication apparatus, such as a photocopier, wherein the reader is a scanner, and the output module is a printer.
  • the system is arranged to control replication of a document in dependence on scanning and decoding steganographically embedded data therein. For example, if a certain item of steganographic data indicates that a document is a protected document, then the system can cease replication of that document.
  • a document within which data is steganographically encoded.
  • the document comprises a medium (such as paper) supporting geometrical feature thereon.
  • the geometrical features may be printed on the medium.
  • the geometrical features are arranged to encode data steganographically in a way that is machine-readable yet is substantially imperceptible to humans.
  • the document comprises at least one group of geometrical features having associated parameter values that define geometrical characteristics of the geometrical features.
  • the document comprises a reference - for example, via features supporting a reference set of parameter values.
  • a set of relative differences between the parameter values of the at least one group of geometrical features and the reference set of parameter values steganographically encodes data within the document.
  • the geometrical features are typographical features.
  • the medium supports geometrical features in a way that is machine readable via visible light scanning of the document.
  • the document is in the form of at least one of: a bank note, a cheque, an agreement document, a deed document, an assignment document, a power of attorney document, a book, an identity document, a legal document, a business document and a government document.
  • steps of the encoding method according to the first aspect of the present invention can be applied to the decoding method according to the second aspect of the present invention, optionally with a complementary or inverse operation of that step being performed.
  • a decoding process of the first aspect used to check the integrity of data steganographically encoded within a document may utilise the decoding method of the second aspect of the present invention.
  • Figure 1 is a schematic diagram of a system according to various embodiments of the present invention, the system being arranged to process documents so as to steganographically encode data therein, and/or to decode data already
  • Figure 2 is an encoding flow diagram according to various embodiments of the present invention.
  • FIG. 3 is a decoding flow diagram according to various embodiments of the present invention.
  • Figure 4 is a sample from a rendered form of a source document of Figure 1 , the sample including a number of different geometrical features associated with text of that source document.
  • Figure 1 is a schematic diagram of a system 1 according to various embodiments of the present invention, the system 1 being arranged to process source documents 5, 5a by modifying them to steganographically encode data 7 therein to form modified documents 6, 6a. To do this, the system 1 is arranged to carry out the encoding workflow as summarised in the flow diagram of Figure 2. The system 1 is also capable of decoding and extracting data 7 steganographically encoded within modified documents 6, 6a. To this end, the system 1 is arranged to carry out the decoding workflow as summarised in the flow diagram of Figure 3.
  • the system 1 comprises a computer 2, an image capture device in the form of a flat-bed scanner 3, and a printer 4.
  • the scanner 3 and printer 4 are communicatively linked to the computer 2, for example via a wired or wireless connection.
  • the scanner 3 enables a physical document to be converted into an electronic form which can then be processed electronically by the computer 2.
  • the printer 4 enables an electronic form of a document to be converted into a physical, printed document.
  • the computer 2 may comprise other features such as a display that provides a user with information (e.g. an output of an encoding or decoding process and/or the presentation of electronic documents), and input means, for example a mouse and keyboard (e.g.
  • the computer 2 may also comprises features such as a processor and/or processing modules for carrying out the processing steps (such as steps to encode or decode data).
  • the computer 2 may comprise memory and/or memory modules for storing and/or registering information such as data to be encoded, values and parameters.
  • the system may comprise a portable electronic computing device, such as a smart-phone (or a tablet).
  • the image capture device may be in the form of a camera of the smart-phone.
  • the computer may be a photocopier.
  • a first step 21 of the encoding process 20 is to obtain a source document 5
  • this may include converting a physical, printed form of the document 5a into an electronic form by scanning it in using the scanner 3.
  • other conversions may be carried out, possibly entirely in the electronic domain.
  • an electronic file such as an HTML file, or a Microsoft Word® file may first need to be rendered so that the
  • the rendered form of a document 5 may be a bitmapped version of the document. It will also be understood that a rendered document 5 may comprise more than one page.
  • a second step 22 of the process is analysis of the document. This is primarily to determine whether and how data 7 may be steganographically encoded within the document.
  • the second step 22 may include an assessment of the quantity and/or nature of content within a document, and a determination of how much data can be steganographically encoded within that content. This assessment or determination can be fed back to a user so that if the source document is unsuitable - for example, it does not have a sufficient quantity or type of content - the user can be provided with guidance on how to improve the source document 5. This is so that more data can be
  • steganographically encoded within a processed, modified document 6, and/or data can be encoded in a way that makes the presence of the encoded data substantially imperceptible to humans.
  • determining whether the document has content in the form of text Another assessment could be determining the amount of text and/or image within the document, by determining the proportion of whitespace in the image. This can be done using a relatively low resolution version of the document, and so is relatively less processor- intensive than operations such as OCR that involving identifying the formation of particular characters of text.
  • the second step 22 includes an identification of predetermined types of geometrical features of the rendered form of the document.
  • geometrical features may include the arrangement and formation of text and/or images.
  • the second step 22 includes registering how those geometrical features are defined and characterised.
  • each geometrical feature may be at least partly defined by a number of parameters, and values of those parameters can be generated from analysis of the geometrical features.
  • registered parameter values associated with an identified geometrical feature can define the geometrical characteristics of that geometrical feature.
  • Parameter values also quantify the geometrical characteristics of geometrical features in a way that allow alteration of those geometrical features by shifting the associated parameter values.
  • the second step 22 of the encoding process also involves selecting a reference using a predetermined reference selection process.
  • This may include selecting one or more of the identified geometric features as a reference, and setting a reference set of parameter values associated with that geometric feature as the reference.
  • the second step 22 of the encoding process follows a series of rules that determine which geometrical features, parameter values and references are to be used to encode data, and in which order they are to be used. Where a document comprises multiple pages, different geometrical features, parameter values and references may be used for each page.
  • the third step 23 of the encoding process 20 is obtaining the data 7 to be encoded. This may involve generating or deriving the data 7 from the content of the original document 5, from another predetermined source and/or from user-provided information.
  • the data may be encrypted and/or include a checksum.
  • the fourth step 24 of the encoding process involves choosing an encoding strategy, ideally in response to the result of carrying out the second step 22.
  • the encoding strategy is dependent on the analysed characteristics of the document of the second step - specifically, the determined characteristics of the geometrical features of the document 5 as represented by the parameter values, and as measured relative to a selected reference.
  • Each encoding strategy is effectively a predetermined document alteration process that maps the data to be steganographically encoded to alterations to the source document 5 so as to arrive at the modified document 6. As will be described in greater detail below, this involves shifting the parameter values relative to the selected reference in dependence on the data to be steganographically encoded. This allows that data to be hidden within the formatting of the document. This makes that data machine- readable, yet the presence of that data is substantially imperceptible to humans.
  • the fifth step 25 of the encoding process is carrying out the chosen encoding strategy to alter the document. This involves applying the changes to the parameter values so that the geometrical features of the original document 5 are altered, thereby encoding the data in a modified document 6.
  • the modified document 6 can then be re- rendered and outputted in a form which ensures that the formatting cannot be altered, for example as a non-editable PDF document, or simply printed.
  • the sixth step 26 of the encoding process is checking that the data has been properly encoded. This can simply be done by running a decoding process, as will be described below, and comparing the data extracted by the decoding process with the original data to be encoded (i.e. of the third step 23). To test the resistance of the encoded data to corruption, this sixth step may first include applying a degradation process to the modified document 6 - for example, reducing the colour-space by applying a thresholding algorithm, reducing the resolution of the document and/or transforming the image content of the document by rotation, scaling or x-y translation. Decoding data overview
  • the decoding process is an inverse of the encoding process 20 to enable data 7 steganographically encoded within a modified document 6 to be extracted.
  • the first step 31 and second step 32 of the decoding process 30 are similar to the first and second steps 21 , 22 of the encoding process 20, but applied to a modified document 6 (sometimes referred to as the "package").
  • the second step 32 of the decoding process 30 is analysis of the modified document 6 to determine how data 7 may have been steganographically encoded within the document 6. Again, this includes an identification of predetermined types of geometrical features, ideally with each geometrical feature being at least partly defined by a number of parameters, and the values of those parameters being measured relative to a reference selected by a predetermined reference selection process.
  • the second step 32 of the decoding process 30 follows the same series of rules as the encoding process 20 so that the order of the geometrical features, parameter values and references used to encode data can be reliably determined. In any case, a set of relative differences between the non-reference parameter values and the reference set of parameter values can be determined.
  • the third step 33 of the decoding process 30 is extraction of the data from the modified document 6. This is achieved by mapping the determined relative differences to the steganographically encoded data via a predetermined mapping process. The mapping is dependent on, and effectively the inverse of the encoding strategy employed.
  • the fourth step 34 of the decoding process 30 is outputting of the extracted data. This may involve verifying the data via a checksum component of the extracted data, and/or receiving a user input to decrypt the extracted data.
  • Document analysis examples may involve verifying the data via a checksum component of the extracted data, and/or receiving a user input to decrypt the extracted data.
  • Figure 4 shows a sample from a rendered form of a source document 5, the sample including a number of different geometrical features associated with text 40 of that source document.
  • These geometrical features are generally typographical features having definitions that are well-known in the art of typography.
  • baselines 41 are generally defined as the lines upon which most letters of a standard body of text sit. From the sample of Figure 4, in the word “Steganography”, nine of its letters “Ste-ano-ra- h-” sit on a respective baseline 41 a and four of its letters “— g— g-p-y” also sit on the same baseline 41 a but have descenders extending below the baseline.
  • Image analysis can be carried out on the rendered form of the document 5 to identify these baselines 41. Moreover, image analysis can also be used to determine the position, arrangement and extent of these baselines 41 , and to populate the appropriate parameter values associated with these characteristics.
  • a "bx" position of a baseline in the sample may be registered.
  • This simple notation and example relies on the assumption that all the baselines are parallel to one another and the top edge of the sample.
  • the exact "bx" positions of these baselines can then be varied slightly relative to one another so that data can be encoded within the document in a way that is not easily perceptible. In the present example, this requires a reference to be set against which the parameter values associated with baseline position can be varied.
  • data can be encoded in the document by slightly varying the non-reference "bx" parameter values of the third and fourth baselines 41c, 41 d.
  • This can be achieved using a predetermined mapping process which specifies how the "bx" values are to be altered to encode different data.
  • the mapping process may specify that the first four bits of a payload message can be mapped to the "bx" spacing between the second and third baselines 41 b, 41 c, and the second fours bits of a payload message can be mapped to the "bx" spacing between the third and fourth baselines 41 c, 41 d.
  • These "bx" spacings can be subsequently compared against the reference spacing (of 400 pixels) to obtain the payload message based on the variance from the reference spacing.
  • a "bx" relative spacing of 400 pixels can represent the four bit message "0000”
  • a relative spacing of 415 pixels can represents the four bit message “1 11 1” and so forth with intermediate pixel variances representing corresponding intermediate values between "0000" and "11 11” .
  • the "bx" positions of the baselines would be modified by the mapping process as follows:
  • a modified document 6 can be created, with the lines of text 40 being shifted to ensure that the baselines are positioned according to the above- specified parameter values.
  • the eight-bit payload message can be
  • a similar process can be carried out: firstly to determine the reference spacing of 400 pixels (between the first and second baselines); secondly to determine the non-reference spacings of: 403 pixels (between the second and third baselines) and 402 pixels (between the third and fourth baselines); and then thirdly mapping the differences in spacing "3" and "2" to two four bit messages "001 1" and "0010". These can then be concatenated to form the eight bit message "001 10010".
  • the maximum baseline shift from the standard reference spacing is a maximum of 15 pixels out of 400. This is less than a 4% difference, and so is very unlikely to be perceived by a user.
  • the use of a relative difference ensures that any document alteration that is carried out is relatively difficult to perceive. It will be understood that data could, in principle, be encoded in a document using absolute values. However, employing this approach does not make the method flexible enough to account for document having significantly different geometrical features - for example, characters of different sizes, types and arrangements.
  • the reference employed in the present example is to a geometric feature that repeats regularly (namely the regular spacing of 400 pixels between adjacent baselines). So that a reference can be reliably employed, the selection of a reference is dependent on that reference being applicable to such regularly repeating geometrical features. This is so that data can be encoded in the small variations deviating from the reference. If there were large spatial variations between consecutive baselines 1 to 3, then the spacing between baselines 1 and 2 could not be used as a reference for modulating the spacing between baselines 2 and 3 to encode data. In view of this, method and systems according various embodiments of the present invention may employ different references for significantly varying geometrical features, even if they are of the same type (e.g. baselines).
  • baselines may be specified more accurately by including the two-dimensional coordinates of the start point and end point of a baseline. With this information, the length of each baseline can be determined along with the orientation of the baseline. Accordingly, these geometrical characteristics can be modulated relative to a reference to encode data.
  • the perceptibility of changes made to geometrical features of a document will vary in dependence on factors such the type of geometrical feature and the characteristic of it that is being modulated to encode data.
  • the amount of data that can be encoded by a variation is proportional to the number of different (machine- distinguishable) variations.
  • this consideration also is related to the resistance to corruption of data encoded as geometric feature variations.
  • a variation of one pixel equates to a difference of one bit of information
  • a down-sampling of the document would cause corruption of the data encoded in the spacing of the baselines.
  • a courser mapping may be used - e.g. a shift of 3 pixels representing a difference of 1 bit of information.
  • aspects of the present embodiments are particularly advantageous and applicable to geometrical features in the form of typographical features - i.e. those that are associated with text.
  • typographical features i.e. those that are associated with text.
  • non-typographical geometrical features for example, predetermined shapes.
  • Mean line - generally a line above and parallel to the baseline which forms the upper boundary of most of the lowercase letters in a body of text.
  • Parameters which may be varied include the spacing between the mean line and a corresponding baseline.
  • the relative height between a capital letter and a consecutive lower case letter (as defined by the cap height and mean line) can be used to encode data.
  • Descender height the vertical height of parts of a lower case letter which extend below the baseline.
  • parts the letters “g”, “p” and “y” of the word “Steganography” have a descender that extends below the baseline to a beard line that is parallel to and below the baseline. Parameters which may be varied include the spacing between the beard line and a
  • the height of consecutive letters having a descender can be used to encode a payload message.
  • Ascender height the vertical height of parts of a lower case letter which extend above the mean line.
  • parts of the letters "t” and “h” of the word “Steganography” extend above the mean line. Parameter similar to those in respect the descender height can be used to encode data.
  • Character size and position - the vertical height, horizontal width and/or orientation between consecutive letters of the same type may be modulated to encode data.
  • Sentence spacing i.e. the space size after a sentence.
  • parameters of these and other geometrical features which may be modulated to encode data relate to the positioning, size and/or orientation ideally relative to a reference, or another geometrical feature.
  • this ensures that the data being encoded within a document is resistant to corruption, especially when the document's colour-space is significantly reduced (e.g. reduced to monotone via a thresholding operation as typically carried out on a black-and-white photocopier).
  • the encoding process follows a series of rules that determine which geometrical features, parameter values and references are to be used to encode data, and in which order they are to be used.
  • the rules effectively lead to the predetermined selection of geometrical features, parameter values and references.
  • the rule may determine that the first two baselines encountered in a document are to be used as reference. Extending further, the rules may determine that baselines be used to encode a first portion of data 7, followed by letter spacing to encode a following second portion of that data, followed by sentence spacing for a third portion, and so forth. This ensures the different types of geometrical features can be utilised to maximise the data to steganographically encoded within the document. Moreover, these rules may codify the manner in which geometrical features can be used to encode of data. For example, if fonts are too small, or the spacing between sequential lines of text is too compact, then baseline modulation may be used to encode a smaller range of data (e.g. 2 bits instead of 4 bits), or be completely disregarded as a means of encoding data altogether so as to avoid data corruption or likely human perception of encoded data.
  • baseline modulation may be used to encode a smaller range of data (e.g. 2 bits instead of 4 bits), or be completely disregarded as a means of encoding data altogether so as to
  • the third step 23 of the encoding process 20 involves obtaining the data 7 to be encoded. Once the data to be encoded has been obtained, its quantity can be enumerated for the purpose of determining whether that data can be encoded within the source document using a primary encoding strategy
  • the data to be encoded can be generated or derived from a number of different sources:
  • the data to be encoded may be based on the content of the original document. This serves as an indicator of the integrity of the content of the document. For example, assuming there is text within the document, a series of metrics can be generated which relate to the content. Thus, if the content of the document were to be slightly changed (but retaining the formatting encoding the data), then this can be flagged by the metrics which are part of the data encoded within the document. These metrics may include at least one of the number of pages in document; the number of words, sentences, paragraphs and/or letters per document and/or page; and the instances of certain characters. For example, the letter "e" is one of the most frequently occurring letters in the English language.
  • the number of letters "e" in an English language document can be enumerated as a metric which represents the content of the document.
  • a different character or series of characters may be used for different languages and/or topics.
  • these metrics act as a checksum for the content, or content portions of the document.
  • These metrics can be beneficial, especially when they are spread across the entire document and/or include redundant data relating to the content of the document.
  • each page of a document may include metrics representing the content of the whole document (thereby enabling the checking of the integrity of a document as a whole, based on the data within any one page).
  • metrics embedded within one page can include a reference to the content on other pages, allowing a cross-check of the integrity of various pages of the document to be performed.
  • the data to be encoded may be based on user-provided information.
  • the data may include alphanumeric data entered by the user.
  • Such data may include an identifier, a PIN, a password or pass-phrase.
  • the data to be encoded may be based on information derived from another user input - for example, from biometric data generated from scanning a user's fingerprint or iris.
  • the data to be encoded may include information not dependent on a user or the document.
  • this may include a time stamp, a random number, a unique identifier and/or a version number of a program used to encode the data.
  • the data can then be further treated prior to steganographically encoding it (or part of that data) within the document.
  • portions of the data may be encrypted using one of a number of different techniques known in the art with the resulting cipher-text being passed to the encoding process that steganographically encodes it within the document.
  • One portion of the data e.g. the user-dependent data
  • the encryption of the other e.g. the data based on the content of the document.
  • a user- provided password can be used as an encryption key to a cryptographic process.
  • the metrics based on the content of the document may be used as a cryptographic salt.
  • many different types of cryptographic process may be employed, such as that disclosed by the Applicant in document: PCT/IB2011/052799 the content of which is hereby incorporated by reference to the extent permitted by applicable law.
  • An additional treatment of the data to be encoded could be to append a verifier (such as a checksum) which can be used to verify subsequent successful extraction of steganographically encoded data, for example via a decoding process such as the decoding method of the present embodiment.
  • a verifier such as a checksum
  • the fourth step 24 of the encoding process involves choosing an encoding strategy. This may be dependent on second step 22 of document analysis, and/or the third step 23 of obtaining the data to be encoded. As mentioned, once the data to be encoded has been obtained, its quantity can be enumerated for the purpose of determining whether that data can be encoded within the source document using a primary encoding strategy. If the quantity of data and/or encoding strategy means that the data cannot be encoded, a sequence of auxiliary encoding strategies can be employed instead to encode the data. The user may be provided feedback about which encoding strategy is being used, and whether or not data can be encoded using that strategy. A user can be provided with a choice of encoding strategies, and provided with a means to choose one. This choice can be made automatically for the user in response to a password or other user-provided input. Advantageously, this means that both the data and the encoding strategy used to encode that data can be kept a secret.
  • an encoding strategy generally involves the alteration of a source document 5 in a way that is dependent on second step 22 of document analysis, and/or the third step 23 of obtaining the data to be encoded. In general, this involves changing the parameter values so that the geometrical features of the original document 5 are altered.
  • the encoding strategy may also alter a document in another way that assists subsequent decoding and extraction of the data 7 from a modified document 6.
  • the encoding strategy may alter a document to provide it with markings that indicate which encoding strategy has been used. Ideally, these markings are not associated with any of the geometrical features of the document, and so are independent of the content of a document 5.
  • markings thus define a clue which can facilitate a subsequent decoding process by specifying the encoding strategy used.
  • the second step 32 of the decoding process 30 can be driven following an identification of such markings.
  • These markings can be provided at a predetermined location within the document to facilitate detection of those marking, and so speeding up the decoding process. Effectively, this means that the decoding process 30 can bypass the step of guessing which encoding process has been used.
  • portions of the data to be encoded could also be rendered within a modified document without being associated with or affecting the geometrical features.
  • a timestamp can be rendered in plaintext at a predetermined location of the document. Again, this can aid a subsequent decoding process and/or act as a verifier of data integrity and/or successful data extraction.
  • a technique may be provided whereby characteristics of the document can be determined and used to change the format of the document in a manner which is not readily perceptible to the human eye but which is detectable by an imaging and decoding technique.
  • the measured degree of format change acts as a unique identifier of the document and can be used to identify the original recipient of that copy.
  • Format change can take many forms. For example in text word spacing, font adjustments, line spacing, border size, indentation etc. can all be used either individually or in combination to slightly modify the document in a manner which is representative of a data to be encoded (e.g. a unique identifier). The modification is so slight that it is not readily perceivable to the human eye. However, such format changes are faithfully reproduced in any low-quality reproduction of that document. For example, no matter how many times the document is photocopied the format information is always reproduced faithfully. As mentioned, there are many different ways in which a conversion can occur to represent payload data.
  • One set of scenarios involve encoding a unique identifier.
  • the conversion is carried out by a conversion algorithm which analyses the document.
  • the conversion algorithm derives several non-format parameters (typically parameters based on content). Once the non-format parameters have been established by this process, they can be used as inputs into another algorithm (a formatting algorithm) for adjusting the formatting parameters of the document.
  • a formatting algorithm for adjusting the formatting parameters of the document.
  • the document is then recreated either by printing or in electronic format (such as a PDF format) using the new formatting parameters as determined by the formatting algorithm.
  • the conversion algorithm can sum the words per page and count the letters in the page to arrive at two content parameters for the document. Also the letters can be summed according to a predetermined value given to each letter and an algorithmic rendition of any of these sums can be used to create a new summed parameter. These parameters can individually be provided to the formatting algorithm or alternatively the new summed parameter can be provided.
  • the formatting algorithm then spaces the words and letters within a given page so as not to change the layout (format) as perceived by the human eye, but with sufficient change to create an encoding to re-render into a value once the page formatting is analysed from an image of the newly-created document.
  • the document content can get summed automatically by normal search engine parameters and a digitised result created and then this digitised result submitted to a randomised time-based algorithm (with the time base element further concealed according to the base units time setting as is described in our co-pending international patent application no PCT/IB2011/052799. This not only allows for secure encryption but also post facto detection of the authoring machine). These results would be expressed in the spacing of the document.
  • a variant of this method is to scan a fingerprint on a portable device e.g. a mobile phone, render a digital value that is also encrypted not only through normal methods but via time-based encryption and sent to a central database (akin to that described in our co-pending international patent application no PCT/IB2011/052799).
  • a central database e.g. a mobile phone
  • the rate of decay in the time base renders even the mobile phone incapable of decrypting the fingerprint moments after it has sent it.
  • This digitised value can be used simultaneously to vary an electronic document that had been received on the mobile device so that each page of that document can be "signed" individually by the user in such a way that it is personal only to the user.
  • the method can be used to sign documents which require proof that the user has agreed to every page and that no page has been "slipped” in intentionally or accidentally after he/she has agreed to the whole document. This is the electronic equivalent of "initialling" every page, but can now be done totally electronically using portable devices.
  • the unique digital reference of the authoring device e.g. laptop or mobile phone
  • the digital signature now expressed in the document in a concealed way.
  • a time reference is printed in the open on the page and a concealed time reference is printed using the above method, with the central database alone able to decipher, from the unique identifier of the authoring machine, the appropriate time decay differential between the two (as per our co-pending international patent application number PCT/IB201 1/052799). In this way, the document is further authorised as coming from a valid authoring machine.
  • This method can be used with all forms of value documents (including banknotes) to authorise that they have been printed by an authorised machine.
  • a time reference can be printed in the open on the document and used as an algorithm reference in a serial number as well as a set of micro text which would be printed at the same time as the serial number and spaced etc. as above with a document linking all three items but with the appropriate time decay known only to the central database.
  • the exact position of the side margin and the extent of the header and footer margins represent additional features which can be used in the determination of the unique identifier of the document. Besides the spacing of letters within words and words themselves, minute changes to punctuation mark spacings as well as line heights can be used.
  • the conversion algorithm re-renders the same document into a digital print version, such as in an Adobe pdf format, without any readily perceivable visually changes to the format created for the standard (original) document. However, these subtle format changes are present, but not perceivable by the human eye.
  • An unauthorised copier will not know whether their document has been sanitised and labelled for their use only or is a standard document which hasn't been treated by this aspect of the present invention.
  • the document when photocopied will faithfully reproduce the document format, such as the concealed spacing differences, so that when the illegally obtained document is scanned and the image processed by the formatting algorithm (operating in reverse) a set of parameters or a summed parameter can be created representing a unique identifier of the document.
  • These reconstituted parameters (which can be termed as a unique identifier of the document or its intended recipient) can be compared to the original parameters and if they are equal, this indicates that the scanned document is from the original which is linked to an intended recipient. Accordingly, this process enables linking of any processed document back to its original intended recipient.
  • a digital signature of document can also be created and stored in addition or separate from the converted document creation. This would be carried out by simply scanning the entire multiple page document, creating a signature code via central database with date and other descriptive information about the document being used and storing this unique signature code at the central database. If it was an electronic document, the unique signature code would be created directly from the electronic version of the document. The signature code could be integrated into the document thus making the whole document "safe", not just its signature page. This embodiment can be used with mobile devices to type and sign electronic documents using the mobile device.
  • an app which scans the users fingerprint using the camera function of the mobile phone, renders this into a digital encoding and then using the document altering embodiment described above imperceptibly alters every page of a document sent by word processing attachment or e-mail to the mobile phone and then sent back as a printed or PDF (or some other form of unalterable visual representation) file.
  • the app when conducting page by page verification of a document to a mobile device, can also have a vendor-specific code and a customer-specific code that is combined with the fingerprint code plus also using relativity co-ordinates (as described in our co-pending international patent application no PCT/IB2011/052799) to link them so that the vendor knows the verification code he gets from the app is personal to him and his customer.
  • a vendor-specific code and a customer-specific code that is combined with the fingerprint code plus also using relativity co-ordinates (as described in our co-pending international patent application no PCT/IB2011/052799) to link them so that the vendor knows the verification code he gets from the app is personal to him and his customer.

Abstract

Methods and systems relating to encoding data steganographically within a document, and decoding such steganographically encoded data are disclosed. Data to be steganographically encoded is mapped to a set of document alteration instructions which, when carried out, alter geometrical features of the document. Data is decoded by analysing characteristics of such geometrical features. Computer programs and documents associated with these methods and systems are also disclosed.

Description

Steganographic Document Alteration
FIELD OF THE INVENTION
The present invention relates to methods and systems for encoding data steganographically within documents. Naturally, the invention also extends to methods and systems for decoding data from such documents, and also the documents themselves that contain steganographically encoded data therein.
BACKGROUND OF THE INVENTION
Steganography is the practice of concealing data within other data. The data to be hidden, such as a message, is concealed within a generally open document so that the existence of the hidden data is not suggested by or apparent in the open document. Steganography can be applied to both physical documents and electronic documents.
One of the better-known applications of steganography is watermarking of digital images. For example, a photographer can steganographically embed authorship data within photographs by subtly altering predetermined components of the digital image. For example, a 24-bit bitmap image file encodes the colour of each pixel using 8 bits for each colour component (red, green and blue). The least significant bit of each colour component of each pixel can thereby be altered to encode hidden data at a data density of three bits per pixel without the image change being perceptible to the human eye. If different identifiers are steganographically embedded within different versions of the same photograph, it is also possible for the photographer to track or verify the source of each distributed photograph. This requires analysis of each photograph to extract the data from the least significant bit of each colour component of each pixel.
Similar steganography techniques can be used to watermark or fingerprint other documents, effectively providing them with identification marks which cannot be easily detected or noticed by the human eye. So long as the digital file is copied, the identification marks are also copied.
One of the problems in the art is that unfaithful copies of a document do not reliably retain the steganographically hidden data, especially if a poor quality
reproduction technique is employed.
Thus, there is an inherent problem with providing confidential information to uncontrolled parties in that they may copy it and provided to third parties in such a way that it cannot be proved the source of a leak. Whilst watermarking and fingerprinting techniques are known for providing identification marks in documents which cannot be detected by the human eye, reproduction techniques (such as photocopying) can degrade content and these security features can be lost. This is particularly the case with documents containing text but also applies to image documents. There is a need therefore to encode data within a document, for example with a reference to the recipient of that document, without enabling the receiver to identify those reference and circumvent them - for example by simply covering up the reference in any given photocopy. There is also a need to avoid degradation of any security features with low- quality mass reproduction techniques such as photocopying.
It is an object of the present invention to ameliorate the above-mentioned problems, at least in part.
SUMMARY OF THE INVENTION
According to a first aspect of the present invention, there is provided a method of altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans.
Ideally, the method comprises at least one of the steps of:
mapping the data to be steganographically encoded within the document to a set of document alteration instructions; and
carrying out the document alteration instructions to alter geometrical features of the document to steganographically encode data therein.
The method may also comprise at identifying at least one group of geometrical features of the rendered form of the document. Ideally the method comprises registering parameter values that are associated with and define geometrical characteristics of the geometrical features. Thus carrying out the document alteration instructions may involve altering the parameter values associated with and defining the geometrical
characteristics of the geometrical features.
Ideally, the method further comprises selecting a reference. Ideally a reference set of parameter values are selected using a predetermined reference selection process.
Ideally, the method further comprises mapping the data to be steganographically encoded within the document to a set of relative differences by using a predetermined mapping process. Furthermore it is preferred that the method comprises encoding the data within the document by modifying the identified geometrical features of the document using a predetermined document alteration process. Ideally this is so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values, and ideally by an amount dependent on the set of relative differences.
Ideally, the geometrical features are typographical features, such as those associated with text. For example, the geometrical feature may comprise at least one of: baselines, mean lines, cap height, descender height, ascender height, character size, character position, character kerning, letter spacing, sentence spacing, size of dots and/or diacritics, size and position of superscript and/or subscript text, paragraph position and margins size.
Ideally, alterations to the geometrical features include at least one of: translation, scaling and rotation of those geometrical features. Ideally, said alterations are carried out with respect to a predetermined reference. Ideally, said alterations are made to otherwise regularly repeated geometrical features.
Ideally, the method further comprises receiving at least one user input; and wherein the data to be encoded is derived, at least in part, from the at least one user input. Ideally, wherein receiving at least one user input comprises receiving at least one of: biometric data and an alphanumeric code.
Ideally, the method further comprises determining at least one metric
representative of at least a portion of the content of the document. Ideally, the data to be encoded is derived, at least in part, from the at least one metric.
Ideally, the method further comprises treating at least part of the data to be encoded prior to steganographically encoding the data within the document. Said treating may comprise encrypting it using an encryption process. Ideally, the encryption process comprises receiving as an input at least one of: a user input and a metric representing the content of the document. Said treating may comprise appending a verifier to the data, such as a checksum.
Ideally, the method further comprises choosing an encoding strategy. Ideally, this is so as to determine how to map the data to be steganographically encoded within the document to a set of document alteration instructions.
Ideally, the method further comprises checking the altered document to determine that data has been successfully encoded therein. This checking step may comprise comparing the data extracted by a decoding method with the original data to be encoded. Ideally, the altered document is first treated with a degradation process prior to the checking step so as to test the resistance of the encoded data to corruption. According to a second aspect of the present invention, there is provided a method of processing a document to decode data steganographically encoded within it. Ideally, the data is encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans. Ideally, the method comprises at least one of the steps of:
identifying at least one group of geometrical features of the rendered form of the document;
registering parameter values that are associated with and define geometrical characteristics of the geometrical features;
selecting a reference set of parameter values using a predetermined reference selection process;
analysing the geometrical features to determine a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and
decoding data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process.
According to a third aspect of the present invention there is provided a computer program arranged to carry out the method of the first or second aspects of the present invention.
According to a fourth aspect of the present invention there is provided a computer program product comprising at least one non-transitory computer-readable storage medium having computer-readable program code portions stored therein, the computer- readable program code portions comprising at least one executable portion which, when executed, carries out a method according to the first or second aspect of the present invention.
According to a fifth aspect of the present invention there is provided a system for altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans. Ideally, the system comprises at least one of:
a reader ideally arranged to read in the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified and altered;
a processor ideally arranged to: analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;
select a reference set of parameter values;
map the data to be steganographically encoded within the document to a set of relative differences; and/or
encode the data within the document by modifying the identified geometrical features of the document so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values by an amount dependent on the set of relative differences; and
a writer to write out the modified document with the data encoded
steganographically encoded therein. According to a sixth aspect of the present invention, there is provided a system for processing a document to decode data steganographically encoded within it. Ideally, the data is encoded in a rendered form of that document in a way that is machine- readable yet is substantially imperceptible to humans. Ideally, the system comprises at least one of:
a reader ideally arranged to read the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified;
a processor ideally arranged to:
analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;
select a reference set of parameter values;
calculate a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and/or
decode data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process;
and
an output module to output the decoded data. Ideally, the system is in the form of a document replication apparatus, such as a photocopier, wherein the reader is a scanner, and the output module is a printer. Ideally, the system is arranged to control replication of a document in dependence on scanning and decoding steganographically embedded data therein. For example, if a certain item of steganographic data indicates that a document is a protected document, then the system can cease replication of that document.
The may be arranged to issue an authorisation signal in dependence on decoding steganographically embedded data within a document. According to a seventh aspect of the present invention, there is provided a document within which data is steganographically encoded. Ideally the document comprises a medium (such as paper) supporting geometrical feature thereon. For example, the geometrical features may be printed on the medium. Ideally, the geometrical features are arranged to encode data steganographically in a way that is machine-readable yet is substantially imperceptible to humans. Ideally, the document comprises at least one group of geometrical features having associated parameter values that define geometrical characteristics of the geometrical features. Ideally, the document comprises a reference - for example, via features supporting a reference set of parameter values. Ideally, a set of relative differences between the parameter values of the at least one group of geometrical features and the reference set of parameter values steganographically encodes data within the document.
Ideally, the geometrical features are typographical features. Ideally, the medium supports geometrical features in a way that is machine readable via visible light scanning of the document. Ideally, the document is in the form of at least one of: a bank note, a cheque, an agreement document, a deed document, an assignment document, a power of attorney document, a book, an identity document, a legal document, a business document and a government document.
Further aspects of the present invention may reside in features of the various aspects of the present invention. Furthermore, it will be understood that features and/or advantages of the different aspects of the present invention may be combined and/or substituted where context allows, with the necessary changes being applied.
For example, steps of the encoding method according to the first aspect of the present invention can be applied to the decoding method according to the second aspect of the present invention, optionally with a complementary or inverse operation of that step being performed. By way of further example, a decoding process of the first aspect used to check the integrity of data steganographically encoded within a document may utilise the decoding method of the second aspect of the present invention.
BRIEF DESCRIPTION OF THE DRAWINGS
In order that the invention may be more readily understood, reference will now be made, by way of example, to the accompanying drawings in which:
Figure 1 is a schematic diagram of a system according to various embodiments of the present invention, the system being arranged to process documents so as to steganographically encode data therein, and/or to decode data already
steganographically encoded therein;
Figure 2 is an encoding flow diagram according to various embodiments of the present invention;
Figure 3 is a decoding flow diagram according to various embodiments of the present invention; and
Figure 4 is a sample from a rendered form of a source document of Figure 1 , the sample including a number of different geometrical features associated with text of that source document.
DESCRIPTION OF PREFERRED EMBODIMENTS
Figure 1 is a schematic diagram of a system 1 according to various embodiments of the present invention, the system 1 being arranged to process source documents 5, 5a by modifying them to steganographically encode data 7 therein to form modified documents 6, 6a. To do this, the system 1 is arranged to carry out the encoding workflow as summarised in the flow diagram of Figure 2. The system 1 is also capable of decoding and extracting data 7 steganographically encoded within modified documents 6, 6a. To this end, the system 1 is arranged to carry out the decoding workflow as summarised in the flow diagram of Figure 3.
The system 1 comprises a computer 2, an image capture device in the form of a flat-bed scanner 3, and a printer 4. The scanner 3 and printer 4 are communicatively linked to the computer 2, for example via a wired or wireless connection. The scanner 3 enables a physical document to be converted into an electronic form which can then be processed electronically by the computer 2. Conversely, the printer 4 enables an electronic form of a document to be converted into a physical, printed document. The computer 2 may comprise other features such as a display that provides a user with information (e.g. an output of an encoding or decoding process and/or the presentation of electronic documents), and input means, for example a mouse and keyboard (e.g. to receive inputs to an encoding or decoding process - for example, a password or a fingerprint). The computer 2 may also comprises features such as a processor and/or processing modules for carrying out the processing steps (such as steps to encode or decode data). The computer 2 may comprise memory and/or memory modules for storing and/or registering information such as data to be encoded, values and parameters.
It will be understood by a person skilled in the art that functional alternatives to the system 1 are possible. For example, the system may comprise a portable electronic computing device, such as a smart-phone (or a tablet). In such a case, the image capture device may be in the form of a camera of the smart-phone. In other alternatives, the computer may be a photocopier.
The foregoing description focusses primarily on the processes carried out by the computer on the electronic forms of documents 5, 6. However, it will be understood that similar features, functions and advantages may be applied to the physical form of the documents 5, 6a, where context allows.
Encoding data overview
An overview of the data encoding process 20 is summarised in Figure 2.
A first step 21 of the encoding process 20 is to obtain a source document 5
(sometimes referred to as the "carrier") and ensure that it is in a suitable form. This requires the rendered form of the document (i.e. one including geometrical features) to be available, so that image analysis and modification of the subsequent steps of the process can be carried out effectively. In the system 1 of Figure 1 , this may include converting a physical, printed form of the document 5a into an electronic form by scanning it in using the scanner 3. However, other conversions may be carried out, possibly entirely in the electronic domain. For example, an electronic file, such as an HTML file, or a Microsoft Word® file may first need to be rendered so that the
geometrical feature of that document (e.g. the shape and arrangement of text and images) are available for processing. The rendered form of a document 5 may be a bitmapped version of the document. It will also be understood that a rendered document 5 may comprise more than one page.
A second step 22 of the process is analysis of the document. This is primarily to determine whether and how data 7 may be steganographically encoded within the document. The second step 22 may include an assessment of the quantity and/or nature of content within a document, and a determination of how much data can be steganographically encoded within that content. This assessment or determination can be fed back to a user so that if the source document is unsuitable - for example, it does not have a sufficient quantity or type of content - the user can be provided with guidance on how to improve the source document 5. This is so that more data can be
steganographically encoded within a processed, modified document 6, and/or data can be encoded in a way that makes the presence of the encoded data substantially imperceptible to humans. One of the assessments that may be carried out is
determining whether the document has content in the form of text. Another assessment could be determining the amount of text and/or image within the document, by determining the proportion of whitespace in the image. This can be done using a relatively low resolution version of the document, and so is relatively less processor- intensive than operations such as OCR that involving identifying the formation of particular characters of text.
The second step 22 includes an identification of predetermined types of geometrical features of the rendered form of the document. For example, geometrical features may include the arrangement and formation of text and/or images. In addition to this, the second step 22 includes registering how those geometrical features are defined and characterised. Specifically, each geometrical feature may be at least partly defined by a number of parameters, and values of those parameters can be generated from analysis of the geometrical features. Thus, registered parameter values associated with an identified geometrical feature can define the geometrical characteristics of that geometrical feature. Parameter values also quantify the geometrical characteristics of geometrical features in a way that allow alteration of those geometrical features by shifting the associated parameter values. The second step 22 of the encoding process also involves selecting a reference using a predetermined reference selection process. This may include selecting one or more of the identified geometric features as a reference, and setting a reference set of parameter values associated with that geometric feature as the reference. Ideally, the second step 22 of the encoding process follows a series of rules that determine which geometrical features, parameter values and references are to be used to encode data, and in which order they are to be used. Where a document comprises multiple pages, different geometrical features, parameter values and references may be used for each page.
The third step 23 of the encoding process 20 is obtaining the data 7 to be encoded. This may involve generating or deriving the data 7 from the content of the original document 5, from another predetermined source and/or from user-provided information. The data may be encrypted and/or include a checksum. The fourth step 24 of the encoding process involves choosing an encoding strategy, ideally in response to the result of carrying out the second step 22. Thus, the encoding strategy is dependent on the analysed characteristics of the document of the second step - specifically, the determined characteristics of the geometrical features of the document 5 as represented by the parameter values, and as measured relative to a selected reference. Each encoding strategy is effectively a predetermined document alteration process that maps the data to be steganographically encoded to alterations to the source document 5 so as to arrive at the modified document 6. As will be described in greater detail below, this involves shifting the parameter values relative to the selected reference in dependence on the data to be steganographically encoded. This allows that data to be hidden within the formatting of the document. This makes that data machine- readable, yet the presence of that data is substantially imperceptible to humans.
Moreover, as data is encoded in the formatting of a document, it is particularly resistant to corruption as a result of unfaithful or otherwise poor quality reproduction of that document, especially if a threshold operation is applied during a copying process. It should be noted that the document alteration is carried out with respect to the reference selected during the second step 22 of the encoding process. Thus, a reference set of parameter values will not shifted, but rather the other non-reference parameter values will shifted relative to the reference set of parameter values.
The fifth step 25 of the encoding process is carrying out the chosen encoding strategy to alter the document. This involves applying the changes to the parameter values so that the geometrical features of the original document 5 are altered, thereby encoding the data in a modified document 6. The modified document 6 can then be re- rendered and outputted in a form which ensures that the formatting cannot be altered, for example as a non-editable PDF document, or simply printed.
The sixth step 26 of the encoding process is checking that the data has been properly encoded. This can simply be done by running a decoding process, as will be described below, and comparing the data extracted by the decoding process with the original data to be encoded (i.e. of the third step 23). To test the resistance of the encoded data to corruption, this sixth step may first include applying a degradation process to the modified document 6 - for example, reducing the colour-space by applying a thresholding algorithm, reducing the resolution of the document and/or transforming the image content of the document by rotation, scaling or x-y translation. Decoding data overview
An overview of the data decoding process 30 is summarised in Figure 3. Effectively, the decoding process is an inverse of the encoding process 20 to enable data 7 steganographically encoded within a modified document 6 to be extracted. The first step 31 and second step 32 of the decoding process 30 are similar to the first and second steps 21 , 22 of the encoding process 20, but applied to a modified document 6 (sometimes referred to as the "package").
In particular, the second step 32 of the decoding process 30 is analysis of the modified document 6 to determine how data 7 may have been steganographically encoded within the document 6. Again, this includes an identification of predetermined types of geometrical features, ideally with each geometrical feature being at least partly defined by a number of parameters, and the values of those parameters being measured relative to a reference selected by a predetermined reference selection process. The second step 32 of the decoding process 30 follows the same series of rules as the encoding process 20 so that the order of the geometrical features, parameter values and references used to encode data can be reliably determined. In any case, a set of relative differences between the non-reference parameter values and the reference set of parameter values can be determined.
The third step 33 of the decoding process 30 is extraction of the data from the modified document 6. This is achieved by mapping the determined relative differences to the steganographically encoded data via a predetermined mapping process. The mapping is dependent on, and effectively the inverse of the encoding strategy employed.
The fourth step 34 of the decoding process 30 is outputting of the extracted data. This may involve verifying the data via a checksum component of the extracted data, and/or receiving a user input to decrypt the extracted data. Document analysis examples
There are many different ways that a document may be analysed to identify geometrical features therein, especially as there can be many different types of geometrical features within a document. Aspects of the present embodiments are particularly advantageous and applicable to geometrical features in the form of typographical features - i.e. those that are associated with text. Accordingly, the foregoing description will focus on the processing of text documents. However, it will be appreciated that the same principles and advantages can be extended to include nontextual features of a document.
Figure 4 shows a sample from a rendered form of a source document 5, the sample including a number of different geometrical features associated with text 40 of that source document. These geometrical features are generally typographical features having definitions that are well-known in the art of typography.
For example, one group of geometrical features are baselines 41. Baselines 41 are generally defined as the lines upon which most letters of a standard body of text sit. From the sample of Figure 4, in the word "Steganography", nine of its letters "Ste-ano-ra- h-" sit on a respective baseline 41 a and four of its letters "— g— g-p-y" also sit on the same baseline 41 a but have descenders extending below the baseline.
Image analysis can be carried out on the rendered form of the document 5 to identify these baselines 41. Moreover, image analysis can also be used to determine the position, arrangement and extent of these baselines 41 , and to populate the appropriate parameter values associated with these characteristics.
By way of a trivial example, a "bx" position of a baseline in the sample may be registered. In the present case, the "bx" position is a notional distance in pixels from the top edge 42 of the sample (which is at position bx=0). Accordingly, the first baseline 41a of the first line of text of the sample may be registered as having the parameter/value: bx=1080, the second 41 b as bx=1480, the third 41c as bx=1880, the fourth 41d as bx=2280. This simple notation and example relies on the assumption that all the baselines are parallel to one another and the top edge of the sample.
The exact "bx" positions of these baselines can then be varied slightly relative to one another so that data can be encoded within the document in a way that is not easily perceptible. In the present example, this requires a reference to be set against which the parameter values associated with baseline position can be varied. For example, the relative spacing between the first and second baseline 41a, 41 b can be selected as a reference. In the present example, this reference value is calculated as 1480 - 1080 = 400 pixels. It will be noted that the spacing between the second and third baseline 41 b, 41 c is 400 pixels, as is the spacing between the third and fourth baselines 41 c, 41 d.
Accordingly, data can be encoded in the document by slightly varying the non-reference "bx" parameter values of the third and fourth baselines 41c, 41 d. This can be achieved using a predetermined mapping process which specifies how the "bx" values are to be altered to encode different data. For example, the mapping process may specify that the first four bits of a payload message can be mapped to the "bx" spacing between the second and third baselines 41 b, 41 c, and the second fours bits of a payload message can be mapped to the "bx" spacing between the third and fourth baselines 41 c, 41 d. These "bx" spacings can be subsequently compared against the reference spacing (of 400 pixels) to obtain the payload message based on the variance from the reference spacing. For example, a "bx" relative spacing of 400 pixels (zero variance) can represent the four bit message "0000" whereas a relative spacing of 415 pixels (variance of 15 pixels) can represents the four bit message "1 11 1" and so forth with intermediate pixel variances representing corresponding intermediate values between "0000" and "11 11" . Thus, using this convention, to encode the message "00110010", the "bx" positions of the baselines would be modified by the mapping process as follows:
Baseline 41a: bx= 1080;
Baseline 41b: bx= 1480;
Baseline 41c: bx= 1883;
Baseline 41 d: bx=2285.
Accordingly, a modified document 6 can be created, with the lines of text 40 being shifted to ensure that the baselines are positioned according to the above- specified parameter values. Thus the eight-bit payload message can be
steganographically encoded within the modified document 6.
To decode this data from the modified document 6, a similar process can be carried out: firstly to determine the reference spacing of 400 pixels (between the first and second baselines); secondly to determine the non-reference spacings of: 403 pixels (between the second and third baselines) and 402 pixels (between the third and fourth baselines); and then thirdly mapping the differences in spacing "3" and "2" to two four bit messages "001 1" and "0010". These can then be concatenated to form the eight bit message "001 10010".
Using this convention, the maximum baseline shift from the standard reference spacing is a maximum of 15 pixels out of 400. This is less than a 4% difference, and so is very unlikely to be perceived by a user. Advantageously, the use of a relative difference (which employs a comparison to a reference derived from the document itself) ensures that any document alteration that is carried out is relatively difficult to perceive. It will be understood that data could, in principle, be encoded in a document using absolute values. However, employing this approach does not make the method flexible enough to account for document having significantly different geometrical features - for example, characters of different sizes, types and arrangements.
It should also be noted that the reference employed in the present example is to a geometric feature that repeats regularly (namely the regular spacing of 400 pixels between adjacent baselines). So that a reference can be reliably employed, the selection of a reference is dependent on that reference being applicable to such regularly repeating geometrical features. This is so that data can be encoded in the small variations deviating from the reference. If there were large spatial variations between consecutive baselines 1 to 3, then the spacing between baselines 1 and 2 could not be used as a reference for modulating the spacing between baselines 2 and 3 to encode data. In view of this, method and systems according various embodiments of the present invention may employ different references for significantly varying geometrical features, even if they are of the same type (e.g. baselines). For example, if there were a significant difference in the spacing between the baselines of the end and start of adjacent paragraphs (e.g. 700 pixels vs. the 400 pixel spacing between baselines of adjacent lines within the same paragraph), then a different reference would be selected, applicable only to adjacent baselines of different paragraphs.
It should also be noted that whilst the above example uses a single one- dimensional parameter ("bx") to define the position of baselines, a more extensive set of parameters and values may be used to more accurately define the formation of geometrical features. For example, baselines may be specified more accurately by including the two-dimensional coordinates of the start point and end point of a baseline. With this information, the length of each baseline can be determined along with the orientation of the baseline. Accordingly, these geometrical characteristics can be modulated relative to a reference to encode data.
The perceptibility of changes made to geometrical features of a document will vary in dependence on factors such the type of geometrical feature and the characteristic of it that is being modulated to encode data. However, the amount of data that can be encoded by a variation is proportional to the number of different (machine- distinguishable) variations. Furthermore, this consideration also is related to the resistance to corruption of data encoded as geometric feature variations. In the above example where a variation of one pixel equates to a difference of one bit of information, a down-sampling of the document would cause corruption of the data encoded in the spacing of the baselines. Thus, there is a trade-off to be made between perceptibility, data density and resistance to data corruption in view of unfaithful reproduction. For example, in the above case relating to baselines, a courser mapping may be used - e.g. a shift of 3 pixels representing a difference of 1 bit of information.
In view of this, it is useful to use many different geometrical features (and characteristics of those features) to encode data, so as to increase data density, whilst minimising human perceptibility and data corruption. The different types of text-based geometrical features are listed below by way of example.
As mentioned previously, aspects of the present embodiments are particularly advantageous and applicable to geometrical features in the form of typographical features - i.e. those that are associated with text. However, it will be appreciated that the same principles and advantages can be extended to non-typographical geometrical features, for example, predetermined shapes. Example typographical features and parameters
a. Baselines - as set out above.
b. Mean line - generally a line above and parallel to the baseline which forms the upper boundary of most of the lowercase letters in a body of text.
Parameters which may be varied include the spacing between the mean line and a corresponding baseline.
c. Cap height - the vertical height of a capital letter relative to the baseline.
The relative height between a capital letter and a consecutive lower case letter (as defined by the cap height and mean line) can be used to encode data.
d. Descender height - the vertical height of parts of a lower case letter which extend below the baseline. For example, parts the letters "g", "p" and "y" of the word "Steganography" have a descender that extends below the baseline to a beard line that is parallel to and below the baseline. Parameters which may be varied include the spacing between the beard line and a
corresponding baseline. Alternatively, the height of consecutive letters having a descender can be used to encode a payload message.
e. Ascender height - the vertical height of parts of a lower case letter which extend above the mean line. For example, parts of the letters "t" and "h" of the word "Steganography" extend above the mean line. Parameter similar to those in respect the descender height can be used to encode data.
f. Character size and position - the vertical height, horizontal width and/or orientation between consecutive letters of the same type may be modulated to encode data.
g. Kerning - the degree of overlap between adjacent characters.
h. Letter spacing - overall spacing of a word.
i. Sentence spacing - i.e. the space size after a sentence.
j. Size of dots or diacritics - e.g. different size period marks can encode
different data.
k. Size and position of superscript and subscript - relative to normal text.
I. Paragraph adjustments - including alignment and justification. For example, if text is not fully justified (e.g. left-aligned only), then it is possible to use letter spacing to ensure that sequential lines of text can be used to encode different values.
m. Margins size - these are less helpful as they are likely to significantly vary, especially with physical documents which are often shifted in registration, especially when copying between media of different proportions - (e.g. A4 to "US letter")-
As can be seen, generally, parameters of these and other geometrical features which may be modulated to encode data relate to the positioning, size and/or orientation ideally relative to a reference, or another geometrical feature. Advantageously, this ensures that the data being encoded within a document is resistant to corruption, especially when the document's colour-space is significantly reduced (e.g. reduced to monotone via a thresholding operation as typically carried out on a black-and-white photocopier).
As mentioned, the encoding process follows a series of rules that determine which geometrical features, parameter values and references are to be used to encode data, and in which order they are to be used. Thus, the rules effectively lead to the predetermined selection of geometrical features, parameter values and references.
For example, the rule may determine that the first two baselines encountered in a document are to be used as reference. Extending further, the rules may determine that baselines be used to encode a first portion of data 7, followed by letter spacing to encode a following second portion of that data, followed by sentence spacing for a third portion, and so forth. This ensures the different types of geometrical features can be utilised to maximise the data to steganographically encoded within the document. Moreover, these rules may codify the manner in which geometrical features can be used to encode of data. For example, if fonts are too small, or the spacing between sequential lines of text is too compact, then baseline modulation may be used to encode a smaller range of data (e.g. 2 bits instead of 4 bits), or be completely disregarded as a means of encoding data altogether so as to avoid data corruption or likely human perception of encoded data.
Examples of data to be encoded
As mentioned, the third step 23 of the encoding process 20 involves obtaining the data 7 to be encoded. Once the data to be encoded has been obtained, its quantity can be enumerated for the purpose of determining whether that data can be encoded within the source document using a primary encoding strategy
As alluded to earlier, the data to be encoded can be generated or derived from a number of different sources:
Firstly, the data to be encoded may be based on the content of the original document. This serves as an indicator of the integrity of the content of the document. For example, assuming there is text within the document, a series of metrics can be generated which relate to the content. Thus, if the content of the document were to be slightly changed (but retaining the formatting encoding the data), then this can be flagged by the metrics which are part of the data encoded within the document. These metrics may include at least one of the number of pages in document; the number of words, sentences, paragraphs and/or letters per document and/or page; and the instances of certain characters. For example, the letter "e" is one of the most frequently occurring letters in the English language. Accordingly, the number of letters "e" in an English language document can be enumerated as a metric which represents the content of the document. A different character (or series of characters) may be used for different languages and/or topics. Effectively, these metrics act as a checksum for the content, or content portions of the document. These metrics can be beneficial, especially when they are spread across the entire document and/or include redundant data relating to the content of the document. For example, each page of a document may include metrics representing the content of the whole document (thereby enabling the checking of the integrity of a document as a whole, based on the data within any one page). Similarly, metrics embedded within one page can include a reference to the content on other pages, allowing a cross-check of the integrity of various pages of the document to be performed.
Secondly, the data to be encoded may be based on user-provided information. For example, the data may include alphanumeric data entered by the user. Such data may include an identifier, a PIN, a password or pass-phrase. Similarly, the data to be encoded may be based on information derived from another user input - for example, from biometric data generated from scanning a user's fingerprint or iris.
Thirdly, the data to be encoded may include information not dependent on a user or the document. For example, this may include a time stamp, a random number, a unique identifier and/or a version number of a program used to encode the data.
Regardless of the source used, the data can then be further treated prior to steganographically encoding it (or part of that data) within the document.
For example, portions of the data may be encrypted using one of a number of different techniques known in the art with the resulting cipher-text being passed to the encoding process that steganographically encodes it within the document. One portion of the data (e.g. the user-dependent data) may be used to control the encryption of the other (e.g. the data based on the content of the document). For example, a user- provided password can be used as an encryption key to a cryptographic process. In a similar fashion, the metrics based on the content of the document may be used as a cryptographic salt. As mentioned, many different types of cryptographic process may be employed, such as that disclosed by the Applicant in document: PCT/IB2011/052799 the content of which is hereby incorporated by reference to the extent permitted by applicable law.
An additional treatment of the data to be encoded could be to append a verifier (such as a checksum) which can be used to verify subsequent successful extraction of steganographically encoded data, for example via a decoding process such as the decoding method of the present embodiment.
Encoding strategies
The fourth step 24 of the encoding process involves choosing an encoding strategy. This may be dependent on second step 22 of document analysis, and/or the third step 23 of obtaining the data to be encoded. As mentioned, once the data to be encoded has been obtained, its quantity can be enumerated for the purpose of determining whether that data can be encoded within the source document using a primary encoding strategy. If the quantity of data and/or encoding strategy means that the data cannot be encoded, a sequence of auxiliary encoding strategies can be employed instead to encode the data. The user may be provided feedback about which encoding strategy is being used, and whether or not data can be encoded using that strategy. A user can be provided with a choice of encoding strategies, and provided with a means to choose one. This choice can be made automatically for the user in response to a password or other user-provided input. Advantageously, this means that both the data and the encoding strategy used to encode that data can be kept a secret.
Once an appropriate encoding strategy is chosen, it can be carried out according to the fifth step 25 of the encoding process, and then checked according to the sixth step 26 of the encoding process.
As mentioned, an encoding strategy generally involves the alteration of a source document 5 in a way that is dependent on second step 22 of document analysis, and/or the third step 23 of obtaining the data to be encoded. In general, this involves changing the parameter values so that the geometrical features of the original document 5 are altered. However, the encoding strategy may also alter a document in another way that assists subsequent decoding and extraction of the data 7 from a modified document 6. For example, the encoding strategy may alter a document to provide it with markings that indicate which encoding strategy has been used. Ideally, these markings are not associated with any of the geometrical features of the document, and so are independent of the content of a document 5. These markings thus define a clue which can facilitate a subsequent decoding process by specifying the encoding strategy used. Specifically, the second step 32 of the decoding process 30 can be driven following an identification of such markings. These markings can be provided at a predetermined location within the document to facilitate detection of those marking, and so speeding up the decoding process. Effectively, this means that the decoding process 30 can bypass the step of guessing which encoding process has been used.
Similarly, portions of the data to be encoded could also be rendered within a modified document without being associated with or affecting the geometrical features. For example, a timestamp can be rendered in plaintext at a predetermined location of the document. Again, this can aid a subsequent decoding process and/or act as a verifier of data integrity and/or successful data extraction.
The above-described embodiment is flexible enough to be applied to a variety of different scenarios as will be described below.
Further examples and scenarios
A technique may be provided whereby characteristics of the document can be determined and used to change the format of the document in a manner which is not readily perceptible to the human eye but which is detectable by an imaging and decoding technique. The measured degree of format change acts as a unique identifier of the document and can be used to identify the original recipient of that copy.
Format change can take many forms. For example in text word spacing, font adjustments, line spacing, border size, indentation etc. can all be used either individually or in combination to slightly modify the document in a manner which is representative of a data to be encoded (e.g. a unique identifier). The modification is so slight that it is not readily perceivable to the human eye. However, such format changes are faithfully reproduced in any low-quality reproduction of that document. For example, no matter how many times the document is photocopied the format information is always reproduced faithfully. As mentioned, there are many different ways in which a conversion can occur to represent payload data.
One set of scenarios involve encoding a unique identifier. The conversion is carried out by a conversion algorithm which analyses the document. The conversion algorithm derives several non-format parameters (typically parameters based on content). Once the non-format parameters have been established by this process, they can be used as inputs into another algorithm (a formatting algorithm) for adjusting the formatting parameters of the document. The document is then recreated either by printing or in electronic format (such as a PDF format) using the new formatting parameters as determined by the formatting algorithm.
For example, in one scenario, the conversion algorithm can sum the words per page and count the letters in the page to arrive at two content parameters for the document. Also the letters can be summed according to a predetermined value given to each letter and an algorithmic rendition of any of these sums can be used to create a new summed parameter. These parameters can individually be provided to the formatting algorithm or alternatively the new summed parameter can be provided. The formatting algorithm then spaces the words and letters within a given page so as not to change the layout (format) as perceived by the human eye, but with sufficient change to create an encoding to re-render into a value once the page formatting is analysed from an image of the newly-created document.
The document content can get summed automatically by normal search engine parameters and a digitised result created and then this digitised result submitted to a randomised time-based algorithm (with the time base element further concealed according to the base units time setting as is described in our co-pending international patent application no PCT/IB2011/052799. This not only allows for secure encryption but also post facto detection of the authoring machine). These results would be expressed in the spacing of the document.
A variant of this method is to scan a fingerprint on a portable device e.g. a mobile phone, render a digital value that is also encrypted not only through normal methods but via time-based encryption and sent to a central database (akin to that described in our co-pending international patent application no PCT/IB2011/052799). In this case the rate of decay in the time base renders even the mobile phone incapable of decrypting the fingerprint moments after it has sent it. This digitised value can be used simultaneously to vary an electronic document that had been received on the mobile device so that each page of that document can be "signed" individually by the user in such a way that it is personal only to the user. The method can be used to sign documents which require proof that the user has agreed to every page and that no page has been "slipped" in intentionally or accidentally after he/she has agreed to the whole document. This is the electronic equivalent of "initialling" every page, but can now be done totally electronically using portable devices.
As further security measure (in an further scenario), the unique digital reference of the authoring device, e.g. laptop or mobile phone, is added into the digital signature now expressed in the document in a concealed way. As a further security measure (in a further scenario), a time reference is printed in the open on the page and a concealed time reference is printed using the above method, with the central database alone able to decipher, from the unique identifier of the authoring machine, the appropriate time decay differential between the two (as per our co-pending international patent application number PCT/IB201 1/052799). In this way, the document is further authorised as coming from a valid authoring machine.
This method can be used with all forms of value documents (including banknotes) to authorise that they have been printed by an authorised machine. For example a time reference can be printed in the open on the document and used as an algorithm reference in a serial number as well as a set of micro text which would be printed at the same time as the serial number and spaced etc. as above with a document linking all three items but with the appropriate time decay known only to the central database. This would allow a check to be undertaken at ATM's or other forms of electronic value document sorters for physical documents etc not only if the document has been printed by an authorised machine but if there has been an "overprint" i.e. unauthorised print run using valid machines.
In addition in any document, ordinary or otherwise, the exact position of the side margin and the extent of the header and footer margins, represent additional features which can be used in the determination of the unique identifier of the document. Besides the spacing of letters within words and words themselves, minute changes to punctuation mark spacings as well as line heights can be used. The conversion algorithm re-renders the same document into a digital print version, such as in an Adobe pdf format, without any readily perceivable visually changes to the format created for the standard (original) document. However, these subtle format changes are present, but not perceivable by the human eye.
An unauthorised copier will not know whether their document has been sanitised and labelled for their use only or is a standard document which hasn't been treated by this aspect of the present invention. The document when photocopied will faithfully reproduce the document format, such as the concealed spacing differences, so that when the illegally obtained document is scanned and the image processed by the formatting algorithm (operating in reverse) a set of parameters or a summed parameter can be created representing a unique identifier of the document. These reconstituted parameters (which can be termed as a unique identifier of the document or its intended recipient) can be compared to the original parameters and if they are equal, this indicates that the scanned document is from the original which is linked to an intended recipient. Accordingly, this process enables linking of any processed document back to its original intended recipient.
Other format changes include character size alteration in such a way as to be imperceptible to the human eye but nonetheless containing information by reference to the variation to the standard size in the document. In all these incarnations, the document would be produced in the digital word processing form so that the character style and character size could be assessed in order that a reference point be established prior to it being rendered in a print style as in Adobe® software.
Conceptually the conversion algorithm and the formatting algorithm and the inverse of the formatting algorithm, can be incorporated in to modern photocopiers such that certain documents with standardised algorithmic values will be refused for reproduction or scanning if they do not match a required intended recipient or document identity. This would represent an extra element to the above scenario where part of the document setting would be used to produce the standardised algorithm to trigger the software enabled scanner/photocopier and would no doubt be used for banknotes or government documents etc.
In a further scenario, a digital signature of document can also be created and stored in addition or separate from the converted document creation. This would be carried out by simply scanning the entire multiple page document, creating a signature code via central database with date and other descriptive information about the document being used and storing this unique signature code at the central database. If it was an electronic document, the unique signature code would be created directly from the electronic version of the document. The signature code could be integrated into the document thus making the whole document "safe", not just its signature page. This embodiment can be used with mobile devices to type and sign electronic documents using the mobile device.
In a further scenario an app is provided which scans the users fingerprint using the camera function of the mobile phone, renders this into a digital encoding and then using the document altering embodiment described above imperceptibly alters every page of a document sent by word processing attachment or e-mail to the mobile phone and then sent back as a printed or PDF (or some other form of unalterable visual representation) file. This would be used for very sensitive documents with the customer has to signify that they have agreed each and every page and would allow for customers signifying that they have agreed to each and every page.
Also in another scenario, when conducting page by page verification of a document to a mobile device, the app can also have a vendor-specific code and a customer-specific code that is combined with the fingerprint code plus also using relativity co-ordinates (as described in our co-pending international patent application no PCT/IB2011/052799) to link them so that the vendor knows the verification code he gets from the app is personal to him and his customer.

Claims

1. A method of altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans, the method comprising:
mapping the data to be steganographically encoded within the document to a set of document alteration instructions; and
carrying out the document alteration instructions to alter geometrical features of the document to steganographically encode data therein.
2. The method of claim 1 , further comprising:
identifying at least one group of geometrical features of the rendered form of the document; and
registering parameter values that are associated with and define geometrical characteristics of the geometrical features; wherein
carrying out the document alteration instructions involve altering the parameter values associated with and defining the geometrical characteristics of the geometrical features.
3. The method of claim 1 or claim 2, further comprising selecting a reference set of parameter values using a predetermined reference selection process.
4. The method of claims 1 to 3, further comprises:
mapping the data to be steganographically encoded within the document to a set of relative differences by using a predetermined mapping process; and
encoding the data within the document by modifying the identified geometrical features of the document using a predetermined document alteration process so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values by an amount dependent on the set of relative differences.
5. The method of any preceding claims, wherein the geometrical features are typographical features associated with text.
6. The method of claim 5, wherein the geometrical feature comprise at least one of: baselines, mean lines, cap height, descender height, ascender height, character size, character position, character kerning, letter spacing, sentence spacing, size of dots and/or diacritics, size and position of superscript and/or subscript text, paragraph position and margins size.
7. The method of any preceding claim, wherein alterations to the geometrical features include at least one of: translation, scaling and rotation of those geometrical features.
8. The method of claim 7, wherein said alterations are carried out with respect to a predetermined reference.
9. The method of claim 7 or claim 8, wherein said alterations are made to otherwise regularly repeated geometrical features.
10. The method of any preceding claim, further comprising receiving at least one user input; and wherein the data to be encoded is derived, at least in part, from the at least one user input.
1 1. The method of claim 10, wherein receiving at least one user input comprises receiving at least one of: biometric data and an alphanumeric code.
12. The method of any preceding claim, further comprising determining at least one metric representative of at least a portion of the content of the document; and wherein the data to be encoded is derived, at least in part, from the at least one metric.
13. The method of any preceding claim, further comprising treating at least part of the data to be encoded prior to steganographically encoding the data within the document.
14. The method of claim 13, wherein treating at least part of the data to be encoded comprises encrypting it using an encryption process.
15. The method of claim 14, wherein the encryption process comprises receiving as an input at least one of: a user input and a metric representing the content of the document.
16. The method of any one of claims 13 to 15, wherein treating at least part of the data to be encoded comprises appending a verifier to the data.
17. The method of any preceding claim, further comprising choosing an encoding strategy so as to determine how to map the data to be steganographically encoded within the document to a set of document alteration instructions.
18. The method of any preceding claim, further comprising checking the altered document to determine that data has been successfully encoded therein.
19. The method of claim 18, wherein said checking step comprises comparing the data extracted by a decoding method with the original data to be encoded.
20. The method of claim 18 or 19, wherein the altered document is first treated with a degradation process prior to the checking step so as to test the resistance of the encoded data to corruption.
21. A method of processing a document to decode data steganographically encoded within it, the data being encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans, the method comprising: identifying at least one group of geometrical features of the rendered form of the document;
registering parameter values that are associated with and define geometrical characteristics of the geometrical features;
selecting a reference set of parameter values using a predetermined reference selection process;
analysing the geometrical features to determine a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and
decoding data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process.
22. A computer program arranged to carry out the method of any preceding claim.
23. A computer program product comprising at least one non-transitory computer- readable storage medium having computer-readable program code portions stored therein, the computer-readable program code portions comprising at least one executable portion which, when executed, carries out a process according to any one of claims 1 to 21.
24. A system for altering a document to encode data steganographically within it, so that the encoded data, from a rendered form of that document, is machine-readable yet is substantially imperceptible to humans, the system comprising:
a reader arranged to read in the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified and altered;
a processor arranged to:
analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;
select a reference set of parameter values;
map the data to be steganographically encoded within the document to a set of relative differences; and
encode the data within the document by modifying the identified geometrical features of the document so as to alter the parameter values associated with those geometrical features relative to the reference set of parameter values by an amount dependent on the set of relative differences; and
a writer to write out the modified document with the data encoded
steganographically encoded therein.
25. A system for processing a document to decode data steganographically encoded within it, the data being encoded in a rendered form of that document in a way that is machine-readable yet is substantially imperceptible to humans, the system comprising: a reader arranged to read the document and convert it into an electronic format in which at least one group of geometrical features of the rendered form of the document can be identified;
a processor arranged to:
analyse the at least one group of geometrical features to derive therefrom parameter values that are associated with and define geometrical characteristics of the geometrical features;
select a reference set of parameter values; calculate a set of relative differences between the parameter values associated with the geometrical features and the reference set of parameter values; and
decode data steganographically encoded within the document by mapping the set of relative differences into said data using a predetermined mapping process;
and
an output module to output the decoded data.
26. The system of claim 25 in the form of a document replication apparatus, wherein the reader is a scanner, and the output module is a printer.
27. The system of claim 26 arranged to control replication of a document in dependence on scanning and decoding steganographically embedded data therein.
28. The system of any one of claim 25 to 27, arranged to issue an authorisation signal in dependence on decoding steganographically embedded data within a document.
29. A document comprising a medium supporting geometrical feature thereon, the geometrical features arranged to encode data steganographically in a way that is machine-readable yet is substantially imperceptible to humans, wherein the document comprises:
at least one group of geometrical features having associated parameter values that define geometrical characteristics of the geometrical features; and
features supporting a reference set of parameter values;
wherein a set of relative differences between the parameter values of the at least one group of geometrical features and the reference set of parameter values
steganographically encodes data within the document.
30. The document of claim 29, wherein the geometrical features are typographical features.
31. The document of claim 29 or claim 30, wherein the medium supports geometrical features in a way that is machine readable via visible light scanning of the document.
32. The document of any one of claims 29 to 31 , in the form of at least one of: a bank note, a cheque, an agreement document, a deed document, an assignment document, a power of attorney document, a book, an identity document, a legal document, a business document and a government document.
PCT/GB2015/050813 2014-03-19 2015-03-19 Steganographic document alteration WO2015140562A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1404959.7 2014-03-19
GB1404959.7A GB2524724B (en) 2014-03-19 2014-03-19 Steganographic document alteration

Publications (1)

Publication Number Publication Date
WO2015140562A1 true WO2015140562A1 (en) 2015-09-24

Family

ID=50635075

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2015/050813 WO2015140562A1 (en) 2014-03-19 2015-03-19 Steganographic document alteration

Country Status (2)

Country Link
GB (1) GB2524724B (en)
WO (1) WO2015140562A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037469B2 (en) 2019-05-20 2021-06-15 Advanced New Technologies Co., Ltd. Copyright protection based on hidden copyright information
US11042612B2 (en) 2019-05-20 2021-06-22 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information
US11080671B2 (en) 2019-05-20 2021-08-03 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information
US11227351B2 (en) 2019-05-20 2022-01-18 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0629972A2 (en) * 1993-04-23 1994-12-21 Hewlett-Packard Company Method and apparatus for embedding identification codes in printed documents
US5467447A (en) * 1990-07-24 1995-11-14 Vogel; Peter S. Document marking system employing context-sensitive embedded marking codes
US5629770A (en) * 1993-12-20 1997-05-13 Lucent Technologies Inc. Document copying deterrent method using line and word shift techniques

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0401258B1 (en) * 1988-02-11 1994-11-30 VOGEL, Peter Samuel Document marking system
US7286684B2 (en) * 1994-03-17 2007-10-23 Digimarc Corporation Secure document design carrying auxiliary machine readable information
US8014557B2 (en) * 2003-06-23 2011-09-06 Digimarc Corporation Watermarking electronic text documents
JP4532331B2 (en) * 2004-12-08 2010-08-25 株式会社リコー Information embedding device, information extracting device, information embedding method, information extracting method, information embedding program, and information extracting program
SG155791A1 (en) * 2008-03-18 2009-10-29 Radiantrust Pte Ltd Method for embedding covert data in a text document using character rotation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5467447A (en) * 1990-07-24 1995-11-14 Vogel; Peter S. Document marking system employing context-sensitive embedded marking codes
EP0629972A2 (en) * 1993-04-23 1994-12-21 Hewlett-Packard Company Method and apparatus for embedding identification codes in printed documents
US5629770A (en) * 1993-12-20 1997-05-13 Lucent Technologies Inc. Document copying deterrent method using line and word shift techniques

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037469B2 (en) 2019-05-20 2021-06-15 Advanced New Technologies Co., Ltd. Copyright protection based on hidden copyright information
US11042612B2 (en) 2019-05-20 2021-06-22 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information
US11056023B2 (en) 2019-05-20 2021-07-06 Advanced New Technologies Co., Ltd. Copyright protection based on hidden copyright information
US11062000B2 (en) 2019-05-20 2021-07-13 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information
US11080671B2 (en) 2019-05-20 2021-08-03 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information
US11227351B2 (en) 2019-05-20 2022-01-18 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information
US11409850B2 (en) 2019-05-20 2022-08-09 Advanced New Technologies Co., Ltd. Identifying copyrighted material using embedded copyright information

Also Published As

Publication number Publication date
GB2524724A (en) 2015-10-07
GB2524724B (en) 2021-07-07
GB201404959D0 (en) 2014-04-30

Similar Documents

Publication Publication Date Title
US8379261B2 (en) Creation and placement of two-dimensional barcode stamps on printed documents for storing authentication information
US7028902B2 (en) Barcode having enhanced visual quality and systems and methods thereof
US7644281B2 (en) Character and vector graphics watermark for structured electronic documents security
JP4137084B2 (en) Method for processing documents with fraud revealing function and method for validating documents with fraud revealing function
US8430301B2 (en) Document authentication using hierarchical barcode stamps to detect alterations of barcode
US5765176A (en) Performing document image management tasks using an iconic image having embedded encoded information
US8037310B2 (en) Document authentication combining digital signature verification and visual comparison
US20030145206A1 (en) Document authentication and verification
US8595503B2 (en) Method of self-authenticating a document while preserving critical content in authentication data
US20040001606A1 (en) Watermark fonts
JP4854491B2 (en) Image processing apparatus and control method thereof
US20070204164A1 (en) Method and apparatus for authenticating printed documents
US8243982B2 (en) Embedding information in document border space
US20170039421A1 (en) Method and system for creating a validation document for security
WO2015140562A1 (en) Steganographic document alteration
CN112085643B (en) Image desensitization processing method, verification method and device, equipment and medium
CN112597455A (en) Document anti-counterfeiting method and device
KR20100087261A (en) Font-input based recognition engine for pattern fonts
US20110158464A1 (en) Method for Embedding Messages into Structure Shapes
US7715057B2 (en) Hierarchical miniature security marks
Mayer et al. Fundamentals and applications of hardcopy communication
Cu et al. Hiding security feature into text content for securing documents using generated font
Mantoro et al. Real-time printed document authentication using watermarked qr code
CN113076528A (en) Anti-counterfeiting information embedding method, anti-counterfeiting information extracting method, anti-counterfeiting information embedding device, anti-counterfeiting information extracting device and storage medium
JP4260076B2 (en) Document creation device, document verification device, document creation method, document verification method, document creation program, document verification program, recording medium storing document creation program, and recording medium storing document verification program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15728078

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15728078

Country of ref document: EP

Kind code of ref document: A1